{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# De-novo signatures\n",
    "\n",
    "We can attempt to discover signatures from new data. For these examples we will use a simple synthetic dataset with 4 signatures which slightly overlap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "import pathlib\n",
    "from cvanmf.denovo import rank_selection, plot_rank_selection, decompositions, Decomposition\n",
    "from cvanmf import models, data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "blocks = data.synthetic_blocks(\n",
    "    overlap=0.4, \n",
    "    k=4, \n",
    "    scale_lognormal_params=dict(sigma=0.75)\n",
    ")\n",
    "blocks\n",
    "# Each feature is assigned a weight drawn from a lognormal distribution, to roughly simulate some features (such a taxon) \n",
    "# generally taking higher values in input data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data generated or provided in the `data` module is provided as a named tuple containing description and metadata, as well as the data. The dataframe can be accessed as `x.data`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = blocks.data\n",
    "sns.heatmap(blocks.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rank selection\n",
    "The number of ranks $k$ is unknown, however we can attempt to estimate it using bi-crossvalidation. This tries learning a held out part of the data using the other parts for a range of ranks a large number of times, and we look at measures of how well the held out part was estimated to identify a suitable rank. These are run from largest rank to smallest, so the estimate of time tends to start of conservative."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rank_sel_res = rank_selection(x=x, ranks=range(2, 7), shuffles=20, seed=4298, progress_bar=False)\n",
    "# shuffles sets the number iterations of bi-crossvalidation which are run. \n",
    "# The default is 100, it has been set lower here to make the documentation easier to compile, but 100 is a more \n",
    "# sensible value for real data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_rank_selection(rank_sel_res, jitter=False, n_col=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In these plots you are typically looking for elbow points; ranks after which the quality of the decomposition does not improve as rapidly, suggesting that adding more signatures is no longer contributing much explanatory power.\n",
    "\n",
    "In our example, cosine similarity increases rapidly until $k=4$ where the rate of increase slows, indicating this is a suitable rank. Real world examples will rarely be so clear, and often several ranks will appear suitable.\n",
    "\n",
    "The different measures here are:\n",
    "* *Cosine Similarity*: Higher is better. The angle between the true data and estimated data, considering each as a flattened 1d vector\n",
    "* *L2 Norm*: Euclidean distance between true and estimated data. Lower is better.\n",
    "* *R-squared*: Coefficient of determination between true and estimated data. Higher is better.\n",
    "* *Reconstruction Error*: Measure of the error in the estimated data compared to true. Lower is better.\n",
    "* *Residual Sum of Sqaures*: Lower is better\n",
    "* *Sparsity H*: Sparsity of the H matrix\n",
    "* *Sparsity W*: Sparsity of the W matrix\n",
    "\n",
    "When there are several candidate ranks, it is useful to generate decompositions for each and manually inspect them. This is usually the case, and some researcher judgement is required in eventually selecting which rank decomposition to use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regularisation selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Background\n",
    "$L1$ regularisation can be applied to encourage sparsity in the decomposition (fewer entries in $W$ and $H$ with high values).\n",
    "A sparse solution can have benefits for intuitive understanding, for instance if we are decomposing a table of genes, a signature with high weights for a few genes, rather than lower weights for many genes, can be easier to interpret biologically. \n",
    "\n",
    "$L2$ regularisation instead encourages a more even distribution of weights across in both $W$ and $H$, leading to more evenly distributed weights and lower sparsity.\n",
    "\n",
    "The degree of regularisation is controlled by two parameters, $l1\\_ratio$ and $alpha$ from the underlying NMF implementation we use from `scikit-learn`. Currently regularisation is applied to both the $H$ and $W$ matrices, though `sklearn` supports only applying to one.\n",
    "\n",
    "* $l1\\_ratio$ - Between 0 and 1, with 1 applying only $L1$ regularisation, and 0 applying only $L2$ regularisation, and any value between being a mix.\n",
    "* $alpha$ - Regularisation coefficient. A value of 0 means applying no regularisation. By default when selecting an $alpha$ parameter, `cvanmf` will try values $2^i$ for $i \\in \\{-5, -4, \\dots, 2\\}$\n",
    "\n",
    "### Application\n",
    "We give an example here of $L1$ regularisation to encourage sparsity. `cvanmf` provides the function `regu_selection` to test values of $alpha$ for a given $l1\\_ratio$. As we want a sparse solution, we will use $l1\\_ratio = 1$. The function only accepts one rank, we have to have already selected the rank of decomposition we want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cvanmf.denovo import regu_selection, plot_regu_selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_alpha, regu_sel_res = regu_selection(x=x, rank=4, shuffles=10, seed=4298, l1_ratio=1.0, progress_bar=False)\n",
    "# shuffles sets the number iterations of bi-crossvalidation which are run. \n",
    "# The default is 100, it has been set lower here to make the documentation easier to compile, but 100 is a more \n",
    "# sensible value for real data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_alpha"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_regu_selection(regu_sel_res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A higher value of alpha leads to greater sparsity, although in this case not by much until high values of $alpha$. However, the measure of how well the decomposition reconstructs the input data also monotonically decrease as $alpha$ increases. `cvanmf` by default identifies the tested value of $alpha$ for which the $R^2$ is not significantly worse than $alpha=0$, which is returned as the first element of the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_alpha"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating decompositions\n",
    "For the ranks of interest, we can generate several decompositions to investigate. NMF solutions are non-unique and depend on the initialisation of the $H$ and $W$ matrices; some initialisations may give a better solution than others. One approach is to make many decompositions, and select those which optimise some criteria, such as reconstruction error.\n",
    "\n",
    "Here we make 100 decompositions for each of ranks 3, 4, and 5, and keep only the top 5 for each rank. The result is a dictionary, with the key being the rank, and the value a list of Decompositions with the first being the best (lowest reconstruction error)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decomps = decompositions(x, ranks=[3, 4, 5], top_n=5, random_starts=5, seed=4928, progress_bar=False)\n",
    "# or to use regularisation\n",
    "# decomps = decompositions(x, ranks=[3, 4, 5], top_n=5, random_starts=5, seed=4928, l1_ratio=1.0, alpha=best_alpha)\n",
    "decomps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use regularisation parameters here, but be aware the than same $alpha$ is applied to all ranks, which may not be suitable as $alpha$ is selected for a decomposition of specific rank."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can inspect the decompositions using some inbuilt plotting functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decomps[4][0].plot_relative_weight(heights=[0.3, 1, 0.2]).render()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decomps[4][0].plot_pcoa()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decomps[4][0].plot_modelfit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decomps[4][0].plot_feature_weight()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reapplying this model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have generated a model, we can reapply it to new data. We'll make two datasets:\n",
    "* y_similar: Has exactly the same properties as our original X. The model should describe this well.\n",
    "* y_different: Has fewer features, and different number of blocks, so should handle this poorly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_similar = data.synthetic_blocks(\n",
    "    overlap=0.4, \n",
    "    k=4, \n",
    "    scale_lognormal_params=dict(sigma=0.75)\n",
    ").data\n",
    "sns.heatmap(y_similar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_different = data.synthetic_blocks(\n",
    "    m=75,\n",
    "    overlap=0.4, \n",
    "    k=3, \n",
    "    scale_lognormal_params=dict(sigma=0.75)\n",
    ").data\n",
    "sns.heatmap(y_different)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying our existing model to the data is simple..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d_similar = decomps[4][0].reapply(\n",
    "    y=y_similar\n",
    ")\n",
    "d_similar.plot_relative_weight(heights=[0.3, 1, 0.2]).render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the similar data, the model looks to have described the data reasonably well, with overlapping blocks of signatures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d_different = decomps[4][0].reapply(\n",
    "    y=y_different\n",
    ")\n",
    "d_different.plot_relative_weight(heights=[0.3, 1, 0.2]).render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For dissimilar data is has still performed okay. If we look at the decomposition, we can see that to align the data `cvanmf` has removed from the signature weight matrix and features which are not in the new data (y_different), so only 75 features are used here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decomps[4][0].w.loc[d_different.w.index] == d_different.w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Slicing decompositions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We might be interested in visualising or analysis only a subset of a decomposition; perhaps we only want to look at one cohort of samples in a model, or only at some of the signatures. Decomposition objects can be indexed by sample, feature, and signature in that order. Normal slice syntax (1:5), lists of index names, or lists of index positions are supported. You can supply a mix of these.\n",
    "\n",
    "This can also be used for abitrary reordering of samples and features for visualisations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "decomp_slcd = decomps[4][0][:50, :, [\"S1\", \"S3\", \"S4\"]]\n",
    "decomp_slcd.plot_relative_weight(heights=[0.3, 1, 0.2]).render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we have restricted to just the first 50 samples, and only signatures S1, S3 and S4. In the full decomposition, S2is important to described many samples from samp_35 to samp_49, so the model fit for these is much poorer in this reduced model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another potential use for the slicing syntax is to reorder samples based on some metadata condition. The method is illustrated below with a random reordering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reorder samples - here we're just generating a random order\n",
    "import random\n",
    "new_order = list(range(decomps[4][0].h.shape[1]))\n",
    "random.shuffle(new_order)\n",
    "\n",
    "# Reorder the decomposition\n",
    "decomps[4][0][new_order].plot_relative_weight(heights=[0.3, 1, 0.2]).render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving and loading\n",
    "Decompositions can be saved to disk for analysis in other environments, or saved and reloaded later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "Decomposition.save_decompositions(decomps, output_dir=pathlib.Path(\"test_save\"), symlink=False, plots=False)\n",
    "# We're not writing any plots here (as they take some time), but you can set this to\n",
    "# - True: make all plots\n",
    "# - List of plots names: make all the listed plots\n",
    "# - None: Make all plots, unless there are many samples (>250), in which case omit any plots which have an element for each sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! ls test_save/**/**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These decompositions can be loaded from disk again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loaded_decomps = Decomposition.load_decompositions(in_dir=pathlib.Path(\"test_save\"))\n",
    "loaded_decomps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An individual decomposition can also be loaded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "one_decomp = Decomposition.load(in_dir=pathlib.Path(\"test_save/5/0/\"))\n",
    "one_decomp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! rm -rf test_save"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stability based rank selection\n",
    "We implement to other popular methods for rank selection, the cophenetic correlation and dispersion coefficients.\n",
    "These are both based on the stability of discrete clusting provided by the decomposition.\n",
    "Each sample (or feature) is 'assigned' to the signature for which it has maximum weight, and a consensus matrix created indicating which pairs of elements are found assigned to the same signature.\n",
    "How similar these consensus matrices are is evaluated across multiple random initialisations, with the intuition that a good factorisation should be stable across intitialisation conditions.\n",
    "\n",
    "We implement these with the functions `cophenetic_correlation`, `dispersion`, and `plot_stability_rank_selection`.\n",
    "\n",
    "### Making decompositions with random intialisations\n",
    "The first step is to make a number of decompositions for each rank of interest using `decompositions`. We are using 10 here, but more (100) would be good for real experiments. The initialisation method should be one which includes a random factor, so either `random` or `nndssvdar`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rs_decomps = decompositions(x, ranks=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], top_n=10, random_starts=10, seed=4928, progress_bar=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting stability coefficients\n",
    "We can fetch each as a series using the respective function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cvanmf.denovo import dispersion, cophenetic_correlation, plot_stability_rank_selection\n",
    "import plotnine as pn\n",
    "\n",
    "dispersion_series = dispersion(rs_decomps)\n",
    "cophenetic_series = cophenetic_correlation(rs_decomps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dispersion_series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cophenetic_series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also plot these, either by providing the series or the `decompositions` results.\n",
    "\n",
    "If you need the values to use in calculations later, it is more efficient to calculate and provide series. Passing `decompositions` results in the function internally calculating the consensus matrices, and for cophenetic correlation perform average linkage clustering, which can be time consuming for large decompositions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_stability_rank_selection(rs_decomps) + pn.theme(figure_size=(6, 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is how you would pass the series to plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_stability_rank_selection(series=[cophenetic_series, dispersion_series]) + pn.theme(figure_size=(6, 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  },
  "mystnb": {
   "execution_timeout": -1
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}