Example application to cell composition data¶

In this example, we will look at the compositions of samples from a non-small cell lung cancer dataset.

The data orirginates from to this study. It was subsequently added to cellxgene, which is the source we have taken the data from. The data on cellxgene is gene expression, but also includes cell type labels for each cell. From this we derived the relative cell type composition of each sample, the proportion of each cell type in each sample.

from cvanmf import denovo, combine, data, stability
import pandas as pd
import plotnine as pn
import numpy as np
import matplotlib.pyplot as plt
import pathlib

lung_data = data.lung_cancer_cells()
lung_data

Non-small cell lung cancer
cvanmf example data
shape	(33, 224)
rank	[]
description	Relative cell-compositions from non-small cell lung cancer studies. Gives the number of cells of different types in lung tissue samples from a non-small cell lung cancer atlas, which was compiled from 29 studies and includes 556 samples, from 318 individuals (86 of which are healthy controls). The data was uploaded to cellxgene using their standard ontologies, which is the source we have taken the data from. Metadata provided here is a mixture of metadata from cellxgene, and some from the original paper. We have selected out only the tissues samples labelled as "lung". In total, this gives 224 samples, and 33 cell types. The data here is total-sum-scaled, i.e. each sample sums to 1.
doi	https://doi.org/10.1016/j.ccell.2022.10.008
citation	Salcher, S. et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell 40, 1503-1520.e8 (2022). Program, C. S.-C. B. et al. CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. 2023.10.30.563174 Preprint at https://doi.org/10.1101/2023.10.30.563174 (2023).

This is a named tuple object, with properties giving tables and metadata associated to the data. The data is in the data property.

composition = lung_data.data
composition.head().iloc[:,:3]

	Adams_Kaminski_2020_001C	Adams_Kaminski_2020_002C	Adams_Kaminski_2020_003C
cell_type
B cell	0.019485	0.030466	0.038913
CD1c-positive myeloid dendritic cell	0.012571	0.007168	0.032539
CD4-positive, alpha-beta T cell	0.006285	0.021505	0.023482
CD8-positive, alpha-beta T cell	0.012571	0.014337	0.042603
alveolar macrophage	0.293526	0.367384	0.300906

The data has been total-sum-scaled, i.e. each sample sums to 1, which we can check.

np.allclose(composition.sum(axis=0), 1.0)

True

The metadata for each column (sample) is stored in the col_metadata property. One of the labels for disease is quite long, so here we’ll tidy it a bit.

# Disease labels are very long, so we will recode these for more compact plots
sample_md = lung_data.col_metadata.copy()
sample_md['disease_long'] = sample_md['disease']
disease_code = {
    "chronic obstructive pulmonary disease": "NC: COPD",
    "normal": "NC",
    "non-small cell lung carcinoma": "C: non-small cell",
    "squamous cell lung carcinoma": "C: squamous cell",
    "lung adenocarcinoma": "C: adenocarcinoma"
}
sample_md['disease'] = sample_md['disease_long'].apply(lambda x: disease_code[x])
sample_md = sample_md.loc[~sample_md.index.duplicated()]
sample_md.head().iloc[:, :3]

	assay	assay_ontology_term_id	development_stage
patient
Adams_Kaminski_2020_001C	10x 3' v2	EFO:0009899	22-year-old human stage
Adams_Kaminski_2020_002C	10x 3' v2	EFO:0009899	25-year-old human stage
Adams_Kaminski_2020_003C	10x 3' v2	EFO:0009899	67-year-old human stage
Adams_Kaminski_2020_034C	10x 3' v2	EFO:0009899	49-year-old human stage
Adams_Kaminski_2020_052CO	10x 3' v2	EFO:0009899	62-year-old human stage

Rank Selection¶

We’ll search from ranks 2 to 10. You may want to search a wider range in other data, but here we will see a clear rank is indicated within this range.

rank_res = denovo.rank_selection(
    x=composition,
    ranks=list(range(2, 11)),
    seed=4298,
    shuffles=100,
    progress_bar=False
)

We can plot the results to look for an elbow point in the values. Throughout the notebook we apply some customisations to make publication ready figures, but these are not neccessary. For most figures the customisation is done using plotnine, which has very similar syntax to ggplot in R. The purpose of any complex changes to the plot are indicated by comments.

# This is a flag indicating whether to write figure produced
PUB = True

def figure_output(plot, name, pub):
    """Output a figure and the underlying data in PNG, PDF and TSV format."""
    if not pub:
        return plot
    prefix = pathlib.Path("cell_imgs")
    out_path = prefix / name
    plot.save(out_path.with_suffix('.png'), dpi=300)
    plot.save(out_path.with_suffix('.pdf'))
    plot.data.to_csv(out_path.with_suffix('.tsv'), sep='\t')
    return plot

plt_ranksel = (
    denovo.plot_rank_selection(rank_res, jitter=False) 
    + pn.guides(fill = "none")
    + pn.theme(figure_size=(2.66 * 2, 1.8))
    + pn.ggtitle("Bicrossvalidation Rank Selection")
    + pn.xlab("Rank")
    + pn.ylab("")
    # We're adding a little more room for the stars - the internal calculation doesn't always get the right space for custom size plots
    + pn.scale_y_continuous(expand=[0.1, 0, 0.1, 0])
)

figure_output(plt_ranksel, "a_ranksel", PUB)
# You can show the figure in the notebook by using it as the final line of the cell
plt_ranksel

/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:1194: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:1298: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:1303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 5.32 x 1.8 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/a_ranksel.png
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 5.32 x 1.8 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/a_ranksel.pdf

_images/b5001da9ae781245929c26c1c04cb9595278ce69bc3fc0d04525de7bee1a58e8.png

Regularisation selection¶

We would like a sparse solution so we will use L1 regularisation, and select a value for parameter alpha. Regularisation selection is performed in much the same way, however the results returned are a tuple with the first element the suggested best value for alpha based on a heuristic, the second element being results in the same structure as rank selection.

best_alpha, regu_res = denovo.regu_selection(
    x=composition,
    rank=3,
    seed=4298,
    l1_ratio=1.0,
    progress_bar=False
)

plt_regusel = (
    denovo.plot_regu_selection(regu_res, jitter=False) 
    + pn.theme(
        figure_size=(2.66, 1.8),
        axis_text_x=pn.element_text(rotation=45, hjust=1.0)
    )
    + pn.guides(fill = "none")
    + pn.ggtitle("Regularisation Selection")
    # We're adding a little more room for the stars
    + pn.scale_y_continuous(expand=[0.1, 0, 0.1, 0]) 
)
figure_output(plt_regusel, "a_regusel", PUB)
plt_regusel

/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:1194: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:1298: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 2.66 x 1.8 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/a_regusel.png
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 2.66 x 1.8 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/a_regusel.pdf

_images/9893f8c0c9654a4adeaab3941b800d68605c104e88308fb6e5fbbd060f77004a.png

best_alpha

0.008928571428571428

The heuristic suggested value is quite a drop in \(R^2\) compared to the previous, so we will instead take the value one before.

Decomposition with selected parameters¶

decomps = denovo.decompositions(
    x=composition,
    ranks=[3],
    alpha=2.23e-03,
    l1_ratio=1.0,
    seed=4298,
    # We keep the 100 best decompositions so we can look at how similar signatures are across them later
    # You could chose to retain a lower number
    top_n=100
)
k3 = decomps[3][0]

100%|███████████████████████████████████| 100/100 [00:00<00:00, 142.97it/s]

Model fit for disease status¶

Dose our model describe certain diseases types better than others? We can plot the model fit for each sample, grouped by disease status. The model fit is relatively good for all groups, though slightly higher for all the cancer samples. We could hypothesise that this might be that there is more variation in composition among healthy participants; or perhaps as there are more cancer samples the model has learnt those types better.

plt_modelfit = (
    k3.plot_modelfit(group=sample_md['disease']) 
    + pn.coord_flip() 
    + pn.theme(figure_size=(3, 1.5)) 
    + pn.ggtitle("Model Fit")
)
plt_modelfit

_images/cbd75384f863edbb6d917669a3cf2643a4f0cf1522d33d71c4cc1acf982dd914.png

We can do the same for tumour stage

k3.plot_modelfit(group=sample_md['tumor_stage']) + pn.coord_flip() + pn.theme(figure_size=(3, 1.5))

_images/6567f773a32486115f5f5f563150e9c5f4e032ca3a67b007ffa1dd5467998881.png

Interpreting signatures¶

Knowing which features contribute to each signature can help give a domain-specific interpretation, such as understanding which types of cells are contributing to the cancer associated signature.

The default plot for this is plot_feature_weight, which shows all features above a certain proportion of the overall signature weights (by default 4%).

Here we see S2 is largely inflammatory immune type / T-cells, S3 malignant cancer cells, and S1 a mix including functional lung cells (pneumocyte, alveolar macrophage).

plt_weight = (
    k3.plot_feature_weight() 
    + pn.theme(
        figure_size=(4.2, 2.5)
    )
    + pn.ggtitle("Feature Weight")
    + pn.scale_x_discrete(expand=(0.1, 0.0, 0.15, 0.0)) # This adds a little more room left and right of the bubbles
)
plt_weight

/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/scales/scales.py:48: PlotnineWarning: Scale for 'x' is already present.
Adding another scale for 'x',
which will replace the existing scale.

_images/71c5d396a8c348803cb42af52d33dea359b84d0e1ae01fa00e4074ef823c7661.png

To make understanding future plots simpler, we can rename signatures. Here we retain the numbering (S1 .. S3), and add a brief tag indicating the broad types of cells represented.

k3.names = ['S1 (Func.)', 'S2 (Infl.)', 'S3 (Tumor)']
plt_weight = (
    k3.plot_feature_weight() 
    + pn.theme(
        figure_size=(4.2, 2.8)
    )
    + pn.ggtitle("Feature Weight")
    + pn.scale_x_discrete(expand=(0.1, 0.0, 0.15, 0.0)) # This adds a little more room left and right of the bubbles
)
figure_output(plt_weight, "f_feature_weight", PUB)
plt_weight

/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/scales/scales.py:48: PlotnineWarning: Scale for 'x' is already present.
Adding another scale for 'x',
which will replace the existing scale.

/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 4.2 x 2.8 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs.png
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 4.2 x 2.8 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs.pdf

_images/519e2e49d3e0d1c637f1c59dbb6260c5ce0e14870ed7534fa4b7fa44189c1044.png

Ordination of samples¶

To look at broad patterns in the data, we can look at PCoA ordination of samples. With few signatures, these willl largely divide by primary signature. In the PCoA below, we can see cancer and non-cancer samples separating along the first axis, and early and late stage tumors separating on the second axis.

We could run the PCoA on the input data, or WH, using the on argument of plot_pcoa/pcoa.

# We're defining a colour scale for tumor stage which gives an intution of progression
stage_color = dict(
    none="grey",
    early="#c8b1e4",
    advanced="#532b88"
)

plt_pcoa = (
    k3.plot_pcoa(on="h", color=sample_md['tumor_stage'], shape="signature") 
    + pn.theme(figure_size=(4.3, 4), legend_position="bottom") 
    + pn.ggtitle("PCoA on Signature Weights")
    + pn.scale_color_manual(
        values=list(stage_color.values()),
        breaks=list(stage_color.keys())
    )
    # We're removing the point colour legend, as it will be indicated on another figure in the publication
    + pn.guides(
        color="none"
    )
)
figure_output(plt_pcoa, "c_pcoa", PUB)
plt_pcoa

/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:3831: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:3835: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 4.3 x 4 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/c_pcoa.png
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 4.3 x 4 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/c_pcoa.pdf

_images/c168ea0b61b79b570cfd4ae51088cc9dca0d23b5b9cdc4f0ea574fe1c1978f78.png

Visualising signature composition¶

An overview of the signature composition for samples can be a useful exploratory step. The plot_relative_weight methods produces a combination of stacked bar plot, a ribbon indicating some categorical metadata value for each sample, and a point plot of model fit.

Here we can see again what we observed above - there is a shift in S2 / S3 abundance between early and late stage tumours, and S3 is more abundant in advanced tumors.

fig = k3.plot_relative_weight(
    # The metadata to show groups for. It will appear in the order and with the name we give it here in this series.
    group=(
        sample_md['tumor_stage']
        .rename('Tumor Stage')
        .sort_values(ascending=False)
    ),
    # The colours should be provided as a series
    # There is currently a bug here, where the colours are applied reversed.
    group_colors=pd.Series(
        reversed(list(stage_color.values())),
        index=list(stage_color.keys())
    ),
    model_fit=False,
    heights=[0.25, 1, 0],
    sample_label_size=False, 
    legend_cols_grp=2,
    legend_cols_sig=2,
    width=4, point_size=0.1,
    threshold=None,
    legend_side="top"
)
# The relative weight plot is produced by Marsilea, so unfortunately currently has a slightly different way to display and save
fig.render()
plt.show()

_images/2461c6e38c65f9fa6bd44e00b14809f901e6801db71b54ef3b252ffb875927be.png

fig.save("cell_imgs/d_relative_weight.pdf")
fig.save("cell_imgs/d_relative_weight.png", dpi=300)

Relationships to metadata¶

Throught the previous plots we have seen what look like some associations between tumor stage and signature. We can plot this relationship to metadata as box or scatter plots. plot_metadata takes a single datframe with all the metadata we want to plot against, and returns two figures with discrete and continuous variables respectively.

By default, statistical tests are run for discrete comparison: a Kruskal-Wallis test for more than two classes, and a Mann-Whitney U-test for two samples. Significance is indicated with stars.

Here, we see that there are differences in signature weight between disease and tumour stage.

disc.data.head()

	sample	metadata_field	metadata_value	signature	signature_weight
0	Adams_Kaminski_2020_001C	Disease	NC	S1 (Func.)	0.961407
1	Adams_Kaminski_2020_001C	Disease	NC	S2 (Infl.)	0.028294
2	Adams_Kaminski_2020_001C	Disease	NC	S3 (Tumor)	0.010300
3	Adams_Kaminski_2020_001C	Disease	NC	modelfit	0.914390
4	Adams_Kaminski_2020_001C	Tumor Stage	none	S1 (Func.)	0.961407

disc, cont = k3.plot_metadata(sample_md[['disease', 'tumor_stage', 'age']].rename(
    # Rename our metadata fields to more readable form
    columns=dict(tumor_stage="Tumor Stage", disease="Disease")
))

disc = (
    disc 
    + pn.guides(fill="none") 
    + pn.theme(
        figure_size=(3,4),
        axis_text_x=pn.element_text(rotation=60, hjust=1, vjust=1)
    )
    + pn.ggtitle("Metadata Plots")
    # We're adding a little more room for the stars - the internal calculation doesn't always get the right space for custom size plots
    + pn.scale_y_continuous(expand=[0.1, 0, 0.2, 0]) 
    # This plot is facetted by default, but we're redefining it so we can add linebreaks to row values
    # To see the data and columns which the plot is based on, your can look at disc.data
    + pn.facet_grid(
        rows='signature',
        cols='metadata_field',
        labeller=pn.labeller(rows=lambda x: x.replace("(", "\n(").replace('model', '\nmodel')),
        scales='free_x',
        space='free_x'
    )
)
figure_output(disc, "e_metadata", PUB)
disc

/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:5147: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:4310: FutureWarning: The provided callable <function min at 0x7f88c86d0360> is currently using DataFrameGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:4326: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/home/kam24goz/repos/cvanmf/src/cvanmf/denovo.py:4331: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/scales/scales.py:48: PlotnineWarning: Scale for 'y' is already present.
Adding another scale for 'y',
which will replace the existing scale.

/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 3 x 4 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/e_metadata.png
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:606: PlotnineWarning: Saving 3 x 4 in image.
/home/kam24goz/miniforge3/envs/cvanmf/lib/python3.12/site-packages/plotnine/ggplot.py:607: PlotnineWarning: Filename: cell_imgs/e_metadata.pdf

_images/627627e45f4a8db71b06eabea3dc9352eddcab4268534235ea1ea0e31b5e6d70.png

Here it is might be interesting to look which of the tumor stages are different in post-hoc tests. That isn’t available in the plot currently, but we can run the tests and get a dataframe of the results

univar_results = k3.univariate_tests(sample_md[['disease', 'tumor_stage']].rename(
    # Rename our metadata fields to more readable form
    columns=dict(tumor_stage="Tumor Stage", disease="Disease")
))
univar_results.head()

	statistic	p	test	signature	md	max_median	max_mean	posthoc_str	posthoc_mat	alpha	local_reject	local_adj_p	global_reject	global_adj_p	adj_method
S1 (Func.)	111.551123	0.0	kruskal	S1 (Func.)	Disease	NC	NC	C: adenocarcinoma\|C: squamous cell(0.041462685...	[[1.0, 0.326832357772083, 0.04146268546446127,...	0.05	True	0.0	True	0.0	fdr_bh
S2 (Infl.)	56.195448	0.0	kruskal	S2 (Infl.)	Disease	C: adenocarcinoma	C: non-small cell	C: adenocarcinoma\|C: squamous cell(4.287959966...	[[1.0, 0.6297354012486505, 4.287959966752344e-...	0.05	True	0.0	True	0.0	fdr_bh
S3 (Tumor)	60.076882	0.0	kruskal	S3 (Tumor)	Disease	C: squamous cell	C: squamous cell	C: adenocarcinoma\|C: squamous cell(0.000366053...	[[1.0, 0.4083706712324559, 0.00036605306670249...	0.05	True	0.0	True	0.0	fdr_bh
modelfit	5.183176	0.269014	kruskal	modelfit	Disease	C: squamous cell	C: adenocarcinoma		[[1.0, 0.8855997748258781, 0.8855997748258781,...	0.05	False	0.269014	False	0.269014	fdr_bh
S1 (Func.)	115.403315	0.0	kruskal	S1 (Func.)	Tumor Stage	none	none	advanced\|early(0.0004988496148587597);advanced...	[[1.0, 0.0004988496148587597, 9.12382034786168...	0.05	True	0.0	True	0.0	fdr_bh

# To extract only the modelfit / stage tests
univar_results[
    (univar_results['signature'] == "modelfit")
    & (univar_results['md'] == 'Tumor Stage')
]['posthoc_str']

modelfit    advanced|none(0.049496854141262696)
Name: posthoc_str, dtype: object

This indicates that only the comparison between advanced and none was significant below the default alpha of 0.05.

We also get a plot for continuous metadata variables, though there is not much interesting to see here.

cont + pn.theme(figure_size=(6, 6))

_images/745b86d1477d908ae206e39db851b4e5f66207bcbb4da18916f3637bb04bd2e0.png

Stability of Signatures¶

NMF solutions vary depending on their initialisation - most methods offered include some randomisation, so solutions can vary. If highly similar signatures are obtained across random initialisations, this can help characterise how robust each signature is. Using the 100 decompositions we produced earlier, we can visualise how similar the signatures in other decompositions are to the best decomposition which we used for analysis.

sig_stability = stability.signature_stability(
    decomps[3],
    decomps[3][0]
)
plt_stability = (
    stability.plot_signature_stability(
        sig_stability,
        k3.colors
    ) +
    pn.ggtitle("Signature\nStability") +
    pn.theme(figure_size=(2.5, 3)) +
    pn.guides(fill="none")
)
plt_stability

_images/6b16308a012855cbfc2ecd2902461982e72f8ff86bc7bba354a359e463709780.png

Signatures are mostly highly similar across decompositions, with a handful of outliers where the solution is more different. In particular, signature S3 seems to have no similar signature in some decompositions.

We can compare this across decompositions of multiple ranks to see if signature stability supports our choice of rank.

decomps_multi = denovo.decompositions(
    x=composition,
    ranks=range(2, 11),
    # Suitable alpha hasn't been selected for all ranks, so set to 0
    alpha=0.0,
    l1_ratio=1.0,
    seed=4298,
    # We want to retain all our decompositions so we can compare them
    top_n=101,
    # We choose 101 so we can take the best, and compare it to 100 others
    random_starts=101,
    progress_bar=False
)

sig_stability_multi = stability.signature_stability(
    decomps_multi
)
plt_stability_multi = (
    stability.plot_signature_stability(
        sig_stability_multi,
        geom_boxplot=dict(outlier_size=0.1),
        geom_line=True
    ) +
    pn.ggtitle("Signature\nStability") +
    pn.theme(figure_size=(7, 5)) +
    pn.guides(fill="none") +
    pn.facet_wrap(facets="k", ncol=6, scales="free_x")
)
plt_stability_multi

_images/8be322b688c0b5270bf09e939e1d76eb760a2ce840d7bc4db5636f78903b92d4.png

The signatures here show much more variation across intialisations, suggesting a stable solution is not being found.