cvanmf.denovo¶

Generate new models using NMF decomposition

This module provides functions to generate new models from data, which encompasses three main steps: rank selection, regularisation selection, and model inspection. The first of these two steps involves running decompositions multiple times for a range of values, and can be time-consuming. Methods are provided to run the whole process on a single machine, but also for running individual decompositions, which are used by the accompanying nextflow pipeline to allow spreading the computation across multiple nodes in an HPC environment.

The main functions for each step are

rank_selection() and plot_rank_selection()
regu_selection() and plot_regu_selection()
decompositions() which produces Decomposition objects

Individual decompositions are represented by a Decomposition object, and visualisation and analysis are carried out using object methods (such as Decomposition.plot_feature_weight()).

Attributes¶

`Numeric`	Alias for python numeric types (a union of int and float).
`PcoaMatrices`	Allowed matrices which PCoA can be constructed from. Allows values w, x,
`logger`	Logger object.

Classes¶

`BicvFold`	One fold from a shuffled matrix
`BicvResult`	Results from a single bi-cross validation run. For each BicvSplit there
`BicvSplit`	Shuffled matrix for bi-cross validation, split into mn matrices in a
`Decomposition`	Decomposition of a matrix.
`NMFParameters`	Parameters for a single decomposition, or iterations of bi-cross

Functions¶

`bicv`(→ BicvResult)	Perform a single run of bicrossvalidation.
`cli_decompose`(→ None)	Decompositions for RANKS.
`cli_rank_selection`(→ None)	Rank selection for NMF using mn-fold bi-cross validation
`cli_regu_selection`(→ None)	Regularisation selection for NMF on ALPHA 9 fold bi-cross validation
`cophenetic_correlation`(→ pandas.Series)	Cophenetic correlation coefficient for rank selection
`decompose`(→ Decomposition)	Perform a single decomposition of a matrix.
`decompositions`(→ Dict[int, List[Decomposition]])	Get the best decompositions for input matrix for one or more ranks.
`dispersion`(→ pandas.Series)	Dispersion coefficient for rank selection
`plot_rank_selection`(→ plotnine.ggplot)	Plot rank selection results from bicrossvalidation.
`plot_regu_selection`(→ plotnine.ggplot)	Plot regularisation selection results.
`plot_stability_rank_selection`(→ plotnine.ggplot)	Plot results for stability based rank selection methods (
`rank_selection`(, progress_bar, List[BicvResult]])	Bi-cross validation for rank selection.
`regu_selection`(, progress_bar, Dict[float, ...)	Bicrossvalidation for regularisation selection.
`signature_similarity`(→ pandas.Series)	Mean cosine similarity of signatures for rank selection
`suggest_alpha`(→ float)	Suggest a suitable value for alpha.
`suggest_rank`(→ Dict[str, int])	Suggest a suitable rank.
`suggest_rank_stability`(→ Dict[str, int])	Suggest a suitable rank in stability based measures.

Module Contents¶

class cvanmf.denovo.BicvFold[source]¶

Bases: NamedTuple

One fold from a shuffled matrix

The submatrices have been joined into the structure shown below

A B . B
C D . D
. . . .
C D . D

from which A will be estimated as A’ using only B, C, D. ```

A: pandas.DataFrame¶

B: pandas.DataFrame¶

C: pandas.DataFrame¶

D: pandas.DataFrame¶

class cvanmf.denovo.BicvResult[source]¶

Bases: NamedTuple

Results from a single bi-cross validation run. For each BicvSplit there are \(mn\) folds (for \(m\) splits on rows, \(n\) splits on columns), for which the top left submatrix (\(A\)) is estimated (\(A'\)) using the other portions.

static join_folds(results: List[BicvResult]) → BicvResult[source]¶

Join results from individual folds

Each fold returns a BicvResult with a length one array. This method joins these into a single object summarising all the folds. Could also join other sets of results.

Parameters:: results – Results from individual folds
Returns:: Single object with individual arrays joined

static results_to_table(results: Iterable[BicvResult] | Dict[Numeric, Iterable[BicvResult]], summarise: Callable[[numpy.ndarray], float] = np.mean) → pandas.DataFrame[source]¶

Convert bi-fold crossvalidation results to a table

For results run with the same parameters, convert the output to a table suitable for plotting.

Parameters:

results – List of results for bicv runs with the same parameters on different shuffles of the data, or dict of runs across multiple values on the same shuffles.
summarise – Function to reduce each the measures (r_squared etc) to a single value for each shuffle.

to_series(summarise: Callable[[numpy.ndarray], float] = np.mean) → pandas.Series[source]¶

Convert bi-fold cross validation results to series

Parameters:: summarise – Function to reduce each the measures (r_squared etc) to a single value for each shuffle.
Returns:: Series with entry for each non-parameter measure

a: List[numpy.ndarray] | None¶: Reconstructed matrix A for each fold. Not included when keep_mats is False.

cosine_similarity: numpy.ndarray¶: Cosine similarity between each A and A’ considered as a flattened vector.

i: int¶: Shuffle number when there are multiple shuffles. Included to allow spreading bicv across multiple processes, but without needing to return a copy of the full matrix.

l2_norm: numpy.ndarray¶: L2 norm between each A and A’.

parameters: NMFParameters¶: Parameters used during this run.

r_squared: numpy.ndarray¶: Explained variance between each A and A’, with each considered as a flattened vector.

reconstruction_error: numpy.ndarray¶: Reconstruction error between each A and A’.

rss: numpy.ndarray¶: Residual sum of squares between each A and A’

sparsity_h: numpy.ndarray¶: Sparsity of H matrix for each A’

sparsity_w: numpy.ndarray¶: Sparsity of W matrix for each A’

class cvanmf.denovo.BicvSplit(mx: List[pandas.DataFrame], design: Tuple[int, int], i: int | None = None)[source]¶

Shuffled matrix for bi-cross validation, split into mn matrices in a m x n pattern. To shuffle and split an existing matrix, use the static method :method:`BicvSplit.from_matrix`

Create a shuffled matrix containing the 9 split matrices. These should be in the order

0, 1, 2
3, 4, 5
6, 7, 8

Parameters:

mx – Split matrices in a flat list. These should be all of one row, then all of the next row etc.
design – Number of even splits on rows and columns of mx.
i – Index of the split if one of many

col(col: int, join: bool = False) → List[pandas.DataFrame] | pandas.DataFrame[source]¶

Get a column of the submatrices by index. Convenience method for readability.

Parameters:

col – Column index to get
join – Join into a single matrix

Returns:

List of submatrices making up the column, or these submatrices joined if requested.

fold(i: int) → BicvFold[source]¶

Construct a fold of the data

There are m*n possible folds of the data, this function constructs the i-th fold.

Parameters:: i – Index of the fold to construct, from [0, mn)
Returns:: A, B, C, and D matrices for this fold

static from_matrix(df: pandas.DataFrame, n: int = 1, design: Tuple[int, int] = (3, 3), random_state: int | numpy.random.Generator | None = None) → Generator[BicvSplit][source]¶

Create random shuffles and splits of a matrix

Parameters:

df – Matrix to shuffle and split
n – Number of shuffles
design – Number of blocks to divide rows and columns into. Default is 3x3 9-fold bicrossvalidation.
random_state – Random state, either int seed or numpy Generator; None for default numpy random Generator.

Returns:

A generator of splits, as BicvSplit objects

static load_all_npz(path: pathlib.Path | str, allow_pickle: bool = True, fix_i: bool = False) → Generator[BicvSplit][source]¶

Read shuffles from files.

Reads either all the npz files in a directory, or those specified by a glob. The expectation is the filenames are in format prefix_i.npz, where i is the number of this shuffle. If not, use fix_i to renumber in order loaded.

Parameters:

path – Directory with .npz files, or glob identifying .npz files
allow_pickle – Allow unpickling when loading; necessary for compressed files.
fix_i – Renumber shuffles.

Returns:

Generator of BicvSplit objects.

static load_npz(path: pathlib.Path, allow_pickle: bool = True, i: int | None = None) → BicvSplit[source]¶

Load splits from file.

Load splits from an npz file. This will mean don’t have the column and row names anymore, but this is unimportant for cross-validation. Pickling is required for loading compressed files, but is not secure, so the option is provided to turn it off if you don’t need it.

Parameters:

path – File to load
allow_pickle – Allow unpickling when loading; necessary for compressed files
i – Shuffle number. Will attempt to parse from filename if blank. Only parses files like prefix_1.npz, which will be i=1, prefix only alphanumeric.

row(row: int, join: bool = False) → List[pandas.DataFrame] | pandas.DataFrame[source]¶

Get a row of the submatrices by index. Convenience method for readability.

Parameters:

row – Row index to get
join – Join into a single matrix

Returns:

List of submatrices making up the row, or these submatrices joined if requested.

static save_all_npz(splits: Iterable[BicvSplit], path: pathlib.Path, fix_i: bool = False, force: bool = False, compress: bool = True) → None[source]¶

Save a collection of splits to a directory as npz files.

Parameters:

splits – Iterable of BicvSplit objects
path – Directory to write to
fix_i – Renumber all splits starting from 0. Does not check if existing numbering is unique.
compress – Use compression
force – Overwrite existing files

save_npz(path: pathlib.Path, compress: bool = True, force: bool = False) → None[source]¶

Save these splits to file.

Write the splits to a numpy format file. This will lose the row and column names, however this is unimportant for rank selection. Compression is enabled by default, as sparse data such as microbiome counts tends to create large files.

Parameters:

path – Path to write to. If passed a directory, will output with filename shuffle_{i}.npz. If i is not set, cause an error.
compress – Use compression.
force – Overwrite existing files.

property design: Tuple[int, int]¶: Design of holdout pattern, given as (rows, columns).

property folds: List[BicvFold]¶: List of the mn possible folds of these submatrices.

property i: int¶: This is the i-th shuffle of the input.

property mx: List[List[pandas.DataFrame]]¶: Submatrices as a 2d list.

property num_folds: int¶: Total number of folds in this design.

property shape: Tuple[int, int]¶: Dimensions of the input matrix.

property size: int¶: Size of input matrix.

property x: pandas.DataFrame¶

The input matrix

This reproduces the input matrix by concatenating the submatrices.

Returns:: Input matrix

class cvanmf.denovo.Decomposition(parameters: NMFParameters, h: pandas.DataFrame, w: pandas.DataFrame, feature_mapping: reapply.FeatureMapping | None = None)[source]¶

Decomposition of a matrix.

Note that we use the naming conventions and orientation common in NMF literature:

\(X\) is the input matrix, with m features on rows, and n samples on columns.
\(H\) is the transformed data, with k signatures on rows, and n samples on columns.
\(W\) is the feature weight matrix, with m features on rows, and m features on columns.

The scikit-learn implementation has these transposed; this package handles transposing back and forth internally, and expects input in the features x samples orientation, and provides \(W\) and \(H\) inline with the literature rather than scikit-learn.

Decomposition objects can be sliced using the syntax:

sliced_model = model[samples, features, signatures]
# Only slice on one dimension
sliced_signatures = model[:, :, ['S1', 'S2']]

Slices must be ordered collections of strings, integer indices, or booleans.

compare_signatures(b: Comparable) → pandas.DataFrame[source]¶

Similarity between these signatures and one other set.

Similarity here is defined as cosine as the angle between each pair of signature vectors, so 1 is identical (ignoring scale) and 0 is perpendicular.

This is a convenience method which calls stability.compare_signatures().

Parameters:: b – Signature matrix, or object with signature matrix
Returns:: Matrix with cosine of angles between signature vectors.

consensus_matrix(on: Literal['h', 'w'] | pandas.DataFrame = 'h') → scipy.sparse.csr_array[source]¶

Consensus matrix of either \(H\) or \(W\).

Most typically, the consensus matrix is calculated on the \(H\) matrix, and is a binary matrix representing whether sample \(i\) is assigned to the same signature as sample \(j\). Samples are assigned to signatures based on their maximum weight. When calculated on \(W\), it is the same but for features assigned.

The primary use of this is in generating a \(\bar{C}\) matrix, the mean number of times two elements are assigned to the same signature. \(\bar{C}\) is used to calculate the cophenetic_correlation() and dispersion() coefficients, a method of determining suitable rank.

This is returned as a lower triangular matrix in sparse format.

discrete_signature_scale(axis: Literal['x', 'y']) → plotnine.scale_x_discrete | plotnine.scale_y_discrete[source]¶

Make a plotnine scale which puts the signatures in order.

By default, plotnine will alphabetically sort (S1, S11 .. S2, S21), this produces a scale object which can be added to a plot to put the signatures in their order in this object.

static load(in_dir: os.PathLike, x: pandas.DataFrame | str | os.PathLike | None = None, delim: str = '\t')[source]¶

Load a decomposition from disk.

Loads a decomposition previously saved using save(). Will automatically determine whether this is a directory or .tar.gz. Can provide a DataFrame of the \(X\) input matrix, primarily this is so when loading multiple decompositions they can all reference the same object. Can also provide an explicit path; if not provided will attempt to load from x.tsv.

Parameters:

in_dir – Directory or .tar.gz containing decomposition.
x – Either the X input matrix as a DataFrame, or a path to a delimiter-separated copy of the X matrix. If None, will attempt to load from x.tsv.
delim – Delimiter for tabular data

static load_decompositions(in_dir: os.PathLike, delim: str = '\t') → Dict[int, List[Decomposition]][source]¶

Load multiple decompositions.

Load a set of decompositions previously saved using save_decompositions(). Will attempt to share a reference to the same \(X\) matrix for memory reasons. The output is a dictionary with keys being ranks, and values being lists of decompositions for that rank.

Parameters:

in_dir – Directory to read from
delim – Delimiter for tabular data files

match_signatures(b: Comparable) → pandas.DataFrame[source]¶

Identify optimal matches between these signatures and one other set

Find the pairing of signatures which are most similar. More technically, this finds the pairing of signatures which maximises the total cosine similarity using the Hungarian algorithm. It is possible that a signature gets paired with another for which the cosine similarity is not highest, suggesting a potentially bad match between some signatures in the model.

The return is a dataframe with columns a and b for which signatures are paired, the cosine similarity of the pairing, and the maximum ‘off-target’ cosine value for any of the signatures which it was not assigned to. The intention for the off-target score is that ideally this would be low, and the paired similarity high: signatures match well their paired one, while being dissimilar to all others.

This is a convenince method which calls stability.match_signatures().

Parameters:: b – Signature matrix, or object with signature matrix
Returns:: DataFrame with pairing and scores

monodominant_samples(threshold: float = 0.9) → pandas.DataFrame[source]¶

Which samples have a monodominant signature.

A monodominant signature is one which represents at least the threshold amount of the weight in the scaled() \(H\) matrix.

Parameters:: threshold – Proportion of the scaled H matrix weight to consider a signature dominnant.
Returns:: Dataframe with column is_monodominant indicating if a sample has a monodominant signature, and signature_name indicating the name of the signature, or none if not.

name_signatures_by_weight(cumulative_sum: float = 0.4, max_char_length: int = 10, max_num_features: int = 5, feature_delimiter: str = '+', number: bool = True, clean: Callable[[str], str] = lambda x: ...) → None[source]¶

Give a slightly more descriptive name to each signature.

Append features with highest relative weights to the end of signature names. This alters the object in place.

Parameters:

cumulative_sum – Add features up to this cumulative sum (from max to min).
max_char_length – Maximum length of new name (before joining with feature delimiter).
max_num_features – Maximum number of features to use in name.
feature_delimiter – When multiple features used, will join with this character
number – Number the signatures. When true, starts each new name with S1, S2, etc.
clean – Function to clean the string. Defaults to replacing spaces with underscores.

pcoa(on: pandas.DataFrame | Literal['x', 'h', 'wh', 'signatures'] = 'h', distance: str = 'braycurtis', wisconsin_standardise: bool = True, sqrt: bool = True) → skbio.OrdinationResults[source]¶

Principal Coordinates Analysis of decomposition.

Performs PCoA on the specified matrix, and returns a scikit-bio OrdinationResults object. Can base distances on any matrix which has a column for each sample, or specify one of these via string. Defaults to distances based on scaled() \(H\) (signature weight in sample) matrix.

Matrix is Wisconsin double standardised by default, as described in R function cmdscale.

Distance defaults to Bray-Curtis dissimilarity, and is square root transformed. Distance is calculated with scipy pdist function, and any method supported there can be specified in distance argument.

Parameters:

on – Matrix to derive distances from
distance – Distance method to use
wisconsin_standardise – Apply Wisconsin double standardisation
sqrt – Square root transform distances

Returns:

PCoA results object from scikit-bio

plot_feature_weight(threshold: float = 0.04, label_fn: Callable[[str], str] = None) → plotnine.ggplot[source]¶

Plot features which contribute to each signature.

Represent the relative contribution of features to signatures, showing any features which contribute over a threshold proportion of the weight.

Parameters:

threshold – Show any features which contribute more than this proportion of the weight for this signature.
label_fn – Function to map labels (use to make shortened labels for example)

plot_metadata(metadata: pandas.DataFrame, against: pandas.DataFrame | Literal['signature', 'model_fit', 'both'] | None = None, continuous_fn: Callable[[pandas.Series], bool] | None = None, discrete_fn: Callable[[pandas.Series], bool] | None = None, boxplot_params: Dict | None = None, point_params: Dict | None = None, disc_rotate_labels: float | None = None, show_significance: bool = True, significance_formatter: Callable[[float, float, float], str] | None = None, univariate_test_params: Dict[str, Any] = None) → Tuple[plotnine.ggplot, plotnine.ggplot][source]¶

Plot relative signature weight against metadata.

Produce plots of signature weight against metadata. Produces two plots, one with boxplots for categorical metadata, one with scatter plots for continuous metadata. Will infer which type each column is. To use an integer as categorical, convert it to Categorical type in pandas. Will conduct univariate tests as described in univariate_tests() and indicate significance with symbols. This will be skipped if show_significance is False, which maybe be sensible for larger numbers of samples and metadata categories.

param metadata:

Dataframe with samples on rows, and metadata on columns.

param against:

DataFrame to plot the metadata against. Should contain an entry for each sample, with samples on rows. Defaults to scaled() \(H\) matrix (transpose of typical \(H\) orientation).

param continuous_fn:

Function to determine if a column is continuous. Defaults to considering any floating type or integer to be continuous. May want to customise if you want to use things such as date time formats.

param discrete_fn:

Function to determine if a column is categorial. Defaults to considerings any string, or object type column with a number of unique values < 3/4 its length as categorical.

param boxplot_params:

Dictionary of parameters to pass to geom_boxplot. These will be fixed parameters (so color=”pink” to set all box outlines to pink).

param point_params:

Dictionary of parameters to pass to geom_point. Will be fixed parameters, see above.

param disc_rotate_labels:

Angle to rotate x axis labels by for boxplots.

param show_significance:

Add significance to each subplot for discrete metadata.

param significance_formatter:

Function which takes the p-value and adjusted p-values and returns a string to use as label. Defaults to Decomposition.significance_format().

param univariate_test_params:

Parameters passed to univariate_tests()

Returns:: A tuple of plotnine ggplot objects, first is boxplots, second is scatter plots.

plot_modelfit(group: pandas.Series | None = None) → plotnine.ggplot[source]¶

Plot model fit distribution.

This provides a histogram of the model fit of samples by default. If a grouping is provided, this will instead produce boxplots with each box being the distribution within a group.

Parameters:: group – Series giving label for group which each sample belongs to. Sample which are not in the group series will be dropped from results with warning.
Returns:: Histogram or boxplots

plot_modelfit_point(threshold: float | None = 0.4, yrange: Tuple[float, float] | None = (0, 1), point_size: float = 1.0) → plotnine.ggplot[source]¶

Model fit for each sample as a point on a vertical scale.

It may be of interest to look at the model fit of individual samples, so this plot shows the model fit of each sample as a point on a vertical scale. A threshold can be set below which the point will be coloured red to indicate low model fit, by default this is 0.4.

Parameters:: threshold – Value below which to colour the model fit red. If omitted will not color any samples. The default of 0.4 is specific to the 5ES model (cvanmf.models.five_es()) and does not neccesarily represent a good threshold for other models.

plot_pcoa(axes: Tuple[int, int] = (0, 1), color: pandas.Series | Literal['signature'] = 'signature', shape: pandas.Series | Literal['signature'] | None = None, signature_arrows: bool = False, point_aes: Dict[str, Any] = None, **kwargs) → plotnine.ggplot[source]¶

Ordination of samples.

Perform PCoA of samples and plot first two axes. PCoA performed by the pcoa() method, and arguments in kwargs are passed on to this method. Samples are coloured by primary ES.

Parameters:

axes – Indices of PCoA axes to plot
color – Metadata to use to color the points, or ‘signature’ to color based on the primary signature
shape – Metadata to used to decide shape of points, or ‘signature’ to base shape on the primary signature
signature_arrows – Plot location of signatures as arrows
point_aes – Dictionary of arguments to pass to geom_point
kwargs – arguments to pass to pcoa()

Returns:

Scatter plot of samples

plot_relative_weight(group: pandas.Series | None = None, group_colors: pandas.Series | None = None, model_fit: bool = True, heights: Dict[str, float] | Iterable[float] = None, width: float = 6.0, sample_label_size: float = 5.0, legend_cols_sig: int = 3, legend_cols_grp: int = 3, legend_side: str = 'bottom', **kwargs)[source]¶

Plot relative weight of each signature in each sample.

To display the plot in a notebook environment, use result.render(). Please note this plot uses the marsilea package rather than plotnine like other plots. Unfortunately, the options for combining multiple elements are not yet well developed in plotnine.

Plots a stacked bar chart with a bar for each sample displaying the relative weight of each signature. Optionally the plot can also include sections at the top summarising the model fit for each sample, and a ribbon along displaying categorical metadata for samples.

Parameters:

group – Categorical metadata for each sample to plot on ribbon at the bottom
group_colors – Colour to associate with each of the metadata categories.
model_fit – Include a top row indicating model fit per sample.
heights – Height in inches for each component of the plot. Specify as a dictionary with keys ‘dot’, ‘ribbon’, ‘bar’, ‘labels’, or a list with heights for the elements included from top to bottom.
width – Width of plot.
sample_label_size – Size for sample labels. Set to 0 to remove sample labels.
legend_cols_sig – Number of columns in Signature legend.
legend_cols_grp – Number of columns in group legend.
legend_side – Location of Signature and group legend. One of ‘top’, ‘right’, ‘left’, ‘bottom’

Returns:

Marsilea whiteboard object. Call .render() to show plot.

plot_weight_distribution(threshold: float = 0.0, scale_transform: str | None = 'log10', nrows: int = 1) → plotnine.ggplot[source]¶

Plot the distribution of feature weights in each signature.

The distribution of signature weights helps described how mixed the features are which describe a sample. This will sort feature weights for each signature independently, and plot a bar for the weight of each feature. So distributions which are longer indicate more features contribute to that signature, and the height of bars indicates whether this is a long tail of low weights, all even, etc.

Parameters:: threshold – Set any weight below this to 0. Effectively, consider

very low weights to not contribute to the signature. :param scale_transform: Transformation to apply to the feature weight axis. Can be any of the transforms in mizani. For no transformation, pass None or “identity”. :param nrows: Number of rows in the plot. Defaults to having all plots on one row for comparability.

reapply(y: pandas.DataFrame, input_validation: cvanmf.reapply.InputValidation | None = None, feature_match: cvanmf.reapply.FeatureMatch | None = None, **kwargs) → Decomposition[source]¶

Get signature weights for new data.

When the features in y exactly match those used to learn this decomposition, you can set the input_validation and feature_match parameters as None.

In some cases, the features in new data y may not exactly match those used in the original decomposition, for instance if you have new microbiome data there may be different taxa present, or a different naming format may be used in the new data. The function feature_match can be used to handle these cases, by defining a function to map names between new and existing data. The input_validation functions is largely used for existing models, to valdiate that data being provided is the expected format.

Parameters:

y – New data of the same type used to generate this decomposition
input_validation – Function to validate and transform y
feature_match – Function to match features in y and w
kwargs – Arguments to pass to validate_input and feature_match

Returns:

Decomposition with signature weights for samples in y.

representative_signatures(threshold: float = 0.9) → pandas.DataFrame[source]¶

Which signatures describe a sample.

Identify which signatures contribute to describing a samples. Represenative signatures are those for which the cumulative sum is equal to or lower than the threshold value.

This is done by considering each sample in the sample scaled() \(H\) matrix, and taking a cumulative sum of weights in descending order. Any signature for which the cumulative sum is less than or equal to the threshold is considered representative.

Parameters:: threshold – Cumulative sum below which samples are considered representative.
Returns:: Boolean dataframe indicating whether a signature is representative for a given sample.

Write decomposition to disk.

Export this decomposition and associated data. This is written to text type files (tab separated for tables, yaml for dictionaries) to allow simpler reading in other analysis environments such as R. Exceptions are raised if any tables cannot be written, but plots are allowed to fail though will produce log entries.

Parameters:

out_dir – Directory to write to. Must be empty.
compress – Create compressed .tar.gz rather than directory.
param_path – Path to YAML file containing parameters used. If not given will create a copy in the directory. If given and symlink is True, will try to make a symlink to parameters file.
x_path – Path to X matrix used. Behaves as param_path for copies/ symlinks.
symlink – Make symlinks ot param_path and x_path if possible.
delim – Delimiter to used for tabular output.
plots – Determine which plots to write. When left default (None) this will produce all plots if there are 500 or fewer samples. If True, all plots will produced; if False no plots will be produced. If a list is provided, any plots named in the list will be produced, i.e. if given [‘pcoa’, ‘modelfit’, ‘radar’], plots from plot_pcoa() and : meth:plot_modelfit would be produced. ‘radar’ would be ignored as there is no plot_radar method.

static save_decompositions(decompositions: Dict[int, List[Decomposition]], output_dir: pathlib.Path, symlink: bool = True, delim: str = '\t', compress: bool = False, **kwargs) → None[source]¶

Save multiple decompositions to disk.

Write multiple decompositions to disk. The structure is that a directory is created for each rank, then within that a directory for each decomposition. By default the input data and parameters will be saved at the top level, and symlinked to by each individual decomposition.

The files output are tables for W and H matrices, scaled W and H, tables basic analyses (primary es etc), and all default plots where possible.

Parameters:

decompositions – Decompositions in form output by decompositions().
output_dir – Directory to write to which is either empty or does not exist.
symlink – Symlink the parameters and input X files.
delim – Delimiter for tabular output.
compress – Compress each decomposition folder to .tar.gz
**kwargs –
Passed to Decomposition.save()

scaled(matrix: pandas.DataFrame | Literal['h', 'w'], by: str | None = None) → pandas.DataFrame[source]¶

Total sum scaled version of a matrix.

Scale a matrix to a proportion of the feature/sample total, or to a proportion of the signature total.

Parameters:

matrix – Matrix to be scaled, one of h or w, or a string from {'h', 'w'}.
by –
Scale to proportion of sample, feature, or signature total. This defaults to
- \(H\): sample
- \(W\): signature

Returns:

Scaled version of matrix.

static significance_format(p: float, local_adj: float, global_adj: float) → str[source]¶

Convert p-values from unvariate tests to display strings.

By default, this will use the following strategy:

global_adj =< 0.01 -> ***
global_adj =< 0.05 -> **
global_adj =< 0.1 -> *
p =< 0.01 -> ..
p =< 0.05 -> .

If implementing a custom formater, p is the unadjusted p-value, local_adj the adjusted p-value only considering the tests for one metadata category, and global_adj considering all tests.

univariate_tests(metadata: pandas.DataFrame, against: pandas.DataFrame | Literal['signature', 'model_fit', 'both'] | None = None, drop_na: bool = True, adj_method: str = 'fdr_bh', alpha: float = 0.05) → pandas.DataFrame[source]¶

Test if signature relative weights vary between categories

Test whether model weights are different between groups using non-parametric univariate tests. Currently uses the Mann-Whitney U-test on two sample cases, and Kruskall-Wallis tests on multiple category tests.

For K-W tests, post-hoc tests are performed using Dunn’s test, with the same adjustment and alpha values. Significant post-hoc tests are returned as a string in the results table, with the format A|B(0.001) for a significant result for pair A and B with adjusted p value of 0.001.

Parameters:

metadata – Dataframe of metadata variables to test against. Can only handle discrete values.
against – What to test the metadata against. This can be signature for relative \(H\) weights, model_fit for per sample cosine similarity, or both (default). You can also provide any arbitrary matrix with the correct dimensions, for instance if you had done some custom processing of the \(H\) matrix, or wanted to use absolute \(H\) weights. An arbitrary matrix should not contain any NA values; any columns with NAs will have NA for all statistical test results.
drop_na – Remove any samples with NA values from metadata before testing. This is done on a per test basis, so one NA will not cause a sample to be removed for all tests.
adj_method – Method to adjust for multiple tests. This is applied both locally (for each metadata category), and globally (considering all tests). Accepts any method supported by statsmodels multipletests.
alpha – Threshold value to reject \(H0\).

Returns:

Dataframe with results for each signature and each metadata variable.

LOAD_FILES: List[str] = ['x.tsv', 'h.tsv', 'w.tsv', 'parameters.yaml', 'properties.yaml']¶: Defines the files while are loaded to recreate a decomposition object from disk.

TOP_CRITERIA: Dict[str, bool]¶

Defines which criteria are available to select the best decomposition based on, and whether to take high values (True) or low values (False).

property beta_divergence: float¶: The beta divergence (using the method defined in the parameters object) between \(X\) and \(WH\).

property color_scale: plotnine.scale_color_discrete¶: Plotnine scale for color aesthetic using signature colors.

property colors: List[str]¶

Colors which represents each signature in plots.

Colors default to a colorblind distinct palette.

Colors can be changed by setting this property. A list can be provided, or a dictionary mapping signature name to new color:

# For a model with three signatures S1, S2, S3
# Change all colors with list
model.colors = ['red', 'blue', '#ffffff']
# Change two colors using dictionary
model.colors = dict(S1='green', S3='#000000')

property cosine_similarity: float¶

Cosine angle between flattened \(X\) and \(WH\).

A measure of how well the model reconstructs the input data. Ranges between 1 and 0, with 1 being perfect correlation, and 0 meaning the model is perpendicular to the input (no correlation). The same measure is available for each sample using model_fit.

property feature_mapping: reapply.FeatureMapping¶

Mapping of new data features to those in the model being reapplied

When fitting new data to an existing model, the naming of feature may vary or some features may not exist in the model. This property holds an object which maps from the new data features to the model features. For de-novo decompositions this will be None.

property fill_scale: plotnine.scale_fill_discrete¶: Plotnine scale for fill aesthetic using signature colors.

property h: pandas.DataFrame¶

Signature weights in each sample.

Matrix with samples on columns, and signatures on rows, with each entry being a signature weight. This is not scaled, see scaled().

property input_hash: int¶: Hash of the input matrix. Used to validate loads where data was not included in the saved form.

property l2_norm: float¶: L2 norm between flattened \(X\) and \(WH\).

property model_fit: pandas.Series¶: How well each sample \(i\) is described by the model, expressed by the cosine angle between \(X_i\) and \((WH)_i\). Cosine angle ranges between 0 and 1 in this case, with 1 being good and 0 poor (perpendicular),

property names: List[str]¶

Names for each of the signatures.

New names for signatures can be given as a list. This will change the name in the w and h matrices:

# Set new names for a model with 4 signatures
mdoel.names = ['A', 'B', 'X', 'Y']

property parameters: NMFParameters¶: Parameters used during decomposition.

property primary_signature: pandas.Series¶

Signature with the highest weight for each sample.

The primary signature for a sample is the one with the highest weight in the \(H\) matrix. In the unusual case where all signatures have 0 weight for a sample, this will return NaN, and is likely a sign of a poor model.

property quality_series: pandas.Series¶

Quality measures (r_squared, cosine similarity etc) as series.

Each decomposition has a range of values describing it’s properties and approximation of the input data. This property is a series which includes all of these properties.

property r_squared: float¶

Coefficient of determination (\(R^2\)) between flattened \(X\) and \(WH\).

A measure of how well the model reconstructs the input data.

property rss: float¶: Residual sum of squares between flattened \(X\) and \(WH\).

property sparsity_h: float¶

Sparsity of h matrix.

This is the proportion of entries in the \(H\) matrix which are 0.

property sparsity_w: float¶

Sparsity of w matrix.

This is the proportion of entries in the \(W\) matrix which are 0.

property w: pandas.DataFrame¶

Feature weights in each signature.

Matrix with signatures on columns, and features on rows, with each entry being a signature weight. This is not scaled, see scaled()

property wh: pandas.DataFrame¶: Product of decomposed matrices \(W\) and \(H\) which approximates input.

class cvanmf.denovo.NMFParameters[source]¶

Bases: NamedTuple

Parameters for a single decomposition, or iterations of bi-cross validation. See sklearn NMF documentation for more detail on parameters.

to_yaml(path: pathlib.Path)[source]¶

Write parameters to a YAML file.

Save the parameters, except the input matrix, to a YAML file.

Parameters:: path – File to write to

alpha: float = 0.0¶: Regularisation parameter applied to both \(H\) and \(W\) matrices.

beta_loss: str = 'kullback-leibler'¶: Beta loss function for NMF decomposition.

init: str = 'nndsvdar'¶: Initialisation method for \(H\) and \(W\) matrices on first step. Defaults to randomised non-negative SVD with small random values added to 0s.

keep_mats: bool = False¶: Whether to return the \(H\) and \(W\) matrices as part of the results.

l1_ratio: float = 0.0¶: Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0.

property log_str: str¶: Format parameters in readable way for logs/console.

max_iter: int = 3000¶: Maximum number of iterations during decomposition. Will terminate earlier if solution converges.

rank: int¶: Rank of the decomposition.

seed: int | numpy.random.Generator | str | None = None¶: Random seed for initialising decomposition matrices; if None no seed used so results will not be reproducible.

x: BicvSplit | pandas.DataFrame | None¶: For a simple decomposition, a matrix as a dataframe. For a bi-cross validation iteration, this should be the shuffled matrix split into mn parts, where m is the number of parts along rows, n along columns. When returning results and keep_mats is False, this will be set to None to avoid passing and saving large data.

cvanmf.denovo.bicv(params: NMFParameters | None = None, **kwargs) → BicvResult[source]¶

Perform a single run of bicrossvalidation.

Perform one run of bicrossvalidation. Parameters can either be passed as a NMFParameters tuple and are documented there, or by keyword arguments using the same names as NMFParameters.

Returns:: Comparisons of the held out submatrix and estimate for each fold

cvanmf.denovo.cli_decompose(input: str, output_dir: str, delimiter: str, progress: bool, verbosity: str, seed: int, l1_ratio: float, alpha: float, max_iter: int, beta_loss: str, init: str, n_runs: int, top_n: int, top_criteria: str, compress: bool, ranks: List[int], symlink: bool) → None¶

Decompositions for RANKS.

RANKS is a list of ranks for which to generate decompositions.

Generate a number of decompositions for each the specified ranks. NMF solutions are non-unique and depend on initialisation, so when using an initialisation with randomness multiple solutions can be produced. From these solutions, the best can be retained based on criteria such as reconstruction error or cosine similarity.

Some initialisation methods are deterministic, and as such only a single decomposition will be produced.

The output is H and W matrices for each decomposition, tables of quality scores, and some analyses with default parameters. For further analysis, decompositions can be loaded using Decomposition.from_dir, or tables used directly for custom analyses. By default, a symlink to the input data

cvanmf.denovo.cli_rank_selection(input: str, output_dir: str, delimiter: str, shuffles: int, progress: bool, verbosity: str, seed: int, rank_min: int, rank_max: int, rank_step: int, l1_ratio: float, alpha: float, max_iter: int, beta_loss: str, init: str, design: Tuple[int, int]) → None¶

Rank selection for NMF using mn-fold bi-cross validation

Attempt to identify a suitable rank k for decomposition of input matrix X. This is done by shuffling the matrix a number of times, and for each shuffle diving it into m x n submatrices (m splits on rows, n splits on columns). Each of these nine is held out and an estimate learnt from the remaining matrices, and the quality of the estimated matrix used to identify a suitable rank.

The underlying NMF implementation is from scikit-learn, and there is more documentation available there for many of the NMF specific parameters there.

cvanmf.denovo.cli_regu_selection(input: str, output_dir: str, delimiter: str, shuffles: int, progress: bool, verbosity: str, seed: int, rank: int, alpha: List[float], l1_ratio: float, max_iter: int, beta_loss: str, init: str, scale: bool, design: Tuple[int, int]) → None¶

Regularisation selection for NMF on ALPHA 9 fold bi-cross validation

Attempt to identify a suitable regularisation parameter alpha for decomposition of input matrix X at a given rank with a given ratio between L1 and L2 regularisation. This is done by shuffling the matrix a number of times, and for each shuffle diving it into 9 submatrices. Each of these nine is held out and an estimate learnt from the remaining matrices, and the quality of the estimated matrix used to identify a suitable alpha.

The underlying NMF implementation is from scikit-learn, and there is more documentation available there for many of the NMF specific parameters there.

ALPHA is a list of values to be tested. 0.0 will always be added.

cvanmf.denovo.cophenetic_correlation(decompositions: Dict[int, List[Decomposition]], on: Literal['h', 'w'] = 'h') → pandas.Series[source]¶

Cophenetic correlation coefficient for rank selection

The cophenetic correlation coefficient (ccc) is a commonly used way to select a suitable rank for decompositions (Brunet 2004). It is based on assigning each sample or feature to a single signature, and looking for stability in which are assigned to the same signature across multiple random initialisations.

Our primary method for rank selection is bicrossvalidation, but we offer the ability to calculate ccc when you have performed multiple decompositions for a rank using decompositions().

Parameters:

decompositions – Results from the decompositions() function. A dictionary with the key being a rank, the value a list of decompositions for that rank.
on – Look for stability in the assignment in the H matrix (samples) or W matrix (features).

Returns:

Series indexed by rank and with value being the ccc.

cvanmf.denovo.decompose(params: NMFParameters) → Decomposition[source]¶

Perform a single decomposition of a matrix.

Parameters:: params – Decomposition parameters as a NMFParameters object.
Returns:: A single decomposition

cvanmf.denovo.decompositions(x: pandas.DataFrame, ranks: Iterable[int], random_starts: int = 100, top_n: int = 5, top_criteria: str = 'beta_divergence', seed: int | numpy.random.Generator | None = None, alpha: float | None = None, l1_ratio: float | None = None, max_iter: int | None = None, beta_loss: str | None = None, init: str | None = 'random', progress_bar: bool = True) → Dict[int, List[Decomposition]][source]¶

Get the best decompositions for input matrix for one or more ranks.

The model obtained by NMF decomposition depend on the initial values of the two matrices W and H; different initialisations lead to different solutions. Two approaches to initialising H and W are to attempt multiple random initialisations and select the best ones based on criteria such as reconstructions error, or to adopt a deterministic method (such as nndsvd) to set initial values.

This function provides both approaches, but defaults to multiple random initialisations. To use one of the deterministic methods, change the initialisation method using init.

A dictionary with one entry for each rank of decomposition requested is return, with the values being a list of top_n best decompositions for that rank. Where a deterministic method is used, the list will only have one item.

Parameters:

x – Matrix to be decomposed
ranks – Rank(s) of decompositions to be produced
random_starts – Number of random initialisations to be tried for each rank. Ignored if using a deterministic initialisations.
top_n – Number of decompositions to be returned for each rank.
top_criteria – Criteria to use when determining which are the top decompositions. Can be one of beta_divergence, rss, r_squared, cosine_similairty, or l2_norm.
seed – Seed or random generator used
alpha – Regularisation parameter applied to both H and W matrices.
l1_ratio – Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0. This controls the mix between sparsifying and densifying regularisation. 1.0 will encourage sparsity, 0.0 density
max_iter – Maximum number of iterations during decomposition. Will terminate earlier if solution converges
beta_loss – Beta loss function for NMF decomposition
init – Initialisation method for H and W matrices on first step. Defaults to random
progress_bar – Display progress bar

cvanmf.denovo.dispersion(decompositions: Dict[int, List[Decomposition]], on: Literal['h', 'w'] = 'h') → pandas.Series[source]¶

Dispersion coefficient for rank selection

The dispersion coefficient is a method for rank selection which looks for consistency in the average consensus matrix (Park 2007). This shares the same underlying data structure as cophenetic_correlation(), the average consensus matrix, looking at how often elements are assigned to the same signature, with elements assigned to the signature with maximum weight. The value for dispersion ranges between 0 and 1, with 1 indicating perfect stability, and 0 a highly scattered consensus matrix.

Our primary method for rank selection is bicrossvalidation, but we offer the ability to calculate dispersion when you have performed multiple decompositions for a rank using decompositions().

Parameters:

decompositions – Results from the decompositions() function. A dictionary with the key being a rank, the value a list of decompositions for that rank.
on – Look for stability in the assignment in the H matrix (samples) or W matrix (features).

Returns:

Series indexed by rank and with value being the dispersion coefficient.

cvanmf.denovo.plot_rank_selection(results: Dict[int | float, List[BicvResult]], exclude: Iterable[str] | None = None, include: Iterable[str] | None = None, show_all: bool = False, geom: str = 'box', summarise: Literal['mean', 'median'] = 'mean', suggested_rank: bool = True, stars_at: Dict[str, int] | None = None, star_size: int = 4, jitter: bool = None, jitter_size: float = 0.3, n_col: int = None, xaxis: str = 'rank', rotate_x_labels: float | None = None, geom_params: Dict[str, Any] = None, **kwargs) → plotnine.ggplot[source]¶

Plot rank selection results from bicrossvalidation.

Draw either box plots or violin plots showing statistics comparing \(A\) and \(A'\) from all bicrossvalidation results across a range of ranks. The plotting library used is plotnine; the returned plot object can be saved or drawn using plt_obj.save or plt_obj.draw respectively. By default, only cosine_similarity and r_squared are plotted. You can define which measures to include using include, or which to exclude using exclude. You can also use show_all to show all the measures.

For cosine_similarity and r_squared, an suggestion of optimal rank is given by identifying an elbow point in the graph using the package kneed, indicated by a star above that rank.

Parameters:

results – Dictionary of results, with rank as key and a list of BicvResult for that rank as value
exclude – Measures from BicvResult not to plot.
include – Measures from BicvResult to plot.
show_all – Show all measures, ignoring anything set in include or exclude.
geom – Type of plot to draw. Accepts either ‘box’ or ‘violin’
summarise – How to summarise the statistics across the folds of a given shuffle.
suggested_rank – Estimate rank using suggest_rank().
stars_at – Manually define x-axis values at which to place stars above the main plot. Mainly used to allow plot_regu_selection() to pass where to plot stars for regularisation selection.
star_size – Size of star indicating suggested rank.
jitter – Draw individual points for each shuffle above the main plot.
jitter_size – Size of jitter points.
n_col – Number of columns in the plot. If blank, attempts to guess a sensible value.
xaxis – Value to plot along the x-axis. “rank” for rank selection, “alpha” for regularisation selection.
rotate_x_labels – Degrees to rotate x-axis labels by. If None will rotate if x-axis is float.
**kwargs –
Passed to suggest_ranks().

Returns:

plotnine.ggplot instance

cvanmf.denovo.plot_regu_selection(regu_res: Tuple[float, Dict] | Dict, alpha_star: bool = True, **kwargs) → plotnine.ggplot[source]¶

Plot regularisation selection results.

Takes a result from regu_selection() and passes to plot_rank_selection() to plot with alpha values along the x-axis. Consequently, pass any parameters for plotting as kwargs.

Parameters:

regu_res – Results from regu_selection().
alpha_star – Suggest and plot a suitable alpha value using suggest_alpha().

cvanmf.denovo.plot_stability_rank_selection(decompositions: Dict[int, List[Decomposition]] | None = None, series: List[pandas.Series] | None = None, include: List[str] = ['cophenetic_correlation', 'dispersion', 'signature_similarity'], suggested_rank: bool = True, on: Literal['h', 'w'] = 'h') → plotnine.ggplot[source]¶

Plot results for stability based rank selection methods ( signature_stability(), cophenetic_correlation(), dispersion()).

Automated rank selection uses suggest_rank_stability().

Parameters:

decompositions – Results from decompositions(). Not used if series is passed.
series – Series to plot, resulting from signature_similarity(), cophenetic_correlation(), or dispersion().
include – Which method to include in the plot, a list containing values from {'cophenetic_correlation', 'dispersion', 'signature_similarity'}.
suggested_rank – Make an estimate of estimate suggested rank using suggest_rank_stability().
on – Calculate stability of H (samples) or W (features). Not used if passed series.

cvanmf.denovo.rank_selection(x: pandas.DataFrame, ranks: Iterable[int], shuffles: int = 100, keep_mats: bool | None = None, seed: int | numpy.random.Generator | None = None, alpha: float | None = None, l1_ratio: float | None = None, max_iter: int | None = None, beta_loss: str | None = None, init: str | None = None, design: Tuple[int, int] | None = (3, 3), progress_bar: bool = True) → Dict[int, List[BicvResult]][source]¶

Bi-cross validation for rank selection.

Run \(mn\)-fold bicrossvalidation across a range of ranks. Briefly, the input matrix is shuffled shuffles times. Each shuffle is then split into \(m imesn\) submatrices (\(m\) splits on rows, \(n\) splits on columns). The rows and columns of submatrices are permuted, and the top left submatrix (\(A\)) is estimated through NMF decompositions of the other matrices producing an estimate \(A'\). Various measures of how well \(A'\) reconstructed \(A\) are provided, see BicvResult for details on the measures.

No multiprocessing is used, as a majority of build of scikit-learn seem to make good use of multiple processors anyway (depending on compilation of underlying libraries and matrix size).

This method returns a dictionary with each rank as a key, and a list containing one BicvResult for each shuffle.

Parameters:

x – Input matrix.
ranks – Ranks of k to be searched. Iterable of unique ints.
shuffles – Number of times to shuffle x.
keep_mats – Return A’ and shuffle as part of results.
seed – Random value generator or seed for creation of the same. If not provided, will initialise with entropy from system.
alpha – Regularisation coefficient
l1_ratio – Ratio between L1 and L2 regularisation. L2 regularisation (0.0) is densifying, L1 (1.0) sparisfying.
max_iter – Maximum iterations of NMF updates. Will end early if solution converges.
beta_loss – Beta-loss function, see sklearn documentation for details.
init – Initialisation method for H and W during decomposition. Used only where one of the matrices during bi-cross steps is not fixed. See sklearn documentation for values.
design – How many blocks to split the input matrix into on rows and columns respectively. Defaults to 3x3 9-fold design.
progress_bar – Show a progress bar while running.

Returns:

Dictionary with entry for each rank, containing a list of results for each shuffle (as a BicvResult object)

cvanmf.denovo.regu_selection(x: pandas.DataFrame, rank: int, alphas: Iterable[float] | None | None = None, scale_samples: bool | None = None, shuffles: int = 100, keep_mats: bool | None = None, seed: int | numpy.random.Generator | None = None, l1_ratio: float | None = 1.0, max_iter: int | None = None, beta_loss: str | None = None, init: str | None = None, design: Tuple[int, int] = (3, 3), progress_bar: bool = True) → Tuple[float, Dict[float, List[BicvResult]]][source]¶

Bicrossvalidation for regularisation selection.

Run \(mn\)-fold bicrossvalidation across a range of regularisation ratios, for a single rank. For a brief description of bi-cross validation see rank_selecton()

No multiprocessing is used, as a majority of build of scikit-learn seem to make good use of multiple processors anyway.

This method returns a tuple with

a float which is the tested alpha which meets the criteria in the ES paper
a dictionary with each alpha value as a key, and a list containing one BicvResult for each shuffle

Parameters:

x – Input matrix.
rank – Rank of decomposition.
alphas – Regularisation alpha parameters to be searched. If left blank a default range will be used.
scale_samples – Divide alpha by number of samples. This is provided as the way regularisation is performed changed in newer sklearn versions, and alpha is multiplied by n_samples. Setting this to True results in the same calculation as earlier sklearn versions, such as the one used in the Enterosignatures paper. If this is set it is honoured; if left as None, when automatic alpha range is calculated they will be scaled by sample, when alpha range specified will not be scaled.
shuffles – Number of times to shuffle x.
keep_mats – Return \(A'\) and shuffle as part of results.
seed – Random value generator or seed for creation of the same. If not provided, will initialise with entropy from system.
alpha – Regularisation coefficient
l1_ratio – Ratio between L1 and L2 regularisation. L2 regularisation (0.0) is densifying, L1 (1.0) sparisfying.
max_iter – Maximum iterations of NMF updates. Will end early if solution converges.
beta_loss – Beta-loss function, see sklearn documentation for details.
init – Initialisation method for H and W during decomposition. Used only where one of the matrices during bi-cross steps is not fixed. See sklearn documentation for values.
progress_bar – Show a progress bar while running.
design – Number of blocks to split input into on rows and columns respectively for bicrossvalidation.

Returns:

Dictionary with entry for each rank, containing a list of results for each shuffle (as a BicvResult object)

cvanmf.denovo.signature_similarity(decompositions: Dict[int, List[Decomposition]]) → pandas.Series[source]¶

Mean cosine similarity of signatures for rank selection

This rank selection criteria is based on the intuition that if a solution is good, it should be across similar multiple random initialisation of the data, similar to the motivation for cophenetic_correlation() and dispersion().

We pair signatures based on a cosine similarity (see cvanmf.stability.match_signatures()) and take the mean value between paired signatures at a rank, and look for clear peaks.

Similarity is calculated between the best decomposition and all otherwise, not all possible pairs.

The paired cosine similarity can also be visualised in more detail using cvanmf.stability.plot_signature_stability().

Parameters:: decompositions – Decompositions for several ranks as output by decompositions().

cvanmf.denovo.suggest_alpha(regu_results: Dict[float, List[BicvResult]]) → float[source]¶

Suggest a suitable value for alpha.

Want to select the largest value of \(alpha\) possible which does not detrimentally effect the quality of the decomposition. To gauge this, we adopt the heuristic of [REF], selecting the highest value of \(alpha\) for which the mean \(R^2\) is not lower than the (mean \(R^2\) + standard deviation) at \(alpha=0\).

This is called by default in regu_selection(). It is provided as public method as the Nextflow pipeline splits the Bicv process, and doesn’t use regu_selection(), and so it can be called after.

Parameters:: regu_results – Dictionary with keys being alpha values, and values a list of BicvResult objects.

cvanmf.denovo.suggest_rank(rank_selection_results: Dict[int, List[BicvResult]] | pandas.DataFrame, summarise: Callable[[numpy.ndarray], float] = np.mean, measures: List[str] = ['cosine_similarity', 'r_squared'], **kwargs) → Dict[str, int][source]¶

Suggest a suitable rank.

Attempt to identify an elbow point in the graphs of cosine similarity and \(R^2\) which represent points where the rate of improvement in the decomposition slows.

Please note this is only a suggestion of a suitable rank; the plots should still be inspected and decompositions of candidate ranks inspected to make a final decision.

This is implemented using the excellent kneed package, and **kwargs are passed to the constructor of KneeLocator, you can use this if you wish to customise the behaviour. We use the online mode of kneed by default.

Parameters:

rank_selection_results – Results from rank_selection(), or these results in DataFrame format from BicvResult.results_to_table()
summarise – Function to summarise results from a shuffle. Roughly speaking, determines which point represent the middle of the distribution of values for purposes of the curve.
measures – The measures to consider if passed a DataFrame
kwargs – Arguments passed to KneeLocator constructor

cvanmf.denovo.suggest_rank_stability(rank_selection_results: pandas.DataFrame | Iterable[pandas.Series] | Dict[int, List[Decomposition]], measures: List[str] = ['cophenetic_correlation', 'dispersion', 'signature_similarity'], near_max: float = 0.02, **kwargs) → Dict[str, int][source]¶

Suggest a suitable rank in stability based measures.

Attempt to identify peaks in stability based rank selection criteria (cophenetic correlation, dispersion, signature similrity). By default the highest peak is selected. Where there are many similar ranks (defined by near_max), the one with the most consecutively decreasing values after it is selected.

Please note this is only a suggestion of a suitable rank; the plots should still be inspected and decompositions of candidate ranks inspected to make a final decision.

When making a plot multiple times (changing parameters etc), it may be preferable to calculate the measures then pass the results as a list of Series, as the calculation can be time consuming.

Parameters:

rank_selection_results – Results from decompositions(), or a collection of series produced by dispersion(), cophenetic_correlation(), and signature_similarity(), or a DataFrame of those series joined.
measures – The measures to consider if passed a DataFrame
near_max – Consider peaks (\(p\)) candidates if they are within a certain distance of global maximum (\(gm\)): \(p \geq gm(1-near_max)\).
kwargs – Passed to np.argrelmax.

cvanmf.denovo.Numeric¶: Alias for python numeric types (a union of int and float).

cvanmf.denovo.PcoaMatrices¶: Allowed matrices which PCoA can be constructed from. Allows values w, x, wh, signatures (alias for w).

cvanmf.denovo.logger: logging.Logger¶: Logger object.