cvanmf.denovo ============= .. py:module:: cvanmf.denovo .. autoapi-nested-parse:: Generate new models using NMF decomposition This module provides functions to generate new models from data, which encompasses three main steps: rank selection, regularisation selection, and model inspection. The first of these two steps involves running decompositions multiple times for a range of values, and can be time-consuming. Methods are provided to run the whole process on a single machine, but also for running individual decompositions, which are used by the accompanying nextflow pipeline to allow spreading the computation across multiple nodes in an HPC environment. The main functions for each step are * :func:`rank_selection` and :func:`plot_rank_selection` * :func:`regu_selection` and :func:`plot_regu_selection` * :func:`decompositions` which produces :class:`Decomposition` objects Individual decompositions are represented by a :class:`Decomposition` object, and visualisation and analysis are carried out using object methods (such as :meth:`Decomposition.plot_feature_weight()`). Attributes ---------- .. autoapisummary:: cvanmf.denovo.Numeric cvanmf.denovo.PcoaMatrices cvanmf.denovo.logger Classes ------- .. autoapisummary:: cvanmf.denovo.BicvFold cvanmf.denovo.BicvResult cvanmf.denovo.BicvSplit cvanmf.denovo.Decomposition cvanmf.denovo.NMFParameters Functions --------- .. autoapisummary:: cvanmf.denovo.bicv cvanmf.denovo.cli_decompose cvanmf.denovo.cli_rank_selection cvanmf.denovo.cli_regu_selection cvanmf.denovo.cophenetic_correlation cvanmf.denovo.decompose cvanmf.denovo.decompositions cvanmf.denovo.dispersion cvanmf.denovo.plot_rank_selection cvanmf.denovo.plot_regu_selection cvanmf.denovo.plot_stability_rank_selection cvanmf.denovo.rank_selection cvanmf.denovo.regu_selection cvanmf.denovo.signature_similarity cvanmf.denovo.suggest_alpha cvanmf.denovo.suggest_rank cvanmf.denovo.suggest_rank_stability Module Contents --------------- .. py:class:: BicvFold Bases: :py:obj:`NamedTuple` One fold from a shuffled matrix The submatrices have been joined into the structure shown below .. code-block:: text A B . B C D . D . . . . C D . D from which A will be estimated as A' using only B, C, D. ``` .. py:attribute:: A :type: pandas.DataFrame .. py:attribute:: B :type: pandas.DataFrame .. py:attribute:: C :type: pandas.DataFrame .. py:attribute:: D :type: pandas.DataFrame .. py:class:: BicvResult Bases: :py:obj:`NamedTuple` Results from a single bi-cross validation run. For each BicvSplit there are :math:`mn` folds (for :math:`m` splits on rows, :math:`n` splits on columns), for which the top left submatrix (:math:`A`) is estimated (:math:`A'`) using the other portions. .. py:method:: join_folds(results: List[BicvResult]) -> BicvResult :staticmethod: Join results from individual folds Each fold returns a BicvResult with a length one array. This method joins these into a single object summarising all the folds. Could also join other sets of results. :param results: Results from individual folds :returns: Single object with individual arrays joined .. py:method:: results_to_table(results: Union[Iterable[BicvResult], Dict[Numeric, Iterable[BicvResult]]], summarise: Callable[[numpy.ndarray], float] = np.mean) -> pandas.DataFrame :staticmethod: Convert bi-fold crossvalidation results to a table For results run with the same parameters, convert the output to a table suitable for plotting. :param results: List of results for bicv runs with the same parameters on different shuffles of the data, or dict of runs across multiple values on the same shuffles. :param summarise: Function to reduce each the measures (r_squared etc) to a single value for each shuffle. .. py:method:: to_series(summarise: Callable[[numpy.ndarray], float] = np.mean) -> pandas.Series Convert bi-fold cross validation results to series :param summarise: Function to reduce each the measures (r_squared etc) to a single value for each shuffle. :returns: Series with entry for each non-parameter measure .. py:attribute:: a :type: Optional[List[numpy.ndarray]] Reconstructed matrix A for each fold. Not included when keep_mats is False. .. py:attribute:: cosine_similarity :type: numpy.ndarray Cosine similarity between each A and A' considered as a flattened vector. .. py:attribute:: i :type: int Shuffle number when there are multiple shuffles. Included to allow spreading bicv across multiple processes, but without needing to return a copy of the full matrix. .. py:attribute:: l2_norm :type: numpy.ndarray L2 norm between each A and A'. .. py:attribute:: parameters :type: NMFParameters Parameters used during this run. .. py:attribute:: r_squared :type: numpy.ndarray Explained variance between each A and A', with each considered as a flattened vector. .. py:attribute:: reconstruction_error :type: numpy.ndarray Reconstruction error between each A and A'. .. py:attribute:: rss :type: numpy.ndarray Residual sum of squares between each A and A' .. py:attribute:: sparsity_h :type: numpy.ndarray Sparsity of H matrix for each A' .. py:attribute:: sparsity_w :type: numpy.ndarray Sparsity of W matrix for each A' .. py:class:: BicvSplit(mx: List[pandas.DataFrame], design: Tuple[int, int], i: Optional[int] = None) Shuffled matrix for bi-cross validation, split into mn matrices in a m x n pattern. To shuffle and split an existing matrix, use the static method :method:`BicvSplit.from_matrix` Create a shuffled matrix containing the 9 split matrices. These should be in the order .. code-block:: text 0, 1, 2 3, 4, 5 6, 7, 8 :param mx: Split matrices in a flat list. These should be all of one row, then all of the next row etc. :param design: Number of even splits on rows and columns of mx. :param i: Index of the split if one of many .. py:method:: col(col: int, join: bool = False) -> Union[List[pandas.DataFrame], pandas.DataFrame] Get a column of the submatrices by index. Convenience method for readability. :param col: Column index to get :param join: Join into a single matrix :returns: List of submatrices making up the column, or these submatrices joined if requested. .. py:method:: fold(i: int) -> BicvFold Construct a fold of the data There are m*n possible folds of the data, this function constructs the i-th fold. :param i: Index of the fold to construct, from [0, mn) :returns: A, B, C, and D matrices for this fold .. py:method:: from_matrix(df: pandas.DataFrame, n: int = 1, design: Tuple[int, int] = (3, 3), random_state: Optional[Union[int, numpy.random.Generator]] = None) -> Generator[BicvSplit] :staticmethod: Create random shuffles and splits of a matrix :param df: Matrix to shuffle and split :param n: Number of shuffles :param design: Number of blocks to divide rows and columns into. Default is 3x3 9-fold bicrossvalidation. :param random_state: Random state, either int seed or numpy Generator; None for default numpy random Generator. :returns: A generator of splits, as BicvSplit objects .. py:method:: load_all_npz(path: Union[pathlib.Path, str], allow_pickle: bool = True, fix_i: bool = False) -> Generator[BicvSplit] :staticmethod: Read shuffles from files. Reads either all the npz files in a directory, or those specified by a glob. The expectation is the filenames are in format prefix_i.npz, where i is the number of this shuffle. If not, use fix_i to renumber in order loaded. :param path: Directory with .npz files, or glob identifying .npz files :param allow_pickle: Allow unpickling when loading; necessary for compressed files. :param fix_i: Renumber shuffles. :returns: Generator of BicvSplit objects. .. py:method:: load_npz(path: pathlib.Path, allow_pickle: bool = True, i: Optional[int] = None) -> BicvSplit :staticmethod: Load splits from file. Load splits from an npz file. This will mean don't have the column and row names anymore, but this is unimportant for cross-validation. Pickling is required for loading compressed files, but is not secure, so the option is provided to turn it off if you don't need it. :param path: File to load :param allow_pickle: Allow unpickling when loading; necessary for compressed files :param i: Shuffle number. Will attempt to parse from filename if blank. Only parses files like prefix_1.npz, which will be i=1, prefix only alphanumeric. .. py:method:: row(row: int, join: bool = False) -> Union[List[pandas.DataFrame], pandas.DataFrame] Get a row of the submatrices by index. Convenience method for readability. :param row: Row index to get :param join: Join into a single matrix :returns: List of submatrices making up the row, or these submatrices joined if requested. .. py:method:: save_all_npz(splits: Iterable[BicvSplit], path: pathlib.Path, fix_i: bool = False, force: bool = False, compress: bool = True) -> None :staticmethod: Save a collection of splits to a directory as npz files. :param splits: Iterable of BicvSplit objects :param path: Directory to write to :param fix_i: Renumber all splits starting from 0. Does not check if existing numbering is unique. :param compress: Use compression :param force: Overwrite existing files .. py:method:: save_npz(path: pathlib.Path, compress: bool = True, force: bool = False) -> None Save these splits to file. Write the splits to a numpy format file. This will lose the row and column names, however this is unimportant for rank selection. Compression is enabled by default, as sparse data such as microbiome counts tends to create large files. :param path: Path to write to. If passed a directory, will output with filename `shuffle_{i}.npz`. If `i` is not set, cause an error. :param compress: Use compression. :param force: Overwrite existing files. .. py:property:: design :type: Tuple[int, int] Design of holdout pattern, given as (rows, columns). .. py:property:: folds :type: List[BicvFold] List of the mn possible folds of these submatrices. .. py:property:: i :type: int This is the i-th shuffle of the input. .. py:property:: mx :type: List[List[pandas.DataFrame]] Submatrices as a 2d list. .. py:property:: num_folds :type: int Total number of folds in this design. .. py:property:: shape :type: Tuple[int, int] Dimensions of the input matrix. .. py:property:: size :type: int Size of input matrix. .. py:property:: x :type: pandas.DataFrame The input matrix This reproduces the input matrix by concatenating the submatrices. :returns: Input matrix .. py:class:: Decomposition(parameters: NMFParameters, h: pandas.DataFrame, w: pandas.DataFrame, feature_mapping: Optional[reapply.FeatureMapping] = None) Decomposition of a matrix. Note that we use the naming conventions and orientation common in NMF literature: * :math:`X` is the input matrix, with m features on rows, and n samples on columns. * :math:`H` is the transformed data, with k signatures on rows, and n samples on columns. * :math:`W` is the feature weight matrix, with m features on rows, and m features on columns. The scikit-learn implementation has these transposed; this package handles transposing back and forth internally, and expects input in the features x samples orientation, and provides :math:`W` and :math:`H` inline with the literature rather than scikit-learn. Decomposition objects can be sliced using the syntax:: sliced_model = model[samples, features, signatures] # Only slice on one dimension sliced_signatures = model[:, :, ['S1', 'S2']] Slices must be ordered collections of strings, integer indices, or booleans. .. py:method:: compare_signatures(b: Comparable) -> pandas.DataFrame Similarity between these signatures and one other set. Similarity here is defined as cosine as the angle between each pair of signature vectors, so 1 is identical (ignoring scale) and 0 is perpendicular. This is a convenience method which calls :func:`stability.compare_signatures`. :param b: Signature matrix, or object with signature matrix :returns: Matrix with cosine of angles between signature vectors. .. py:method:: consensus_matrix(on: Union[Literal['h', 'w'], pandas.DataFrame] = 'h') -> scipy.sparse.csr_array Consensus matrix of either :math:`H` or :math:`W`. Most typically, the consensus matrix is calculated on the :math:`H` matrix, and is a binary matrix representing whether sample :math:`i` is assigned to the same signature as sample :math:`j`. Samples are assigned to signatures based on their maximum weight. When calculated on :math:`W`, it is the same but for features assigned. The primary use of this is in generating a :math:`\bar{C}` matrix, the mean number of times two elements are assigned to the same signature. :math:`\bar{C}` is used to calculate the :meth:`cophenetic_correlation` and :meth:`dispersion` coefficients, a method of determining suitable rank. This is returned as a lower triangular matrix in sparse format. .. py:method:: discrete_signature_scale(axis: Literal['x', 'y']) -> Union[plotnine.scale_x_discrete, plotnine.scale_y_discrete] Make a plotnine scale which puts the signatures in order. By default, plotnine will alphabetically sort (S1, S11 .. S2, S21), this produces a scale object which can be added to a plot to put the signatures in their order in this object. .. py:method:: load(in_dir: os.PathLike, x: Optional[Union[pandas.DataFrame, str, os.PathLike]] = None, delim: str = '\t') :staticmethod: Load a decomposition from disk. Loads a decomposition previously saved using :meth:`save`. Will automatically determine whether this is a directory or .tar.gz. Can provide a DataFrame of the :math:`X` input matrix, primarily this is so when loading multiple decompositions they can all reference the same object. Can also provide an explicit path; if not provided will attempt to load from ``x.tsv``. :param in_dir: Directory or .tar.gz containing decomposition. :param x: Either the X input matrix as a DataFrame, or a path to a delimiter-separated copy of the X matrix. If None, will attempt to load from x.tsv. :param delim: Delimiter for tabular data .. py:method:: load_decompositions(in_dir: os.PathLike, delim: str = '\t') -> Dict[int, List[Decomposition]] :staticmethod: Load multiple decompositions. Load a set of decompositions previously saved using :meth:`save_decompositions`. Will attempt to share a reference to the same :math:`X` matrix for memory reasons. The output is a dictionary with keys being ranks, and values being lists of decompositions for that rank. :param in_dir: Directory to read from :param delim: Delimiter for tabular data files .. py:method:: match_signatures(b: Comparable) -> pandas.DataFrame Identify optimal matches between these signatures and one other set Find the pairing of signatures which are most similar. More technically, this finds the pairing of signatures which maximises the total cosine similarity using the Hungarian algorithm. It is possible that a signature gets paired with another for which the cosine similarity is not highest, suggesting a potentially bad match between some signatures in the model. The return is a dataframe with columns a and b for which signatures are paired, the cosine similarity of the pairing, and the maximum 'off-target' cosine value for any of the signatures which it was not assigned to. The intention for the off-target score is that ideally this would be low, and the paired similarity high: signatures match well their paired one, while being dissimilar to all others. This is a convenince method which calls :func:`stability.match_signatures`. :param b: Signature matrix, or object with signature matrix :returns: DataFrame with pairing and scores .. py:method:: monodominant_samples(threshold: float = 0.9) -> pandas.DataFrame Which samples have a monodominant signature. A monodominant signature is one which represents at least the threshold amount of the weight in the :meth:`scaled` :math:`H` matrix. :param threshold: Proportion of the scaled H matrix weight to consider a signature dominnant. :return: Dataframe with column is_monodominant indicating if a sample has a monodominant signature, and signature_name indicating the name of the signature, or none if not. .. py:method:: name_signatures_by_weight(cumulative_sum: float = 0.4, max_char_length: int = 10, max_num_features: int = 5, feature_delimiter: str = '+', number: bool = True, clean: Callable[[str], str] = lambda x: x.replace(' ', '_')) -> None Give a slightly more descriptive name to each signature. Append features with highest relative weights to the end of signature names. This alters the object in place. :param cumulative_sum: Add features up to this cumulative sum (from max to min). :param max_char_length: Maximum length of new name (before joining with feature delimiter). :param max_num_features: Maximum number of features to use in name. :param feature_delimiter: When multiple features used, will join with this character :param number: Number the signatures. When true, starts each new name with S1, S2, etc. :param clean: Function to clean the string. Defaults to replacing spaces with underscores. .. py:method:: pcoa(on: Union[pandas.DataFrame, Literal['x', 'h', 'wh', 'signatures']] = 'h', distance: str = 'braycurtis', wisconsin_standardise: bool = True, sqrt: bool = True) -> skbio.OrdinationResults Principal Coordinates Analysis of decomposition. Performs PCoA on the specified matrix, and returns a scikit-bio OrdinationResults object. Can base distances on any matrix which has a column for each sample, or specify one of these via string. Defaults to distances based on :meth:`scaled` :math:`H` (signature weight in sample) matrix. Matrix is Wisconsin double standardised by default, as described in R function ``cmdscale``. Distance defaults to Bray-Curtis dissimilarity, and is square root transformed. Distance is calculated with scipy ``pdist`` function, and any method supported there can be specified in distance argument. :param on: Matrix to derive distances from :param distance: Distance method to use :param wisconsin_standardise: Apply Wisconsin double standardisation :param sqrt: Square root transform distances :return: PCoA results object from scikit-bio .. py:method:: plot_feature_weight(threshold: float = 0.04, label_fn: Callable[[str], str] = None) -> plotnine.ggplot Plot features which contribute to each signature. Represent the relative contribution of features to signatures, showing any features which contribute over a threshold proportion of the weight. :param threshold: Show any features which contribute more than this proportion of the weight for this signature. :param label_fn: Function to map labels (use to make shortened labels for example) .. py:method:: plot_metadata(metadata: pandas.DataFrame, against: Optional[Union[pandas.DataFrame, Literal['signature', 'model_fit', 'both']]] = None, continuous_fn: Optional[Callable[[pandas.Series], bool]] = None, discrete_fn: Optional[Callable[[pandas.Series], bool]] = None, boxplot_params: Optional[Dict] = None, point_params: Optional[Dict] = None, disc_rotate_labels: Optional[float] = None, show_significance: bool = True, significance_formatter: Optional[Callable[[float, float, float], str]] = None, univariate_test_params: Dict[str, Any] = None) -> Tuple[plotnine.ggplot, plotnine.ggplot] Plot relative signature weight against metadata. Produce plots of signature weight against metadata. Produces two plots, one with boxplots for categorical metadata, one with scatter plots for continuous metadata. Will infer which type each column is. To use an integer as categorical, convert it to Categorical type in pandas. Will conduct univariate tests as described in :meth:`univariate_tests` and indicate significance with symbols. This will be skipped if ``show_significance`` is False, which maybe be sensible for larger numbers of samples and metadata categories. :param metadata: Dataframe with samples on rows, and metadata on columns. :param against: DataFrame to plot the metadata against. Should contain an entry for each sample, with samples on rows. Defaults to :meth:`scaled` :math:`H` matrix (transpose of typical :math:`H` orientation). :param continuous_fn: Function to determine if a column is continuous. Defaults to considering any floating type or integer to be continuous. May want to customise if you want to use things such as date time formats. :param discrete_fn: Function to determine if a column is categorial. Defaults to considerings any string, or object type column with a number of unique values < 3/4 its length as categorical. :param boxplot_params: Dictionary of parameters to pass to ``geom_boxplot.`` These will be fixed parameters (so color="pink" to set all box outlines to pink). :param point_params: Dictionary of parameters to pass to geom_point. Will be fixed parameters, see above. :param disc_rotate_labels: Angle to rotate x axis labels by for boxplots. :param show_significance: Add significance to each subplot for discrete metadata. :param significance_formatter: Function which takes the p-value and adjusted p-values and returns a string to use as label. Defaults to :meth:`Decomposition.significance_format`. :param univariate_test_params: Parameters passed to :meth:`univariate_tests` :return: A tuple of plotnine ggplot objects, first is boxplots, second is scatter plots. .. py:method:: plot_modelfit(group: Optional[pandas.Series] = None) -> plotnine.ggplot Plot model fit distribution. This provides a histogram of the model fit of samples by default. If a grouping is provided, this will instead produce boxplots with each box being the distribution within a group. :param group: Series giving label for group which each sample belongs to. Sample which are not in the group series will be dropped from results with warning. :return: Histogram or boxplots .. py:method:: plot_modelfit_point(threshold: Optional[float] = 0.4, yrange: Optional[Tuple[float, float]] = (0, 1), point_size: float = 1.0) -> plotnine.ggplot Model fit for each sample as a point on a vertical scale. It may be of interest to look at the model fit of individual samples, so this plot shows the model fit of each sample as a point on a vertical scale. A threshold can be set below which the point will be coloured red to indicate low model fit, by default this is 0.4. :param threshold: Value below which to colour the model fit red. If omitted will not color any samples. The default of 0.4 is specific to the 5ES model (:func:`cvanmf.models.five_es`) and does not neccesarily represent a good threshold for other models. .. py:method:: plot_pcoa(axes: Tuple[int, int] = (0, 1), color: Union[pandas.Series, Literal['signature']] = 'signature', shape: Optional[Union[pandas.Series, Literal['signature']]] = None, signature_arrows: bool = False, point_aes: Dict[str, Any] = None, **kwargs) -> plotnine.ggplot Ordination of samples. Perform PCoA of samples and plot first two axes. PCoA performed by the :meth:`pcoa` method, and arguments in kwargs are passed on to this method. Samples are coloured by primary ES. :param axes: Indices of PCoA axes to plot :param color: Metadata to use to color the points, or 'signature' to color based on the primary signature :param shape: Metadata to used to decide shape of points, or 'signature' to base shape on the primary signature :param signature_arrows: Plot location of signatures as arrows :param point_aes: Dictionary of arguments to pass to geom_point :param kwargs: arguments to pass to :meth:`pcoa` :return: Scatter plot of samples .. py:method:: plot_relative_weight(group: Optional[pandas.Series] = None, group_colors: Optional[pandas.Series] = None, model_fit: bool = True, heights: Union[Dict[str, float], Iterable[float]] = None, width: float = 6.0, sample_label_size: float = 5.0, legend_cols_sig: int = 3, legend_cols_grp: int = 3, legend_side: str = 'bottom', **kwargs) Plot relative weight of each signature in each sample. To display the plot in a notebook environment, use ``result.render()``. Please note this plot uses the marsilea package rather than plotnine like other plots. Unfortunately, the options for combining multiple elements are not yet well developed in plotnine. Plots a stacked bar chart with a bar for each sample displaying the relative weight of each signature. Optionally the plot can also include sections at the top summarising the model fit for each sample, and a ribbon along displaying categorical metadata for samples. :param group: Categorical metadata for each sample to plot on ribbon at the bottom :param group_colors: Colour to associate with each of the metadata categories. :param model_fit: Include a top row indicating model fit per sample. :param heights: Height in inches for each component of the plot. Specify as a dictionary with keys 'dot', 'ribbon', 'bar', 'labels', or a list with heights for the elements included from top to bottom. :param width: Width of plot. :param sample_label_size: Size for sample labels. Set to 0 to remove sample labels. :param legend_cols_sig: Number of columns in Signature legend. :param legend_cols_grp: Number of columns in group legend. :param legend_side: Location of Signature and group legend. One of 'top', 'right', 'left', 'bottom' :return: Marsilea whiteboard object. Call ``.render()`` to show plot. .. py:method:: plot_weight_distribution(threshold: float = 0.0, scale_transform: Optional[str] = 'log10', nrows: int = 1) -> plotnine.ggplot Plot the distribution of feature weights in each signature. The distribution of signature weights helps described how mixed the features are which describe a sample. This will sort feature weights for each signature independently, and plot a bar for the weight of each feature. So distributions which are longer indicate more features contribute to that signature, and the height of bars indicates whether this is a long tail of low weights, all even, etc. :param threshold: Set any weight below this to 0. Effectively, consider very low weights to not contribute to the signature. :param scale_transform: Transformation to apply to the feature weight axis. Can be any of the transforms in `mizani`. For no transformation, pass None or "identity". :param nrows: Number of rows in the plot. Defaults to having all plots on one row for comparability. .. py:method:: reapply(y: pandas.DataFrame, input_validation: Optional[cvanmf.reapply.InputValidation] = None, feature_match: Optional[cvanmf.reapply.FeatureMatch] = None, **kwargs) -> Decomposition Get signature weights for new data. When the features in ``y`` exactly match those used to learn this decomposition, you can set the ``input_validation`` and ``feature_match`` parameters as None. In some cases, the features in new data y may not exactly match those used in the original decomposition, for instance if you have new microbiome data there may be different taxa present, or a different naming format may be used in the new data. The function ``feature_match`` can be used to handle these cases, by defining a function to map names between new and existing data. The ``input_validation`` functions is largely used for existing models, to valdiate that data being provided is the expected format. :param y: New data of the same type used to generate this decomposition :param input_validation: Function to validate and transform ``y`` :param feature_match: Function to match features in ``y`` and :attr:`w` :param kwargs: Arguments to pass to ``validate_input`` and ``feature_match`` :return: :class:`Decomposition` with signature weights for samples in ``y``. .. py:method:: representative_signatures(threshold: float = 0.9) -> pandas.DataFrame Which signatures describe a sample. Identify which signatures contribute to describing a samples. Represenative signatures are those for which the cumulative sum is equal to or lower than the threshold value. This is done by considering each sample in the sample :meth:`scaled` :math:`H` matrix, and taking a cumulative sum of weights in descending order. Any signature for which the cumulative sum is less than or equal to the threshold is considered representative. :param threshold: Cumulative sum below which samples are considered representative. :return: Boolean dataframe indicating whether a signature is representative for a given sample. .. py:method:: save(out_dir: Union[str, pathlib.Path], compress: bool = False, param_path: Optional[pathlib.Path] = None, x_path: Optional[pathlib.Path] = None, symlink: bool = True, delim: str = '\t', plots: Optional[Union[bool, Iterable[str]]] = None) -> None Write decomposition to disk. Export this decomposition and associated data. This is written to text type files (tab separated for tables, yaml for dictionaries) to allow simpler reading in other analysis environments such as R. Exceptions are raised if any tables cannot be written, but plots are allowed to fail though will produce log entries. :param out_dir: Directory to write to. Must be empty. :param compress: Create compressed .tar.gz rather than directory. :param param_path: Path to YAML file containing parameters used. If not given will create a copy in the directory. If given and symlink is True, will try to make a symlink to parameters file. :param x_path: Path to X matrix used. Behaves as param_path for copies/ symlinks. :param symlink: Make symlinks ot param_path and x_path if possible. :param delim: Delimiter to used for tabular output. :param plots: Determine which plots to write. When left default (None) this will produce all plots if there are 500 or fewer samples. If True, all plots will produced; if False no plots will be produced. If a list is provided, any plots named in the list will be produced, i.e. if given ['pcoa', 'modelfit', 'radar'], plots from :meth:`plot_pcoa` and : meth:`plot_modelfit` would be produced. 'radar' would be ignored as there is no `plot_radar` method. .. py:method:: save_decompositions(decompositions: Dict[int, List[Decomposition]], output_dir: pathlib.Path, symlink: bool = True, delim: str = '\t', compress: bool = False, **kwargs) -> None :staticmethod: Save multiple decompositions to disk. Write multiple decompositions to disk. The structure is that a directory is created for each rank, then within that a directory for each decomposition. By default the input data and parameters will be saved at the top level, and symlinked to by each individual decomposition. The files output are tables for W and H matrices, scaled W and H, tables basic analyses (primary es etc), and all default plots where possible. :param decompositions: Decompositions in form output by :func:`decompositions`. :param output_dir: Directory to write to which is either empty or does not exist. :param symlink: Symlink the parameters and input X files. :param delim: Delimiter for tabular output. :param compress: Compress each decomposition folder to .tar.gz :param **kwargs: Passed to :meth:`Decomposition.save` .. py:method:: scaled(matrix: Union[pandas.DataFrame, Literal['h', 'w']], by: Optional[str] = None) -> pandas.DataFrame Total sum scaled version of a matrix. Scale a matrix to a proportion of the feature/sample total, or to a proportion of the signature total. :param matrix: Matrix to be scaled, one of :attr:`h` or :attr:`w`, or a string from ``{'h', 'w'}.`` :param by: Scale to proportion of ``sample``, ``feature``, or ``signature`` total. This defaults to * :math:`H`: ``sample`` * :math:`W`: ``signature`` :return: Scaled version of matrix. .. py:method:: significance_format(p: float, local_adj: float, global_adj: float) -> str :staticmethod: Convert p-values from unvariate tests to display strings. By default, this will use the following strategy: * global_adj =< 0.01 -> *** * global_adj =< 0.05 -> ** * global_adj =< 0.1 -> * * p =< 0.01 -> .. * p =< 0.05 -> . If implementing a custom formater, ``p`` is the unadjusted p-value, ``local_adj`` the adjusted p-value only considering the tests for one metadata category, and ``global_adj`` considering all tests. .. py:method:: univariate_tests(metadata: pandas.DataFrame, against: Optional[Union[pandas.DataFrame, Literal['signature', 'model_fit', 'both']]] = None, drop_na: bool = True, adj_method: str = 'fdr_bh', alpha: float = 0.05) -> pandas.DataFrame Test if signature relative weights vary between categories Test whether model weights are different between groups using non-parametric univariate tests. Currently uses the Mann-Whitney U-test on two sample cases, and Kruskall-Wallis tests on multiple category tests. For K-W tests, post-hoc tests are performed using Dunn's test, with the same adjustment and alpha values. Significant post-hoc tests are returned as a string in the results table, with the format A|B(0.001) for a significant result for pair A and B with adjusted p value of 0.001. :param metadata: Dataframe of metadata variables to test against. Can only handle discrete values. :param against: What to test the metadata against. This can be ``signature`` for relative :math:`H` weights, ``model_fit`` for per sample cosine similarity, or ``both`` (default). You can also provide any arbitrary matrix with the correct dimensions, for instance if you had done some custom processing of the :math:`H` matrix, or wanted to use absolute :math:`H` weights. An arbitrary matrix should not contain any NA values; any columns with NAs will have NA for all statistical test results. :param drop_na: Remove any samples with ``NA`` values from metadata before testing. This is done on a per test basis, so one ``NA`` will not cause a sample to be removed for all tests. :param adj_method: Method to adjust for multiple tests. This is applied both locally (for each metadata category), and globally (considering all tests). Accepts any method supported by statsmodels ``multipletests``. :param alpha: Threshold value to reject :math:`H0`. :return: Dataframe with results for each signature and each metadata variable. .. py:attribute:: LOAD_FILES :type: List[str] :value: ['x.tsv', 'h.tsv', 'w.tsv', 'parameters.yaml', 'properties.yaml'] Defines the files while are loaded to recreate a decomposition object from disk. .. py:attribute:: TOP_CRITERIA :type: Dict[str, bool] Defines which criteria are available to select the best decomposition based on, and whether to take high values (True) or low values (False). :meta private: .. py:property:: beta_divergence :type: float The beta divergence (using the method defined in the parameters object) between :math:`X` and :math:`WH`. .. py:property:: color_scale :type: plotnine.scale_color_discrete Plotnine scale for color aesthetic using signature colors. .. py:property:: colors :type: List[str] Colors which represents each signature in plots. Colors default to a colorblind distinct palette. Colors can be changed by setting this property. A list can be provided, or a dictionary mapping signature name to new color:: # For a model with three signatures S1, S2, S3 # Change all colors with list model.colors = ['red', 'blue', '#ffffff'] # Change two colors using dictionary model.colors = dict(S1='green', S3='#000000') .. py:property:: cosine_similarity :type: float Cosine angle between flattened :math:`X` and :math:`WH`. A measure of how well the model reconstructs the input data. Ranges between 1 and 0, with 1 being perfect correlation, and 0 meaning the model is perpendicular to the input (no correlation). The same measure is available for each sample using :attr:`model_fit`. .. py:property:: feature_mapping :type: reapply.FeatureMapping Mapping of new data features to those in the model being reapplied When fitting new data to an existing model, the naming of feature may vary or some features may not exist in the model. This property holds an object which maps from the new data features to the model features. For de-novo decompositions this will be None. .. py:property:: fill_scale :type: plotnine.scale_fill_discrete Plotnine scale for fill aesthetic using signature colors. .. py:property:: h :type: pandas.DataFrame Signature weights in each sample. Matrix with samples on columns, and signatures on rows, with each entry being a signature weight. This is not scaled, see :meth:`scaled`. .. py:property:: input_hash :type: int Hash of the input matrix. Used to validate loads where data was not included in the saved form. .. py:property:: l2_norm :type: float L2 norm between flattened :math:`X` and :math:`WH`. .. py:property:: model_fit :type: pandas.Series How well each sample :math:`i` is described by the model, expressed by the cosine angle between :math:`X_i` and :math:`(WH)_i`. Cosine angle ranges between 0 and 1 in this case, with 1 being good and 0 poor (perpendicular), .. py:property:: names :type: List[str] Names for each of the signatures. New names for signatures can be given as a list. This will change the name in the :attr:`w` and :attr:`h` matrices:: # Set new names for a model with 4 signatures mdoel.names = ['A', 'B', 'X', 'Y'] .. py:property:: parameters :type: NMFParameters Parameters used during decomposition. .. py:property:: primary_signature :type: pandas.Series Signature with the highest weight for each sample. The primary signature for a sample is the one with the highest weight in the :math:`H` matrix. In the unusual case where all signatures have 0 weight for a sample, this will return NaN, and is likely a sign of a poor model. .. py:property:: quality_series :type: pandas.Series Quality measures (r_squared, cosine similarity etc) as series. Each decomposition has a range of values describing it's properties and approximation of the input data. This property is a series which includes all of these properties. .. py:property:: r_squared :type: float Coefficient of determination (:math:`R^2`) between flattened :math:`X` and :math:`WH`. A measure of how well the model reconstructs the input data. .. py:property:: rss :type: float Residual sum of squares between flattened :math:`X` and :math:`WH`. .. py:property:: sparsity_h :type: float Sparsity of :attr:`h` matrix. This is the proportion of entries in the :math:`H` matrix which are 0. .. py:property:: sparsity_w :type: float Sparsity of :attr:`w` matrix. This is the proportion of entries in the :math:`W` matrix which are 0. .. py:property:: w :type: pandas.DataFrame Feature weights in each signature. Matrix with signatures on columns, and features on rows, with each entry being a signature weight. This is not scaled, see :meth:`scaled` .. py:property:: wh :type: pandas.DataFrame Product of decomposed matrices :math:`W` and :math:`H` which approximates input. .. py:class:: NMFParameters Bases: :py:obj:`NamedTuple` Parameters for a single decomposition, or iterations of bi-cross validation. See sklearn NMF documentation for more detail on parameters. .. py:method:: to_yaml(path: pathlib.Path) Write parameters to a YAML file. Save the parameters, except the input matrix, to a YAML file. :param path: File to write to .. py:attribute:: alpha :type: float :value: 0.0 Regularisation parameter applied to both :math:`H` and :math:`W` matrices. .. py:attribute:: beta_loss :type: str :value: 'kullback-leibler' Beta loss function for NMF decomposition. .. py:attribute:: init :type: str :value: 'nndsvdar' Initialisation method for :math:`H` and :math:`W` matrices on first step. Defaults to randomised non-negative SVD with small random values added to 0s. .. py:attribute:: keep_mats :type: bool :value: False Whether to return the :math:`H` and :math:`W` matrices as part of the results. .. py:attribute:: l1_ratio :type: float :value: 0.0 Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0. .. py:property:: log_str :type: str Format parameters in readable way for logs/console. .. py:attribute:: max_iter :type: int :value: 3000 Maximum number of iterations during decomposition. Will terminate earlier if solution converges. .. py:attribute:: rank :type: int Rank of the decomposition. .. py:attribute:: seed :type: Optional[Union[int, numpy.random.Generator, str]] :value: None Random seed for initialising decomposition matrices; if None no seed used so results will not be reproducible. .. py:attribute:: x :type: Optional[Union[BicvSplit, pandas.DataFrame]] For a simple decomposition, a matrix as a dataframe. For a bi-cross validation iteration, this should be the shuffled matrix split into mn parts, where m is the number of parts along rows, n along columns. When returning results and keep_mats is False, this will be set to None to avoid passing and saving large data. .. py:function:: bicv(params: Optional[NMFParameters] = None, **kwargs) -> BicvResult Perform a single run of bicrossvalidation. Perform one run of bicrossvalidation. Parameters can either be passed as a :class:`NMFParameters` tuple and are documented there, or by keyword arguments using the same names as :class:`NMFParameters`. :returns: Comparisons of the held out submatrix and estimate for each fold .. py:function:: cli_decompose(input: str, output_dir: str, delimiter: str, progress: bool, verbosity: str, seed: int, l1_ratio: float, alpha: float, max_iter: int, beta_loss: str, init: str, n_runs: int, top_n: int, top_criteria: str, compress: bool, ranks: List[int], symlink: bool) -> None Decompositions for RANKS. RANKS is a list of ranks for which to generate decompositions. Generate a number of decompositions for each the specified ranks. NMF solutions are non-unique and depend on initialisation, so when using an initialisation with randomness multiple solutions can be produced. From these solutions, the best can be retained based on criteria such as reconstruction error or cosine similarity. Some initialisation methods are deterministic, and as such only a single decomposition will be produced. The output is H and W matrices for each decomposition, tables of quality scores, and some analyses with default parameters. For further analysis, decompositions can be loaded using Decomposition.from_dir, or tables used directly for custom analyses. By default, a symlink to the input data .. py:function:: cli_rank_selection(input: str, output_dir: str, delimiter: str, shuffles: int, progress: bool, verbosity: str, seed: int, rank_min: int, rank_max: int, rank_step: int, l1_ratio: float, alpha: float, max_iter: int, beta_loss: str, init: str, design: Tuple[int, int]) -> None Rank selection for NMF using mn-fold bi-cross validation Attempt to identify a suitable rank k for decomposition of input matrix X. This is done by shuffling the matrix a number of times, and for each shuffle diving it into m x n submatrices (m splits on rows, n splits on columns). Each of these nine is held out and an estimate learnt from the remaining matrices, and the quality of the estimated matrix used to identify a suitable rank. The underlying NMF implementation is from scikit-learn, and there is more documentation available there for many of the NMF specific parameters there. .. py:function:: cli_regu_selection(input: str, output_dir: str, delimiter: str, shuffles: int, progress: bool, verbosity: str, seed: int, rank: int, alpha: List[float], l1_ratio: float, max_iter: int, beta_loss: str, init: str, scale: bool, design: Tuple[int, int]) -> None Regularisation selection for NMF on ALPHA 9 fold bi-cross validation Attempt to identify a suitable regularisation parameter alpha for decomposition of input matrix X at a given rank with a given ratio between L1 and L2 regularisation. This is done by shuffling the matrix a number of times, and for each shuffle diving it into 9 submatrices. Each of these nine is held out and an estimate learnt from the remaining matrices, and the quality of the estimated matrix used to identify a suitable alpha. The underlying NMF implementation is from scikit-learn, and there is more documentation available there for many of the NMF specific parameters there. ALPHA is a list of values to be tested. 0.0 will always be added. .. py:function:: cophenetic_correlation(decompositions: Dict[int, List[Decomposition]], on: Literal['h', 'w'] = 'h') -> pandas.Series Cophenetic correlation coefficient for rank selection The cophenetic correlation coefficient (ccc) is a commonly used way to select a suitable rank for decompositions (Brunet 2004). It is based on assigning each sample or feature to a single signature, and looking for stability in which are assigned to the same signature across multiple random initialisations. Our primary method for rank selection is bicrossvalidation, but we offer the ability to calculate ccc when you have performed multiple decompositions for a rank using :func:`decompositions`. :param decompositions: Results from the :func:`decompositions` function. A dictionary with the key being a rank, the value a list of decompositions for that rank. :param on: Look for stability in the assignment in the H matrix (samples) or W matrix (features). :returns: Series indexed by rank and with value being the ccc. .. py:function:: decompose(params: NMFParameters) -> Decomposition Perform a single decomposition of a matrix. :param params: Decomposition parameters as a :class:`NMFParameters` object. :return: A single decomposition .. py:function:: decompositions(x: pandas.DataFrame, ranks: Iterable[int], random_starts: int = 100, top_n: int = 5, top_criteria: str = 'beta_divergence', seed: Optional[Union[int, numpy.random.Generator]] = None, alpha: Optional[float] = None, l1_ratio: Optional[float] = None, max_iter: Optional[int] = None, beta_loss: Optional[str] = None, init: Optional[str] = 'random', progress_bar: bool = True) -> Dict[int, List[Decomposition]] Get the best decompositions for input matrix for one or more ranks. The model obtained by NMF decomposition depend on the initial values of the two matrices W and H; different initialisations lead to different solutions. Two approaches to initialising H and W are to attempt multiple random initialisations and select the best ones based on criteria such as reconstructions error, or to adopt a deterministic method (such as nndsvd) to set initial values. This function provides both approaches, but defaults to multiple random initialisations. To use one of the deterministic methods, change the initialisation method using `init`. A dictionary with one entry for each rank of decomposition requested is return, with the values being a list of top_n best decompositions for that rank. Where a deterministic method is used, the list will only have one item. :param x: Matrix to be decomposed :param ranks: Rank(s) of decompositions to be produced :param random_starts: Number of random initialisations to be tried for each rank. Ignored if using a deterministic initialisations. :param top_n: Number of decompositions to be returned for each rank. :param top_criteria: Criteria to use when determining which are the top decompositions. Can be one of beta_divergence, rss, r_squared, cosine_similairty, or l2_norm. :param seed: Seed or random generator used :param alpha: Regularisation parameter applied to both H and W matrices. :param l1_ratio: Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0. This controls the mix between sparsifying and densifying regularisation. 1.0 will encourage sparsity, 0.0 density :param max_iter: Maximum number of iterations during decomposition. Will terminate earlier if solution converges :param beta_loss: Beta loss function for NMF decomposition :param init: Initialisation method for H and W matrices on first step. Defaults to random :param progress_bar: Display progress bar .. py:function:: dispersion(decompositions: Dict[int, List[Decomposition]], on: Literal['h', 'w'] = 'h') -> pandas.Series Dispersion coefficient for rank selection The dispersion coefficient is a method for rank selection which looks for consistency in the average consensus matrix (Park 2007). This shares the same underlying data structure as :func:`cophenetic_correlation`, the average consensus matrix, looking at how often elements are assigned to the same signature, with elements assigned to the signature with maximum weight. The value for dispersion ranges between 0 and 1, with 1 indicating perfect stability, and 0 a highly scattered consensus matrix. Our primary method for rank selection is bicrossvalidation, but we offer the ability to calculate dispersion when you have performed multiple decompositions for a rank using :func:`decompositions`. :param decompositions: Results from the :func:`decompositions` function. A dictionary with the key being a rank, the value a list of decompositions for that rank. :param on: Look for stability in the assignment in the H matrix (samples) or W matrix (features). :returns: Series indexed by rank and with value being the dispersion coefficient. .. py:function:: plot_rank_selection(results: Dict[Union[int, float], List[BicvResult]], exclude: Optional[Iterable[str]] = None, include: Optional[Iterable[str]] = None, show_all: bool = False, geom: str = 'box', summarise: Literal['mean', 'median'] = 'mean', suggested_rank: bool = True, stars_at: Optional[Dict[str, int]] = None, star_size: int = 4, jitter: bool = None, jitter_size: float = 0.3, n_col: int = None, xaxis: str = 'rank', rotate_x_labels: Optional[float] = None, geom_params: Dict[str, Any] = None, **kwargs) -> plotnine.ggplot Plot rank selection results from bicrossvalidation. Draw either box plots or violin plots showing statistics comparing :math:`A` and :math:`A'` from all bicrossvalidation results across a range of ranks. The plotting library used is ``plotnine``; the returned plot object can be saved or drawn using ``plt_obj.save`` or ``plt_obj.draw`` respectively. By default, only `cosine_similarity` and `r_squared` are plotted. You can define which measures to include using include, or which to exclude using exclude. You can also use show_all to show all the measures. For `cosine_similarity` and `r_squared`, an suggestion of optimal rank is given by identifying an elbow point in the graph using the package ``kneed``, indicated by a star above that rank. :param results: Dictionary of results, with rank as key and a list of :class:`BicvResult` for that rank as value :param exclude: Measures from :class:`BicvResult` not to plot. :param include: Measures from :class:`BicvResult` to plot. :param show_all: Show all measures, ignoring anything set in include or exclude. :param geom: Type of plot to draw. Accepts either 'box' or 'violin' :param summarise: How to summarise the statistics across the folds of a given shuffle. :param suggested_rank: Estimate rank using :func:`suggest_rank`. :param stars_at: Manually define x-axis values at which to place stars above the main plot. Mainly used to allow :func:`plot_regu_selection` to pass where to plot stars for regularisation selection. :param star_size: Size of star indicating suggested rank. :param jitter: Draw individual points for each shuffle above the main plot. :param jitter_size: Size of jitter points. :param n_col: Number of columns in the plot. If blank, attempts to guess a sensible value. :param xaxis: Value to plot along the x-axis. "rank" for rank selection, "alpha" for regularisation selection. :param rotate_x_labels: Degrees to rotate x-axis labels by. If None will rotate if x-axis is float. :param **kwargs: Passed to :func:`suggest_ranks`. :return: :class:`plotnine.ggplot` instance .. py:function:: plot_regu_selection(regu_res: Union[Tuple[float, Dict], Dict], alpha_star: bool = True, **kwargs) -> plotnine.ggplot Plot regularisation selection results. Takes a result from :func:`regu_selection` and passes to :func:`plot_rank_selection` to plot with alpha values along the x-axis. Consequently, pass any parameters for plotting as kwargs. :param regu_res: Results from :func:`regu_selection`. :param alpha_star: Suggest and plot a suitable alpha value using :func:`suggest_alpha`. .. py:function:: plot_stability_rank_selection(decompositions: Optional[Dict[int, List[Decomposition]]] = None, series: Optional[List[pandas.Series]] = None, include: List[str] = ['cophenetic_correlation', 'dispersion', 'signature_similarity'], suggested_rank: bool = True, on: Literal['h', 'w'] = 'h') -> plotnine.ggplot Plot results for stability based rank selection methods ( :func:`signature_stability`, :func:`cophenetic_correlation`, :func:`dispersion`). Automated rank selection uses :func:`suggest_rank_stability`. :param decompositions: Results from :func:`decompositions`. Not used if series is passed. :param series: Series to plot, resulting from :func:`signature_similarity`, :func:`cophenetic_correlation`, or :func:`dispersion`. :param include: Which method to include in the plot, a list containing values from ``{'cophenetic_correlation', 'dispersion', 'signature_similarity'}``. :param suggested_rank: Make an estimate of estimate suggested rank using :func:`suggest_rank_stability`. :param on: Calculate stability of H (samples) or W (features). Not used if passed series. .. py:function:: rank_selection(x: pandas.DataFrame, ranks: Iterable[int], shuffles: int = 100, keep_mats: Optional[bool] = None, seed: Optional[Union[int, numpy.random.Generator]] = None, alpha: Optional[float] = None, l1_ratio: Optional[float] = None, max_iter: Optional[int] = None, beta_loss: Optional[str] = None, init: Optional[str] = None, design: Optional[Tuple[int, int]] = (3, 3), progress_bar: bool = True) -> Dict[int, List[BicvResult]] Bi-cross validation for rank selection. Run :math:`mn`-fold bicrossvalidation across a range of ranks. Briefly, the input matrix is shuffled `shuffles` times. Each shuffle is then split into :math:`m imesn` submatrices (:math:`m` splits on rows, :math:`n` splits on columns). The rows and columns of submatrices are permuted, and the top left submatrix (:math:`A`) is estimated through NMF decompositions of the other matrices producing an estimate :math:`A'`. Various measures of how well :math:`A'` reconstructed :math:`A` are provided, see :class:`BicvResult` for details on the measures. No multiprocessing is used, as a majority of build of scikit-learn seem to make good use of multiple processors anyway (depending on compilation of underlying libraries and matrix size). This method returns a dictionary with each rank as a key, and a list containing one :class:`BicvResult` for each shuffle. :param x: Input matrix. :param ranks: Ranks of k to be searched. Iterable of unique ints. :param shuffles: Number of times to shuffle `x`. :param keep_mats: Return A' and shuffle as part of results. :param seed: Random value generator or seed for creation of the same. If not provided, will initialise with entropy from system. :param alpha: Regularisation coefficient :param l1_ratio: Ratio between L1 and L2 regularisation. L2 regularisation (0.0) is densifying, L1 (1.0) sparisfying. :param max_iter: Maximum iterations of NMF updates. Will end early if solution converges. :param beta_loss: Beta-loss function, see sklearn documentation for details. :param init: Initialisation method for H and W during decomposition. Used only where one of the matrices during bi-cross steps is not fixed. See sklearn documentation for values. :param design: How many blocks to split the input matrix into on rows and columns respectively. Defaults to 3x3 9-fold design. :param progress_bar: Show a progress bar while running. :returns: Dictionary with entry for each rank, containing a list of results for each shuffle (as a :class:`BicvResult` object) .. py:function:: regu_selection(x: pandas.DataFrame, rank: int, alphas: Optional[Iterable[float], None] = None, scale_samples: Optional[bool] = None, shuffles: int = 100, keep_mats: Optional[bool] = None, seed: Optional[Union[int, numpy.random.Generator]] = None, l1_ratio: Optional[float] = 1.0, max_iter: Optional[int] = None, beta_loss: Optional[str] = None, init: Optional[str] = None, design: Tuple[int, int] = (3, 3), progress_bar: bool = True) -> Tuple[float, Dict[float, List[BicvResult]]] Bicrossvalidation for regularisation selection. Run :math:`mn`-fold bicrossvalidation across a range of regularisation ratios, for a single rank. For a brief description of bi-cross validation see :func:`rank_selecton` No multiprocessing is used, as a majority of build of scikit-learn seem to make good use of multiple processors anyway. This method returns a tuple with * a float which is the tested alpha which meets the criteria in the ES paper * a dictionary with each alpha value as a key, and a list containing one :class:`BicvResult` for each shuffle :param x: Input matrix. :param rank: Rank of decomposition. :param alphas: Regularisation alpha parameters to be searched. If left blank a default range will be used. :param scale_samples: Divide alpha by number of samples. This is provided as the way regularisation is performed changed in newer sklearn versions, and alpha is multiplied by n_samples. Setting this to True results in the same calculation as earlier sklearn versions, such as the one used in the Enterosignatures paper. If this is set it is honoured; if left as None, when automatic alpha range is calculated they will be scaled by sample, when alpha range specified will not be scaled. :param shuffles: Number of times to shuffle `x`. :param keep_mats: Return :math:`A'` and `shuffle` as part of results. :param seed: Random value generator or seed for creation of the same. If not provided, will initialise with entropy from system. :param alpha: Regularisation coefficient :param l1_ratio: Ratio between L1 and L2 regularisation. L2 regularisation (0.0) is densifying, L1 (1.0) sparisfying. :param max_iter: Maximum iterations of NMF updates. Will end early if solution converges. :param beta_loss: Beta-loss function, see sklearn documentation for details. :param init: Initialisation method for H and W during decomposition. Used only where one of the matrices during bi-cross steps is not fixed. See sklearn documentation for values. :param progress_bar: Show a progress bar while running. :param design: Number of blocks to split input into on rows and columns respectively for bicrossvalidation. :returns: Dictionary with entry for each rank, containing a list of results for each shuffle (as a :class:`BicvResult` object) .. py:function:: signature_similarity(decompositions: Dict[int, List[Decomposition]]) -> pandas.Series Mean cosine similarity of signatures for rank selection This rank selection criteria is based on the intuition that if a solution is good, it should be across similar multiple random initialisation of the data, similar to the motivation for :func:`cophenetic_correlation` and :func:`dispersion`. We pair signatures based on a cosine similarity (see :func:`cvanmf.stability.match_signatures`) and take the mean value between paired signatures at a rank, and look for clear peaks. Similarity is calculated between the best decomposition and all otherwise, not all possible pairs. The paired cosine similarity can also be visualised in more detail using :func:`cvanmf.stability.plot_signature_stability`. :param decompositions: Decompositions for several ranks as output by :func:`decompositions`. .. py:function:: suggest_alpha(regu_results: Dict[float, List[BicvResult]]) -> float Suggest a suitable value for alpha. Want to select the largest value of :math:`alpha` possible which does not detrimentally effect the quality of the decomposition. To gauge this, we adopt the heuristic of [REF], selecting the highest value of :math:`alpha` for which the mean :math:`R^2` is not lower than the (mean :math:`R^2` + standard deviation) at :math:`alpha=0`. This is called by default in :func:`regu_selection`. It is provided as public method as the Nextflow pipeline splits the Bicv process, and doesn't use :func:`regu_selection`, and so it can be called after. :param regu_results: Dictionary with keys being alpha values, and values a list of :class:`BicvResult` objects. .. py:function:: suggest_rank(rank_selection_results: Union[Dict[int, List[BicvResult]], pandas.DataFrame], summarise: Callable[[numpy.ndarray], float] = np.mean, measures: List[str] = ['cosine_similarity', 'r_squared'], **kwargs) -> Dict[str, int] Suggest a suitable rank. Attempt to identify an elbow point in the graphs of cosine similarity and :math:`R^2` which represent points where the rate of improvement in the decomposition slows. Please note this is only a suggestion of a suitable rank; the plots should still be inspected and decompositions of candidate ranks inspected to make a final decision. This is implemented using the excellent `kneed` package, and `**kwargs` are passed to the constructor of `KneeLocator`, you can use this if you wish to customise the behaviour. We use the online mode of kneed by default. :param rank_selection_results: Results from :func:`rank_selection`, or these results in DataFrame format from :meth:`BicvResult.results_to_table` :param summarise: Function to summarise results from a shuffle. Roughly speaking, determines which point represent the middle of the distribution of values for purposes of the curve. :param measures: The measures to consider if passed a DataFrame :param kwargs: Arguments passed to ``KneeLocator`` constructor .. py:function:: suggest_rank_stability(rank_selection_results: Union[pandas.DataFrame, Iterable[pandas.Series], Dict[int, List[Decomposition]]], measures: List[str] = ['cophenetic_correlation', 'dispersion', 'signature_similarity'], near_max: float = 0.02, **kwargs) -> Dict[str, int] Suggest a suitable rank in stability based measures. Attempt to identify peaks in stability based rank selection criteria (cophenetic correlation, dispersion, signature similrity). By default the highest peak is selected. Where there are many similar ranks (defined by `near_max`), the one with the most consecutively decreasing values after it is selected. Please note this is only a suggestion of a suitable rank; the plots should still be inspected and decompositions of candidate ranks inspected to make a final decision. When making a plot multiple times (changing parameters etc), it may be preferable to calculate the measures then pass the results as a list of Series, as the calculation can be time consuming. :param rank_selection_results: Results from :func:`decompositions`, or a collection of series produced by :func:`dispersion`, :func:`cophenetic_correlation`, and :func:`signature_similarity`, or a DataFrame of those series joined. :param measures: The measures to consider if passed a DataFrame :param near_max: Consider peaks (:math:`p`) candidates if they are within a certain distance of global maximum (:math:`gm`): :math:`p \geq gm(1-near_max)`. :param kwargs: Passed to ``np.argrelmax``. .. py:data:: Numeric Alias for python numeric types (a union of int and float). .. py:data:: PcoaMatrices Allowed matrices which PCoA can be constructed from. Allows values w, x, wh, signatures (alias for w). .. py:data:: logger :type: logging.Logger Logger object.