cvanmf.denovo
=============

.. py:module:: cvanmf.denovo

.. autoapi-nested-parse::

   Generate new models using NMF decomposition

   This module provides functions to generate new models from data,
   which encompasses three main steps: rank selection, regularisation
   selection, and model inspection. The first of these two steps involves
   running decompositions multiple times for a range of values, and can be
   time-consuming. Methods are provided to run the whole process on a single
   machine, but also for running individual decompositions, which are used by the
   accompanying nextflow pipeline to allow spreading the computation across
   multiple nodes in an HPC environment.

   The main functions for each step are

   * :func:`rank_selection` and :func:`plot_rank_selection`
   * :func:`regu_selection` and :func:`plot_regu_selection`
   * :func:`decompositions` which produces :class:`Decomposition` objects

   Individual decompositions are represented by a :class:`Decomposition` object,
   and visualisation and analysis are carried out using object methods (such as
   :meth:`Decomposition.plot_feature_weight()`).


Attributes
----------

.. autoapisummary::

   cvanmf.denovo.Numeric
   cvanmf.denovo.PcoaMatrices
   cvanmf.denovo.logger


Classes
-------

.. autoapisummary::

   cvanmf.denovo.BicvFold
   cvanmf.denovo.BicvResult
   cvanmf.denovo.BicvSplit
   cvanmf.denovo.Decomposition
   cvanmf.denovo.NMFParameters


Functions
---------

.. autoapisummary::

   cvanmf.denovo.bicv
   cvanmf.denovo.cli_decompose
   cvanmf.denovo.cli_rank_selection
   cvanmf.denovo.cli_regu_selection
   cvanmf.denovo.cophenetic_correlation
   cvanmf.denovo.decompose
   cvanmf.denovo.decompositions
   cvanmf.denovo.dispersion
   cvanmf.denovo.plot_rank_selection
   cvanmf.denovo.plot_regu_selection
   cvanmf.denovo.plot_stability_rank_selection
   cvanmf.denovo.rank_selection
   cvanmf.denovo.regu_selection
   cvanmf.denovo.signature_similarity
   cvanmf.denovo.suggest_alpha
   cvanmf.denovo.suggest_rank
   cvanmf.denovo.suggest_rank_stability


Module Contents
---------------

.. py:class:: BicvFold

   Bases: :py:obj:`NamedTuple`


   One fold from a shuffled matrix

   The submatrices have been joined into the structure shown below

   .. code-block:: text

       A B . B
       C D . D
       . . . .
       C D . D

   from which A will be estimated as A' using only B, C, D.
   ```


   .. py:attribute:: A
      :type:  pandas.DataFrame


   .. py:attribute:: B
      :type:  pandas.DataFrame


   .. py:attribute:: C
      :type:  pandas.DataFrame


   .. py:attribute:: D
      :type:  pandas.DataFrame


.. py:class:: BicvResult

   Bases: :py:obj:`NamedTuple`


   Results from a single bi-cross validation run. For each BicvSplit there
   are :math:`mn` folds (for :math:`m` splits on rows, :math:`n` splits on
   columns), for which the top left submatrix (:math:`A`) is estimated
   (:math:`A'`) using the other portions.


   .. py:method:: join_folds(results: List[BicvResult]) -> BicvResult
      :staticmethod:


      Join results from individual folds

      Each fold returns a BicvResult with a length one array. This method
      joins these into a single object summarising all the folds. Could
      also join other sets of results.

      :param results: Results from individual folds
      :returns: Single object with individual arrays joined


   .. py:method:: results_to_table(results: Union[Iterable[BicvResult], Dict[Numeric, Iterable[BicvResult]]], summarise: Callable[[numpy.ndarray], float] = np.mean) -> pandas.DataFrame
      :staticmethod:


      Convert bi-fold crossvalidation results to a table

      For results run with the same parameters, convert the output to a
      table suitable for plotting.

      :param results: List of results for bicv runs with the same
          parameters on different shuffles of the data, or dict of runs
          across multiple values on the same shuffles.
      :param summarise: Function to reduce each the measures (r_squared etc)
          to a single value for each shuffle.


   .. py:method:: to_series(summarise: Callable[[numpy.ndarray], float] = np.mean) -> pandas.Series

      Convert bi-fold cross validation results to series

      :param summarise: Function to reduce each the measures (r_squared etc)
          to a single value for each shuffle.
      :returns: Series with entry for each non-parameter measure


   .. py:attribute:: a
      :type:  Optional[List[numpy.ndarray]]

      Reconstructed matrix A for each fold. Not included when keep_mats is
      False.


   .. py:attribute:: cosine_similarity
      :type:  numpy.ndarray

      Cosine similarity between each A and A' considered as a flattened
      vector.


   .. py:attribute:: i
      :type:  int

      Shuffle number when there are multiple shuffles. Included to allow
      spreading bicv across multiple processes, but without needing to return
      a copy of the full matrix.


   .. py:attribute:: l2_norm
      :type:  numpy.ndarray

      L2 norm between each A and A'.


   .. py:attribute:: parameters
      :type:  NMFParameters

      Parameters used during this run.


   .. py:attribute:: r_squared
      :type:  numpy.ndarray

      Explained variance between each A and A', with each considered as a
      flattened vector.


   .. py:attribute:: reconstruction_error
      :type:  numpy.ndarray

      Reconstruction error between each A and A'.


   .. py:attribute:: rss
      :type:  numpy.ndarray

      Residual sum of squares between each A and A'


   .. py:attribute:: sparsity_h
      :type:  numpy.ndarray

      Sparsity of H matrix for each A'


   .. py:attribute:: sparsity_w
      :type:  numpy.ndarray

      Sparsity of W matrix for each A'


.. py:class:: BicvSplit(mx: List[pandas.DataFrame], design: Tuple[int, int], i: Optional[int] = None)

   Shuffled matrix for bi-cross validation, split into mn matrices in a
   m x n pattern. To shuffle and split an existing matrix, use the static
   method :method:`BicvSplit.from_matrix`

   Create a shuffled matrix containing the 9 split matrices. These
   should be in the order

   .. code-block:: text

       0, 1, 2
       3, 4, 5
       6, 7, 8

   :param mx: Split matrices in a flat list. These should be all of one
       row, then all of the next row etc.
   :param design: Number of even splits on rows and columns of mx.
   :param i: Index of the split if one of many


   .. py:method:: col(col: int, join: bool = False) -> Union[List[pandas.DataFrame], pandas.DataFrame]

      Get a column of the submatrices by index. Convenience method for
      readability.

      :param col: Column index to get
      :param join: Join into a single matrix
      :returns: List of submatrices making up the column, or these submatrices
          joined if requested.


   .. py:method:: fold(i: int) -> BicvFold

      Construct a fold of the data

      There are m*n possible folds of the data, this function constructs the
      i-th fold.

      :param i: Index of the fold to construct, from [0, mn)
      :returns: A, B, C, and D matrices for this fold


   .. py:method:: from_matrix(df: pandas.DataFrame, n: int = 1, design: Tuple[int, int] = (3, 3), random_state: Optional[Union[int, numpy.random.Generator]] = None) -> Generator[BicvSplit]
      :staticmethod:


      Create random shuffles and splits of a matrix

      :param df: Matrix to shuffle and split
      :param n: Number of shuffles
      :param design: Number of blocks to divide rows and columns into.
          Default is 3x3 9-fold bicrossvalidation.
      :param random_state: Random state, either int seed or numpy Generator;
          None for default numpy random Generator.
      :returns: A generator of  splits, as BicvSplit objects


   .. py:method:: load_all_npz(path: Union[pathlib.Path, str], allow_pickle: bool = True, fix_i: bool = False) -> Generator[BicvSplit]
      :staticmethod:


      Read shuffles from files.

      Reads either all the npz files in a directory, or those specified by a
      glob. The expectation is the filenames are in format prefix_i.npz,
      where i is the number of this shuffle. If not, use fix_i to renumber
      in order loaded.

      :param path: Directory with .npz files, or glob identifying .npz files
      :param allow_pickle: Allow unpickling when loading; necessary for
          compressed files.
      :param fix_i: Renumber shuffles.
      :returns: Generator of BicvSplit objects.


   .. py:method:: load_npz(path: pathlib.Path, allow_pickle: bool = True, i: Optional[int] = None) -> BicvSplit
      :staticmethod:


      Load splits from file.

      Load splits from an npz file. This will mean don't have the column
      and row names anymore, but this is unimportant for cross-validation.
      Pickling is required for loading compressed files, but is not secure,
      so the option is provided to turn it off if you don't need it.

      :param path: File to load
      :param allow_pickle: Allow unpickling when loading; necessary for
          compressed files
      :param i: Shuffle number. Will attempt to parse from filename
          if blank. Only parses files like prefix_1.npz, which will be i=1,
          prefix only alphanumeric.


   .. py:method:: row(row: int, join: bool = False) -> Union[List[pandas.DataFrame], pandas.DataFrame]

      Get a row of the submatrices by index. Convenience method for
      readability.

      :param row: Row index to get
      :param join: Join into a single matrix
      :returns: List of submatrices making up the row, or these submatrices
          joined if requested.


   .. py:method:: save_all_npz(splits: Iterable[BicvSplit], path: pathlib.Path, fix_i: bool = False, force: bool = False, compress: bool = True) -> None
      :staticmethod:


      Save a collection of splits to a directory as npz files.

      :param splits: Iterable of BicvSplit objects
      :param path: Directory to write to
      :param fix_i: Renumber all splits starting from 0. Does not check
          if existing numbering is unique.
      :param compress: Use compression
      :param force: Overwrite existing files


   .. py:method:: save_npz(path: pathlib.Path, compress: bool = True, force: bool = False) -> None

      Save these splits to file.

      Write the splits to a numpy format file. This will lose the row
      and column names, however this is unimportant for rank selection.
      Compression is enabled by default, as sparse data such as microbiome
      counts tends to create large files.

      :param path: Path to write to. If passed a directory, will output
          with filename `shuffle_{i}.npz`. If `i` is not set, cause an error.
      :param compress: Use compression.
      :param force: Overwrite existing files.


   .. py:property:: design
      :type: Tuple[int, int]


      Design of holdout pattern, given as (rows, columns).


   .. py:property:: folds
      :type: List[BicvFold]


      List of the mn possible folds of these submatrices.


   .. py:property:: i
      :type: int


      This is the i-th shuffle of the input.


   .. py:property:: mx
      :type: List[List[pandas.DataFrame]]


      Submatrices as a 2d list.


   .. py:property:: num_folds
      :type: int


      Total number of folds in this design.


   .. py:property:: shape
      :type: Tuple[int, int]


      Dimensions of the input matrix.


   .. py:property:: size
      :type: int


      Size of input matrix.


   .. py:property:: x
      :type: pandas.DataFrame


      The input matrix

      This reproduces the input matrix by concatenating the submatrices.

      :returns: Input matrix


.. py:class:: Decomposition(parameters: NMFParameters, h: pandas.DataFrame, w: pandas.DataFrame, feature_mapping: Optional[reapply.FeatureMapping] = None)

   Decomposition of a matrix.

   Note that we use the naming conventions and orientation common in NMF
   literature:

   * :math:`X` is the input matrix, with m features on rows, and n samples on
     columns.
   * :math:`H` is the transformed data, with k signatures on rows, and n
     samples on columns.
   * :math:`W` is the feature weight matrix, with m features on rows, and m
     features on columns.

   The scikit-learn implementation has these transposed; this package
   handles transposing back and forth internally, and expects input in the
   features x samples orientation, and provides :math:`W` and :math:`H` inline
   with the literature rather than scikit-learn.

   Decomposition objects can be sliced using the syntax::

       sliced_model = model[samples, features, signatures]
       # Only slice on one dimension
       sliced_signatures = model[:, :, ['S1', 'S2']]

   Slices must be ordered collections of strings, integer indices, or booleans.


   .. py:method:: compare_signatures(b: Comparable) -> pandas.DataFrame

      Similarity between these signatures and one other set.

      Similarity here is defined as cosine as the angle between each
      pair of signature vectors, so 1 is identical (ignoring scale) and
      0 is perpendicular.

      This is a convenience method which calls
      :func:`stability.compare_signatures`.

      :param b: Signature matrix, or object with signature matrix
      :returns: Matrix with cosine of angles between signature vectors.


   .. py:method:: consensus_matrix(on: Union[Literal['h', 'w'], pandas.DataFrame] = 'h') -> scipy.sparse.csr_array

      Consensus matrix of either :math:`H` or :math:`W`.

      Most typically, the consensus matrix is calculated on the :math:`H`
      matrix, and is a binary matrix representing whether sample :math:`i` is
      assigned to the same signature as sample :math:`j`. Samples are
      assigned to signatures based on their maximum weight. When calculated
      on :math:`W`, it is the same but for features assigned.

      The primary use of this is in generating a :math:`\bar{C}` matrix, the
      mean number of times two elements are assigned to the same signature.
      :math:`\bar{C}` is used to calculate the :meth:`cophenetic_correlation`
      and :meth:`dispersion` coefficients, a method of determining suitable
      rank.

      This is returned as a lower triangular matrix in sparse format.


   .. py:method:: discrete_signature_scale(axis: Literal['x', 'y']) -> Union[plotnine.scale_x_discrete, plotnine.scale_y_discrete]

      Make a plotnine scale which puts the signatures in order.

      By default, plotnine will alphabetically sort (S1, S11 .. S2, S21),
      this produces a scale object which can be added to a plot to put the
      signatures in their order in this object.


   .. py:method:: load(in_dir: os.PathLike, x: Optional[Union[pandas.DataFrame, str, os.PathLike]] = None, delim: str = '\t')
      :staticmethod:


      Load a decomposition from disk.

      Loads a decomposition previously saved using :meth:`save`. Will
      automatically determine whether this is a directory or .tar.gz.
      Can provide a DataFrame of the :math:`X` input matrix, primarily this is
      so when loading multiple decompositions they can all reference the
      same object. Can also provide an explicit path; if not provided will
      attempt to load from ``x.tsv``.

      :param in_dir: Directory or .tar.gz containing decomposition.
      :param x: Either the X input matrix as a DataFrame, or a path to
          a delimiter-separated copy of the X matrix. If None, will attempt
          to load from x.tsv.
      :param delim: Delimiter for tabular data


   .. py:method:: load_decompositions(in_dir: os.PathLike, delim: str = '\t') -> Dict[int, List[Decomposition]]
      :staticmethod:


      Load multiple decompositions.

      Load a set of decompositions previously saved using
      :meth:`save_decompositions`. Will attempt to share a reference to
      the same :math:`X` matrix for memory reasons. The output is a dictionary
      with keys being ranks, and values being lists of decompositions
      for that rank.

      :param in_dir: Directory to read from
      :param delim: Delimiter for tabular data files


   .. py:method:: match_signatures(b: Comparable) -> pandas.DataFrame

      Identify optimal matches between these signatures and one other set

      Find the pairing of signatures which are most similar. More technically,
      this finds the pairing of signatures which maximises the total cosine
      similarity using the Hungarian algorithm. It is possible that a
      signature gets paired with another for which the cosine similarity is
      not highest, suggesting a potentially bad match between some signatures
      in the model.

      The return is a dataframe with columns a and b for which signatures
      are paired, the cosine similarity of the pairing, and the maximum
      'off-target' cosine value for any of the signatures which it was not
      assigned to. The intention for the off-target score is that ideally
      this would be low, and the paired similarity high: signatures match
      well their paired one, while being dissimilar to all others.

      This is a convenince method which calls
      :func:`stability.match_signatures`.

      :param b: Signature matrix, or object with signature matrix
      :returns: DataFrame with pairing and scores


   .. py:method:: monodominant_samples(threshold: float = 0.9) -> pandas.DataFrame

      Which samples have a monodominant signature.

      A monodominant signature is one which represents at least the
      threshold amount of the weight in the :meth:`scaled` :math:`H` matrix.

      :param threshold: Proportion of the scaled H matrix weight to consider
          a signature dominnant.
      :return: Dataframe with column is_monodominant indicating if a
          sample has a monodominant signature, and signature_name indicating
          the name of the signature, or none if not.


   .. py:method:: name_signatures_by_weight(cumulative_sum: float = 0.4, max_char_length: int = 10, max_num_features: int = 5, feature_delimiter: str = '+', number: bool = True, clean: Callable[[str], str] = lambda x: x.replace(' ', '_')) -> None

      Give a slightly more descriptive name to each signature.

      Append features with highest relative weights to the end of
      signature names. This alters the object in place.

      :param cumulative_sum: Add features up to this cumulative sum (from
          max to min).
      :param max_char_length: Maximum length of new name (before joining
          with feature delimiter).
      :param max_num_features: Maximum number of features to use in name.
      :param feature_delimiter: When multiple features used, will join with
          this character
      :param number: Number the signatures. When true, starts each new name
          with S1, S2, etc.
      :param clean: Function to clean the string. Defaults to replacing
          spaces with underscores.


   .. py:method:: pcoa(on: Union[pandas.DataFrame, Literal['x', 'h', 'wh', 'signatures']] = 'h', distance: str = 'braycurtis', wisconsin_standardise: bool = True, sqrt: bool = True) -> skbio.OrdinationResults

      Principal Coordinates Analysis of decomposition.

      Performs PCoA on the specified matrix, and returns a scikit-bio
      OrdinationResults object. Can base distances on any matrix which has
      a column for each sample, or specify one of these via string. Defaults
      to distances based on :meth:`scaled` :math:`H` (signature weight in
      sample) matrix.

      Matrix is Wisconsin double standardised by default, as described in R
      function ``cmdscale``.

      Distance defaults to Bray-Curtis dissimilarity, and is square root
      transformed. Distance is calculated with scipy ``pdist`` function, and
      any method supported there can be specified in distance argument.

      :param on: Matrix to derive distances from
      :param distance: Distance method to use
      :param wisconsin_standardise: Apply Wisconsin double standardisation
      :param sqrt: Square root transform distances
      :return: PCoA results object from scikit-bio


   .. py:method:: plot_feature_weight(threshold: float = 0.04, label_fn: Callable[[str], str] = None) -> plotnine.ggplot

      Plot features which contribute to each signature.

      Represent the relative contribution of features to signatures, showing
      any features which contribute over a threshold proportion of the weight.

      :param threshold: Show any features which contribute more than this
          proportion of the weight for this signature.
      :param label_fn: Function to map labels (use to make shortened labels
          for example)


   .. py:method:: plot_metadata(metadata: pandas.DataFrame, against: Optional[Union[pandas.DataFrame, Literal['signature', 'model_fit', 'both']]] = None, continuous_fn: Optional[Callable[[pandas.Series], bool]] = None, discrete_fn: Optional[Callable[[pandas.Series], bool]] = None, boxplot_params: Optional[Dict] = None, point_params: Optional[Dict] = None, disc_rotate_labels: Optional[float] = None, show_significance: bool = True, significance_formatter: Optional[Callable[[float, float, float], str]] = None, univariate_test_params: Dict[str, Any] = None) -> Tuple[plotnine.ggplot, plotnine.ggplot]

      Plot relative signature weight against metadata.

       Produce plots of signature weight against metadata. Produces two plots,
       one with boxplots for categorical metadata, one with scatter plots for
       continuous metadata. Will infer which type each column is. To use an
       integer as categorical, convert it to Categorical type in pandas. Will
       conduct univariate tests as described in :meth:`univariate_tests` and
       indicate significance with symbols. This will be skipped if
       ``show_significance`` is False, which maybe be sensible for larger
       numbers of samples and metadata categories.

       :param metadata: Dataframe with samples on rows, and metadata on
           columns.
       :param against: DataFrame to plot the metadata against. Should
           contain an entry for each sample, with samples on rows. Defaults to
           :meth:`scaled` :math:`H` matrix (transpose of typical :math:`H`
           orientation).
       :param continuous_fn: Function to determine if a column is
           continuous. Defaults to considering any floating type or integer to
           be continuous. May want to customise if you want to use things such
           as date time formats.
       :param discrete_fn: Function to determine if a column is categorial.
           Defaults to considerings any string, or object type column with
           a number of unique values < 3/4 its length as categorical.
       :param boxplot_params: Dictionary of parameters to pass to
           ``geom_boxplot.`` These will be fixed parameters (so color="pink"
           to set all box outlines to pink).
       :param point_params: Dictionary of parameters to pass to geom_point.
           Will be fixed parameters, see above.
       :param disc_rotate_labels: Angle to rotate x axis labels by for
           boxplots.
       :param show_significance: Add significance to each subplot for discrete
           metadata.
       :param significance_formatter: Function which takes the p-value and
           adjusted p-values and returns a string to use as label. Defaults
           to :meth:`Decomposition.significance_format`.
       :param univariate_test_params: Parameters passed to
           :meth:`univariate_tests`
      :return: A tuple of plotnine ggplot objects, first is boxplots,
           second is scatter plots.


   .. py:method:: plot_modelfit(group: Optional[pandas.Series] = None) -> plotnine.ggplot

      Plot model fit distribution.

      This provides a histogram of the model fit of samples by default. If
      a grouping is provided, this will instead produce boxplots with each
      box being the distribution within a group.

      :param group: Series giving label for group which each sample
          belongs to. Sample which are not in the group series will
          be dropped from results with warning.
      :return: Histogram or boxplots


   .. py:method:: plot_modelfit_point(threshold: Optional[float] = 0.4, yrange: Optional[Tuple[float, float]] = (0, 1), point_size: float = 1.0) -> plotnine.ggplot

      Model fit for each sample as a point on a vertical scale.

      It may be of interest to look at the model fit of individual samples,
      so this plot shows the model fit of each sample as a point on a
      vertical scale. A threshold can be set below which the point will be
      coloured red to indicate low model fit, by default this is 0.4.

      :param threshold: Value below which to colour the model fit red. If
          omitted will not color any samples. The default of 0.4 is specific
          to the 5ES model (:func:`cvanmf.models.five_es`) and does not neccesarily
          represent a good threshold for other models.


   .. py:method:: plot_pcoa(axes: Tuple[int, int] = (0, 1), color: Union[pandas.Series, Literal['signature']] = 'signature', shape: Optional[Union[pandas.Series, Literal['signature']]] = None, signature_arrows: bool = False, point_aes: Dict[str, Any] = None, **kwargs) -> plotnine.ggplot

      Ordination of samples.

      Perform PCoA of samples and plot first two axes. PCoA performed by the
      :meth:`pcoa` method, and arguments in kwargs are passed on to this
      method. Samples are coloured by primary ES.

      :param axes: Indices of PCoA axes to plot
      :param color: Metadata to use to color the points, or 'signature' to
          color based on the primary signature
      :param shape:  Metadata to used to decide shape of points,
          or 'signature' to base shape on the primary signature
      :param signature_arrows: Plot location of signatures as arrows
      :param point_aes: Dictionary of arguments to pass to geom_point
      :param kwargs: arguments to pass to :meth:`pcoa`
      :return: Scatter plot of samples


   .. py:method:: plot_relative_weight(group: Optional[pandas.Series] = None, group_colors: Optional[pandas.Series] = None, model_fit: bool = True, heights: Union[Dict[str, float], Iterable[float]] = None, width: float = 6.0, sample_label_size: float = 5.0, legend_cols_sig: int = 3, legend_cols_grp: int = 3, legend_side: str = 'bottom', **kwargs)

      Plot relative weight of each signature in each sample.

      To display the plot in a notebook environment, use ``result.render()``.
      Please note this plot uses the marsilea package rather than
      plotnine like other plots. Unfortunately, the options for combining
      multiple elements are not yet well developed in plotnine.

      Plots a stacked bar chart with a bar for each sample displaying the
      relative weight of each signature. Optionally the plot can also
      include sections at the top summarising the model fit for each
      sample, and a ribbon along displaying categorical metadata for samples.

      :param group: Categorical metadata for each sample to plot on ribbon
          at the bottom
      :param group_colors: Colour to associate with each of the metadata
          categories.
      :param model_fit: Include a top row indicating model fit per sample.
      :param heights: Height in inches for each component of the plot.
          Specify as a dictionary with keys 'dot', 'ribbon', 'bar', 'labels',
          or a list with heights for the elements included from top to bottom.
      :param width: Width of plot.
      :param sample_label_size: Size for sample labels. Set to 0 to remove
          sample labels.
      :param legend_cols_sig: Number of columns in Signature legend.
      :param legend_cols_grp: Number of columns in group legend.
      :param legend_side: Location of Signature and group legend. One of
          'top', 'right', 'left', 'bottom'
      :return: Marsilea whiteboard object. Call ``.render()`` to show plot.


   .. py:method:: plot_weight_distribution(threshold: float = 0.0, scale_transform: Optional[str] = 'log10', nrows: int = 1) -> plotnine.ggplot

      Plot the distribution of feature weights in each signature.

      The distribution of signature weights helps described how mixed the
      features are which describe a sample. This will sort feature weights
      for each signature independently, and plot a bar for the weight of
      each feature. So distributions which are longer indicate more features
      contribute to that signature, and the height of bars indicates
      whether this is a long tail of low weights, all even, etc.

      :param threshold: Set any weight below this to 0. Effectively, consider
      very low weights to not contribute to the signature.
      :param scale_transform: Transformation to apply to the feature weight
      axis. Can be any of the transforms in `mizani`. For no transformation,
      pass None or "identity".
      :param nrows: Number of rows in the plot. Defaults to having all plots
      on one row for comparability.


   .. py:method:: reapply(y: pandas.DataFrame, input_validation: Optional[cvanmf.reapply.InputValidation] = None, feature_match: Optional[cvanmf.reapply.FeatureMatch] = None, **kwargs) -> Decomposition

      Get signature weights for new data.

      When the features in ``y`` exactly match those used to learn this
      decomposition, you can set the ``input_validation`` and
      ``feature_match`` parameters as None.

      In some cases, the features in new data y may not exactly match
      those used in the original decomposition, for instance if you have new
      microbiome data there may be different taxa present, or a different
      naming format may be used in the new data. The function
      ``feature_match`` can be used to handle these cases, by defining a
      function to map names between new and existing data. The
      ``input_validation`` functions is largely used for existing models, to
      valdiate that data being provided is the expected format.

      :param y: New data of the same type used to generate this decomposition
      :param input_validation: Function to validate and transform ``y``
      :param feature_match: Function to match features in ``y`` and :attr:`w`
      :param kwargs: Arguments to pass to ``validate_input`` and
          ``feature_match``
      :return: :class:`Decomposition` with signature weights for samples in
          ``y``.


   .. py:method:: representative_signatures(threshold: float = 0.9) -> pandas.DataFrame

      Which signatures describe a sample.

      Identify which signatures contribute to describing a samples.
      Represenative signatures are those for which the cumulative sum is
      equal to or lower than the threshold value.

      This is done by considering each sample in the sample :meth:`scaled`
      :math:`H` matrix, and taking a cumulative sum of weights in descending
      order. Any signature for which the cumulative sum is less than or equal
      to the threshold is considered representative.

      :param threshold: Cumulative sum below which samples are considered
          representative.
      :return: Boolean dataframe indicating whether a signature is
          representative for a given sample.


   .. py:method:: save(out_dir: Union[str, pathlib.Path], compress: bool = False, param_path: Optional[pathlib.Path] = None, x_path: Optional[pathlib.Path] = None, symlink: bool = True, delim: str = '\t', plots: Optional[Union[bool, Iterable[str]]] = None) -> None

      Write decomposition to disk.

      Export this decomposition and associated data. This is written to text
      type files (tab separated for tables, yaml for dictionaries) to allow
      simpler reading in other analysis environments such as R. Exceptions
      are raised if any tables cannot be written, but plots are allowed to
      fail though will produce log entries.

      :param out_dir: Directory to write to. Must be empty.
      :param compress: Create compressed .tar.gz rather than directory.
      :param param_path: Path to YAML file containing parameters used. If not
          given will create a copy in the directory. If given and symlink
          is True, will try to make a symlink to parameters file.
      :param x_path: Path to X matrix used. Behaves as param_path for copies/
          symlinks.
      :param symlink: Make symlinks ot param_path and x_path if possible.
      :param delim: Delimiter to used for tabular output.
      :param plots: Determine which plots to write. When left default
          (None) this will produce all plots if there are 500 or fewer samples.
          If True, all plots will produced; if False no plots will be produced.
          If a list is provided, any plots named in the list will be produced,
          i.e. if given ['pcoa', 'modelfit', 'radar'], plots from
          :meth:`plot_pcoa` and : meth:`plot_modelfit` would be produced.
          'radar' would be ignored as there is no `plot_radar` method.


   .. py:method:: save_decompositions(decompositions: Dict[int, List[Decomposition]], output_dir: pathlib.Path, symlink: bool = True, delim: str = '\t', compress: bool = False, **kwargs) -> None
      :staticmethod:


      Save multiple decompositions to disk.

      Write multiple decompositions to disk. The structure is that a
      directory is created for each rank, then within that a directory for
      each decomposition. By default the input data and parameters will be
      saved at the top level, and symlinked to by each individual
      decomposition.

      The files output are tables for W and H matrices, scaled W and H,
      tables basic analyses (primary es etc), and all default plots where
      possible.

      :param decompositions: Decompositions in form output by
          :func:`decompositions`.
      :param output_dir: Directory to write to which is either empty or
          does not exist.
      :param symlink: Symlink the parameters and input X files.
      :param delim: Delimiter for tabular output.
      :param compress: Compress each decomposition folder to .tar.gz
      :param **kwargs: Passed to :meth:`Decomposition.save`


   .. py:method:: scaled(matrix: Union[pandas.DataFrame, Literal['h', 'w']], by: Optional[str] = None) -> pandas.DataFrame

      Total sum scaled version of a matrix.

      Scale a matrix to a proportion of the feature/sample total, or
      to a proportion of the signature total.

      :param matrix: Matrix to be scaled, one of :attr:`h` or :attr:`w`, or a
          string from ``{'h', 'w'}.``
      :param by: Scale to proportion of ``sample``, ``feature``, or
          ``signature`` total. This defaults to

          * :math:`H`: ``sample``
          * :math:`W`: ``signature``

      :return: Scaled version of matrix.


   .. py:method:: significance_format(p: float, local_adj: float, global_adj: float) -> str
      :staticmethod:


      Convert p-values from unvariate tests to display strings.

      By default, this will use the following strategy:

      * global_adj =< 0.01 -> ***
      * global_adj =< 0.05 -> **
      * global_adj =< 0.1  -> *
      * p =< 0.01 -> ..
      * p =< 0.05 -> .

      If implementing a custom formater, ``p`` is the unadjusted p-value,
      ``local_adj`` the adjusted p-value only considering the tests for one
      metadata category, and ``global_adj`` considering all tests.


   .. py:method:: univariate_tests(metadata: pandas.DataFrame, against: Optional[Union[pandas.DataFrame, Literal['signature', 'model_fit', 'both']]] = None, drop_na: bool = True, adj_method: str = 'fdr_bh', alpha: float = 0.05) -> pandas.DataFrame

      Test if signature relative weights vary between categories

      Test whether model weights are different between groups using
      non-parametric univariate tests. Currently uses the Mann-Whitney
      U-test on two sample cases, and Kruskall-Wallis tests on multiple
      category tests.

      For K-W tests, post-hoc tests are performed using Dunn's
      test, with the same adjustment and alpha values. Significant
      post-hoc tests are returned as a string in the results table, with the
      format A|B(0.001) for a significant result for pair A and B with
      adjusted p value of 0.001.

      :param metadata: Dataframe of metadata variables to test against. Can
          only handle discrete values.
      :param against: What to test the metadata against. This can be
          ``signature`` for relative :math:`H` weights, ``model_fit`` for per
          sample cosine similarity, or ``both`` (default). You can also
          provide any arbitrary matrix with the correct dimensions, for
          instance if you had done some custom processing of the :math:`H`
          matrix, or wanted to use absolute :math:`H` weights. An arbitrary
          matrix should not contain any NA values; any columns with NAs will
          have NA for all statistical test results.
      :param drop_na: Remove any samples with ``NA`` values from metadata
          before testing. This is done on a per test basis, so one ``NA`` will
          not cause a sample to be removed for all tests.
      :param adj_method: Method to adjust for multiple tests. This is applied
          both locally (for each metadata category), and globally (considering
          all tests). Accepts any method supported by statsmodels
          ``multipletests``.
      :param alpha: Threshold value to reject :math:`H0`.
      :return: Dataframe with results for each signature and each metadata
          variable.


   .. py:attribute:: LOAD_FILES
      :type:  List[str]
      :value: ['x.tsv', 'h.tsv', 'w.tsv', 'parameters.yaml', 'properties.yaml']


      Defines the files while are loaded to recreate a decomposition object
      from disk.


   .. py:attribute:: TOP_CRITERIA
      :type:  Dict[str, bool]

      Defines which criteria are available to select the best decomposition
      based on, and whether to take high values (True) or low values (False).

      :meta private:


   .. py:property:: beta_divergence
      :type: float


      The beta divergence (using the method defined in the parameters
      object) between :math:`X` and :math:`WH`.


   .. py:property:: color_scale
      :type: plotnine.scale_color_discrete


      Plotnine scale for color aesthetic using signature colors.


   .. py:property:: colors
      :type: List[str]


      Colors which represents each signature in plots.

      Colors default to a colorblind distinct palette.

      Colors can be changed by setting this property. A list can be provided,
      or a dictionary mapping signature name to new color::

          # For a model with three signatures S1, S2, S3
          # Change all colors with list
          model.colors = ['red', 'blue', '#ffffff']
          # Change two colors using dictionary
          model.colors = dict(S1='green', S3='#000000')


   .. py:property:: cosine_similarity
      :type: float


      Cosine angle between flattened :math:`X` and :math:`WH`.

      A measure of how well the model reconstructs the input data. Ranges
      between 1 and 0, with 1 being perfect correlation, and 0 meaning the
      model is perpendicular to the input (no correlation). The same
      measure is available for each sample using :attr:`model_fit`.


   .. py:property:: feature_mapping
      :type: reapply.FeatureMapping


      Mapping of new data features to those in the model being reapplied

      When fitting new data to an existing model, the naming of feature may
      vary or some features may not exist in the model. This property holds an
      object which maps from the new data features to the model features.
      For de-novo decompositions this will be None.


   .. py:property:: fill_scale
      :type: plotnine.scale_fill_discrete


      Plotnine scale for fill aesthetic using signature colors.


   .. py:property:: h
      :type: pandas.DataFrame


      Signature weights in each sample.

      Matrix with samples on columns, and signatures on rows, with each entry
      being a signature weight. This is not scaled, see :meth:`scaled`.


   .. py:property:: input_hash
      :type: int


      Hash of the input matrix. Used to validate loads where data
      was not included in the saved form.


   .. py:property:: l2_norm
      :type: float


      L2 norm between flattened :math:`X` and :math:`WH`.


   .. py:property:: model_fit
      :type: pandas.Series


      How well each sample :math:`i` is described by the model, expressed
      by the cosine angle between :math:`X_i` and :math:`(WH)_i`. Cosine
      angle ranges between 0 and 1 in this case, with 1 being good and 0
      poor (perpendicular),


   .. py:property:: names
      :type: List[str]


      Names for each of the signatures.

      New names for signatures can be given as a list. This will change
      the name in the :attr:`w` and :attr:`h` matrices::

          # Set new names for a model with 4 signatures
          mdoel.names = ['A', 'B', 'X', 'Y']


   .. py:property:: parameters
      :type: NMFParameters


      Parameters used during decomposition.


   .. py:property:: primary_signature
      :type: pandas.Series


      Signature with the highest weight for each sample.

      The primary signature for a sample is the one with the highest weight
      in the :math:`H` matrix. In the unusual case where all signatures have 0
      weight for a sample, this will return NaN, and is likely a sign of
      a poor model.


   .. py:property:: quality_series
      :type: pandas.Series


      Quality measures (r_squared, cosine similarity etc) as series.

      Each decomposition has a range of values describing it's properties and
      approximation of the input data. This property is a series which
      includes all of these properties.


   .. py:property:: r_squared
      :type: float


      Coefficient of determination (:math:`R^2`) between flattened
      :math:`X` and :math:`WH`.

      A measure of how well the model reconstructs the input data.


   .. py:property:: rss
      :type: float


      Residual sum of squares between flattened :math:`X`
      and :math:`WH`.


   .. py:property:: sparsity_h
      :type: float


      Sparsity of :attr:`h` matrix.

      This is the proportion of entries in the :math:`H` matrix which are 0.


   .. py:property:: sparsity_w
      :type: float


      Sparsity of :attr:`w` matrix.

      This is the proportion of entries in the :math:`W` matrix which are 0.


   .. py:property:: w
      :type: pandas.DataFrame


      Feature weights in each signature.

      Matrix with signatures on columns, and features on rows, with each entry
      being a signature weight. This is not scaled, see :meth:`scaled`


   .. py:property:: wh
      :type: pandas.DataFrame


      Product of decomposed matrices :math:`W` and :math:`H` which
      approximates input.


.. py:class:: NMFParameters

   Bases: :py:obj:`NamedTuple`


   Parameters for a single decomposition, or iterations of bi-cross
   validation. See sklearn NMF documentation for more detail on parameters.


   .. py:method:: to_yaml(path: pathlib.Path)

      Write parameters to a YAML file.

      Save the parameters, except the input matrix, to a YAML file.

      :param path: File to write to


   .. py:attribute:: alpha
      :type:  float
      :value: 0.0


      Regularisation parameter applied to both :math:`H` and :math:`W`
      matrices.


   .. py:attribute:: beta_loss
      :type:  str
      :value: 'kullback-leibler'


      Beta loss function for NMF decomposition.


   .. py:attribute:: init
      :type:  str
      :value: 'nndsvdar'


      Initialisation method for :math:`H` and :math:`W` matrices on first step.
      Defaults to randomised non-negative SVD with small random values added to
      0s.


   .. py:attribute:: keep_mats
      :type:  bool
      :value: False


      Whether to return the :math:`H` and :math:`W` matrices as part of the
      results.


   .. py:attribute:: l1_ratio
      :type:  float
      :value: 0.0


      Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0.


   .. py:property:: log_str
      :type: str


      Format parameters in readable way for logs/console.


   .. py:attribute:: max_iter
      :type:  int
      :value: 3000


      Maximum number of iterations during decomposition. Will terminate earlier
      if solution converges.


   .. py:attribute:: rank
      :type:  int

      Rank of the decomposition.


   .. py:attribute:: seed
      :type:  Optional[Union[int, numpy.random.Generator, str]]
      :value: None


      Random seed for initialising decomposition matrices; if None no seed
      used so results will not be reproducible.


   .. py:attribute:: x
      :type:  Optional[Union[BicvSplit, pandas.DataFrame]]

      For a simple decomposition, a matrix as a dataframe. For a bi-cross
      validation iteration, this should be the shuffled matrix split into mn
      parts, where m is the number of parts along rows, n along columns. When
      returning results and keep_mats is False, this will be set to None to
      avoid passing and saving large data.


.. py:function:: bicv(params: Optional[NMFParameters] = None, **kwargs) -> BicvResult

   Perform a single run of bicrossvalidation.

   Perform one run of bicrossvalidation. Parameters can either be passed
   as a :class:`NMFParameters` tuple and are documented there, or by keyword
   arguments using the same names as :class:`NMFParameters`.

   :returns: Comparisons of the held out submatrix and estimate for each fold


.. py:function:: cli_decompose(input: str, output_dir: str, delimiter: str, progress: bool, verbosity: str, seed: int, l1_ratio: float, alpha: float, max_iter: int, beta_loss: str, init: str, n_runs: int, top_n: int, top_criteria: str, compress: bool, ranks: List[int], symlink: bool) -> None

   Decompositions for RANKS.

   RANKS is a list of ranks for which to generate decompositions.

   Generate a number of decompositions for each the specified ranks. NMF
   solutions are non-unique and depend on initialisation, so when using an
   initialisation with randomness multiple solutions can be produced.
   From these solutions, the best can be retained based on criteria such
   as reconstruction error or cosine similarity.

   Some initialisation methods are deterministic, and as such only a single
   decomposition will be produced.

   The output is H and W matrices for each decomposition, tables of quality
   scores, and some analyses with default parameters. For further analysis,
   decompositions can be loaded using Decomposition.from_dir, or tables used
   directly for custom analyses. By default, a symlink to the input data


.. py:function:: cli_rank_selection(input: str, output_dir: str, delimiter: str, shuffles: int, progress: bool, verbosity: str, seed: int, rank_min: int, rank_max: int, rank_step: int, l1_ratio: float, alpha: float, max_iter: int, beta_loss: str, init: str, design: Tuple[int, int]) -> None

   Rank selection for NMF using mn-fold bi-cross validation

   Attempt to identify a suitable rank k for decomposition of input matrix X.
   This is done by shuffling the matrix a number of times, and for each
   shuffle diving it into m x n submatrices (m splits on rows, n splits on
   columns). Each of these nine is held out and an estimate learnt from the
   remaining matrices, and the quality of the estimated matrix used to
   identify a suitable rank.

   The underlying NMF implementation is from scikit-learn, and there is more
   documentation available there for many of the NMF specific parameters there.


.. py:function:: cli_regu_selection(input: str, output_dir: str, delimiter: str, shuffles: int, progress: bool, verbosity: str, seed: int, rank: int, alpha: List[float], l1_ratio: float, max_iter: int, beta_loss: str, init: str, scale: bool, design: Tuple[int, int]) -> None

   Regularisation selection for NMF on ALPHA 9 fold bi-cross validation

   Attempt to identify a suitable regularisation parameter alpha for
   decomposition of input matrix X at a given rank with a given ratio
   between L1 and L2 regularisation.
   This is done by shuffling the matrix a number of times, and for each
   shuffle diving it into 9 submatrices. Each of these nine is held out and
   an estimate learnt from the remaining matrices, and the quality of the
   estimated matrix used to identify a suitable alpha.

   The underlying NMF implementation is from scikit-learn, and there is more
   documentation available there for many of the NMF specific parameters there.

   ALPHA is a list of values to be tested. 0.0 will always be added.


.. py:function:: cophenetic_correlation(decompositions: Dict[int, List[Decomposition]], on: Literal['h', 'w'] = 'h') -> pandas.Series

   Cophenetic correlation coefficient for rank selection

   The cophenetic correlation coefficient (ccc) is a commonly used way to
   select a suitable rank for decompositions (Brunet 2004). It is based on
   assigning each sample or feature to a single signature, and looking for
   stability in which are assigned to the same signature across multiple random
   initialisations.

   Our primary method for rank selection is bicrossvalidation, but we offer
   the ability to calculate ccc when you have performed multiple
   decompositions for a rank using :func:`decompositions`.

   :param decompositions: Results from the :func:`decompositions` function.
       A dictionary with the key being a rank, the value a list of
       decompositions for that rank.
   :param on: Look for stability in the assignment in the H matrix (samples)
       or W matrix (features).
   :returns: Series indexed by rank and with value being the ccc.


.. py:function:: decompose(params: NMFParameters) -> Decomposition

   Perform a single decomposition of a matrix.

   :param params: Decomposition parameters as a :class:`NMFParameters` object.
   :return: A single decomposition


.. py:function:: decompositions(x: pandas.DataFrame, ranks: Iterable[int], random_starts: int = 100, top_n: int = 5, top_criteria: str = 'beta_divergence', seed: Optional[Union[int, numpy.random.Generator]] = None, alpha: Optional[float] = None, l1_ratio: Optional[float] = None, max_iter: Optional[int] = None, beta_loss: Optional[str] = None, init: Optional[str] = 'random', progress_bar: bool = True) -> Dict[int, List[Decomposition]]

   Get the best decompositions for input matrix for one or more ranks.

   The model obtained by NMF decomposition depend on the initial values of the
   two matrices W and H; different initialisations lead to different solutions.
   Two approaches to initialising H and W are to attempt multiple random
   initialisations and select the best ones based on criteria such as
   reconstructions error, or to adopt a deterministic method (such as
   nndsvd) to set initial values.

   This function provides both approaches, but defaults to multiple random
   initialisations. To use one of the deterministic methods, change the
   initialisation method using `init`.

   A dictionary with one entry for each rank of decomposition requested is
   return, with the values being a list of top_n best decompositions for that
   rank. Where a deterministic method is used, the list will only have one
   item.

   :param x: Matrix to be decomposed
   :param ranks: Rank(s) of decompositions to be produced
   :param random_starts: Number of random initialisations to be tried for
       each rank. Ignored if using a deterministic initialisations.
   :param top_n: Number of decompositions to be returned for each rank.
   :param top_criteria: Criteria to use when determining which are the top
       decompositions. Can be one of beta_divergence, rss, r_squared,
       cosine_similairty, or l2_norm.
   :param seed: Seed or random generator used
   :param alpha: Regularisation parameter applied to both H and W matrices.
   :param l1_ratio: Regularisation mixing parameter. In range 0.0 <= l1_ratio
         <= 1.0. This controls the mix between sparsifying and densifying
         regularisation. 1.0 will encourage sparsity, 0.0 density
   :param max_iter: Maximum number of iterations during decomposition. Will
       terminate earlier if solution converges
   :param beta_loss: Beta loss function for NMF decomposition
   :param init: Initialisation method for H and W matrices on first step.
       Defaults to random
   :param progress_bar: Display progress bar


.. py:function:: dispersion(decompositions: Dict[int, List[Decomposition]], on: Literal['h', 'w'] = 'h') -> pandas.Series

   Dispersion coefficient for rank selection

   The dispersion coefficient is a method for rank selection which
   looks for consistency in the average consensus matrix (Park 2007).
   This shares the same underlying data structure as
   :func:`cophenetic_correlation`, the average consensus matrix, looking at
   how often elements are assigned to the same signature, with elements
   assigned to the signature with maximum weight. The value for dispersion
   ranges between 0 and 1, with 1 indicating perfect stability, and 0 a highly
   scattered consensus matrix.

   Our primary method for rank selection is bicrossvalidation, but we offer
   the ability to calculate dispersion when you have performed multiple
   decompositions for a rank using :func:`decompositions`.

   :param decompositions: Results from the :func:`decompositions` function.
       A dictionary with the key being a rank, the value a list of
       decompositions for that rank.
   :param on: Look for stability in the assignment in the H matrix (samples)
       or W matrix (features).
   :returns: Series indexed by rank and with value being the dispersion
       coefficient.


.. py:function:: plot_rank_selection(results: Dict[Union[int, float], List[BicvResult]], exclude: Optional[Iterable[str]] = None, include: Optional[Iterable[str]] = None, show_all: bool = False, geom: str = 'box', summarise: Literal['mean', 'median'] = 'mean', suggested_rank: bool = True, stars_at: Optional[Dict[str, int]] = None, star_size: int = 4, jitter: bool = None, jitter_size: float = 0.3, n_col: int = None, xaxis: str = 'rank', rotate_x_labels: Optional[float] = None, geom_params: Dict[str, Any] = None, **kwargs) -> plotnine.ggplot

   Plot rank selection results from bicrossvalidation.

   Draw either box plots or violin plots showing statistics comparing
   :math:`A` and :math:`A'` from all bicrossvalidation results across a
   range of ranks.
   The plotting library used is ``plotnine``; the returned plot object
   can be saved or drawn using ``plt_obj.save`` or ``plt_obj.draw``
   respectively.
   By default, only `cosine_similarity` and `r_squared` are plotted. You can
   define which measures to include using include, or which to exclude using
   exclude. You can also use show_all to show all the measures.

   For `cosine_similarity` and `r_squared`, an suggestion of optimal rank
   is given by identifying an elbow point in the graph using the package
   ``kneed``, indicated by a star above that rank.

   :param results: Dictionary of results, with rank as key and a list of
       :class:`BicvResult` for that rank as value
   :param exclude: Measures from :class:`BicvResult` not to plot.
   :param include: Measures from :class:`BicvResult` to plot.
   :param show_all: Show all measures, ignoring anything set in include or
       exclude.
   :param geom: Type of plot to draw. Accepts either 'box' or 'violin'
   :param summarise: How to summarise the statistics across the folds
       of a given shuffle.
   :param suggested_rank: Estimate rank using :func:`suggest_rank`.
   :param stars_at: Manually define x-axis values at which to place stars
       above the main plot. Mainly used to allow :func:`plot_regu_selection`
       to pass where to plot stars for regularisation selection.
   :param star_size: Size of star indicating suggested rank.
   :param jitter: Draw individual points for each shuffle above the main plot.
   :param jitter_size: Size of jitter points.
   :param n_col: Number of columns in the plot. If blank, attempts to guess
       a sensible value.
   :param xaxis: Value to plot along the x-axis. "rank" for rank selection,
       "alpha" for regularisation selection.
   :param rotate_x_labels: Degrees to rotate x-axis labels by. If None
       will rotate if x-axis is float.
   :param **kwargs: Passed to :func:`suggest_ranks`.
   :return: :class:`plotnine.ggplot` instance


.. py:function:: plot_regu_selection(regu_res: Union[Tuple[float, Dict], Dict], alpha_star: bool = True, **kwargs) -> plotnine.ggplot

   Plot regularisation selection results.

   Takes a result from :func:`regu_selection` and passes to
   :func:`plot_rank_selection` to plot with alpha values along the
   x-axis. Consequently, pass any parameters for plotting as kwargs.

   :param regu_res: Results from :func:`regu_selection`.
   :param alpha_star: Suggest and plot a suitable alpha value using
       :func:`suggest_alpha`.


.. py:function:: plot_stability_rank_selection(decompositions: Optional[Dict[int, List[Decomposition]]] = None, series: Optional[List[pandas.Series]] = None, include: List[str] = ['cophenetic_correlation', 'dispersion', 'signature_similarity'], suggested_rank: bool = True, on: Literal['h', 'w'] = 'h') -> plotnine.ggplot

   Plot results for stability based rank selection methods (
   :func:`signature_stability`, :func:`cophenetic_correlation`,
   :func:`dispersion`).

   Automated rank selection uses :func:`suggest_rank_stability`.

   :param decompositions: Results from :func:`decompositions`. Not used if
       series is passed.
   :param series: Series to plot, resulting from :func:`signature_similarity`,
       :func:`cophenetic_correlation`, or :func:`dispersion`.
   :param include: Which method to include in the plot,
       a list containing values from ``{'cophenetic_correlation', 'dispersion',
       'signature_similarity'}``.
   :param suggested_rank: Make an estimate of estimate suggested rank using
       :func:`suggest_rank_stability`.
   :param on: Calculate stability of H (samples) or W (features). Not used if
       passed series.


.. py:function:: rank_selection(x: pandas.DataFrame, ranks: Iterable[int], shuffles: int = 100, keep_mats: Optional[bool] = None, seed: Optional[Union[int, numpy.random.Generator]] = None, alpha: Optional[float] = None, l1_ratio: Optional[float] = None, max_iter: Optional[int] = None, beta_loss: Optional[str] = None, init: Optional[str] = None, design: Optional[Tuple[int, int]] = (3, 3), progress_bar: bool = True) -> Dict[int, List[BicvResult]]

   Bi-cross validation for rank selection.

   Run :math:`mn`-fold bicrossvalidation across a range of ranks. Briefly, the
   input matrix is shuffled `shuffles` times. Each shuffle is then split
   into :math:`m       imesn` submatrices (:math:`m` splits on rows, :math:`n`
   splits on columns). The rows and columns of submatrices are permuted, and
   the top left submatrix (:math:`A`) is estimated through NMF decompositions
   of the other matrices producing an estimate :math:`A'`. Various measures of
   how well :math:`A'` reconstructed :math:`A` are provided, see
   :class:`BicvResult` for details on the measures.

   No multiprocessing is used, as a majority of build of scikit-learn seem to
   make good use of multiple processors anyway (depending on compilation of
   underlying libraries and matrix size).

   This method returns a dictionary with each rank as a key, and a list
   containing one :class:`BicvResult` for each shuffle.

   :param x: Input matrix.
   :param ranks: Ranks of k to be searched. Iterable of unique ints.
   :param shuffles: Number of times to shuffle `x`.
   :param keep_mats: Return A' and shuffle as part of results.
   :param seed: Random value generator or seed for creation of the same.
       If not provided, will initialise with entropy from system.
   :param alpha: Regularisation coefficient
   :param l1_ratio: Ratio between L1 and L2 regularisation. L2 regularisation
       (0.0) is densifying, L1 (1.0) sparisfying.
   :param max_iter: Maximum iterations of NMF updates. Will end early if
       solution converges.
   :param beta_loss: Beta-loss function, see sklearn documentation for
       details.
   :param init: Initialisation method for H and W during decomposition.
       Used only where one of the matrices during bi-cross steps is not
       fixed. See sklearn documentation for values.
   :param design: How many blocks to split the input matrix into on rows and
       columns respectively. Defaults to 3x3 9-fold design.
   :param progress_bar: Show a progress bar while running.
   :returns: Dictionary with entry for each rank, containing a list of
       results for each shuffle (as a :class:`BicvResult` object)


.. py:function:: regu_selection(x: pandas.DataFrame, rank: int, alphas: Optional[Iterable[float], None] = None, scale_samples: Optional[bool] = None, shuffles: int = 100, keep_mats: Optional[bool] = None, seed: Optional[Union[int, numpy.random.Generator]] = None, l1_ratio: Optional[float] = 1.0, max_iter: Optional[int] = None, beta_loss: Optional[str] = None, init: Optional[str] = None, design: Tuple[int, int] = (3, 3), progress_bar: bool = True) -> Tuple[float, Dict[float, List[BicvResult]]]

   Bicrossvalidation for regularisation selection.

   Run :math:`mn`-fold bicrossvalidation across a range of regularisation
   ratios, for a single rank. For a brief description of bi-cross validation
   see :func:`rank_selecton`

   No multiprocessing is used, as a majority of build of scikit-learn
   seem to make good use of multiple processors anyway.

   This method returns a tuple with

   * a float which is the tested alpha which meets the criteria in
     the ES paper
   * a dictionary with each alpha value as a key, and a list containing one
     :class:`BicvResult` for each shuffle

   :param x: Input matrix.
   :param rank: Rank of decomposition.
   :param alphas: Regularisation alpha parameters to be searched. If left
       blank a default range will be used.
   :param scale_samples: Divide alpha by number of samples. This is provided
       as the way regularisation is performed changed in newer sklearn
       versions, and alpha is multiplied by n_samples. Setting this to True
       results in the same calculation as earlier sklearn versions, such as
       the one used in the Enterosignatures paper. If this is set it is
       honoured; if left as None, when automatic alpha range is calculated
       they will be scaled by sample, when alpha range specified will not be
       scaled.
   :param shuffles: Number of times to shuffle `x`.
   :param keep_mats: Return :math:`A'` and `shuffle` as part of results.
   :param seed: Random value generator or seed for creation of the same.
       If not provided, will initialise with entropy from system.
   :param alpha: Regularisation coefficient
   :param l1_ratio: Ratio between L1 and L2 regularisation. L2 regularisation
       (0.0) is densifying, L1 (1.0) sparisfying.
   :param max_iter: Maximum iterations of NMF updates. Will end early if
       solution converges.
   :param beta_loss: Beta-loss function, see sklearn documentation for
       details.
   :param init: Initialisation method for H and W during decomposition.
       Used only where one of the matrices during bi-cross steps is not
       fixed. See sklearn documentation for values.
   :param progress_bar: Show a progress bar while running.
   :param design: Number of blocks to split input into on rows and columns
       respectively for bicrossvalidation.
   :returns: Dictionary with entry for each rank, containing a list of
       results for each shuffle (as a :class:`BicvResult` object)


.. py:function:: signature_similarity(decompositions: Dict[int, List[Decomposition]]) -> pandas.Series

   Mean cosine similarity of signatures for rank selection

   This rank selection criteria is based on the intuition that if a solution
   is good, it should be across similar multiple random initialisation of the
   data, similar to the motivation for :func:`cophenetic_correlation` and
   :func:`dispersion`.

   We pair signatures based on a cosine similarity (see
   :func:`cvanmf.stability.match_signatures`) and take the mean value between
   paired signatures at a rank, and look for clear peaks.

   Similarity is calculated between the best decomposition and all otherwise,
   not all possible pairs.

   The paired cosine similarity can also be visualised in more detail using
   :func:`cvanmf.stability.plot_signature_stability`.

   :param decompositions: Decompositions for several ranks as output by
       :func:`decompositions`.


.. py:function:: suggest_alpha(regu_results: Dict[float, List[BicvResult]]) -> float

   Suggest a suitable value for alpha.

   Want to select the largest value of :math:`alpha` possible which does not
   detrimentally effect the quality of the decomposition. To gauge this,
   we adopt the heuristic of [REF], selecting the highest value of
   :math:`alpha` for which the mean :math:`R^2` is not lower than the (mean
   :math:`R^2` + standard deviation) at :math:`alpha=0`.

   This is called by default in :func:`regu_selection`. It is provided as
   public method as the Nextflow pipeline splits the Bicv process, and
   doesn't use :func:`regu_selection`, and so it can be called after.

   :param regu_results: Dictionary with keys being alpha values, and values
       a list of :class:`BicvResult` objects.


.. py:function:: suggest_rank(rank_selection_results: Union[Dict[int, List[BicvResult]], pandas.DataFrame], summarise: Callable[[numpy.ndarray], float] = np.mean, measures: List[str] = ['cosine_similarity', 'r_squared'], **kwargs) -> Dict[str, int]

   Suggest a suitable rank.

   Attempt to identify an elbow point in the graphs of cosine similarity
   and :math:`R^2` which represent points where the rate of improvement in
   the decomposition slows.

   Please note this is only a suggestion of a suitable rank; the plots
   should still be inspected and decompositions of candidate ranks inspected to
   make a final decision.

   This is implemented using the excellent `kneed` package, and `**kwargs`
   are passed to the constructor of `KneeLocator`, you can use this if you
   wish to customise the behaviour. We use the online mode of kneed by default.

   :param rank_selection_results: Results from :func:`rank_selection`, or
       these results in DataFrame format from
       :meth:`BicvResult.results_to_table`
   :param summarise: Function to summarise results from a shuffle. Roughly
       speaking, determines which point represent the middle of the
       distribution of values for purposes of the curve.
   :param measures: The measures to consider if passed a DataFrame
   :param kwargs: Arguments passed to ``KneeLocator`` constructor


.. py:function:: suggest_rank_stability(rank_selection_results: Union[pandas.DataFrame, Iterable[pandas.Series], Dict[int, List[Decomposition]]], measures: List[str] = ['cophenetic_correlation', 'dispersion', 'signature_similarity'], near_max: float = 0.02, **kwargs) -> Dict[str, int]

   Suggest a suitable rank in stability based measures.

   Attempt to identify peaks in stability based rank selection criteria
   (cophenetic correlation, dispersion, signature similrity). By default the
   highest peak is selected. Where there are many similar ranks (defined by
   `near_max`), the one with the most consecutively decreasing values after it
   is selected.

   Please note this is only a suggestion of a suitable rank; the plots
   should still be inspected and decompositions of candidate ranks inspected to
   make a final decision.

   When making a plot multiple times (changing parameters etc), it may be
   preferable to calculate the measures then pass the results as a list of
   Series, as the calculation can be time consuming.

   :param rank_selection_results: Results from :func:`decompositions`,
       or a collection of series produced by :func:`dispersion`,
       :func:`cophenetic_correlation`, and :func:`signature_similarity`, or a
       DataFrame of those series joined.
   :param measures: The measures to consider if passed a DataFrame
   :param near_max: Consider peaks (:math:`p`) candidates if they are within a
       certain distance of global maximum (:math:`gm`):
       :math:`p \geq gm(1-near_max)`.
   :param kwargs: Passed to ``np.argrelmax``.


.. py:data:: Numeric

   Alias for python numeric types (a union of int and float).

.. py:data:: PcoaMatrices

   Allowed matrices which PCoA can be constructed from. Allows values w, x,
   wh, signatures (alias for w).

.. py:data:: logger
   :type:  logging.Logger

   Logger object.