cvanmf.data
===========

.. py:module:: cvanmf.data

.. autoapi-nested-parse::

   Example data for decomposition.

   This modules provides both real world datasets, and functions to synthetic data
   with a known number of signatures. Each is returned as an
   :class:`ExampleData` object, which includes metadata and
   citations for the data where appropriate. Each can be loaded by calling the
   relevant function (i.e. `data.leukemia()`)

   Real Data
   ---------
   :func:`leukemia` is gene expression data from ALL/AML-type leukemia
   patients. :func:`lung_cancer_cells` is the count of types of cells in
   non-small cell lung cancer tissues. :func:`swimmer` is an image dataset
   with stick figure representations of a swimmer from above.

   Synthetic Data
   --------------
   :func:`synthetic_blocks` makes data with an overlapping block pattern along
   the diagonal. :func:`synthetic_dense` makes data with a dense structure
   (each sample can contain any number of signatures).


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/cvanmf/data/utils/index


Classes
-------

.. autoapisummary::

   cvanmf.data.ExampleData


Functions
---------

.. autoapisummary::

   cvanmf.data.example_abundance
   cvanmf.data.leukemia
   cvanmf.data.lung_cancer_cells
   cvanmf.data.swimmer
   cvanmf.data.synthetic_blocks
   cvanmf.data.synthetic_dense


Package Contents
----------------

.. py:class:: ExampleData

   Bases: :py:obj:`NamedTuple`


   Example data, including citations, description, and other metadata.


   .. py:attribute:: citation
      :type:  Optional[str]

      Preferred citation if you use this data.


   .. py:attribute:: col_metadata
      :type:  Optional[pandas.DataFrame]

      Metadata associated to each column.


   .. py:attribute:: data
      :type:  pandas.DataFrame

      Table containing the data.


   .. py:attribute:: description
      :type:  str

      Longform description of the data.


   .. py:attribute:: doi
      :type:  Optional[str]

      DOI for data or original paper.


   .. py:attribute:: name
      :type:  str

      Descriptive name.


   .. py:attribute:: other_metadata
      :type:  Optional[Dict[str, Any]]

      Any other metadata related to this data (data dictionaries etc.)


   .. py:attribute:: rank
      :type:  Optional[Union[int, List[int]]]

      Correct rank for this data, or ranks if more than one can be
      considered correct.


   .. py:attribute:: row_metadata
      :type:  Optional[pandas.DataFrame]

      Metadata associated to each row.


   .. py:attribute:: short_name
      :type:  str

      Short name.


.. py:function:: example_abundance() -> pandas.DataFrame

   Genus level Non-Western cohort bacterial microbiome abundance.

   From Frioux et al. (2023, https://doi.org/10.1016/j.chom.2023.05.024).

   :return: Genus level relative abundance table using GTDB r207 taxonomy.
   :rtype: pd.DataFrame


.. py:function:: leukemia() -> ExampleData

   Gene expression data for ALL and AML B- and T-cell type leukemia.

   This data was analysed in Brunet et al (2004), and often used as a
   standard dataset for biological applications of NMF since. It has two
   broad categories (ALL/AML), but AML can be refined into two subtypes (B/T).
   B-cell AML appears to contain a further stable sub-grouping, so we have
   indicated the true rank of this data as 3 or 4.


.. py:function:: lung_cancer_cells() -> ExampleData

   Relative cell-compositions from non-small cell lung cancer studies.

   Gives the number of cells of different types in lung tissue samples from a
   non-small cell lung cancer atlas, which was compiled from 29 studies and
   includes 556 samples, from 318 individuals (86 of which are healthy
   controls). The data was uploaded to cellxgene using their standard
   ontologies, which is the source we have taken the data from. Metadata
   provided here is a mixture of metadata from cellxgene, and some from the
   original paper. We have selected out only the tissues samples labelled as
   "lung". In total, this gives 224 samples, and 33 cell types. The data
   here is total-sum-scaled, i.e. each sample sums to 1.


.. py:function:: swimmer() -> ExampleData

   Stick figure images of a swimmer with 4 limbs in 4 positions.

   Designed by Donoho et al. to be partially representable by NMF, each image
   has a line torso, accompanied by four straight limbs which can be in one
   of four positions (0, 45, 90, 135 and 180 degrees from torso). With the
   exception of the torso, each limb position should be representable by
   NMF decomposition. As such, the true rank of this data is 17 (4*4 limbs,
   plus torso), but with conventional NMF as implemented here the torso cannot
   be learnt, only the 16 limbs.


.. py:function:: synthetic_blocks(m: int = 100, n: int = 100, overlap: float = 0.25, k: int = 3, normal_noise_params: Optional[Dict] = None, scale_lognormal_params: Optional[Dict] = None) -> ExampleData

   Generate simple synthetic data.

   Create an m x n matrix with blocks along the diagonal which overlap to an
   extent defined by overlap.

   :param m: Number of rows in matrix
   :param n: Number of columns in matrix
   :param overlap: Proportion of block length to participate in overlap
   :param k: Number of signatures
   :param normal_noise_params: Parameters to pass to `numpy.random.normal`
       to apply noise to entries. Leave as none to use default parameters.
   :param scale_lognormal_params: Parameters to pass to
       `numpy.random.lognormal` to scale each feature (give some features
       higher values than others). If set to true, will use default parameters
       for distribution. Leave as None to skip feature scaling.


.. py:function:: synthetic_dense(m: int = 100, n: int = 100, h_sparsity: float = 0.0, shared_features: float = 0.25, k: int = 3, normal_noise_params: Optional[Dict] = None, scale_lognormal_params: Optional[Dict] = None, keep_mats: bool = False) -> ExampleData

   Generate dense synthetic data.

   Dense data is generated by making a :math:`W` matrix with :math:`k`
   signatures, and multiplying this with a randomly filled :math:`H` matrix.
   Optionally, a proportion of the :math:`H` matrices can be randomly set to 0
   using `h_sparsity`. The extent to which features are shared between
   signatures is defined via `shared_features`. Each signature is initially
   assigned an even proportion of the :math:`m` features (remainder spread as
   evenly as possible between them), so there are no shared features. Then if
   :math:`|k|` is the number of features assigned to a signature, each
   signature is assigned :math:`|k|*shared\_features` randomly selected from
   the remaining features. This means the overlapping structure is potentially
   quite different from that of :func:`synthetic_blocks`.

   :param m: Number of features.
   :param n: Number of samples.
   :param h_sparsity: Proportion of :math:`H` matrix to randomly set to 0
   :param shared_features: Amount of shared features to add to a signature, as
       a proportion of it's base size.
   :param k: Number of signatures.
   :param normal_noise_params: Parameters passed to
       :func:`numpy.random.normal` when adding noise to data.
   :param scale_lognormal_params: Parameters passed to
       :func:`numpy.random.lognormal` when selecting weights for features in a
       signature. If this is None, a uniform distribution between 0 and 1 is
       used instead.
   :param keep_mats: Return the :math:`H` and :math:`W` matrices used to
       generate the data in the :attr:`ExampleData.other_metadata`.