cvanmf.data =========== .. py:module:: cvanmf.data .. autoapi-nested-parse:: Example data for decomposition. This modules provides both real world datasets, and functions to synthetic data with a known number of signatures. Each is returned as an :class:`ExampleData` object, which includes metadata and citations for the data where appropriate. Each can be loaded by calling the relevant function (i.e. `data.leukemia()`) Real Data --------- :func:`leukemia` is gene expression data from ALL/AML-type leukemia patients. :func:`lung_cancer_cells` is the count of types of cells in non-small cell lung cancer tissues. :func:`swimmer` is an image dataset with stick figure representations of a swimmer from above. Synthetic Data -------------- :func:`synthetic_blocks` makes data with an overlapping block pattern along the diagonal. :func:`synthetic_dense` makes data with a dense structure (each sample can contain any number of signatures). Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/cvanmf/data/utils/index Classes ------- .. autoapisummary:: cvanmf.data.ExampleData Functions --------- .. autoapisummary:: cvanmf.data.example_abundance cvanmf.data.leukemia cvanmf.data.lung_cancer_cells cvanmf.data.swimmer cvanmf.data.synthetic_blocks cvanmf.data.synthetic_dense Package Contents ---------------- .. py:class:: ExampleData Bases: :py:obj:`NamedTuple` Example data, including citations, description, and other metadata. .. py:attribute:: citation :type: Optional[str] Preferred citation if you use this data. .. py:attribute:: col_metadata :type: Optional[pandas.DataFrame] Metadata associated to each column. .. py:attribute:: data :type: pandas.DataFrame Table containing the data. .. py:attribute:: description :type: str Longform description of the data. .. py:attribute:: doi :type: Optional[str] DOI for data or original paper. .. py:attribute:: name :type: str Descriptive name. .. py:attribute:: other_metadata :type: Optional[Dict[str, Any]] Any other metadata related to this data (data dictionaries etc.) .. py:attribute:: rank :type: Optional[Union[int, List[int]]] Correct rank for this data, or ranks if more than one can be considered correct. .. py:attribute:: row_metadata :type: Optional[pandas.DataFrame] Metadata associated to each row. .. py:attribute:: short_name :type: str Short name. .. py:function:: example_abundance() -> pandas.DataFrame Genus level Non-Western cohort bacterial microbiome abundance. From Frioux et al. (2023, https://doi.org/10.1016/j.chom.2023.05.024). :return: Genus level relative abundance table using GTDB r207 taxonomy. :rtype: pd.DataFrame .. py:function:: leukemia() -> ExampleData Gene expression data for ALL and AML B- and T-cell type leukemia. This data was analysed in Brunet et al (2004), and often used as a standard dataset for biological applications of NMF since. It has two broad categories (ALL/AML), but AML can be refined into two subtypes (B/T). B-cell AML appears to contain a further stable sub-grouping, so we have indicated the true rank of this data as 3 or 4. .. py:function:: lung_cancer_cells() -> ExampleData Relative cell-compositions from non-small cell lung cancer studies. Gives the number of cells of different types in lung tissue samples from a non-small cell lung cancer atlas, which was compiled from 29 studies and includes 556 samples, from 318 individuals (86 of which are healthy controls). The data was uploaded to cellxgene using their standard ontologies, which is the source we have taken the data from. Metadata provided here is a mixture of metadata from cellxgene, and some from the original paper. We have selected out only the tissues samples labelled as "lung". In total, this gives 224 samples, and 33 cell types. The data here is total-sum-scaled, i.e. each sample sums to 1. .. py:function:: swimmer() -> ExampleData Stick figure images of a swimmer with 4 limbs in 4 positions. Designed by Donoho et al. to be partially representable by NMF, each image has a line torso, accompanied by four straight limbs which can be in one of four positions (0, 45, 90, 135 and 180 degrees from torso). With the exception of the torso, each limb position should be representable by NMF decomposition. As such, the true rank of this data is 17 (4*4 limbs, plus torso), but with conventional NMF as implemented here the torso cannot be learnt, only the 16 limbs. .. py:function:: synthetic_blocks(m: int = 100, n: int = 100, overlap: float = 0.25, k: int = 3, normal_noise_params: Optional[Dict] = None, scale_lognormal_params: Optional[Dict] = None) -> ExampleData Generate simple synthetic data. Create an m x n matrix with blocks along the diagonal which overlap to an extent defined by overlap. :param m: Number of rows in matrix :param n: Number of columns in matrix :param overlap: Proportion of block length to participate in overlap :param k: Number of signatures :param normal_noise_params: Parameters to pass to `numpy.random.normal` to apply noise to entries. Leave as none to use default parameters. :param scale_lognormal_params: Parameters to pass to `numpy.random.lognormal` to scale each feature (give some features higher values than others). If set to true, will use default parameters for distribution. Leave as None to skip feature scaling. .. py:function:: synthetic_dense(m: int = 100, n: int = 100, h_sparsity: float = 0.0, shared_features: float = 0.25, k: int = 3, normal_noise_params: Optional[Dict] = None, scale_lognormal_params: Optional[Dict] = None, keep_mats: bool = False) -> ExampleData Generate dense synthetic data. Dense data is generated by making a :math:`W` matrix with :math:`k` signatures, and multiplying this with a randomly filled :math:`H` matrix. Optionally, a proportion of the :math:`H` matrices can be randomly set to 0 using `h_sparsity`. The extent to which features are shared between signatures is defined via `shared_features`. Each signature is initially assigned an even proportion of the :math:`m` features (remainder spread as evenly as possible between them), so there are no shared features. Then if :math:`|k|` is the number of features assigned to a signature, each signature is assigned :math:`|k|*shared\_features` randomly selected from the remaining features. This means the overlapping structure is potentially quite different from that of :func:`synthetic_blocks`. :param m: Number of features. :param n: Number of samples. :param h_sparsity: Proportion of :math:`H` matrix to randomly set to 0 :param shared_features: Amount of shared features to add to a signature, as a proportion of it's base size. :param k: Number of signatures. :param normal_noise_params: Parameters passed to :func:`numpy.random.normal` when adding noise to data. :param scale_lognormal_params: Parameters passed to :func:`numpy.random.lognormal` when selecting weights for features in a signature. If this is None, a uniform distribution between 0 and 1 is used instead. :param keep_mats: Return the :math:`H` and :math:`W` matrices used to generate the data in the :attr:`ExampleData.other_metadata`.