cvanmf.data¶

Example data for decomposition.

This modules provides both real world datasets, and functions to synthetic data with a known number of signatures. Each is returned as an ExampleData object, which includes metadata and citations for the data where appropriate. Each can be loaded by calling the relevant function (i.e. data.leukemia())

Real Data¶

leukemia() is gene expression data from ALL/AML-type leukemia patients. lung_cancer_cells() is the count of types of cells in non-small cell lung cancer tissues. swimmer() is an image dataset with stick figure representations of a swimmer from above.

Synthetic Data¶

synthetic_blocks() makes data with an overlapping block pattern along the diagonal. synthetic_dense() makes data with a dense structure (each sample can contain any number of signatures).

Submodules¶

cvanmf.data.utils

Classes¶

ExampleData

Example data, including citations, description, and other metadata.

Functions¶

`example_abundance`(→ pandas.DataFrame)	Genus level Non-Western cohort bacterial microbiome abundance.
`leukemia`(→ ExampleData)	Gene expression data for ALL and AML B- and T-cell type leukemia.
`lung_cancer_cells`(→ ExampleData)	Relative cell-compositions from non-small cell lung cancer studies.
`swimmer`(→ ExampleData)	Stick figure images of a swimmer with 4 limbs in 4 positions.
`synthetic_blocks`(→ ExampleData)	Generate simple synthetic data.
`synthetic_dense`(→ ExampleData)	Generate dense synthetic data.

Package Contents¶

class cvanmf.data.ExampleData[source]¶

Bases: NamedTuple

Example data, including citations, description, and other metadata.

citation: str | None¶: Preferred citation if you use this data.

col_metadata: pandas.DataFrame | None¶: Metadata associated to each column.

data: pandas.DataFrame¶: Table containing the data.

description: str¶: Longform description of the data.

doi: str | None¶: DOI for data or original paper.

name: str¶: Descriptive name.

other_metadata: Dict[str, Any] | None¶: Any other metadata related to this data (data dictionaries etc.)

rank: int | List[int] | None¶: Correct rank for this data, or ranks if more than one can be considered correct.

row_metadata: pandas.DataFrame | None¶: Metadata associated to each row.

short_name: str¶: Short name.

cvanmf.data.example_abundance() → pandas.DataFrame[source]¶

Genus level Non-Western cohort bacterial microbiome abundance.

From Frioux et al. (2023, https://doi.org/10.1016/j.chom.2023.05.024).

Returns:: Genus level relative abundance table using GTDB r207 taxonomy.
Return type:: pd.DataFrame

cvanmf.data.leukemia() → ExampleData[source]¶

Gene expression data for ALL and AML B- and T-cell type leukemia.

This data was analysed in Brunet et al (2004), and often used as a standard dataset for biological applications of NMF since. It has two broad categories (ALL/AML), but AML can be refined into two subtypes (B/T). B-cell AML appears to contain a further stable sub-grouping, so we have indicated the true rank of this data as 3 or 4.

cvanmf.data.lung_cancer_cells() → ExampleData[source]¶

Relative cell-compositions from non-small cell lung cancer studies.

Gives the number of cells of different types in lung tissue samples from a non-small cell lung cancer atlas, which was compiled from 29 studies and includes 556 samples, from 318 individuals (86 of which are healthy controls). The data was uploaded to cellxgene using their standard ontologies, which is the source we have taken the data from. Metadata provided here is a mixture of metadata from cellxgene, and some from the original paper. We have selected out only the tissues samples labelled as “lung”. In total, this gives 224 samples, and 33 cell types. The data here is total-sum-scaled, i.e. each sample sums to 1.

cvanmf.data.swimmer() → ExampleData[source]¶

Stick figure images of a swimmer with 4 limbs in 4 positions.

Designed by Donoho et al. to be partially representable by NMF, each image has a line torso, accompanied by four straight limbs which can be in one of four positions (0, 45, 90, 135 and 180 degrees from torso). With the exception of the torso, each limb position should be representable by NMF decomposition. As such, the true rank of this data is 17 (4*4 limbs, plus torso), but with conventional NMF as implemented here the torso cannot be learnt, only the 16 limbs.

cvanmf.data.synthetic_blocks(m: int = 100, n: int = 100, overlap: float = 0.25, k: int = 3, normal_noise_params: Dict | None = None, scale_lognormal_params: Dict | None = None) → ExampleData[source]¶

Generate simple synthetic data.

Create an m x n matrix with blocks along the diagonal which overlap to an extent defined by overlap.

Parameters:

m – Number of rows in matrix
n – Number of columns in matrix
overlap – Proportion of block length to participate in overlap
k – Number of signatures
normal_noise_params – Parameters to pass to numpy.random.normal to apply noise to entries. Leave as none to use default parameters.
scale_lognormal_params – Parameters to pass to numpy.random.lognormal to scale each feature (give some features higher values than others). If set to true, will use default parameters for distribution. Leave as None to skip feature scaling.

cvanmf.data.synthetic_dense(m: int = 100, n: int = 100, h_sparsity: float = 0.0, shared_features: float = 0.25, k: int = 3, normal_noise_params: Dict | None = None, scale_lognormal_params: Dict | None = None, keep_mats: bool = False) → ExampleData[source]¶

Generate dense synthetic data.

Dense data is generated by making a \(W\) matrix with \(k\) signatures, and multiplying this with a randomly filled \(H\) matrix. Optionally, a proportion of the \(H\) matrices can be randomly set to 0 using h_sparsity. The extent to which features are shared between signatures is defined via shared_features. Each signature is initially assigned an even proportion of the \(m\) features (remainder spread as evenly as possible between them), so there are no shared features. Then if \(|k|\) is the number of features assigned to a signature, each signature is assigned \(|k|*shared\_features\) randomly selected from the remaining features. This means the overlapping structure is potentially quite different from that of synthetic_blocks().

Parameters:

m – Number of features.
n – Number of samples.
h_sparsity – Proportion of \(H\) matrix to randomly set to 0
shared_features – Amount of shared features to add to a signature, as a proportion of it’s base size.
k – Number of signatures.
normal_noise_params – Parameters passed to numpy.random.normal() when adding noise to data.
scale_lognormal_params – Parameters passed to numpy.random.lognormal() when selecting weights for features in a signature. If this is None, a uniform distribution between 0 and 1 is used instead.
keep_mats – Return the \(H\) and \(W\) matrices used to generate the data in the ExampleData.other_metadata.