cvanmf.reapply

Reapply existing Enterosignature models to new abundance data.

The easiest way to do this is through the reapply() function, which is most flexible about parameter types. The other functions perform individual steps, which are useful if you want fine control of a given step, but probably not necessary for most uses.

Attributes

Classes

FeatureMapping

Connect new data table features to those in the model.

FeatureMatch

Signature for functions which perform feature matching.

InputValidation

Signature for functions which perform input validation.

Functions

cli(→ None)

Command line interface to fit new data to an existing NMF Signatures

match_genera(→ FeatureMapping)

Match taxonomic names in the input table and the Enterosignatures W

match_identical(→ FeatureMapping)

Match features by identical labels only.

nmf_transform(→ pandas.DataFrame)

Transform the input data into model weights.

reapply(→ denovo.Decomposition)

Load and transform abundances to an existing model.

validate_genus_table(→ pandas.DataFrame)

Basic checks and transformations of the abundance table.

Module Contents

class cvanmf.reapply.FeatureMapping(target_features: Set[str], source_features: Set[str], hard_map: Dict[str, str] | None = None)[source]

Connect new data table features to those in the model.

Manage the mappings from input features to model features. Source features are the features in the new abundance table we want to fit to the model; target features are the features in the model we’re trying to match to. User defined mappings can be provided via hard_map, any subsequent mappings for a source taxon in hard_map will be ignored. New mappings are added via add(). When mappings are fully defined the model w matrix and the new data table can be matched using transform_w() and transform_abundance()

Parameters:
  • target_features (Set[str]) – Model features to map to

  • source_features (Set[str]) – Input features to be mapped from

  • hard_map (Optional[Dict[str, str]]) – User defined mappings, as a dictionary with source as key and target as value.

add(feature_from: str, feature_to: str) None[source]

Add a mapping. If there is already a mapping from this feature, we will append this one. Use conflicts() to identify where more than one mapping exists.

Parameters:
  • feature_from (str) – Feature in the new table

  • feature_to (str) – Model feature to map to

Raises:

EnteroException – Feature not in the relevant sets

missing() Collection[str][source]

Identify input features which currently have no mapping.

Returns:

Source features which are not mapping to any model feature

Return type:

Collection[str]

to_df() pandas.DataFrame[source]

Produce a dataframe of the mapping. Where mappings are amibiguous, multiple rows will be included. Where mappings are missing, one row with a blank target will be included.

Returns:

DataFrame with two columns, first source feature, second target feature.

Return type:

pd.DataFrame

transform_abundance(abd_tbl: pandas.DataFrame) pandas.DataFrame[source]

Applying mapping to the input table.

Make a table with renamed and combined rows based on the identified mappings.

Parameters:

abd_tbl (pd.DataFrame) – New table, samples on columns

Returns:

Table with mappings applied

Return type:

pd.DataFrame

transform_w(w: pandas.DataFrame, abd_tbl: pandas.DataFrame) pandas.DataFrame[source]

Match the model w matrix to the new table.

Make a W matrix which has features not in the abundance table removed, and rows added for features which are in the abundance table but not the model.

Parameters:
  • w (pd.DataFrame) – Model W matrix

  • abd_tbl (pd.DataFrame) – New matrix. Should not have been transformed with transform_abundance().

Returns:

W matrix matched to new table

Return type:

pd.DataFrame

property conflicts: List[Tuple[str, List[str]]]

Features for which more than one target exists.

property mapping: Dict[str, List[str]]

Mapping from source to target features.

class cvanmf.reapply.FeatureMatch[source]

Bases: Protocol

Signature for functions which perform feature matching.

class cvanmf.reapply.InputValidation[source]

Bases: Protocol

Signature for functions which perform input validation.

cvanmf.reapply.cli(input: str, model: str, hard_mapping: str, rollup: bool, separator: str, output_dir: str) None

Command line interface to fit new data to an existing NMF Signatures model. The new data must use the same features as the model, though there can be some difference (features in now data not in model and vice versa). Currently this is GTDB r207 for the 5 Enterosignatures model.

For more on Enterosignatures see:

cvanmf.reapply.match_genera(w: pandas.DataFrame, y: pandas.DataFrame, hard_mapping: Dict[str, str] | None = None, family_rollup: bool = True, **kwargs) FeatureMapping[source]

Match taxonomic names in the input table and the Enterosignatures W matrix.

This function is currently based on the R script provided by Clemence in the Enterosignatures (ES) gitlab repo (prepare_matrices.R) https://gitlab.inria.fr/cfrioux/enterosignature-paper/. This will attempt to match names. Mappings in the hard_mapping parameter are new names to ES names, and will be applied before any other matches identified.

Parameters:
  • w (pd.DataFrame) – Enterosignatures W matrix

  • y (pd.DataFrame) – Abundance table being transformed

  • hard_mapping (Dict[str, str]) – Mapping from input to ES name

  • family_rollup (bool) – Move abundance of genera which are not matched to the family level entry if one exists

  • logger (Callable[[Any], None]) – Function to log messages

Returns:

Transformed abundance table, es W matrix, and mapping object

Return type:

Tuple[pd.DataFrame, pd.DataFrame, List[str]]

cvanmf.reapply.match_identical(w: pandas.DataFrame, y: pandas.DataFrame, **kwargs) FeatureMapping[source]

Match features by identical labels only.

Parameters:
  • w – W matrix from model

  • y – Table of new data

cvanmf.reapply.nmf_transform(new_abd: pandas.DataFrame, w_prime: pandas.DataFrame) pandas.DataFrame[source]

Transform the input data into model weights.

Takes the matched up W matrix and feature matrix. Expects the row ordering of W and feature matrix to be the same. Any NA values will be filled with 0.

Parameters:
  • new_abd – Feature matrix matched to W

  • w_prime – Model weights

Returns:

Model weights for the given model and abundances, note this is not relative abundance (do not sum to 1)

cvanmf.reapply.reapply(y: str | pandas.DataFrame, model: str | cvanmf.models.Signatures = '5es', hard_mapping: str | pandas.DataFrame | None = None, separator: str = '\t', output_dir: str | None = None, **kwargs) denovo.Decomposition[source]

Load and transform abundances to an existing model.

The new data must be annotated against the same taxonomy the model uses. Currently for the 5 ES models this is GTDB r207. Feature names will be automatically matched between the abundance table and model where possible, (see match_genera()). Most of the work is done in transform_table(), this mostly provides convenience of allowing parameters to be paths or DataFrames, or to specify models as string or object.

Parameters:
  • y – Feature matrix to transform. Can be a string giving path, or a DataFrame.

  • model – Model to use. Can be a Signature object, or the name of one of the provded Signature objects. Currently this is ‘5es’ for the 5ES model of Frioux et al. (2023, https://doi.org/10.1016/j.chom.2023.05.024).

  • hard_mapping – Define matchups between feature identifiers in y and those in model W matrix. These will be used in preference of any automated matches. Should be a table with index being y matrix identifier, and first column the model W identifier. Can be either a path, or DataFrame.

  • separator – Separator to use when reading and writing matrices.

  • output_dir – Directory to write results to. Directory will be created if it does not exist. Pass None for no output to disk.

  • **kwargs

    Passed to the Signature validate_input and match_feature functions.

cvanmf.reapply.validate_genus_table(abd_tbl: pandas.DataFrame, **kwargs) pandas.DataFrame[source]

Basic checks and transformations of the abundance table.

Some transformations may be made here, such as transposition. Any transformation will be written out to inform the user. Transformations are done in place.

Parameters:
  • abd_tbl (pd.DataFrame) – Abundance table to check

  • logger (Callable[[str], None]) – Function to report errors

Returns:

Validated, potentially transformed, dataframe

Return type:

pd.DataFrame

cvanmf.reapply.logger: logging.Logger