cvanmf.combine¶
Classes¶
Functions¶
|
Combine signatures into a non-redundant set. |
|
Make 5 cohorts of data with a complex structure. |
|
Split a DataFrame into multiple, based on the provided cohort labels. |
Module Contents¶
- class cvanmf.combine.Cluster(signatures: Iterable[Signature] | None = None, label: str | None = None)¶
- static as_dataframe(clusters: List[Cluster]) pandas.DataFrame¶
Concatenated table with the mean signatures for each cluster.
- Parameters:
clusters – Clusters to concatenate
- Returns:
Concatenated DataFrame with each mean signature as a column
- static cosine_similarity(clusters: List[Cluster]) pandas.DataFrame¶
Get the pairwise cosine similarity between clusters.
- Parameters:
clusters – Clusters to calculate similarity between
- Returns:
DataFrame with pairwise cosine similarity
- member_feature_weights(top_n: int = None, unit_scale: bool = True) pandas.DataFrame¶
Weights of features in all member signatures of this cluster.
This is returned in long form, intended to be used for plotting using plotnine.
- member_similarity(utri: bool = False) pandas.DataFrame¶
Pairwise cosine similarity of the member signatures.
- static merge(clusters: Iterable[Cluster]) Cluster¶
Merge clusters.
- Parameters:
clusters – Cluster to merge.
- Returns:
A new cluster containing the union of the signatures.
- plot_feature_variance(top_n: int | None = None, split_cohort: bool = False)¶
Box plot for how much a feature weight varies.
- plot_mds(**kwargs) plotnine.ggplot¶
Plot signatures in this cluster in a 2D space.
Produce a 2D representation of member signatures of this cluster, with color indicting cohort and shape indicating rank. Can pass any arguments used by scikit-learn MDS constructor in kwargs.
- plot_member_similarity(split_cohort: bool = True) plotnine.ggplot¶
- support(type: Literal['signature', 'model', 'cohort'] = 'signature', samples: bool = False) int¶
The support for this cluster.
We might want to place more confidence in signatures which appear more frequently, either having more similar signatures overall, appearing in more models, or appearing in more cohorts. This function provides either the number of signatures in the cluster, the number of models the cluster members originate from (likely to be the same as number of signatures), or the number of cohorts the signature is found in. Alternatively, we can express this as the total number of samples for the model or cohort case, for when some cohorts are small and we might want to weight those signatures from large cohorts.
- Parameters:
type – What to count. One of ‘signature’, ‘model’, ‘cohort’
samples – Count the number of samples for model or cohort rather than treating each as one.
- property cohort_model_count: collections.Counter¶
Number of models from each cohort which contain member signatures.
- property label: str¶
Label for this cluster in plots.
- property mean_signature: pandas.Series¶
Return the mean signature for this cluster.
- property member_data: pandas.DataFrame¶
Data from all cohorts which any member signature originates from.
- class cvanmf.combine.Cohort(name: str, models: List[Model] | None = None, x: pandas.DataFrame | None = None)¶
-
- static from_comparables(comparables: Iterable[cvanmf.stability.Comparable], name: str = '', x: pandas.DataFrame | None = None) Cohort¶
- property name: str¶
- property x: pandas.DataFrame | None¶
- class cvanmf.combine.Combiner(cohorts: Iterable[Cohort])¶
- identify_cohort_specific(cohort_proportion: float = 0.95) List[Cluster]¶
Find clusters which are consistent in one or more cohorts.
A cluster may be unique to a cohort, being consistently recovered in data from that cohort but not in some others. These may appear to have poor support globally (considering all cohorts), but when there is strong consensus within a cohort it may be of interest to retain these clusters.
- label_by_match(external_model: cvanmf.stability.Comparable) pandas.DataFrame¶
Label clusters by their similarity to an existing model.
- merge_similar(cosine_threshold: float = 0.9, density_threshold: float = 0.98) None¶
Merge signatures which are highly similar based on cosine similarity.
Signatures are grouped into clusters when they all share similarity greater than the specified threshold. A signature can end up in multiple clusters as a result of this grouping. Use refine_multimembers to force a singular membership after merging.
- Parameters:
cosine_threshold – Cosine similarity above which to consider samples similar.
- plot_feature_variance(split_cohort: bool = True, top_n: int | None = 20, label_fn: Callable[[List[str]], List[str]] = None, unit_scale: bool = True) plotnine.ggplot¶
Plot feature variance for all signatures.
- plot_mds(hull: bool = True, **kwargs) plotnine.ggplot¶
Plot all clusters member signatures in reduced dimensions.
- plot_member_similarity(split_cohort: bool = True)¶
- remove_linear_combinations(cosine_threshold: float = 0.9, min_small_support_ratio: float = 0.5, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, do_removal: bool = True) None¶
Remove signatures which can be expressed as linear combinations.
Some signatures might be a combination of multiple others, which commonly co-occur in some of the cohorts. We can remove these and keep only the constituent signatures. However, do not want to discard well supported large signatures as they are combination of smaller but less well supported signatures.
- Parameters:
cosine_threshold – Model fit threshold
do_removal – Remove the signatures which are combinations. Set to False to return the identified signatures without removing.
- remove_linear_combinations_2(cosine_threshold: float = 0.9, min_small_support_ratio: float = 0.5, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, do_removal: bool = True)¶
Linear combination removal with removal of smaller poorly supported signatures along with larger ones.
- remove_low_support(support_required: float | int = 0.2, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, signature_floor: int = 2, exempt_clusters: List[Cluster] = None, retain_alpha: float = 0.05, only_cohort_data: bool = False) List[Cluster]¶
Remove signatures do not appear frequently among models or cohorts.
This looks for clusters with signatures which appear in a small number of models (or just a small number of signatures, or signatures from a small number of cohorts). Any signatures below the threshold for low support are retained only if removing them cause a worse model fit than removing the other signatures (the model fit for all the others pooled). This can be evaluated on the full data, or only on the data from the cohorts which member signatures are drawn from.
- Parameters:
support_required – Either proportion or absolute number of below which the signature will be considered to have low support.
support_type – What to use to determine support; either the number of signatures, number of models cluster members appear in, or number of cohorts cluster members appear in.
support_samples – For model or cohort, count the number of samples rather than each model or cohort as one. Useful when cohorts very uneven size.
signature_floor – Remove any clusters with fewer than this number of member signatures.
exempt_clusters – Clusters which will not be removed even if meeting low support criteria.
retain_alpha – Signatures are retained if removing them has a significantly different impact on model fit compared to any good signature. Tested with a Kruskal-Wallis test using this parameter as a threshold. Set to 0 to reject all low support signatures.
only_cohort_data – Only use data from the cohorts from which this cluster has support when evaluating the change in model fit from omitting this signature.
- property cohort_data: pandas.DataFrame¶
Combined data for all cohorts.
- class cvanmf.combine.Model(signatures: Iterable[Signature] | None = None, cohort: Cohort | None = None)¶
One decomposition which is being combined.
- class cvanmf.combine.Signature(model: Model | None = None, **kwargs)¶
Bases:
pandas.SeriesOne signature which is being combined.
- static mds(signatures: Iterable[Signature] | pandas.DataFrame, **kwargs) pandas.DataFrame¶
Perform MDS ordination of a set of signatures.
Positions signatures in an n-dimensional space using cosine distance. Using the sklearn MDS implementation, so all arguments to the MDS constructor can be passed in kwargs. Generally useful ones are n_components for number of dimensions, metric for metric or non-metric MDS.
- Parameters:
signatures – Iterable of signatures, or a DataFrame of signatures
n – Dimensions to use in NMDS
- Returns:
DataFrame with signatures on rows and dimensions coordinates
on columns.
- cvanmf.combine.combine_signatures(signatures: Iterable[cvanmf.stability.Comparable], x: pandas.DataFrame = None, merge_threshold: float = 0.9, split_threshold: float = 0.9, prune_low_support: bool = True, low_support_threshold: int | float = 0.2, low_support_alpha: float = 0.05) cvanmf.denovo.Decomposition¶
Combine signatures into a non-redundant set.
- cvanmf.combine.example_cohort_structure() Dict[str, pandas.DataFrame]¶
Make 5 cohorts of data with a complex structure.
This set of cohorts is suitable to evaluate whether we are capturing the different kind of situations we expect during signature combining. These are: * highly similar signatures which should be merged * signatures which are linear combinations of others which should be removed * low support signatures (present in few cohorts) which can be * * uninformative, in which case removed * * informative cohort specific, in which case retained
To this end, we make 5 cohorts based on the 5 ES model signatures. 1. Contains all 5 ES 2. Doesn’t contain ES_Bifi 3. Contains an extra signature IS_1 which is informative 4. Completely shuffled, does not fit at all 5. Doesn’t contain ES_Bact or ES_Esch 6. Merged ES_Firm and ES_Prev
Each is returned as a named tuple containing name, h, w, x
- cvanmf.combine.split_dataframe_to_cohorts(x: pandas.DataFrame, cohort_labels: pandas.Series, min_size: int = 0) Dict[Any, pandas.DataFrame]¶
Split a DataFrame into multiple, based on the provided cohort labels.