cvanmf.combine ============== .. py:module:: cvanmf.combine Classes ------- .. autoapisummary:: cvanmf.combine.Cluster cvanmf.combine.Cohort cvanmf.combine.Combiner cvanmf.combine.Model cvanmf.combine.Signature Functions --------- .. autoapisummary:: cvanmf.combine.combine_signatures cvanmf.combine.example_cohort_structure cvanmf.combine.split_dataframe_to_cohorts Module Contents --------------- .. py:class:: Cluster(signatures: Optional[Iterable[Signature]] = None, label: Optional[str] = None) .. py:method:: as_dataframe(clusters: List[Cluster]) -> pandas.DataFrame :staticmethod: Concatenated table with the mean signatures for each cluster. :param clusters: Clusters to concatenate :returns: Concatenated DataFrame with each mean signature as a column .. py:method:: cosine_similarity(clusters: List[Cluster]) -> pandas.DataFrame :staticmethod: Get the pairwise cosine similarity between clusters. :param clusters: Clusters to calculate similarity between :returns: DataFrame with pairwise cosine similarity .. py:method:: member_feature_weights(top_n: int = None, unit_scale: bool = True) -> pandas.DataFrame Weights of features in all member signatures of this cluster. This is returned in long form, intended to be used for plotting using plotnine. .. py:method:: member_similarity(utri: bool = False) -> pandas.DataFrame Pairwise cosine similarity of the member signatures. .. py:method:: merge(clusters: Iterable[Cluster]) -> Cluster :staticmethod: Merge clusters. :param clusters: Cluster to merge. :returns: A new cluster containing the union of the signatures. .. py:method:: plot_feature_variance(top_n: Optional[int] = None, split_cohort: bool = False) Box plot for how much a feature weight varies. .. py:method:: plot_mds(**kwargs) -> plotnine.ggplot Plot signatures in this cluster in a 2D space. Produce a 2D representation of member signatures of this cluster, with color indicting cohort and shape indicating rank. Can pass any arguments used by scikit-learn MDS constructor in kwargs. .. py:method:: plot_member_similarity(split_cohort: bool = True) -> plotnine.ggplot .. py:method:: support(type: Literal['signature', 'model', 'cohort'] = 'signature', samples: bool = False) -> int The support for this cluster. We might want to place more confidence in signatures which appear more frequently, either having more similar signatures overall, appearing in more models, or appearing in more cohorts. This function provides either the number of signatures in the cluster, the number of models the cluster members originate from (likely to be the same as number of signatures), or the number of cohorts the signature is found in. Alternatively, we can express this as the total number of samples for the model or cohort case, for when some cohorts are small and we might want to weight those signatures from large cohorts. :param type: What to count. One of 'signature', 'model', 'cohort' :param samples: Count the number of samples for model or cohort rather than treating each as one. .. py:property:: cohort_model_count :type: collections.Counter Number of models from each cohort which contain member signatures. .. py:property:: label :type: str Label for this cluster in plots. .. py:property:: mean_signature :type: pandas.Series Return the mean signature for this cluster. .. py:property:: member_data :type: pandas.DataFrame Data from all cohorts which any member signature originates from. .. py:property:: signatures :type: List[Signature] .. py:class:: Cohort(name: str, models: Optional[List[Model]] = None, x: Optional[pandas.DataFrame] = None) .. py:method:: add_models(models: Iterable[Model]) -> Cohort .. py:method:: from_comparables(comparables: Iterable[cvanmf.stability.Comparable], name: str = '', x: Optional[pandas.DataFrame] = None) -> Cohort :staticmethod: .. py:property:: models :type: List[Model] .. py:property:: name :type: str .. py:property:: signatures :type: List[Signature] .. py:property:: x :type: Optional[pandas.DataFrame] .. py:class:: Combiner(cohorts: Iterable[Cohort]) .. py:method:: identify_cohort_specific(cohort_proportion: float = 0.95) -> List[Cluster] Find clusters which are consistent in one or more cohorts. A cluster may be unique to a cohort, being consistently recovered in data from that cohort but not in some others. These may appear to have poor support globally (considering all cohorts), but when there is strong consensus within a cohort it may be of interest to retain these clusters. .. py:method:: label_by_match(external_model: cvanmf.stability.Comparable) -> pandas.DataFrame Label clusters by their similarity to an existing model. .. py:method:: merge_similar(cosine_threshold: float = 0.9, density_threshold: float = 0.98) -> None Merge signatures which are highly similar based on cosine similarity. Signatures are grouped into clusters when they all share similarity greater than the specified threshold. A signature can end up in multiple clusters as a result of this grouping. Use refine_multimembers to force a singular membership after merging. :param cosine_threshold: Cosine similarity above which to consider samples similar. .. py:method:: plot_feature_variance(split_cohort: bool = True, top_n: Optional[int] = 20, label_fn: Callable[[List[str]], List[str]] = None, unit_scale: bool = True) -> plotnine.ggplot Plot feature variance for all signatures. .. py:method:: plot_mds(hull: bool = True, **kwargs) -> plotnine.ggplot Plot all clusters member signatures in reduced dimensions. .. py:method:: plot_member_similarity(split_cohort: bool = True) .. py:method:: remove_linear_combinations(cosine_threshold: float = 0.9, min_small_support_ratio: float = 0.5, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, do_removal: bool = True) -> None Remove signatures which can be expressed as linear combinations. Some signatures might be a combination of multiple others, which commonly co-occur in some of the cohorts. We can remove these and keep only the constituent signatures. However, do not want to discard well supported large signatures as they are combination of smaller but less well supported signatures. :param cosine_threshold: Model fit threshold :param do_removal: Remove the signatures which are combinations. Set to False to return the identified signatures without removing. .. py:method:: remove_linear_combinations_2(cosine_threshold: float = 0.9, min_small_support_ratio: float = 0.5, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, do_removal: bool = True) Linear combination removal with removal of smaller poorly supported signatures along with larger ones. .. py:method:: remove_low_support(support_required: Union[float, int] = 0.2, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, signature_floor: int = 2, exempt_clusters: List[Cluster] = None, retain_alpha: float = 0.05, only_cohort_data: bool = False) -> List[Cluster] Remove signatures do not appear frequently among models or cohorts. This looks for clusters with signatures which appear in a small number of models (or just a small number of signatures, or signatures from a small number of cohorts). Any signatures below the threshold for low support are retained only if removing them cause a worse model fit than removing the other signatures (the model fit for all the others pooled). This can be evaluated on the full data, or only on the data from the cohorts which member signatures are drawn from. :param support_required: Either proportion or absolute number of below which the signature will be considered to have low support. :param support_type: What to use to determine support; either the number of signatures, number of models cluster members appear in, or number of cohorts cluster members appear in. :param support_samples: For model or cohort, count the number of samples rather than each model or cohort as one. Useful when cohorts very uneven size. :param signature_floor: Remove any clusters with fewer than this number of member signatures. :param exempt_clusters: Clusters which will not be removed even if meeting low support criteria. :param retain_alpha: Signatures are retained if removing them has a significantly different impact on model fit compared to any good signature. Tested with a Kruskal-Wallis test using this parameter as a threshold. Set to 0 to reject all low support signatures. :param only_cohort_data: Only use data from the cohorts from which this cluster has support when evaluating the change in model fit from omitting this signature. .. py:property:: clusters :type: List[Cluster] .. py:property:: cohort_data :type: pandas.DataFrame Combined data for all cohorts. .. py:property:: cohorts :type: List[Cohort] .. py:property:: removed_clusters :type: List[Cluster] .. py:class:: Model(signatures: Optional[Iterable[Signature]] = None, cohort: Optional[Cohort] = None) One decomposition which is being combined. .. py:method:: add_signatures(signatures: Iterable[Signature]) -> Model Add signatures and set them to refer to this model. .. py:method:: from_comparable(comparable: cvanmf.stability.Comparable) -> Model :staticmethod: .. py:property:: cohort :type: Cohort .. py:property:: signatures :type: List[Signature] .. py:class:: Signature(model: Optional[Model] = None, **kwargs) Bases: :py:obj:`pandas.Series` One signature which is being combined. .. py:method:: from_comparable(c: cvanmf.stability.Comparable) -> List[Signature] :staticmethod: .. py:method:: mds(signatures: Union[Iterable[Signature], pandas.DataFrame], **kwargs) -> pandas.DataFrame :staticmethod: Perform MDS ordination of a set of signatures. Positions signatures in an n-dimensional space using cosine distance. Using the sklearn MDS implementation, so all arguments to the MDS constructor can be passed in kwargs. Generally useful ones are n_components for number of dimensions, metric for metric or non-metric MDS. :param signatures: Iterable of signatures, or a DataFrame of signatures :param n: Dimensions to use in NMDS :returns: DataFrame with signatures on rows and dimensions coordinates on columns. .. py:property:: model :type: Optional[Model] .. py:function:: combine_signatures(signatures: Iterable[cvanmf.stability.Comparable], x: pandas.DataFrame = None, merge_threshold: float = 0.9, split_threshold: float = 0.9, prune_low_support: bool = True, low_support_threshold: Union[int, float] = 0.2, low_support_alpha: float = 0.05) -> cvanmf.denovo.Decomposition Combine signatures into a non-redundant set. .. py:function:: example_cohort_structure() -> Dict[str, pandas.DataFrame] Make 5 cohorts of data with a complex structure. This set of cohorts is suitable to evaluate whether we are capturing the different kind of situations we expect during signature combining. These are: * highly similar signatures which should be merged * signatures which are linear combinations of others which should be removed * low support signatures (present in few cohorts) which can be * * uninformative, in which case removed * * informative cohort specific, in which case retained To this end, we make 5 cohorts based on the 5 ES model signatures. 1. Contains all 5 ES 2. Doesn't contain ES_Bifi 3. Contains an extra signature IS_1 which is informative 4. Completely shuffled, does not fit at all 5. Doesn't contain ES_Bact or ES_Esch 6. Merged ES_Firm and ES_Prev Each is returned as a named tuple containing name, h, w, x .. py:function:: split_dataframe_to_cohorts(x: pandas.DataFrame, cohort_labels: pandas.Series, min_size: int = 0) -> Dict[Any, pandas.DataFrame] Split a DataFrame into multiple, based on the provided cohort labels.