cvanmf.combine
==============

.. py:module:: cvanmf.combine


Classes
-------

.. autoapisummary::

   cvanmf.combine.Cluster
   cvanmf.combine.Cohort
   cvanmf.combine.Combiner
   cvanmf.combine.Model
   cvanmf.combine.Signature


Functions
---------

.. autoapisummary::

   cvanmf.combine.combine_signatures
   cvanmf.combine.example_cohort_structure
   cvanmf.combine.split_dataframe_to_cohorts


Module Contents
---------------

.. py:class:: Cluster(signatures: Optional[Iterable[Signature]] = None, label: Optional[str] = None)

   .. py:method:: as_dataframe(clusters: List[Cluster]) -> pandas.DataFrame
      :staticmethod:


      Concatenated table with the mean signatures for each cluster.

      :param clusters: Clusters to concatenate
      :returns: Concatenated DataFrame with each mean signature as a column


   .. py:method:: cosine_similarity(clusters: List[Cluster]) -> pandas.DataFrame
      :staticmethod:


      Get the pairwise cosine similarity between clusters.

      :param clusters: Clusters to calculate similarity between
      :returns: DataFrame with pairwise cosine similarity


   .. py:method:: member_feature_weights(top_n: int = None, unit_scale: bool = True) -> pandas.DataFrame

      Weights of features in all member signatures of this cluster.

      This is returned in long form, intended to be used for plotting using
      plotnine.


   .. py:method:: member_similarity(utri: bool = False) -> pandas.DataFrame

      Pairwise cosine similarity of the member signatures.


   .. py:method:: merge(clusters: Iterable[Cluster]) -> Cluster
      :staticmethod:


      Merge clusters.

      :param clusters: Cluster to merge.
      :returns: A new cluster containing the union of the signatures.


   .. py:method:: plot_feature_variance(top_n: Optional[int] = None, split_cohort: bool = False)

      Box plot for how much a feature weight varies.


   .. py:method:: plot_mds(**kwargs) -> plotnine.ggplot

      Plot signatures in this cluster in a 2D space.

      Produce a 2D representation of member signatures of this cluster, with
      color indicting cohort and shape indicating rank. Can pass any arguments
      used by scikit-learn MDS constructor in kwargs.


   .. py:method:: plot_member_similarity(split_cohort: bool = True) -> plotnine.ggplot


   .. py:method:: support(type: Literal['signature', 'model', 'cohort'] = 'signature', samples: bool = False) -> int

      The support for this cluster.

      We might want to place more confidence in signatures which appear more
      frequently, either having more similar signatures overall, appearing
      in more models, or appearing in more cohorts. This function provides
      either the number of signatures in the cluster, the number of models
      the cluster members originate from (likely to be the same as number of
      signatures), or the number of cohorts the signature is found in.
      Alternatively, we can express this as the total number of samples for
      the model or cohort case, for when some cohorts are small and we might
      want to weight those signatures from large cohorts.

      :param type: What to count. One of 'signature', 'model', 'cohort'
      :param samples: Count the number of samples for model or cohort
          rather than treating each as one.


   .. py:property:: cohort_model_count
      :type: collections.Counter


      Number of models from each cohort which contain member signatures.


   .. py:property:: label
      :type: str


      Label for this cluster in plots.


   .. py:property:: mean_signature
      :type: pandas.Series


      Return the mean signature for this cluster.


   .. py:property:: member_data
      :type: pandas.DataFrame


      Data from all cohorts which any member signature originates from.


   .. py:property:: signatures
      :type: List[Signature]


.. py:class:: Cohort(name: str, models: Optional[List[Model]] = None, x: Optional[pandas.DataFrame] = None)

   .. py:method:: add_models(models: Iterable[Model]) -> Cohort


   .. py:method:: from_comparables(comparables: Iterable[cvanmf.stability.Comparable], name: str = '', x: Optional[pandas.DataFrame] = None) -> Cohort
      :staticmethod:


   .. py:property:: models
      :type: List[Model]


   .. py:property:: name
      :type: str


   .. py:property:: signatures
      :type: List[Signature]


   .. py:property:: x
      :type: Optional[pandas.DataFrame]


.. py:class:: Combiner(cohorts: Iterable[Cohort])

   .. py:method:: identify_cohort_specific(cohort_proportion: float = 0.95) -> List[Cluster]

      Find clusters which are consistent in one or more cohorts.

      A cluster may be unique to a cohort, being consistently recovered in
      data from that cohort but not in some others. These may appear to
      have poor support globally (considering all cohorts), but when
      there is strong consensus within a cohort it may be of interest to
      retain these clusters.


   .. py:method:: label_by_match(external_model: cvanmf.stability.Comparable) -> pandas.DataFrame

      Label clusters by their similarity to an existing model.


   .. py:method:: merge_similar(cosine_threshold: float = 0.9, density_threshold: float = 0.98) -> None

      Merge signatures which are highly similar based on cosine similarity.

      Signatures are grouped into clusters when they all share similarity
      greater than the specified threshold. A signature can end up in
      multiple clusters as a result of this grouping. Use
      refine_multimembers to force a singular membership after merging.

      :param cosine_threshold: Cosine similarity above which to consider
          samples similar.


   .. py:method:: plot_feature_variance(split_cohort: bool = True, top_n: Optional[int] = 20, label_fn: Callable[[List[str]], List[str]] = None, unit_scale: bool = True) -> plotnine.ggplot

      Plot feature variance for all signatures.


   .. py:method:: plot_mds(hull: bool = True, **kwargs) -> plotnine.ggplot

      Plot all clusters member signatures in reduced dimensions.


   .. py:method:: plot_member_similarity(split_cohort: bool = True)


   .. py:method:: remove_linear_combinations(cosine_threshold: float = 0.9, min_small_support_ratio: float = 0.5, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, do_removal: bool = True) -> None

      Remove signatures which can be expressed as linear combinations.

      Some signatures might be a combination of multiple others, which
      commonly co-occur in some of the cohorts. We can remove these and
      keep only the constituent signatures. However, do not want to discard
      well supported large signatures as they are combination of smaller but
      less well supported signatures.

      :param cosine_threshold: Model fit threshold
      :param do_removal: Remove the signatures which are combinations. Set to
          False to return the identified signatures without removing.


   .. py:method:: remove_linear_combinations_2(cosine_threshold: float = 0.9, min_small_support_ratio: float = 0.5, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, do_removal: bool = True)

      Linear combination removal with removal of smaller poorly supported
      signatures along with larger ones.


   .. py:method:: remove_low_support(support_required: Union[float, int] = 0.2, support_type: Literal['signature', 'model', 'cohort'] = 'model', support_samples: bool = False, signature_floor: int = 2, exempt_clusters: List[Cluster] = None, retain_alpha: float = 0.05, only_cohort_data: bool = False) -> List[Cluster]

      Remove signatures do not appear frequently among models or cohorts.

      This looks for clusters with signatures which appear in a small number
      of models (or just a small number of signatures, or signatures
      from a small number of cohorts). Any signatures below the threshold
      for low support are retained only if removing them cause a worse
      model fit than removing the other signatures (the model fit for all
      the others pooled). This can be evaluated on the full data, or only on
      the data from the cohorts which member signatures are drawn from.

      :param support_required: Either proportion or absolute number of below
          which the signature will be considered to have low support.
      :param support_type: What to use to determine support; either the
          number of signatures, number of models cluster members appear in,
          or number of cohorts cluster members appear in.
      :param support_samples: For model or cohort, count the number of
          samples rather than each model or cohort as one. Useful when
          cohorts very uneven size.
      :param signature_floor: Remove any clusters with fewer than this
          number of member signatures.
      :param exempt_clusters: Clusters which will not be removed even if
          meeting low support criteria.
      :param retain_alpha: Signatures are retained if removing them
          has a significantly different impact on model fit compared
          to any good signature. Tested with a Kruskal-Wallis test using
          this parameter as a threshold. Set to 0 to reject all low support
          signatures.
      :param only_cohort_data: Only use data from the cohorts from which
          this cluster has support when evaluating the change in model fit
          from omitting this signature.


   .. py:property:: clusters
      :type: List[Cluster]


   .. py:property:: cohort_data
      :type: pandas.DataFrame


      Combined data for all cohorts.


   .. py:property:: cohorts
      :type: List[Cohort]


   .. py:property:: removed_clusters
      :type: List[Cluster]


.. py:class:: Model(signatures: Optional[Iterable[Signature]] = None, cohort: Optional[Cohort] = None)

   One decomposition which is being combined.


   .. py:method:: add_signatures(signatures: Iterable[Signature]) -> Model

      Add signatures and set them to refer to this model.


   .. py:method:: from_comparable(comparable: cvanmf.stability.Comparable) -> Model
      :staticmethod:


   .. py:property:: cohort
      :type: Cohort


   .. py:property:: signatures
      :type: List[Signature]


.. py:class:: Signature(model: Optional[Model] = None, **kwargs)

   Bases: :py:obj:`pandas.Series`


   One signature which is being combined.


   .. py:method:: from_comparable(c: cvanmf.stability.Comparable) -> List[Signature]
      :staticmethod:


   .. py:method:: mds(signatures: Union[Iterable[Signature], pandas.DataFrame], **kwargs) -> pandas.DataFrame
      :staticmethod:


      Perform MDS ordination of a set of signatures.

      Positions signatures in an n-dimensional space using cosine distance.
      Using the sklearn MDS implementation, so all arguments to the MDS
      constructor can be passed in kwargs. Generally useful ones are
      n_components for number of dimensions, metric for metric or non-metric
      MDS.

      :param signatures: Iterable of signatures, or a DataFrame of signatures
      :param n: Dimensions to use in NMDS
      :returns: DataFrame with signatures on rows and dimensions coordinates
      on columns.


   .. py:property:: model
      :type: Optional[Model]


.. py:function:: combine_signatures(signatures: Iterable[cvanmf.stability.Comparable], x: pandas.DataFrame = None, merge_threshold: float = 0.9, split_threshold: float = 0.9, prune_low_support: bool = True, low_support_threshold: Union[int, float] = 0.2, low_support_alpha: float = 0.05) -> cvanmf.denovo.Decomposition

   Combine signatures into a non-redundant set.


.. py:function:: example_cohort_structure() -> Dict[str, pandas.DataFrame]

   Make 5 cohorts of data with a complex structure.

   This set of cohorts is suitable to evaluate whether we are capturing the
   different kind of situations we expect during signature combining. These
   are:
   * highly similar signatures which should be merged
   * signatures which are linear combinations of others which should be removed
   * low support signatures (present in few cohorts) which can be
   *   *   uninformative, in which case removed
   *   *   informative cohort specific, in which case retained

   To this end, we make 5 cohorts based on the 5 ES model signatures.
   1.  Contains all 5 ES
   2.  Doesn't contain ES_Bifi
   3.  Contains an extra signature IS_1 which is informative
   4.  Completely shuffled, does not fit at all
   5.  Doesn't contain ES_Bact or ES_Esch
   6.  Merged ES_Firm and ES_Prev

   Each is returned as a named tuple containing name, h, w, x


.. py:function:: split_dataframe_to_cohorts(x: pandas.DataFrame, cohort_labels: pandas.Series, min_size: int = 0) -> Dict[Any, pandas.DataFrame]

   Split a DataFrame into multiple, based on the provided cohort labels.