Command Line

reapply

Command line interface to fit new data to an existing NMF Signatures model. The new data must use the same features as the model, though there can be some difference (features in now data not in model and vice versa). Currently this is GTDB r207 for the 5 Enterosignatures model.

For more on Enterosignatures see:

Usage

reapply [OPTIONS]

Options

-i, --input <input>

Required New feature matrix, with features on rows and samples on columns.

-m, --model <model>

Name of the model to reapply.

Options:

5es

-h, --hard_mapping <hard_mapping>

Mapping between features in input table and model W matrix. Provide as a csv, with first column features in input table, second column the name in model W to map to.

--rollup, --no-rollup

Only used when genera are features. Genera in abundance table which do not match the model W matrix, add their abundance to a family level entry if one exists.

-s, --separator <separator>

Separator used in input and output files.

-o, --output_dir <output_dir>

Required Directory to write output to.

rank_select

Rank selection for NMF using mn-fold bi-cross validation

Attempt to identify a suitable rank k for decomposition of input matrix X. This is done by shuffling the matrix a number of times, and for each shuffle diving it into m x n submatrices (m splits on rows, n splits on columns). Each of these nine is held out and an estimate learnt from the remaining matrices, and the quality of the estimated matrix used to identify a suitable rank.

The underlying NMF implementation is from scikit-learn, and there is more documentation available there for many of the NMF specific parameters there.

Usage

rank_select [OPTIONS]

Options

-i, --input <input>

Required Matrix to be decomposed, in character delimited format. Use -d/–delimiter to set delimiter.

-o, --output_dir <output_dir>

Directory to write output. Defaults to current directory. Output is a table with a row for each shuffle and rank combination, and columns for each of the rank selection measures (R^2, cosine similarity, etc.)

-d, --delimiter <delimiter>

Delimiter to use for input and output tables. Defaults to tab.

-n, --shuffles <shuffles>

Number of times to shuffle input matrix. Bi-cross validation is run once on each shuffle, for each rank.

Default:

100

--progress, --no-progress

Display progress bar showing number of bi-cross validation iterations completed and remaining.

Default:

True

--log_warning

Log only warnings or higher.

--log_info

Log progress information as well as warnings etc.

--log_debug

Log debug info as well as info, warnings, etc.

--seed <seed>

Seed to initialise random state. Specify if results need to be reproducible.

-l, --rank_min <rank_min>

Required Lower bound of ranks to search. Must be >= 2.

-u, --rank_max <rank_max>

Required Upper bound of ranks to search. Must be >= 2.

-s, --rank_step <rank_step>

Step between ranks to search.

--l1_ratio <l1_ratio>

Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0. This controls the mix between sparsifying and densifying regularisation. 1.0 will encourage sparsity, 0.0 density.

Default:

0.0

--alpha <alpha>

Multiplier for regularisation terms.

Default:

0.0

--max_iter <max_iter>

Maximum number of iterations during decomposition. Will terminate earlier if solution converges. Warnings will be emitted when the solutions fail to converge.

Default:

3000

--beta_loss <beta_loss>

Beta loss function for NMF decomposition.

Default:

'kullback-leibler'

Options:

kullback-leibler | frobenius | itakura-saito

--init <init>

Method to use when intialising H and W for decomposition.

Default:

'nndsvdar'

Options:

nndsvdar | random | nndsvd | nndsvda

--design <design>

Two numbers stating how to split the input matrix for bicrossvalidation. The first is the number of even splits to make along the rows; the second along the columns. This will results in mn folds for each shuffle. Provide in the format ‘–design 4 3’. More fold means more iterations per shuffle, so increases execution time.

Default:

3, 3

regu_select

Regularisation selection for NMF on ALPHA 9 fold bi-cross validation

Attempt to identify a suitable regularisation parameter alpha for decomposition of input matrix X at a given rank with a given ratio between L1 and L2 regularisation. This is done by shuffling the matrix a number of times, and for each shuffle diving it into 9 submatrices. Each of these nine is held out and an estimate learnt from the remaining matrices, and the quality of the estimated matrix used to identify a suitable alpha.

The underlying NMF implementation is from scikit-learn, and there is more documentation available there for many of the NMF specific parameters there.

ALPHA is a list of values to be tested. 0.0 will always be added.

Usage

regu_select [OPTIONS] [ALPHA]...

Options

-i, --input <input>

Required Matrix to be decomposed, in character delimited format. Use -d/–delimiter to set delimiter.

-o, --output_dir <output_dir>

Directory to write output. Defaults to current directory. Output is a table with a row for each shuffle and rank combination, and columns for each of the rank selection measures (R^2, cosine similarity, etc.)

-d, --delimiter <delimiter>

Delimiter to use for input and output tables. Defaults to tab.

-s, --shuffles <shuffles>

Number of times to shuffle input matrix. Bi-cross validation is run once on each shuffle, for each rank.

Default:

100

--progress, --no-progress

Display progress bar showing number of bi-cross validation iterations completed and remaining.

Default:

True

--log_warning

Log only warnings or higher.

--log_info

Log progress information as well as warnings etc.

--log_debug

Log debug info as well as info, warnings, etc.

--seed <seed>

Seed to initialise random state. Specify if results need to be reproducible.

--l1_ratio <l1_ratio>

Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0. This controls the mix between sparsifying and densifying regularisation. 1.0 will encourage sparsity, 0.0 density.

Default:

1.0

-k, --rank <rank>

Required Number of signatures in the decomposition. Regularisation is selected for a given rank, and the optimal value may vary between ranks.

--max_iter <max_iter>

Maximum number of iterations during decomposition. Will terminate earlier if solution converges. Warnings will be emitted when the solutions fail to converge.

Default:

3000

--beta_loss <beta_loss>

Beta loss function for NMF decomposition.

Default:

'kullback-leibler'

Options:

kullback-leibler | frobenius | itakura-saito

--init <init>

Method to use when intialising H and W for decomposition.

Default:

'nndsvdar'

Options:

nndsvdar | random | nndsvd | nndsvda

--scale, --no-scale

Scale alpha parameter by number of samples. Setting this to True provides the same behaviour as was applied in earlier versions of scikit-learn. This is done by default, as the default alpha values are selected to work with this regularisation calculation. The alpha values reported in the output will be the scaled alpha values.

Default:

True

--design <design>

Two numbers stating how to split the input matrix for bicrossvalidation. The first is the number of even splits to make along the rows; the second along the columns. This will results in mn folds for each shuffle. Provide in the format ‘–design 4 3’. More fold means more iterations per shuffle, so increases execution time.

Default:

3, 3

Arguments

ALPHA

Optional argument(s)

decompose

Decompositions for RANKS.

RANKS is a list of ranks for which to generate decompositions.

Generate a number of decompositions for each the specified ranks. NMF solutions are non-unique and depend on initialisation, so when using an initialisation with randomness multiple solutions can be produced. From these solutions, the best can be retained based on criteria such as reconstruction error or cosine similarity.

Some initialisation methods are deterministic, and as such only a single decomposition will be produced.

The output is H and W matrices for each decomposition, tables of quality scores, and some analyses with default parameters. For further analysis, decompositions can be loaded using Decomposition.from_dir, or tables used directly for custom analyses. By default, a symlink to the input data

Usage

decompose [OPTIONS] [RANKS]...

Options

-i, --input <input>

Required Matrix to be decomposed, in character delimited format. Use -d/–delimiter to set delimiter.

-o, --output_dir <output_dir>

Directory to write output. Defaults to current directory. Output is a table with a row for each shuffle and rank combination, and columns for each of the rank selection measures (R^2, cosine similarity, etc.)

-d, --delimiter <delimiter>

Delimiter to use for input and output tables. Defaults to tab.

--progress, --no-progress

Display progress bar showing number of bi-cross validation iterations completed and remaining.

Default:

True

--log_warning

Log only warnings or higher.

--log_info

Log progress information as well as warnings etc.

--log_debug

Log debug info as well as info, warnings, etc.

--seed <seed>

Seed to initialise random state. Specify if results need to be reproducible.

--l1_ratio <l1_ratio>

Regularisation mixing parameter. In range 0.0 <= l1_ratio <= 1.0. This controls the mix between sparsifying and densifying regularisation. 1.0 will encourage sparsity, 0.0 density.

Default:

0.0

--alpha <alpha>

Multiplier for regularisation terms.

Default:

0.0

--max_iter <max_iter>

Maximum number of iterations during decomposition. Will terminate earlier if solution converges. Warnings will be emitted when the solutions fail to converge.

Default:

3000

--beta_loss <beta_loss>

Beta loss function for NMF decomposition.

Default:

'kullback-leibler'

Options:

kullback-leibler | frobenius | itakura-saito

--init <init>

Method to use when intialising H and W for decomposition.

Default:

'random'

Options:

nndsvdar | random | nndsvd | nndsvda

--n_runs <n_runs>

Number of times to run decomposition for each rank. Ignored when init is a deterministic method (nndsvd/nndsvda).

Default:

20

--top_n <top_n>

Keep and report only the best top_n decompositions of the n_runs decompositions produced. Which are the best decompositions is determined by top_criteria. Ignored when init is a deterministic method (nndsvd/nndsvda).

Default:

5

--top_criteria <top_criteria>

Criteria used to determine which of the n_runs decompositions to keep and report.

Default:

'beta_divergence'

Options:

cosine_similarity | r_squared | rss | l2_norm | beta_divergence

--compress, --no_compress

Compress output folders to .tar.gz. Default is to output each decomposition to a separate folder.

Default:

False

Create a symlink for files which do not vary between runs (input, parameters, etc). If disabled, will make redundant copies.

Default:

False

Arguments

RANKS

Optional argument(s)