patpy.tl.MOFA#

class patpy.tl.MOFA(sample_key, cell_group_key, layer=None, seed=67, n_factors=10, aggregate_cell_types=True, aggregation_mode='mean', scale_views=False, scale_groups=False, center_groups=True, use_float32=False, ard_factors=False, ard_weights=True, spikeslab_weights=True, spikeslab_factors=False, iterations=1000, convergence_mode='fast', startELBO=1, freqELBO=1, gpu_mode=False, gpu_device=None, verbose=False, quiet=False, outfile=None, save_interrupted=False)#

Patient representation using MOFA2 model, treating patients as samples with optional cell type views.

Parameters:
  • sample_key (str) – Column in .obs containing sample (patient) IDs.

  • cell_group_key (str) – Column in .obs containing cell type information.

  • layer (str | None (default: None)) – Layer in AnnData to use for gene expression data. If None, uses .X.

  • seed (int (default: 67)) – Random seed for reproducibility.

  • n_factors (int (default: 10)) – Number of latent factors to learn.

  • aggregate_cell_types (bool (default: True)) – If True, treat each cell type as a separate view. If False, aggregate gene expression across all cell types into a single view.

  • aggregation_mode (str (default: 'mean')) – Name of the aggregation function to use (e.g., ‘mean’, ‘median’, ‘sum’)

  • scale_views (bool (default: False)) – Scale each view to unit variance.

  • scale_groups (bool (default: False)) – Scale each group to unit variance.

  • center_groups (bool (default: True)) – Center each group.

  • use_float32 (bool (default: False)) – Use 32-bit floating point precision.

  • ard_factors (bool (default: False)) – Use Automatic Relevance Determination (ARD) prior on factors.

  • ard_weights (bool (default: True)) – Use ARD prior on weights.

  • spikeslab_weights (bool (default: True)) – Use spike-and-slab prior on weights.

  • spikeslab_factors (bool (default: False)) – Use spike-and-slab prior on factors.

  • iterations (int (default: 1000)) – Maximum number of training iterations.

  • convergence_mode (str (default: 'fast')) – Convergence speed mode.

  • startELBO (int (default: 1)) – Iteration number to start computing the Evidence Lower Bound (ELBO).

  • freqELBO (int (default: 1)) – Frequency of ELBO computation after startELBO.

  • gpu_mode (bool (default: False)) – Use GPU for training.

  • gpu_device (int | None (default: None)) – GPU device ID to use.

  • verbose (bool (default: False)) – Verbose output during training.

  • quiet (bool (default: False)) – Suppress training output.

  • outfile (str | None (default: None)) – Path to save the trained model.

  • save_interrupted (bool (default: False)) – Save the model if training is interrupted.

Methods table#

calculate_distance_matrix([force, ...])

Calculate distances between patients using MOFA2 latent factors.

embed([method, n_jobs, verbose])

Embed distances into 2-D coordinates.

evaluate_representation(target[, method, ...])

Evaluate representation of target for the given distance matrix

fit_linear_probe(target[, task, test_size, ...])

Fit a linear probe on top of sample embeddings.

plot_clustermap([metadata_cols, figsize])

Plot a hierarchically-clustered heat-map of the distance matrix.

plot_embedding([method, metadata_cols, ...])

Plot a 2-D embedding of distances, optionally coloured by metadata.

plot_metadata_distribution(metadata_columns, ...)

Predict metadata columns, and plot embeddings colorised by metadata values

predict_metadata(target[, metadata, ...])

Predict classes from metadata column target for samples using K-Nearest Neighbors classifier

prepare_anndata(adata)

Prepare AnnData for MOFA2, optionally treating cell types as separate views.

to_adata([metadata])

Convert samples data to AnnData object

Methods#

MOFA.calculate_distance_matrix(force=False, store_weights=False, dist='euclidean')#

Calculate distances between patients using MOFA2 latent factors.

Parameters:
  • force (bool = False) – If True, recalculate the distance matrix even if it exists.

  • store_weights (bool, default: False) – If True, store the weights (relation of factors to genes) in self.adata.uns.

Returns:

-distances (ndarray) Matrix of distances between patients.

MOFA.embed(method='UMAP', n_jobs=-1, verbose=False)#

Embed distances into 2-D coordinates.

Parameters:
  • distances – Square distance matrix of shape (n_samples, n_samples).

  • method (str (default: 'UMAP')) – One of "MDS", "TSNE", "UMAP".

  • n_jobs (int (default: -1)) – Number of parallel threads (-1 = all).

  • verbose (bool (default: False)) – Print progress information.

Return type:

ndarray

Returns:

-coordinates (ndarray) Array of shape (n_samples, 2).

MOFA.evaluate_representation(target, method='knn', metadata=None, num_donors_subset=None, proportion_donors_subset=None, **parameters)#

Evaluate representation of target for the given distance matrix

Parameters:
  • target ("str") – A sample-level covariate to evaluate representation for

  • method (Literal['knn', 'distances', 'proportions', 'silhouette', 'persistence', 'permanova'] (default: 'knn')) –

    Method to use for evaluation:

    • knn: predict values of target using K-nearest neighbors and evaluate the prediction

    • distances: test if distances between samples are significantly different from the null distribution

    • proportions: test if distribution of target differs between groups (e.g. clusters)

    • silhouette: calculate silhouette score for the given distances

  • num_donors_subset (int, optional) – Absolute number of donors to include in the evaluation.

  • proportion_donors_subset (float, optional) – Proportion of donors to include in the evaluation.

  • parameters (dict) –

    Parameters for the evaluation method. The following parameters are used:

    • knn:
      • n_neighbors: number of neighbors to use for prediction

      • task: type of prediction task. One of “classification”, “regression”, “ranking”. See documentation of predict_knn for more information

    • distances:
      • control_level: value of target that should be used as a control group

      • normalization_type: type of normalization to use. One of “total”, “shift”, “var”. See documentation of test_distances_significance for more information

      • n_bootstraps: number of bootstrap iterations to use

      • trimmed_fraction: fraction of the most extreme values to remove from the distribution

      • compare_by_difference: if True, normalization is defined as difference (as in the original paper). Otherwise, it is defined as a ratio

    • proportions:
      • groups: groups (e.g. cluster numbers) of the observations

Returns:

-result (dict) Result of evaluation with the following keys:

  • score: a number evaluating the representation. The higher the better

  • metric: name of the metric used for evaluation

  • n_unique: number of unique values in target

  • n_observations: number of observations used for evaluation. Can be different for different targets, even within one dataset (because of NAs)

  • method: name of the method used for evaluation

There are other optional keys depending on the method used for evaluation.

MOFA.fit_linear_probe(target, task='classification', test_size=0.2, random_state=42, test_sample_labels=None)#

Fit a linear probe on top of sample embeddings.

Parameters:
  • target (str) – Column in self.adata.obs to predict.

  • task (Literal['classification', 'regression'] (default: 'classification')) – "classification" or "regression".

  • test_size (float (default: 0.2)) – Fraction of donors held out for evaluation when test_sample_labels is not provided.

  • random_state (int (default: 42)) – Random seed for the train/test split (used only when test_sample_labels is not provided).

  • test_sample_labels (list | None (default: None)) – Explicit list of sample labels (index values of sample_representation) to use as the test set. When provided, test_size and random_state are ignored. When None, a random split is performed and the chosen test labels are stored in test_sample_labels for reproducibility.

Return type:

dict

Returns:

dict Keys: "model", "test_sample_labels", "{target}_test", "{target}_pred".

For classification: additionally "accuracy" and "f1". For regression: additionally "r2" and "pearson".

Examples

>>> result = model.fit_linear_probe(target="age", task="regression")
>>> print(f"Pearson r = {result['pearson']:.3f}")
MOFA.plot_clustermap(metadata_cols=None, figsize=(10, 12), *args, **kwargs)#

Plot a hierarchically-clustered heat-map of the distance matrix.

Parameters:
Returns:

seaborn.matrix.ClusterGrid

MOFA.plot_embedding(method='UMAP', metadata_cols=None, continuous_palette='viridis', categorical_palette='tab10', na_color='lightgray', axes=None, use_uns_colors=True, color_key_suffix='_colors', show_legend=True)#

Plot a 2-D embedding of distances, optionally coloured by metadata.

Parameters:
  • method (str (default: 'UMAP')) – Embedding method. One of "MDS", "TSNE", "UMAP".

  • metadata_cols (list[str] | None (default: None)) – Columns from .obs used for colouring.

  • continuous_palette (str (default: 'viridis')) – Seaborn palette names for continuous / categorical metadata.

  • categorical_palette (str (default: 'tab10')) – Seaborn palette names for continuous / categorical metadata.

  • na_color (str (default: 'lightgray')) – Colour used for samples with missing metadata values.

  • axes (default: None) – Existing matplotlib Axes (or array of Axes) to plot into.

  • use_uns_colors (bool (default: True)) – If True, look for colors in adata.uns[f'{col}{color_key_suffix}'] and use them if available (similar to scanpy).

  • color_key_suffix (str (default: '_colors')) – Suffix for the color key in adata.uns. Default is "_colors". For example, with suffix "_colors", will look for adata.uns['cell_type_colors'].

  • show_legend (bool (default: True)) – If True, display the legend. If False, hide it.

Returns:

matplotlib Axes or array of Axes

MOFA.plot_metadata_distribution(metadata_columns, tasks, method='knn', embedding='UMAP', metadata=None, metric_threshold=0.4)#

Predict metadata columns, and plot embeddings colorised by metadata values

Parameters:
  • metadata_columns (list[str]) – List of metadata columns to show

  • tasks (list[str]) – Tasks for each metadata column (classification, ranking or regression). Can be one string for all columns.

  • method (Literal['knn', 'distances', 'proportions', 'silhouette', 'persistence', 'permanova'] (default: 'knn')) – Method to use for evaluation. See documentation of evaluate_representation for more information

  • embedding (str (default: 'UMAP')) – Embedding to use for plotting

  • metric_threshold (float = 0.3) – Results with lower values than this metric will not be displayed

MOFA.predict_metadata(target, metadata=None, n_neighbors=3, task='classification')#

Predict classes from metadata column target for samples using K-Nearest Neighbors classifier

Parameters:
  • target (str) – Column name from adata.obs, which will be used for classification

  • metadata (Optional[pd.DataFrame] = None) – Table with metadata about samples. Index should contain samples. If None, adata.obs is used

  • n_neighbors (int (default: 3)) – Number of neighbors to use for classification

  • task (str = "classification")

Returns:

y_truearray-like

True values of target from metadata for samples with known values

y_predictedarray-like

Predicted values of target for samples with known values

MOFA.prepare_anndata(adata)#

Prepare AnnData for MOFA2, optionally treating cell types as separate views.

Parameters:

adata (AnnData) – Annotated data matrix

MOFA.to_adata(metadata=None, *args, **kwargs)#

Convert samples data to AnnData object

Parameters:
  • metadata (DataFrame (default: None)) – Metadata about samples to be added to .obs of AnnData object. Should contain samples in index

  • *args – Additional arguments to pass to calculate_distance_matrix method

  • **kwargs – Additional arguments to pass to calculate_distance_matrix method

Returns:

-samples_adata (AnnData) AnnData object with samples data