patpy.tl.GroupedPseudobulk

patpy.tl.GroupedPseudobulk#

class patpy.tl.GroupedPseudobulk(sample_key, cell_group_key, layer='X_pca', seed=67)#: Baseline, where distances between samples are average distances between their cell group pseudobulks

Methods table#

`calculate_distance_matrix`([force, ...])	Calculate distances between samples as average distance between per cell-type pseudobulks
`embed`([method, n_jobs, verbose])	Embed distances into 2-D coordinates.
`evaluate_representation`(target[, method, ...])	Evaluate representation of `target` for the given distance matrix
`fit_linear_probe`(target[, task, test_size, ...])	Fit a linear probe on top of sample embeddings.
`plot_clustermap`([metadata_cols, figsize])	Plot a hierarchically-clustered heat-map of the distance matrix.
`plot_embedding`([method, metadata_cols, ...])	Plot a 2-D embedding of distances, optionally coloured by metadata.
`plot_metadata_distribution`(metadata_columns, ...)	Predict metadata columns, and plot embeddings colorised by metadata values
`predict_metadata`(target[, metadata, ...])	Predict classes from metadata column `target` for samples using K-Nearest Neighbors classifier
`prepare_anndata`(adata)	Prepare adata for analysis.
`to_adata`([metadata])	Convert samples data to AnnData object

Methods#

GroupedPseudobulk.calculate_distance_matrix(force=False, aggregate='mean', dist='euclidean')#: Calculate distances between samples as average distance between per cell-type pseudobulks

GroupedPseudobulk.embed(method='UMAP', n_jobs=-1, verbose=False)#

Embed distances into 2-D coordinates.

Parameters:

distances – Square distance matrix of shape (n_samples, n_samples).
method (str (default: 'UMAP')) – One of "MDS", "TSNE", "UMAP".
n_jobs (int (default: -1)) – Number of parallel threads (-1 = all).
verbose (bool (default: False)) – Print progress information.

Return type:

ndarray

Returns:

-coordinates (ndarray) Array of shape (n_samples, 2).

GroupedPseudobulk.evaluate_representation(target, method='knn', metadata=None, num_donors_subset=None, proportion_donors_subset=None, **parameters)#

Evaluate representation of target for the given distance matrix

Parameters:

target ("str") – A sample-level covariate to evaluate representation for
method (Literal['knn', 'distances', 'proportions', 'silhouette', 'persistence', 'permanova'] (default: 'knn')) –
Method to use for evaluation:
- knn: predict values of target using K-nearest neighbors and evaluate the prediction
- distances: test if distances between samples are significantly different from the null distribution
- proportions: test if distribution of target differs between groups (e.g. clusters)
- silhouette: calculate silhouette score for the given distances
num_donors_subset (int, optional) – Absolute number of donors to include in the evaluation.
proportion_donors_subset (float, optional) – Proportion of donors to include in the evaluation.
parameters (dict) –
Parameters for the evaluation method. The following parameters are used:
- knn:
  - n_neighbors: number of neighbors to use for prediction
  - task: type of prediction task. One of “classification”, “regression”, “ranking”. See documentation of predict_knn for more information
- distances:
  - control_level: value of target that should be used as a control group
  - normalization_type: type of normalization to use. One of “total”, “shift”, “var”. See documentation of test_distances_significance for more information
  - n_bootstraps: number of bootstrap iterations to use
  - trimmed_fraction: fraction of the most extreme values to remove from the distribution
  - compare_by_difference: if True, normalization is defined as difference (as in the original paper). Otherwise, it is defined as a ratio
- proportions:
  - groups: groups (e.g. cluster numbers) of the observations

Returns:

-result (dict) Result of evaluation with the following keys:

score: a number evaluating the representation. The higher the better
metric: name of the metric used for evaluation
n_unique: number of unique values in target
n_observations: number of observations used for evaluation. Can be different for different targets, even within one dataset (because of NAs)
method: name of the method used for evaluation

There are other optional keys depending on the method used for evaluation.

GroupedPseudobulk.fit_linear_probe(target, task='classification', test_size=0.2, random_state=42, test_sample_labels=None, store=False)#

Fit a linear probe on top of sample embeddings.

The probe is a plain sklearn model (Ridge for regression, balanced LogisticRegression for classification) trained on the method’s sample representation. This works for any method that produces a per-sample embedding, including supervised methods whose native head solves a different task (e.g. training a regression probe on top of a classification model such as MixMIL).

Parameters:

target (str) – Column in self.adata.obs to predict.
task (Literal['classification', 'regression'] (default: 'classification')) – "classification" or "regression".
test_size (float (default: 0.2)) – Fraction of donors held out for evaluation when test_sample_labels is not provided.
random_state (int (default: 42)) – Random seed for the train/test split (used only when test_sample_labels is not provided).
test_sample_labels (list | None (default: None)) – Explicit list of sample labels (index values of sample_representation) to use as the test set. When provided, test_size and random_state are ignored. Pass an empty list to train the probe on all samples — useful when fitting a probe that will be applied to a different cohort; the returned metrics are then computed on the train set (see evaluated_on below). When None, a random split is performed and the chosen test labels are stored in test_sample_labels for reproducibility.
store (bool (default: False)) – When True (supervised methods only), register the fitted probe so that predict can reuse it on the current (or a swapped-in) cohort. The probe is saved in self._probes[target] and target is added to self.label_keys / self.tasks if not already present. This is how a regression head is attached to a classification-only model.

Return type:

dict

Returns:

dict Keys: "model", "test_sample_labels", "evaluated_on", "{target}_test", "{target}_pred".

For classification: additionally "accuracy" and "f1". For regression: additionally "r2", "pearson", "spearman" and "mae".

evaluated_on is "test" when a non-empty test set is used and "train" when the probe was trained on all samples; in the latter case the metrics and "{target}_test"/"{target}_pred" describe the train set.

Examples

>>> result = model.fit_linear_probe(target="age", task="regression")
>>> print(f"Pearson r = {result['pearson']:.3f}")

Attach a regression head to a classification model and predict:

>>> model.fit_linear_probe("age", task="regression", store=True)
>>> ages = model.predict("age")

GroupedPseudobulk.plot_clustermap(metadata_cols=None, figsize=(10, 12), *args, **kwargs)#

Plot a hierarchically-clustered heat-map of the distance matrix.

Parameters:

metadata_cols (list[str] or None) – .obs columns to annotate the heat-map.
figsize (tuple)
*args – Passed to calculate_distance_matrix().
**kwargs – Passed to calculate_distance_matrix().

Returns:

seaborn.matrix.ClusterGrid

GroupedPseudobulk.plot_embedding(method='UMAP', metadata_cols=None, continuous_palette='viridis', categorical_palette='tab10', na_color='lightgray', axes=None, use_uns_colors=True, color_key_suffix='_colors', show_legend=True)#

Plot a 2-D embedding of distances, optionally coloured by metadata.

Parameters:

method (str (default: 'UMAP')) – Embedding method. One of "MDS", "TSNE", "UMAP".
metadata_cols (list[str] | None (default: None)) – Columns from .obs used for colouring.
continuous_palette (str (default: 'viridis')) – Seaborn palette names for continuous / categorical metadata.
categorical_palette (str (default: 'tab10')) – Seaborn palette names for continuous / categorical metadata.
na_color (str (default: 'lightgray')) – Colour used for samples with missing metadata values.
axes (default: None) – Existing matplotlib Axes (or array of Axes) to plot into.
use_uns_colors (bool (default: True)) – If True, look for colors in adata.uns[f'{col}{color_key_suffix}'] and use them if available (similar to scanpy).
color_key_suffix (str (default: '_colors')) – Suffix for the color key in adata.uns. Default is "_colors". For example, with suffix "_colors", will look for adata.uns['cell_type_colors'].
show_legend (bool (default: True)) – If True, display the legend. If False, hide it.

Returns:

matplotlib Axes or array of Axes

GroupedPseudobulk.plot_metadata_distribution(metadata_columns, tasks, method='knn', embedding='UMAP', metadata=None, metric_threshold=0.4)#

Predict metadata columns, and plot embeddings colorised by metadata values

Parameters:

metadata_columns (list[str]) – List of metadata columns to show
tasks (list[str]) – Tasks for each metadata column (classification, ranking or regression). Can be one string for all columns.
method (Literal['knn', 'distances', 'proportions', 'silhouette', 'persistence', 'permanova'] (default: 'knn')) – Method to use for evaluation. See documentation of evaluate_representation for more information
embedding (str (default: 'UMAP')) – Embedding to use for plotting
metric_threshold (float = 0.3) – Results with lower values than this metric will not be displayed

GroupedPseudobulk.predict_metadata(target, metadata=None, n_neighbors=3, task='classification')#

Predict classes from metadata column target for samples using K-Nearest Neighbors classifier

Parameters:

target (str) – Column name from adata.obs, which will be used for classification
metadata (Optional[pd.DataFrame] = None) – Table with metadata about samples. Index should contain samples. If None, adata.obs is used
n_neighbors (int (default: 3)) – Number of neighbors to use for classification
task (str = "classification")

Returns:

y_truearray-like: True values of target from metadata for samples with known values
y_predictedarray-like: Predicted values of target for samples with known values

GroupedPseudobulk.prepare_anndata(adata)#

Prepare adata for analysis.

Calls BaseSampleMethod.prepare_anndata() and checks that the model is not already fitted (to avoid silent re-use of stale state). Subclasses must call super().prepare_anndata(adata) first.

GroupedPseudobulk.to_adata(metadata=None, *args, **kwargs)#

Convert samples data to AnnData object

Parameters:

metadata (DataFrame (default: None)) – Metadata about samples to be added to .obs of AnnData object. Should contain samples in index
*args – Additional arguments to pass to calculate_distance_matrix method
**kwargs – Additional arguments to pass to calculate_distance_matrix method

Returns:

-samples_adata (AnnData) AnnData object with samples data

patpy.tl.GroupedPseudobulk

Contents

patpy.tl.GroupedPseudobulk#

Methods table#

Methods#