patpy.tl.GroupedPseudobulk#
- class patpy.tl.GroupedPseudobulk(sample_key, cell_group_key, layer='X_pca', seed=67)#
Baseline, where distances between samples are average distances between their cell group pseudobulks
Methods table#
|
Calculate distances between samples as average distance between per cell-type pseudobulks |
|
Embed distances into 2-D coordinates. |
|
Evaluate representation of |
|
Fit a linear probe on top of sample embeddings. |
|
Plot a hierarchically-clustered heat-map of the distance matrix. |
|
Plot a 2-D embedding of distances, optionally coloured by metadata. |
|
Predict metadata columns, and plot embeddings colorised by metadata values |
|
Predict classes from metadata column |
|
Prepare adata for analysis. |
|
Convert samples data to AnnData object |
Methods#
- GroupedPseudobulk.calculate_distance_matrix(force=False, aggregate='mean', dist='euclidean')#
Calculate distances between samples as average distance between per cell-type pseudobulks
- GroupedPseudobulk.embed(method='UMAP', n_jobs=-1, verbose=False)#
Embed distances into 2-D coordinates.
- Parameters:
- Return type:
- Returns:
-coordinates (
ndarray) Array of shape(n_samples, 2).
- GroupedPseudobulk.evaluate_representation(target, method='knn', metadata=None, num_donors_subset=None, proportion_donors_subset=None, **parameters)#
Evaluate representation of
targetfor the given distance matrix- Parameters:
target ("str") – A sample-level covariate to evaluate representation for
method (
Literal['knn','distances','proportions','silhouette','persistence','permanova'] (default:'knn')) –Method to use for evaluation:
knn: predict values of
targetusing K-nearest neighbors and evaluate the predictiondistances: test if distances between samples are significantly different from the null distribution
proportions: test if distribution of
targetdiffers between groups (e.g. clusters)silhouette: calculate silhouette score for the given distances
num_donors_subset (int, optional) – Absolute number of donors to include in the evaluation.
proportion_donors_subset (float, optional) – Proportion of donors to include in the evaluation.
parameters (dict) –
Parameters for the evaluation method. The following parameters are used:
- knn:
n_neighbors: number of neighbors to use for prediction
task: type of prediction task. One of “classification”, “regression”, “ranking”. See documentation of
predict_knnfor more information
- distances:
control_level: value of
targetthat should be used as a control groupnormalization_type: type of normalization to use. One of “total”, “shift”, “var”. See documentation of
test_distances_significancefor more informationn_bootstraps: number of bootstrap iterations to use
trimmed_fraction: fraction of the most extreme values to remove from the distribution
compare_by_difference: if True, normalization is defined as difference (as in the original paper). Otherwise, it is defined as a ratio
- proportions:
groups: groups (e.g. cluster numbers) of the observations
- Returns:
-result (
dict) Result of evaluation with the following keys:score: a number evaluating the representation. The higher the better
metric: name of the metric used for evaluation
n_unique: number of unique values in
targetn_observations: number of observations used for evaluation. Can be different for different targets, even within one dataset (because of NAs)
method: name of the method used for evaluation
There are other optional keys depending on the method used for evaluation.
- GroupedPseudobulk.fit_linear_probe(target, task='classification', test_size=0.2, random_state=42, test_sample_labels=None)#
Fit a linear probe on top of sample embeddings.
- Parameters:
target (
str) – Column inself.adata.obsto predict.task (
Literal['classification','regression'] (default:'classification')) –"classification"or"regression".test_size (
float(default:0.2)) – Fraction of donors held out for evaluation whentest_sample_labelsis not provided.random_state (
int(default:42)) – Random seed for the train/test split (used only whentest_sample_labelsis not provided).test_sample_labels (
list|None(default:None)) – Explicit list of sample labels (index values ofsample_representation) to use as the test set. When provided,test_sizeandrandom_stateare ignored. WhenNone, a random split is performed and the chosen test labels are stored intest_sample_labelsfor reproducibility.
- Return type:
- Returns:
dict Keys:
"model","test_sample_labels","{target}_test","{target}_pred".For classification: additionally
"accuracy"and"f1". For regression: additionally"r2"and"pearson".
Examples
>>> result = model.fit_linear_probe(target="age", task="regression") >>> print(f"Pearson r = {result['pearson']:.3f}")
- GroupedPseudobulk.plot_clustermap(metadata_cols=None, figsize=(10, 12), *args, **kwargs)#
Plot a hierarchically-clustered heat-map of the distance matrix.
- Parameters:
metadata_cols (list[str] or None) –
.obscolumns to annotate the heat-map.figsize (tuple)
*args – Passed to
calculate_distance_matrix().**kwargs – Passed to
calculate_distance_matrix().
- Returns:
seaborn.matrix.ClusterGrid
- GroupedPseudobulk.plot_embedding(method='UMAP', metadata_cols=None, continuous_palette='viridis', categorical_palette='tab10', na_color='lightgray', axes=None, use_uns_colors=True, color_key_suffix='_colors', show_legend=True)#
Plot a 2-D embedding of distances, optionally coloured by metadata.
- Parameters:
method (
str(default:'UMAP')) – Embedding method. One of"MDS","TSNE","UMAP".metadata_cols (
list[str] |None(default:None)) – Columns from.obsused for colouring.continuous_palette (
str(default:'viridis')) – Seaborn palette names for continuous / categorical metadata.categorical_palette (
str(default:'tab10')) – Seaborn palette names for continuous / categorical metadata.na_color (
str(default:'lightgray')) – Colour used for samples with missing metadata values.axes (default:
None) – Existing matplotlib Axes (or array of Axes) to plot into.use_uns_colors (
bool(default:True)) – IfTrue, look for colors inadata.uns[f'{col}{color_key_suffix}']and use them if available (similar to scanpy).color_key_suffix (
str(default:'_colors')) – Suffix for the color key inadata.uns. Default is"_colors". For example, with suffix"_colors", will look foradata.uns['cell_type_colors'].show_legend (
bool(default:True)) – IfTrue, display the legend. IfFalse, hide it.
- Returns:
matplotlib Axes or array of Axes
- GroupedPseudobulk.plot_metadata_distribution(metadata_columns, tasks, method='knn', embedding='UMAP', metadata=None, metric_threshold=0.4)#
Predict metadata columns, and plot embeddings colorised by metadata values
- Parameters:
metadata_columns (
list[str]) – List of metadata columns to showtasks (
list[str]) – Tasks for each metadata column (classification, ranking or regression). Can be one string for all columns.method (
Literal['knn','distances','proportions','silhouette','persistence','permanova'] (default:'knn')) – Method to use for evaluation. See documentation ofevaluate_representationfor more informationembedding (
str(default:'UMAP')) – Embedding to use for plottingmetric_threshold (float = 0.3) – Results with lower values than this metric will not be displayed
- GroupedPseudobulk.predict_metadata(target, metadata=None, n_neighbors=3, task='classification')#
Predict classes from metadata column
targetfor samples using K-Nearest Neighbors classifier- Parameters:
target (str) – Column name from
adata.obs, which will be used for classificationmetadata (Optional[pd.DataFrame] = None) – Table with metadata about samples. Index should contain samples. If None,
adata.obsis usedn_neighbors (
int(default:3)) – Number of neighbors to use for classificationtask (str = "classification")
- Returns:
- y_truearray-like
True values of
targetfrom metadata for samples with known values- y_predictedarray-like
Predicted values of
targetfor samples with known values
- GroupedPseudobulk.prepare_anndata(adata)#
Prepare adata for analysis.
Calls
BaseSampleMethod.prepare_anndata()and checks that the model is not already fitted (to avoid silent re-use of stale state). Subclasses must callsuper().prepare_anndata(adata)first.
- GroupedPseudobulk.to_adata(metadata=None, *args, **kwargs)#
Convert samples data to AnnData object
- Parameters:
metadata (
DataFrame(default:None)) – Metadata about samples to be added to .obs of AnnData object. Should contain samples in index*args – Additional arguments to pass to calculate_distance_matrix method
**kwargs – Additional arguments to pass to calculate_distance_matrix method
- Returns:
-samples_adata (
AnnData) AnnData object with samples data