Understanding sources of variation in single-cell data with GloScope

Understanding sources of variation in single-cell data with GloScope#

What explains differences between samples in your data? A disease condition of patients or maybe batch effects? In this tutorial, we will use patpy to build a sample representation from single-cell transcriptomics data and evaluate the sources of variation in it.

We will use GloScope, a density-based sample representation method. It treats samples as point clouds, where each point is a cell. To understand how dissimilar samples are, GloScope computes divergence between these point clouds. Intuitively, if some cells states are present in every sample, they will have a little contribution to differences between samples. On the other hand, sample-specific cell states will have a high contribution. You can find more details about GloScope in the original publication.

Gloscope has shown the best performance among methods in the SPARE sample representation benchmark. It captures clinical information well and uncovers disease trajectories as demonstrated in our patient trajectory analysis tutorial. In this notebook, we’ll run our Python reimplementation of GloScope. We’ll see that it gives identical results to the original R implemetation, but is orders of magnitude faster.

1. Setup#

There are 2 possible ways to run GloScope in python: with GPU and without. While both are more efficient than R version, GPU uses accelerated engine for mathematical operations and performs particularly rapidly. But all good things come at a cost: to use GPU you must have it in your computer and install additional packages. If this works for you, follow the installation steps below. For running CPU version, jump to step 1.3, all the required dependencies already come with patpy.

1.1 Option 1 (with conda)#

Download the environment file from here or manually create a file rapids_singlecell.yaml with the following content:

name: rapids_singlecell
channels:
  - rapidsai
  - nvidia
  - conda-forge
  - bioconda
dependencies:
  - python=3.11
  - anndata=0.11
  - scanpy=1.11
  - leidenalg
  - louvain
  - zarr
  - cuda-version=12
  - rapids=24.06
  - scikit-learn<1.6 # required by dask-ml
  - scipy<1.15
  - dask
  - dask-ml
  - sparse
  - harmonypy
  - pytorch
  - pip
  - pip:
    - rapids-singlecell
    - scikit-learn<1.6

Create a new conda environment based on this file:

conda env create -f rapids_singlecell.yaml

Activate the environment:

conda activate rapids_singlecell

Install ipykernel to be able to use this environment in a jupyter notebook:

conda install ipykernel

1.2 Option 2 (conda-free). Install the RAPIDS package#

You can use the script provided by rapidsai developers. Activate a desired environment, and run the following:

git clone https://github.com/rapidsai/rapidsai-csp-utils.git
python rapidsai-csp-utils/colab/pip-install.py

Run this cell to enable consistent usage of array API:

import os
os.environ["SCIPY_ARRAY_API"] = "1"

1.3 Install the patpy package#

For latest released version, simply run:

pip install patpy

To get a newer development version (might be unstable), install from github:

pip install git+https://github.com/lueckenlab/patpy.git@main

1.4 Import packages#

import pandas as pd
import numpy as np
import scanpy as sc
import patpy
import seaborn as sns
import matplotlib.pyplot as plt

import random

random.seed(42)
np.random.seed(42)

patpy.__version__

'0.11.2'

2. Data Preparation#

2.1 Read the data#

Here, we use COMBAT dataset. This dataset contains 783k cells (after some additional preprocessing) from 138 COVID-19 patients and healthy donors.

ADATA_PATH = "/home/icb/vladimir.shitov/projects/vladimir.shitov/2023_05_patient_representation_benchmark/reproducibility/pat_rep_benchmark/data/combat/combat_processed.h5ad"

adata = sc.read_h5ad(ADATA_PATH)
adata

AnnData object with n_obs × n_vars = 783677 × 3000
    obs: 'Annotation_cluster_id', 'Annotation_cluster_name', 'Annotation_minor_subset', 'Annotation_major_subset', 'Annotation_cell_type', 'GEX_region', 'QC_ngenes', 'QC_total_UMI', 'QC_pct_mitochondrial', 'QC_scrub_doublet_scores', 'TCR_chain_composition', 'TCR_clone_ID', 'TCR_clone_count', 'TCR_clone_proportion', 'TCR_contains_unproductive', 'TCR_doublet', 'TCR_chain_TRA', 'TCR_v_gene_TRA', 'TCR_d_gene_TRA', 'TCR_j_gene_TRA', 'TCR_c_gene_TRA', 'TCR_productive_TRA', 'TCR_cdr3_TRA', 'TCR_umis_TRA', 'TCR_chain_TRA2', 'TCR_v_gene_TRA2', 'TCR_d_gene_TRA2', 'TCR_j_gene_TRA2', 'TCR_c_gene_TRA2', 'TCR_productive_TRA2', 'TCR_cdr3_TRA2', 'TCR_umis_TRA2', 'TCR_chain_TRB', 'TCR_v_gene_TRB', 'TCR_d_gene_TRB', 'TCR_j_gene_TRB', 'TCR_c_gene_TRB', 'TCR_productive_TRB', 'TCR_chain_TRB2', 'TCR_v_gene_TRB2', 'TCR_d_gene_TRB2', 'TCR_j_gene_TRB2', 'TCR_c_gene_TRB2', 'TCR_productive_TRB2', 'TCR_cdr3_TRB2', 'TCR_umis_TRB2', 'BCR_umis_HC', 'BCR_contig_qc_HC', 'BCR_functionality_HC', 'BCR_v_call_HC', 'BCR_v_score_HC', 'BCR_j_call_HC', 'BCR_j_score_HC', 'BCR_junction_aa_HC', 'BCR_total_mut_HC', 'BCR_s_mut_HC', 'BCR_r_mut_HC', 'BCR_c_gene_HC', 'BCR_clone_per_replicate_HC', 'BCR_clone_global_HC', 'BCR_clonal_abundance_HC', 'BCR_locus_LC', 'BCR_umis_LC', 'BCR_contig_qc_LC', 'BCR_functionality_LC', 'BCR_v_call_LC', 'BCR_v_score_LC', 'BCR_j_call_LC', 'BCR_j_score_LC', 'BCR_junction_aa_LC', 'BCR_total_mut_LC', 'BCR_s_mut_LC', 'BCR_r_mut_LC', 'BCR_c_gene_LC', 'COMBAT_ID', 'scRNASeq_sample_ID', 'COMBAT_participant_timepoint_ID', 'Source', 'Age', 'Sex', 'Race', 'BMI', 'Hospitalstay', 'Death28', 'Institute', 'PreExistingHeartDisease', 'PreExistingLungDisease', 'PreExistingKidneyDisease', 'PreExistingDiabetes', 'PreExistingHypertension', 'PreExistingImmunocompromised', 'Smoking', 'Symptomatic', 'Requiredvasoactive', 'Respiratorysupport', 'SARSCoV2PCR', 'Outcome', 'TimeSinceOnset', 'Ethnicity', 'Tissue', 'DiseaseClassification', 'Pool_ID', 'Channel_ID', 'ifn_1_score', '_scvi_batch', '_scvi_labels', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'vertex', 'eigenvector_centrality'
    var: 'gene_ids', 'feature_types', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'Institute', 'ObjectCreateDate', 'Source_colors', 'Technology', 'X_gloscope_cuml_distances', 'X_gloscope_pynndescent_distances', 'X_scpoli', '_scvi_manager_uuid', '_scvi_uuid', 'genome_annotation_version', 'gloscope_representation', 'gloscope_scpoli_distances', 'hvg', 'log1p', 'neighbors', 'pca', 'scpoli_distances', 'scpoli_parameters', 'scpoli_samples'
    obsm: 'X_pca', 'X_scANVI_batch', 'X_scANVI_sample', 'X_scVI_batch', 'X_scVI_sample', 'X_scpoli', 'X_umap', 'X_umap_source'
    varm: 'PCs'
    layers: 'X_raw_counts'
    obsp: 'connectivities', 'distances'

2.2 Set columns containing sample IDs, cell types and metadata#

For convenience, let’s store columns contatining patient-level information to python variables:

sample_id_col = "scRNASeq_sample_ID"

# Columns from adata.obs containing metadata about samples
samples_metadata_cols = ["Source", "Outcome", "Death28", "Hospitalstay", "TimeSinceOnset", "Institute", "Pool_ID",
                         "Age", "Sex", "BMI", "PreExistingHeartDisease", "PreExistingLungDisease", "PreExistingDiabetes",
                         "PreExistingHypertension", "Smoking", "Requiredvasoactive", "SARSCoV2PCR"]

# Columns we will use for plotting
cols_to_plot = ["Source", "Outcome", "Death28", "Institute", "Pool_ID"]

Additionally, let’s set a layer that we will use as cell representations. It must be "X" or one of the keys in either .layers or .obsm. Here, we will use scVI embedding:

layer = "X_scVI_batch"

2.3 Store metadata and calculate QC metrics#

To evaluate sample representation, let’s extract a sample metadata table from .obs data frame in annotated data object:

metadata = adata.obs[samples_metadata_cols + [sample_id_col]].drop_duplicates()
metadata.set_index(sample_id_col, inplace=True)
metadata

	Source	Outcome	Death28	Hospitalstay	TimeSinceOnset	Institute	Pool_ID	Age	Sex	BMI	PreExistingHeartDisease	PreExistingLungDisease	PreExistingDiabetes	PreExistingHypertension	Smoking	Requiredvasoactive	SARSCoV2PCR
scRNASeq_sample_ID
S00109-Ja001E-PBCa	COVID_SEV	2.0	0	1.0	12.0	Oxford	gPlexA	5.0	1	5	0.0	0.0	0.0	1.0	0	1.0	1
S00112-Ja003E-PBCa	COVID_MILD	5.0	0	2.0	12.0	Oxford	gPlexA	5.0	1	5	0.0	0.0	0.0	0.0	0	0.0	1
S00005-Ja005E-PBCa	COVID_CRIT	2.0	0	4.0	14.0	Oxford	gPlexA	7.0	0	1	1.0	0.0	0.0	1.0	0	1.0	1
S00061-Ja003E-PBCa	COVID_SEV	4.0	0	2.0	6.0	Oxford	gPlexA	5.0	1	5	0.0	0.0	0.0	0.0	0	0.0	1
S00056-Ja003E-PBCa	COVID_SEV	3.0	0	3.0	23.0	Oxford	gPlexA	7.0	0	5	0.0	0.0	0.0	0.0	0	0.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
S00065-Ja003E-PBCa	COVID_CRIT	2.0	0	3.0	16.0	Oxford	gPlexK	5.0	0	5	0.0	0.0	1.0	0.0	0	1.0	1
S00048-Ja003E-PBCa	COVID_SEV	4.0	0	2.0	12.0	Oxford	gPlexK	7.0	1	3	0.0	0.0	0.0	1.0	0	0.0	1
G05112-Ja005E-PBCa	COVID_HCW_MILD	6.0	0	NaN	7.0	Oxford	gPlexK	4.0	0	5	NaN	NaN	NaN	NaN	0	NaN	1
N00038-Ja001E-PBGa	Sepsis	NaN	0	0.0	7.0	Oxford	gPlexK	4.0	1	5	0.0	0.0	0.0	0.0	0	0.0	0
U00501-Ua005E-PBUa	Flu	2.0	0	NaN	11.0	St_Georges	gPlexK	6.0	1	1	NaN	NaN	0.0	0.0	0	1.0	0

138 rows × 17 columns

Let’s further add quality control information to evaluate how much representations are affected by number of genes, mRNA counts and percentage of mitochondrial genes. We’ll aggregate this cell-level QC metric using a median for every sample:

cell_qc_metadata = patpy.pp.calculate_cell_qc_metrics(
    adata, sample_key=sample_id_col, cell_qc_vars=["QC_ngenes", "QC_total_UMI", "QC_pct_mitochondrial"]
)
cell_qc_metadata.columns = ["median_n_genes_by_counts", "median_total_counts", "median_pct_counts_mt"]
cell_qc_metadata

	median_n_genes_by_counts	median_total_counts	median_pct_counts_mt
scRNASeq_sample_ID
G05061-Ja005E-PBCa	1107.0	3125.5	3.011159
G05064-Ja005E-PBCa	975.0	2933.0	1.332430
G05073-Ja005E-PBCa	1141.0	3554.0	2.422559
G05077-Ja005E-PBCa	1125.0	3268.0	2.946723
G05078-Ja005E-PBCa	999.0	3106.0	2.825308
...	...	...	...
U00607-Ua005E-PBUa	1827.0	5749.0	2.982509
U00613-Ua005E-PBUa	1251.5	3325.0	2.053083
U00617-Ua005E-PBUa	1410.5	4524.0	3.886215
U00619-Ua005E-PBUa	1532.0	4105.0	2.688350
U00701-Ua005E-PBUa	1093.5	3074.5	3.714650

138 rows × 3 columns

n_genes_metadata = patpy.pp.calculate_n_cells_per_sample(adata, sample_id_col)
n_genes_metadata

	n_cells
scRNASeq_sample_ID
S00052-Ja005E-PBCa	13918
H00054-Ha001E-PBGa	10938
H00067-Ha001E-PBGa	10781
N00023-Ja001E-PBGa	10484
H00053-Ha001E-PBGa	10458
...	...
U00607-Ua005E-PBUa	1021
U00613-Ua005E-PBUa	970
U00701-Ua005E-PBUa	872
U00601-Ua005E-PBUa	619
U00504-Ua005E-PBUa	161

138 rows × 1 columns

Combine all the tables into one:

metadata = pd.concat(
    [
        metadata,
        cell_qc_metadata.loc[metadata.index],
        n_genes_metadata.loc[metadata.index],
    ],
    axis=1,
)

metadata.shape

(138, 21)

Merge aggregated QC to cell metadata:

adata.obs = adata.obs.merge(metadata.drop(columns=samples_metadata_cols), left_on=sample_id_col, right_index=True)

2.4 Quality control#

To reduce noise in the representations, we need to remove samples with too few cells:

adata = patpy.pp.filter_small_samples(adata, sample_key=sample_id_col, sample_size_threshold=500)

1 samples removed: U00504-Ua005E-PBUa

3. Running GloScope in Python#

The nearest neighbor computation for the distance matrix calculation can either be done using a CPU or a GPU. By setting the parameter use_gpu=True the matrix will be calculated using either cuML or PyNNDescent.

4.1 Run the GloScope with GPU acceleration#

Using patpy, we can run sample representation methods with a simple interface:

gloscope_py = patpy.tl.GloScope_py(sample_key=sample_id_col, layer=layer, k=25, use_gpu=True)
gloscope_py.prepare_anndata(adata)

4.2 Calculate the distance matrix between samples#

GloScope does not create a vector representation of samples, but instead gives a pairwise KL-divergence matrix. The smaller the values, the more similar samples are. GloScope does not produce “distances” in mathematical sense, but we can treat the divergencies like it. Even with that caveat, it captures more information than other sample representation methods.

%%time
distances_py = gloscope_py.calculate_distance_matrix(force=True)
distances_py

CPU times: user 21.4 s, sys: 3.54 s, total: 25 s
Wall time: 2min 14s

array([[0.        , 2.98802018, 4.96645102, ..., 3.36136601, 8.29764793,
        4.77171363],
       [2.98802018, 0.        , 4.58604968, ..., 4.24508584, 7.28937722,
        2.99966394],
       [4.96645102, 4.58604968, 0.        , ..., 5.72771941, 7.25859998,
        3.11516585],
       ...,
       [3.36136601, 4.24508584, 5.72771941, ..., 0.        , 7.92112695,
        5.27670011],
       [8.29764793, 7.28937722, 7.25859998, ..., 7.92112695, 0.        ,
        4.89489186],
       [4.77171363, 2.99966394, 3.11516585, ..., 5.27670011, 4.89489186,
        0.        ]], shape=(137, 137))

Our implementation only takes about two minutes for the whole COMBAT data set when using GPU! Perfect time for you to go for another cup of coffee, but not long enough to make you feel bored.

4.3 Visualise the sample representation using UMAP#

But does GloScope capture information about samples? Let’s build a UMAP plot to get an idea. On this plot, similar patients are localised together:

gloscope_py.plot_embedding(method="UMAP", metadata_cols=cols_to_plot);

../../_images/35bb5109a0a6e95236f01eca5738449de246f90846db617a310fa599d146a996.png

We can see that patients with similar “Source” and “Outcome” are grouped together. The latter reflects the severity of the disease: from 1, which means death, to 6, which encodes healthy donors. “Death28” column encodes whether patients died in the period of 28 days. The last 2 columns represent locations where samples were processed and the batches. The visualisation suggests that the batches mix with each other and explain the embedding much worse than the Outcome. We will further validate this finding numerically.

4. Evaluating sources of variation in the data#

While somebody is already writing a wrathful twitter thread on our interpretation of UMAPs, let’s compute some metrics to understand what explains sample-level variation in the data. We need to classify our sample-level covariates (columns in the metadata) to clinically relevant, technical and contextual. The former 2 categories can be used to judge whether sample representation method is working well for our analysis goals. The “contextual” category contains columns, which we don’t aim to analyse, but may want to know whether they impact the representation.

Furthermore, we need to define prediction tasks for every metadata columns. They must be one of:

“classification” – for discrete labels such as disease status
“regression” – for continuous covariates such as number of hospital stay days
“ranking” – for ranked variables. For example, “Outcome” in our dataset is encoded as numbers from 1 to 6. In this case, if a sample has label “6” (healthy), predicting “5” (hospitalised, no extra oxigen) is better than “3” (non-invasive ventilation).

We will predict each covariate based on the nearest neighbors (the most similar samples) in the representation. Setting the task will define metrics used to evaluate quality of the prediction. For classification, it is F1-macro score (corrected for random prediction), and for other tasks Spearman correlation is used.

benchmark_schema = {
    "relevant": {
        "Source": "classification",
        "Outcome": "ranking",
        "Death28": "classification",
        "Hospitalstay": "regression",
        "TimeSinceOnset": "regression"
    },
    "technical": {
        "Institute": "classification",
        "Pool_ID": "classification",
        "median_n_genes_by_counts": "regression",
        "median_total_counts": "regression",
        "median_pct_counts_mt": "regression",
        "n_cells": "regression"
    },
    "contextual": {
        "Age": "regression",
        "Sex": "classification",
        "BMI": "regression",
        "PreExistingHeartDisease": "classification",
        "PreExistingLungDisease": "classification",
        "PreExistingDiabetes": "classification",
        "PreExistingHypertension": "classification",
        "Smoking": "classification",
        "Requiredvasoactive": "classification",
        "SARSCoV2PCR": "classification"
    }
}

4.1 Run the covariate prediction#

metadata = metadata.loc[gloscope_py.samples]  # Ensure the order matches the distance matrix

results = []

for type_name, type_data in benchmark_schema.items():
    for target, task in type_data.items():
        result = patpy.tl.evaluate_representation(
            distances=distances_py,
            target=metadata[target],
            method="knn",
            task=task,
            n_neighbors=5
        )

        result["covariate"] = target
        result["covariate_type"] = type_name

        if result["metric"] == "spearman_r":
            result["score"] = abs(result["score"])

        results.append(result)

We can order covariates based on how well they can be predicted from a representation. It gives us an insight on what explains variability in the data:

knn_results = pd.DataFrame(results)
knn_results.sort_values(by="score", ascending=False)

	score	metric	n_unique	n_observations	method	covariate	covariate_type
9	0.745459	spearman_r	137	137	knn	median_pct_counts_mt	technical
7	0.631919	spearman_r	118	137	knn	median_n_genes_by_counts	technical
1	0.626460	spearman_r	6	112	knn	Outcome	relevant
20	0.599415	f1_macro_calibrated	2	137	knn	SARSCoV2PCR	contextual
8	0.585656	spearman_r	131	137	knn	median_total_counts	technical
15	0.549561	f1_macro_calibrated	2	99	knn	PreExistingLungDisease	contextual
4	0.423161	spearman_r	26	127	knn	TimeSinceOnset	relevant
2	0.421534	f1_macro_calibrated	2	137	knn	Death28	relevant
10	0.395705	spearman_r	137	137	knn	n_cells	technical
13	0.317015	spearman_r	6	137	knn	BMI	contextual
0	0.314356	f1_macro_calibrated	8	137	knn	Source	relevant
5	0.286458	f1_macro_calibrated	2	137	knn	Institute	technical
11	0.275583	spearman_r	8	137	knn	Age	contextual
3	0.260062	spearman_r	14	102	knn	Hospitalstay	relevant
14	0.259044	f1_macro_calibrated	2	99	knn	PreExistingHeartDisease	contextual
12	0.113460	f1_macro_calibrated	2	137	knn	Sex	contextual
6	0.078434	f1_macro_calibrated	10	137	knn	Pool_ID	technical
18	0.031751	f1_macro_calibrated	3	137	knn	Smoking	contextual
16	0.003726	f1_macro_calibrated	2	112	knn	PreExistingDiabetes	contextual
17	0.000000	f1_macro_calibrated	2	112	knn	PreExistingHypertension	contextual
19	0.000000	f1_macro_calibrated	2	115	knn	Requiredvasoactive	contextual

4.2 Visualise the results#

Let’s build a plot based on the table above for a nice visual comparision:

plt.figure(figsize=(10, 8), dpi=300)
plt.rcParams.update({
    'font.size': 14,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10
})

sns.barplot(
    data=knn_results,
    y="covariate",
    x="score",
    hue="covariate_type",
    dodge=False,
    palette="colorblind",
    order=knn_results.sort_values('score', ascending = False).covariate
)

plt.xlabel("KNN prediction score")
plt.ylabel("Covariate")

plt.legend(title="Covariate Type")
plt.title("Sources of variation in the COMBAT dataset", fontweight='bold')

sns.despine()
plt.tight_layout()
plt.show()

../../_images/425219a522b729b611905ce51155f2429c9f9f7ba7f330f58fc8ab5afc3973eb.png

We can see that median percentage of mitochondrial counts and median number of genes can be predicted very well from a representation. This is rather bad news, as these covariates are technical. On the bright side, the next best-predicted covariate is the “Outcome”, indicating that disease outcome effects the transcriptomics in blood! Let’s visualise these covariates on the representation:

well_represented_cols = ["Outcome", "median_pct_counts_mt", "median_n_genes_by_counts"]

axes = gloscope_py.plot_embedding(method="UMAP", metadata_cols=well_represented_cols)

for ax, col in zip(axes, well_represented_cols):
    ax.legend(loc=(0, 1.02), title=col)

../../_images/3316de5ecd775bfbc3e85d5e9b8c79d59ea6ef504593d5df720174470dc896c9.png

As we previously seen, outcome has a clear grouping on the UMAP. We can also see a notable gradient of median percentage of mitochondrial gene counts and number of genes in a sample.

Another insight we can learn from this sample representation is what doesn’t affect the data. For example, pre-existing diabetes and hypertension cannot be predicted well from the data we have. That can mean the following:

These are not the main axes of variation in the data. Perhaps, they do affect the transcriptomics, but this effect is hidden behind stronger sources of variation. If you are interested in analysis of those covariates, a more targeted approach is needed.
These covariates do not affect the data at all. There are limitations of what we can predict from transcriptomics. Perhaps, these information is simply beyond them.
Metadata is too sparse. We don’t have too many samples with positive values of these covariates. Larger datasets could uncover them better.

In any case, sample representation allows us to immediately see what the data hides.

Note another interesting phenomenon: smoking status can be predicted worse than BMI. Isn’t that surprising in the context of respiratory diseases?

4.3 Run the GloScope using CPU#

If you don’t have a GPU, you can still run GloScope in Python way faster than in R! The interface is exactly the same, just set use_gpu=False this time:

gloscope_py_cpu = patpy.tl.GloScope_py(sample_key=sample_id_col, layer=layer, k=25, use_gpu=False)
gloscope_py_cpu.prepare_anndata(adata)

%%time
distances_py_cpu = gloscope_py_cpu.calculate_distance_matrix(force=True)
distances_py_cpu

CPU times: user 22min 13s, sys: 5.63 s, total: 22min 19s
Wall time: 22min 9s

array([[0.        , 3.01316958, 4.98487267, ..., 3.3902534 , 8.34240482,
        4.79270835],
       [3.01316958, 0.        , 4.62207723, ..., 4.27903188, 7.35136176,
        3.02328651],
       [4.98487267, 4.62207723, 0.        , ..., 5.754768  , 7.31083296,
        3.14633485],
       ...,
       [3.3902534 , 4.27903188, 5.754768  , ..., 0.        , 7.98588573,
        5.30291834],
       [8.34240482, 7.35136176, 7.31083296, ..., 7.98588573, 0.        ,
        4.93305717],
       [4.79270835, 3.02328651, 3.14633485, ..., 5.30291834, 4.93305717,
        0.        ]], shape=(137, 137))

Well, that took a bit longer than a minute, but still better than 3 hours it takes for the R implementation. You can notice that numbers in the matrix are almost identical to what we had before. Let’s take a look at UMAP visualisations:

axes = gloscope_py_cpu.plot_embedding(method="UMAP", metadata_cols=well_represented_cols)

for ax, col in zip(axes, well_represented_cols):
    ax.legend(loc=(0, 1.02), title=col)

../../_images/40650c13d476d14dab1a8df29d1fb2541bdba606ec984c33186af0849942d83d.png

We can see that embeddings look very similar to what we had before. Qualitative assessment would give the similar results.

5. Original GloScope Implementation#

This version is implemented in R and can be used via patpy. We do not recommend this approach for your workflow and only show it here for comparison.

If you want to run it, you need to install R dependencies. The easiest way to do it is via conda. You can use the following environment:

name: gloscope
channels:
  - conda-forge
  - bioconda
dependencies:
  - r-base=4.3
  - r-data.table
  - r-reticulate
  - r-anndata
  - bioconductor-gloscope
  - anndata2ri

Or install the required dependencies into an existing environment (don’t forget to activate it in prior):

conda install -c conda-forge -c bioconda r-base=4.3 r-data.table r-reticulate r-anndata bioconductor-gloscope anndata2ri -y

Initialise the sample represenation method. You can set n_workers to the number of CPUs on your machine to make the computation faster:

gloscope_r = patpy.tl.GloScope(sample_key=sample_id_col, layer=layer, k=25, n_workers=6)
gloscope_r.prepare_anndata(adata)

%%time
distances_r = gloscope_r.calculate_distance_matrix(force=True)
distances_r

Calculating GloScope distance matrix
CPU times: user 32.2 s, sys: 4.59 s, total: 36.8 s
Wall time: 7h 50min 4s

array([[0.        , 2.88381577, 4.86649874, ..., 3.25397249, 8.19277845,
        4.66033323],
       [2.88381577, 0.        , 4.48540625, ..., 4.13700121, 7.18381653,
        2.88759288],
       [4.86649874, 4.48540625, 0.        , ..., 5.62388687, 7.15729136,
        3.00734636],
       ...,
       [3.25397249, 4.13700121, 5.62388687, ..., 0.        , 7.81237691,
        5.16143919],
       [8.19277845, 7.18381653, 7.15729136, ..., 7.81237691, 0.        ,
        4.78215528],
       [4.66033323, 2.88759288, 3.00734636, ..., 5.16143919, 4.78215528,
        0.        ]], shape=(137, 137))

axes = gloscope_r.plot_embedding(method="UMAP", metadata_cols=well_represented_cols)

for ax, col in zip(axes, well_represented_cols):
    ax.legend(loc=(0, 1.02), title=col)

../../_images/0016ddafc47e2424870f4c3538916c7e7616754665d94fcac2866744b8556325.png

The numbers are again very similar to what we had before, and so it the representation. But it took us 8 hours to run!

Conclusion#

In this tutorial we saw how to obtain a high quality sample representation from single-cell transcriptomic data with GloScope and patpy. It enables understanding the sources of variation in the data. With GloScope we saw that COVID-19 outcome has a high influence on the data. To learn how sample representation can be further used to derive markers of COVID-19 severity, see our tutorial on patient trajectory analysis.

Benchmarking GloScope reimplementation#

Here are some more technical results on the reimplementation of GloScope for interested readers.

How much faster is the Python reimplementation of GloScope?#

Up to 160x faster with GPU and up to 7x faster without it. The bigger your dataset, the faster our implementation is. Here are the results of an experiment with COMBAT dataset subsets of a different size:

images/gloscope_eval_speedup.png

How similar are the results of different GloScope implementations?#

For all practical purposes, they are identical. Here is the comparison of KL divergence matrices obtained on a different subsets of the COMBAT dataset by different reimplementations. Mean squared error (MSE) is used to compare numerical results, and Spearman correlation is used to check whether the order of samples is preserved. Both metrics are close to their limits (0 and 1 correspondingly):

images/gloscope_eval_matrices.png

The GloScope reimplementation in Python was developed by Emma Schonner as a part of her Master thesis at the Technical University of Munich (TUM)