Understanding sources of variation in single-cell data with GloScope#

What explains differences between samples in your data? A disease condition of patients or maybe batch effects? In this tutorial, we will use patpy to build a sample representation from single-cell transcriptomics data and evaluate the sources of variation in it.

We will use GloScope, a density-based sample representation method. It treats samples as point clouds, where each point is a cell. To understand how dissimilar samples are, GloScope computes divergence between these point clouds. Intuitively, if some cells states are present in every sample, they will have a little contribution to differences between samples. On the other hand, sample-specific cell states will have a high contribution. You can find more details about GloScope in the original publication.

../../_images/gloscope_illustration.png

Gloscope has shown the best performance among methods in the SPARE sample representation benchmark. It captures clinical information well and uncovers disease trajectories as demonstrated in our patient trajectory analysis tutorial. In this notebook, we’ll run our Python reimplementation of GloScope. We’ll see that it gives identical results to the original R implemetation, but is orders of magnitude faster.

1. Setup#

There are 2 possible ways to run GloScope in python: with GPU and without. While both are more efficient than R version, GPU uses accelerated engine for mathematical operations and performs particularly rapidly. But all good things come at a cost: to use GPU you must have it in your computer and install additional packages. If this works for you, follow the installation steps below. For running CPU version, jump to step 1.3, all the required dependencies already come with patpy.

1.1 Option 1 (with conda)#

  1. Download the environment file from here or manually create a file rapids_singlecell.yaml with the following content:

name: rapids_singlecell
channels:
  - rapidsai
  - nvidia
  - conda-forge
  - bioconda
dependencies:
  - python=3.11
  - anndata=0.11
  - scanpy=1.11
  - leidenalg
  - louvain
  - zarr
  - cuda-version=12
  - rapids=24.06
  - scikit-learn<1.6 # required by dask-ml
  - scipy<1.15
  - dask
  - dask-ml
  - sparse
  - harmonypy
  - pytorch
  - pip
  - pip:
    - rapids-singlecell
    - scikit-learn<1.6
  1. Create a new conda environment based on this file:

conda env create -f rapids_singlecell.yaml
  1. Activate the environment:

conda activate rapids_singlecell
  1. Install ipykernel to be able to use this environment in a jupyter notebook:

conda install ipykernel

1.2 Option 2 (conda-free). Install the RAPIDS package#

You can use the script provided by rapidsai developers. Activate a desired environment, and run the following:

git clone https://github.com/rapidsai/rapidsai-csp-utils.git
python rapidsai-csp-utils/colab/pip-install.py

Run this cell to enable consistent usage of array API:

import os
os.environ["SCIPY_ARRAY_API"] = "1"

1.3 Install the patpy package#

For latest released version, simply run:

pip install patpy

To get a newer development version (might be unstable), install from github:

pip install git+https://github.com/lueckenlab/patpy.git@main

1.4 Import packages#

import pandas as pd
import numpy as np
import scanpy as sc
import patpy
import seaborn as sns
import matplotlib.pyplot as plt

import random

random.seed(42)
np.random.seed(42)
patpy.__version__
'0.11.2'

2. Data Preparation#

2.1 Read the data#

Here, we use COMBAT dataset. This dataset contains 783k cells (after some additional preprocessing) from 138 COVID-19 patients and healthy donors.

ADATA_PATH = "/home/icb/vladimir.shitov/projects/vladimir.shitov/2023_05_patient_representation_benchmark/reproducibility/pat_rep_benchmark/data/combat/combat_processed.h5ad"
adata = sc.read_h5ad(ADATA_PATH)
adata
AnnData object with n_obs × n_vars = 783677 × 3000
    obs: 'Annotation_cluster_id', 'Annotation_cluster_name', 'Annotation_minor_subset', 'Annotation_major_subset', 'Annotation_cell_type', 'GEX_region', 'QC_ngenes', 'QC_total_UMI', 'QC_pct_mitochondrial', 'QC_scrub_doublet_scores', 'TCR_chain_composition', 'TCR_clone_ID', 'TCR_clone_count', 'TCR_clone_proportion', 'TCR_contains_unproductive', 'TCR_doublet', 'TCR_chain_TRA', 'TCR_v_gene_TRA', 'TCR_d_gene_TRA', 'TCR_j_gene_TRA', 'TCR_c_gene_TRA', 'TCR_productive_TRA', 'TCR_cdr3_TRA', 'TCR_umis_TRA', 'TCR_chain_TRA2', 'TCR_v_gene_TRA2', 'TCR_d_gene_TRA2', 'TCR_j_gene_TRA2', 'TCR_c_gene_TRA2', 'TCR_productive_TRA2', 'TCR_cdr3_TRA2', 'TCR_umis_TRA2', 'TCR_chain_TRB', 'TCR_v_gene_TRB', 'TCR_d_gene_TRB', 'TCR_j_gene_TRB', 'TCR_c_gene_TRB', 'TCR_productive_TRB', 'TCR_chain_TRB2', 'TCR_v_gene_TRB2', 'TCR_d_gene_TRB2', 'TCR_j_gene_TRB2', 'TCR_c_gene_TRB2', 'TCR_productive_TRB2', 'TCR_cdr3_TRB2', 'TCR_umis_TRB2', 'BCR_umis_HC', 'BCR_contig_qc_HC', 'BCR_functionality_HC', 'BCR_v_call_HC', 'BCR_v_score_HC', 'BCR_j_call_HC', 'BCR_j_score_HC', 'BCR_junction_aa_HC', 'BCR_total_mut_HC', 'BCR_s_mut_HC', 'BCR_r_mut_HC', 'BCR_c_gene_HC', 'BCR_clone_per_replicate_HC', 'BCR_clone_global_HC', 'BCR_clonal_abundance_HC', 'BCR_locus_LC', 'BCR_umis_LC', 'BCR_contig_qc_LC', 'BCR_functionality_LC', 'BCR_v_call_LC', 'BCR_v_score_LC', 'BCR_j_call_LC', 'BCR_j_score_LC', 'BCR_junction_aa_LC', 'BCR_total_mut_LC', 'BCR_s_mut_LC', 'BCR_r_mut_LC', 'BCR_c_gene_LC', 'COMBAT_ID', 'scRNASeq_sample_ID', 'COMBAT_participant_timepoint_ID', 'Source', 'Age', 'Sex', 'Race', 'BMI', 'Hospitalstay', 'Death28', 'Institute', 'PreExistingHeartDisease', 'PreExistingLungDisease', 'PreExistingKidneyDisease', 'PreExistingDiabetes', 'PreExistingHypertension', 'PreExistingImmunocompromised', 'Smoking', 'Symptomatic', 'Requiredvasoactive', 'Respiratorysupport', 'SARSCoV2PCR', 'Outcome', 'TimeSinceOnset', 'Ethnicity', 'Tissue', 'DiseaseClassification', 'Pool_ID', 'Channel_ID', 'ifn_1_score', '_scvi_batch', '_scvi_labels', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'vertex', 'eigenvector_centrality'
    var: 'gene_ids', 'feature_types', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'Institute', 'ObjectCreateDate', 'Source_colors', 'Technology', 'X_gloscope_cuml_distances', 'X_gloscope_pynndescent_distances', 'X_scpoli', '_scvi_manager_uuid', '_scvi_uuid', 'genome_annotation_version', 'gloscope_representation', 'gloscope_scpoli_distances', 'hvg', 'log1p', 'neighbors', 'pca', 'scpoli_distances', 'scpoli_parameters', 'scpoli_samples'
    obsm: 'X_pca', 'X_scANVI_batch', 'X_scANVI_sample', 'X_scVI_batch', 'X_scVI_sample', 'X_scpoli', 'X_umap', 'X_umap_source'
    varm: 'PCs'
    layers: 'X_raw_counts'
    obsp: 'connectivities', 'distances'

2.2 Set columns containing sample IDs, cell types and metadata#

For convenience, let’s store columns contatining patient-level information to python variables:

sample_id_col = "scRNASeq_sample_ID"

# Columns from adata.obs containing metadata about samples
samples_metadata_cols = ["Source", "Outcome", "Death28", "Hospitalstay", "TimeSinceOnset", "Institute", "Pool_ID",
                         "Age", "Sex", "BMI", "PreExistingHeartDisease", "PreExistingLungDisease", "PreExistingDiabetes",
                         "PreExistingHypertension", "Smoking", "Requiredvasoactive", "SARSCoV2PCR"]

# Columns we will use for plotting
cols_to_plot = ["Source", "Outcome", "Death28", "Institute", "Pool_ID"]

Additionally, let’s set a layer that we will use as cell representations. It must be "X" or one of the keys in either .layers or .obsm. Here, we will use scVI embedding:

layer = "X_scVI_batch"

2.3 Store metadata and calculate QC metrics#

To evaluate sample representation, let’s extract a sample metadata table from .obs data frame in annotated data object:

metadata = adata.obs[samples_metadata_cols + [sample_id_col]].drop_duplicates()
metadata.set_index(sample_id_col, inplace=True)
metadata
Source Outcome Death28 Hospitalstay TimeSinceOnset Institute Pool_ID Age Sex BMI PreExistingHeartDisease PreExistingLungDisease PreExistingDiabetes PreExistingHypertension Smoking Requiredvasoactive SARSCoV2PCR
scRNASeq_sample_ID
S00109-Ja001E-PBCa COVID_SEV 2.0 0 1.0 12.0 Oxford gPlexA 5.0 1 5 0.0 0.0 0.0 1.0 0 1.0 1
S00112-Ja003E-PBCa COVID_MILD 5.0 0 2.0 12.0 Oxford gPlexA 5.0 1 5 0.0 0.0 0.0 0.0 0 0.0 1
S00005-Ja005E-PBCa COVID_CRIT 2.0 0 4.0 14.0 Oxford gPlexA 7.0 0 1 1.0 0.0 0.0 1.0 0 1.0 1
S00061-Ja003E-PBCa COVID_SEV 4.0 0 2.0 6.0 Oxford gPlexA 5.0 1 5 0.0 0.0 0.0 0.0 0 0.0 1
S00056-Ja003E-PBCa COVID_SEV 3.0 0 3.0 23.0 Oxford gPlexA 7.0 0 5 0.0 0.0 0.0 0.0 0 0.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
S00065-Ja003E-PBCa COVID_CRIT 2.0 0 3.0 16.0 Oxford gPlexK 5.0 0 5 0.0 0.0 1.0 0.0 0 1.0 1
S00048-Ja003E-PBCa COVID_SEV 4.0 0 2.0 12.0 Oxford gPlexK 7.0 1 3 0.0 0.0 0.0 1.0 0 0.0 1
G05112-Ja005E-PBCa COVID_HCW_MILD 6.0 0 NaN 7.0 Oxford gPlexK 4.0 0 5 NaN NaN NaN NaN 0 NaN 1
N00038-Ja001E-PBGa Sepsis NaN 0 0.0 7.0 Oxford gPlexK 4.0 1 5 0.0 0.0 0.0 0.0 0 0.0 0
U00501-Ua005E-PBUa Flu 2.0 0 NaN 11.0 St_Georges gPlexK 6.0 1 1 NaN NaN 0.0 0.0 0 1.0 0

138 rows × 17 columns

Let’s further add quality control information to evaluate how much representations are affected by number of genes, mRNA counts and percentage of mitochondrial genes. We’ll aggregate this cell-level QC metric using a median for every sample:

cell_qc_metadata = patpy.pp.calculate_cell_qc_metrics(
    adata, sample_key=sample_id_col, cell_qc_vars=["QC_ngenes", "QC_total_UMI", "QC_pct_mitochondrial"]
)
cell_qc_metadata.columns = ["median_n_genes_by_counts", "median_total_counts", "median_pct_counts_mt"]
cell_qc_metadata
median_n_genes_by_counts median_total_counts median_pct_counts_mt
scRNASeq_sample_ID
G05061-Ja005E-PBCa 1107.0 3125.5 3.011159
G05064-Ja005E-PBCa 975.0 2933.0 1.332430
G05073-Ja005E-PBCa 1141.0 3554.0 2.422559
G05077-Ja005E-PBCa 1125.0 3268.0 2.946723
G05078-Ja005E-PBCa 999.0 3106.0 2.825308
... ... ... ...
U00607-Ua005E-PBUa 1827.0 5749.0 2.982509
U00613-Ua005E-PBUa 1251.5 3325.0 2.053083
U00617-Ua005E-PBUa 1410.5 4524.0 3.886215
U00619-Ua005E-PBUa 1532.0 4105.0 2.688350
U00701-Ua005E-PBUa 1093.5 3074.5 3.714650

138 rows × 3 columns

n_genes_metadata = patpy.pp.calculate_n_cells_per_sample(adata, sample_id_col)
n_genes_metadata
n_cells
scRNASeq_sample_ID
S00052-Ja005E-PBCa 13918
H00054-Ha001E-PBGa 10938
H00067-Ha001E-PBGa 10781
N00023-Ja001E-PBGa 10484
H00053-Ha001E-PBGa 10458
... ...
U00607-Ua005E-PBUa 1021
U00613-Ua005E-PBUa 970
U00701-Ua005E-PBUa 872
U00601-Ua005E-PBUa 619
U00504-Ua005E-PBUa 161

138 rows × 1 columns

Combine all the tables into one:

metadata = pd.concat(
    [
        metadata,
        cell_qc_metadata.loc[metadata.index],
        n_genes_metadata.loc[metadata.index],
    ],
    axis=1,
)

metadata.shape
(138, 21)

Merge aggregated QC to cell metadata:

adata.obs = adata.obs.merge(metadata.drop(columns=samples_metadata_cols), left_on=sample_id_col, right_index=True)

2.4 Quality control#

To reduce noise in the representations, we need to remove samples with too few cells:

adata = patpy.pp.filter_small_samples(adata, sample_key=sample_id_col, sample_size_threshold=500)
1 samples removed: U00504-Ua005E-PBUa

3. Running GloScope in Python#

The nearest neighbor computation for the distance matrix calculation can either be done using a CPU or a GPU. By setting the parameter use_gpu=True the matrix will be calculated using either cuML or PyNNDescent.

4.1 Run the GloScope with GPU acceleration#

Using patpy, we can run sample representation methods with a simple interface:

gloscope_py = patpy.tl.GloScope_py(sample_key=sample_id_col, layer=layer, k=25, use_gpu=True)
gloscope_py.prepare_anndata(adata)

4.2 Calculate the distance matrix between samples#

GloScope does not create a vector representation of samples, but instead gives a pairwise KL-divergence matrix. The smaller the values, the more similar samples are. GloScope does not produce “distances” in mathematical sense, but we can treat the divergencies like it. Even with that caveat, it captures more information than other sample representation methods.

%%time
distances_py = gloscope_py.calculate_distance_matrix(force=True)
distances_py
CPU times: user 21.4 s, sys: 3.54 s, total: 25 s
Wall time: 2min 14s
array([[0.        , 2.98802018, 4.96645102, ..., 3.36136601, 8.29764793,
        4.77171363],
       [2.98802018, 0.        , 4.58604968, ..., 4.24508584, 7.28937722,
        2.99966394],
       [4.96645102, 4.58604968, 0.        , ..., 5.72771941, 7.25859998,
        3.11516585],
       ...,
       [3.36136601, 4.24508584, 5.72771941, ..., 0.        , 7.92112695,
        5.27670011],
       [8.29764793, 7.28937722, 7.25859998, ..., 7.92112695, 0.        ,
        4.89489186],
       [4.77171363, 2.99966394, 3.11516585, ..., 5.27670011, 4.89489186,
        0.        ]], shape=(137, 137))

Our implementation only takes about two minutes for the whole COMBAT data set when using GPU! Perfect time for you to go for another cup of coffee, but not long enough to make you feel bored.

4.3 Visualise the sample representation using UMAP#

But does GloScope capture information about samples? Let’s build a UMAP plot to get an idea. On this plot, similar patients are localised together:

gloscope_py.plot_embedding(method="UMAP", metadata_cols=cols_to_plot);
../../_images/35bb5109a0a6e95236f01eca5738449de246f90846db617a310fa599d146a996.png

We can see that patients with similar “Source” and “Outcome” are grouped together. The latter reflects the severity of the disease: from 1, which means death, to 6, which encodes healthy donors. “Death28” column encodes whether patients died in the period of 28 days. The last 2 columns represent locations where samples were processed and the batches. The visualisation suggests that the batches mix with each other and explain the embedding much worse than the Outcome. We will further validate this finding numerically.

4. Evaluating sources of variation in the data#

While somebody is already writing a wrathful twitter thread on our interpretation of UMAPs, let’s compute some metrics to understand what explains sample-level variation in the data. We need to classify our sample-level covariates (columns in the metadata) to clinically relevant, technical and contextual. The former 2 categories can be used to judge whether sample representation method is working well for our analysis goals. The “contextual” category contains columns, which we don’t aim to analyse, but may want to know whether they impact the representation.

Furthermore, we need to define prediction tasks for every metadata columns. They must be one of:

  • “classification” – for discrete labels such as disease status

  • “regression” – for continuous covariates such as number of hospital stay days

  • “ranking” – for ranked variables. For example, “Outcome” in our dataset is encoded as numbers from 1 to 6. In this case, if a sample has label “6” (healthy), predicting “5” (hospitalised, no extra oxigen) is better than “3” (non-invasive ventilation).

We will predict each covariate based on the nearest neighbors (the most similar samples) in the representation. Setting the task will define metrics used to evaluate quality of the prediction. For classification, it is F1-macro score (corrected for random prediction), and for other tasks Spearman correlation is used.

benchmark_schema = {
    "relevant": {
        "Source": "classification",
        "Outcome": "ranking",
        "Death28": "classification",
        "Hospitalstay": "regression",
        "TimeSinceOnset": "regression"
    },
    "technical": {
        "Institute": "classification",
        "Pool_ID": "classification",
        "median_n_genes_by_counts": "regression",
        "median_total_counts": "regression",
        "median_pct_counts_mt": "regression",
        "n_cells": "regression"
    },
    "contextual": {
        "Age": "regression",
        "Sex": "classification",
        "BMI": "regression",
        "PreExistingHeartDisease": "classification",
        "PreExistingLungDisease": "classification",
        "PreExistingDiabetes": "classification",
        "PreExistingHypertension": "classification",
        "Smoking": "classification",
        "Requiredvasoactive": "classification",
        "SARSCoV2PCR": "classification"
    }
}

4.1 Run the covariate prediction#

metadata = metadata.loc[gloscope_py.samples]  # Ensure the order matches the distance matrix
results = []

for type_name, type_data in benchmark_schema.items():
    for target, task in type_data.items():
        result = patpy.tl.evaluate_representation(
            distances=distances_py,
            target=metadata[target],
            method="knn",
            task=task,
            n_neighbors=5
        )

        result["covariate"] = target
        result["covariate_type"] = type_name

        if result["metric"] == "spearman_r":
            result["score"] = abs(result["score"])

        results.append(result)

We can order covariates based on how well they can be predicted from a representation. It gives us an insight on what explains variability in the data:

knn_results = pd.DataFrame(results)
knn_results.sort_values(by="score", ascending=False)
score metric n_unique n_observations method covariate covariate_type
9 0.745459 spearman_r 137 137 knn median_pct_counts_mt technical
7 0.631919 spearman_r 118 137 knn median_n_genes_by_counts technical
1 0.626460 spearman_r 6 112 knn Outcome relevant
20 0.599415 f1_macro_calibrated 2 137 knn SARSCoV2PCR contextual
8 0.585656 spearman_r 131 137 knn median_total_counts technical
15 0.549561 f1_macro_calibrated 2 99 knn PreExistingLungDisease contextual
4 0.423161 spearman_r 26 127 knn TimeSinceOnset relevant
2 0.421534 f1_macro_calibrated 2 137 knn Death28 relevant
10 0.395705 spearman_r 137 137 knn n_cells technical
13 0.317015 spearman_r 6 137 knn BMI contextual
0 0.314356 f1_macro_calibrated 8 137 knn Source relevant
5 0.286458 f1_macro_calibrated 2 137 knn Institute technical
11 0.275583 spearman_r 8 137 knn Age contextual
3 0.260062 spearman_r 14 102 knn Hospitalstay relevant
14 0.259044 f1_macro_calibrated 2 99 knn PreExistingHeartDisease contextual
12 0.113460 f1_macro_calibrated 2 137 knn Sex contextual
6 0.078434 f1_macro_calibrated 10 137 knn Pool_ID technical
18 0.031751 f1_macro_calibrated 3 137 knn Smoking contextual
16 0.003726 f1_macro_calibrated 2 112 knn PreExistingDiabetes contextual
17 0.000000 f1_macro_calibrated 2 112 knn PreExistingHypertension contextual
19 0.000000 f1_macro_calibrated 2 115 knn Requiredvasoactive contextual

4.2 Visualise the results#

Let’s build a plot based on the table above for a nice visual comparision:

plt.figure(figsize=(10, 8), dpi=300)
plt.rcParams.update({
    'font.size': 14,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10
})

sns.barplot(
    data=knn_results,
    y="covariate",
    x="score",
    hue="covariate_type",
    dodge=False,
    palette="colorblind",
    order=knn_results.sort_values('score', ascending = False).covariate
)

plt.xlabel("KNN prediction score")
plt.ylabel("Covariate")

plt.legend(title="Covariate Type")
plt.title("Sources of variation in the COMBAT dataset", fontweight='bold')

sns.despine()
plt.tight_layout()
plt.show()
../../_images/425219a522b729b611905ce51155f2429c9f9f7ba7f330f58fc8ab5afc3973eb.png

We can see that median percentage of mitochondrial counts and median number of genes can be predicted very well from a representation. This is rather bad news, as these covariates are technical. On the bright side, the next best-predicted covariate is the “Outcome”, indicating that disease outcome effects the transcriptomics in blood! Let’s visualise these covariates on the representation:

well_represented_cols = ["Outcome", "median_pct_counts_mt", "median_n_genes_by_counts"]

axes = gloscope_py.plot_embedding(method="UMAP", metadata_cols=well_represented_cols)

for ax, col in zip(axes, well_represented_cols):
    ax.legend(loc=(0, 1.02), title=col)
../../_images/3316de5ecd775bfbc3e85d5e9b8c79d59ea6ef504593d5df720174470dc896c9.png

As we previously seen, outcome has a clear grouping on the UMAP. We can also see a notable gradient of median percentage of mitochondrial gene counts and number of genes in a sample.

Another insight we can learn from this sample representation is what doesn’t affect the data. For example, pre-existing diabetes and hypertension cannot be predicted well from the data we have. That can mean the following:

  • These are not the main axes of variation in the data. Perhaps, they do affect the transcriptomics, but this effect is hidden behind stronger sources of variation. If you are interested in analysis of those covariates, a more targeted approach is needed.

  • These covariates do not affect the data at all. There are limitations of what we can predict from transcriptomics. Perhaps, these information is simply beyond them.

  • Metadata is too sparse. We don’t have too many samples with positive values of these covariates. Larger datasets could uncover them better.

In any case, sample representation allows us to immediately see what the data hides.

Note another interesting phenomenon: smoking status can be predicted worse than BMI. Isn’t that surprising in the context of respiratory diseases?

4.3 Run the GloScope using CPU#

If you don’t have a GPU, you can still run GloScope in Python way faster than in R! The interface is exactly the same, just set use_gpu=False this time:

gloscope_py_cpu = patpy.tl.GloScope_py(sample_key=sample_id_col, layer=layer, k=25, use_gpu=False)
gloscope_py_cpu.prepare_anndata(adata)
%%time
distances_py_cpu = gloscope_py_cpu.calculate_distance_matrix(force=True)
distances_py_cpu
CPU times: user 22min 13s, sys: 5.63 s, total: 22min 19s
Wall time: 22min 9s
array([[0.        , 3.01316958, 4.98487267, ..., 3.3902534 , 8.34240482,
        4.79270835],
       [3.01316958, 0.        , 4.62207723, ..., 4.27903188, 7.35136176,
        3.02328651],
       [4.98487267, 4.62207723, 0.        , ..., 5.754768  , 7.31083296,
        3.14633485],
       ...,
       [3.3902534 , 4.27903188, 5.754768  , ..., 0.        , 7.98588573,
        5.30291834],
       [8.34240482, 7.35136176, 7.31083296, ..., 7.98588573, 0.        ,
        4.93305717],
       [4.79270835, 3.02328651, 3.14633485, ..., 5.30291834, 4.93305717,
        0.        ]], shape=(137, 137))

Well, that took a bit longer than a minute, but still better than 3 hours it takes for the R implementation. You can notice that numbers in the matrix are almost identical to what we had before. Let’s take a look at UMAP visualisations:

axes = gloscope_py_cpu.plot_embedding(method="UMAP", metadata_cols=well_represented_cols)

for ax, col in zip(axes, well_represented_cols):
    ax.legend(loc=(0, 1.02), title=col)
../../_images/40650c13d476d14dab1a8df29d1fb2541bdba606ec984c33186af0849942d83d.png

We can see that embeddings look very similar to what we had before. Qualitative assessment would give the similar results.

5. Original GloScope Implementation#

This version is implemented in R and can be used via patpy. We do not recommend this approach for your workflow and only show it here for comparison.

If you want to run it, you need to install R dependencies. The easiest way to do it is via conda. You can use the following environment:

name: gloscope
channels:
  - conda-forge
  - bioconda
dependencies:
  - r-base=4.3
  - r-data.table
  - r-reticulate
  - r-anndata
  - bioconductor-gloscope
  - anndata2ri

Or install the required dependencies into an existing environment (don’t forget to activate it in prior):

conda install -c conda-forge -c bioconda r-base=4.3 r-data.table r-reticulate r-anndata bioconductor-gloscope anndata2ri -y

Initialise the sample represenation method. You can set n_workers to the number of CPUs on your machine to make the computation faster:

gloscope_r = patpy.tl.GloScope(sample_key=sample_id_col, layer=layer, k=25, n_workers=6)
gloscope_r.prepare_anndata(adata)
%%time
distances_r = gloscope_r.calculate_distance_matrix(force=True)
distances_r
Calculating GloScope distance matrix
CPU times: user 32.2 s, sys: 4.59 s, total: 36.8 s
Wall time: 7h 50min 4s
array([[0.        , 2.88381577, 4.86649874, ..., 3.25397249, 8.19277845,
        4.66033323],
       [2.88381577, 0.        , 4.48540625, ..., 4.13700121, 7.18381653,
        2.88759288],
       [4.86649874, 4.48540625, 0.        , ..., 5.62388687, 7.15729136,
        3.00734636],
       ...,
       [3.25397249, 4.13700121, 5.62388687, ..., 0.        , 7.81237691,
        5.16143919],
       [8.19277845, 7.18381653, 7.15729136, ..., 7.81237691, 0.        ,
        4.78215528],
       [4.66033323, 2.88759288, 3.00734636, ..., 5.16143919, 4.78215528,
        0.        ]], shape=(137, 137))
axes = gloscope_r.plot_embedding(method="UMAP", metadata_cols=well_represented_cols)

for ax, col in zip(axes, well_represented_cols):
    ax.legend(loc=(0, 1.02), title=col)
../../_images/0016ddafc47e2424870f4c3538916c7e7616754665d94fcac2866744b8556325.png

The numbers are again very similar to what we had before, and so it the representation. But it took us 8 hours to run!

Conclusion#

In this tutorial we saw how to obtain a high quality sample representation from single-cell transcriptomic data with GloScope and patpy. It enables understanding the sources of variation in the data. With GloScope we saw that COVID-19 outcome has a high influence on the data. To learn how sample representation can be further used to derive markers of COVID-19 severity, see our tutorial on patient trajectory analysis.

Benchmarking GloScope reimplementation#

Here are some more technical results on the reimplementation of GloScope for interested readers.

How much faster is the Python reimplementation of GloScope?#

Up to 160x faster with GPU and up to 7x faster without it. The bigger your dataset, the faster our implementation is. Here are the results of an experiment with COMBAT dataset subsets of a different size:

images/gloscope_eval_speedup.png

How similar are the results of different GloScope implementations?#

For all practical purposes, they are identical. Here is the comparison of KL divergence matrices obtained on a different subsets of the COMBAT dataset by different reimplementations. Mean squared error (MSE) is used to compare numerical results, and Spearman correlation is used to check whether the order of samples is preserved. Both metrics are close to their limits (0 and 1 correspondingly):

images/gloscope_eval_matrices.png

The GloScope reimplementation in Python was developed by Emma Schonner as a part of her Master thesis at the Technical University of Munich (TUM)