Patient trajectory analysis

Patient trajectory analysis#

In this notebook, I’ll show you an example of sample representation analysis to infer COVID severity trajectory

First, let’s import some packages. You might be unfamiliar with ehrapy: a package from Theis lab. It is an extension of scanpy for electronic health record (EHR) analysis. Many methods fully overlap (e.g. scanpy.tl.pca -> ehrapy.tl.pca) but there are a bit more. Think of it as scanpy but with a bit more cool features and support for categorical data.

import patpy
import ehrapy as ep
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scanpy as sc

Here, I use COMBAT dataset. It contains PBMC from 140 individuals with various COVID severity. You can find processed dataset, more information about it, and link to the publication on Kaggle.

adata = sc.read_h5ad("/Users/vladimir.shitov/Documents/programming/pat_rep_benchmark/data/combat/combat_processed.h5ad")
adata
AnnData object with n_obs × n_vars = 783677 × 3000
    obs: 'Annotation_cluster_id', 'Annotation_cluster_name', 'Annotation_minor_subset', 'Annotation_major_subset', 'Annotation_cell_type', 'GEX_region', 'QC_ngenes', 'QC_total_UMI', 'QC_pct_mitochondrial', 'QC_scrub_doublet_scores', 'TCR_chain_composition', 'TCR_clone_ID', 'TCR_clone_count', 'TCR_clone_proportion', 'TCR_contains_unproductive', 'TCR_doublet', 'TCR_chain_TRA', 'TCR_v_gene_TRA', 'TCR_d_gene_TRA', 'TCR_j_gene_TRA', 'TCR_c_gene_TRA', 'TCR_productive_TRA', 'TCR_cdr3_TRA', 'TCR_umis_TRA', 'TCR_chain_TRA2', 'TCR_v_gene_TRA2', 'TCR_d_gene_TRA2', 'TCR_j_gene_TRA2', 'TCR_c_gene_TRA2', 'TCR_productive_TRA2', 'TCR_cdr3_TRA2', 'TCR_umis_TRA2', 'TCR_chain_TRB', 'TCR_v_gene_TRB', 'TCR_d_gene_TRB', 'TCR_j_gene_TRB', 'TCR_c_gene_TRB', 'TCR_productive_TRB', 'TCR_chain_TRB2', 'TCR_v_gene_TRB2', 'TCR_d_gene_TRB2', 'TCR_j_gene_TRB2', 'TCR_c_gene_TRB2', 'TCR_productive_TRB2', 'TCR_cdr3_TRB2', 'TCR_umis_TRB2', 'BCR_umis_HC', 'BCR_contig_qc_HC', 'BCR_functionality_HC', 'BCR_v_call_HC', 'BCR_v_score_HC', 'BCR_j_call_HC', 'BCR_j_score_HC', 'BCR_junction_aa_HC', 'BCR_total_mut_HC', 'BCR_s_mut_HC', 'BCR_r_mut_HC', 'BCR_c_gene_HC', 'BCR_clone_per_replicate_HC', 'BCR_clone_global_HC', 'BCR_clonal_abundance_HC', 'BCR_locus_LC', 'BCR_umis_LC', 'BCR_contig_qc_LC', 'BCR_functionality_LC', 'BCR_v_call_LC', 'BCR_v_score_LC', 'BCR_j_call_LC', 'BCR_j_score_LC', 'BCR_junction_aa_LC', 'BCR_total_mut_LC', 'BCR_s_mut_LC', 'BCR_r_mut_LC', 'BCR_c_gene_LC', 'COMBAT_ID', 'scRNASeq_sample_ID', 'COMBAT_participant_timepoint_ID', 'Source', 'Age', 'Sex', 'Race', 'BMI', 'Hospitalstay', 'Death28', 'Institute', 'PreExistingHeartDisease', 'PreExistingLungDisease', 'PreExistingKidneyDisease', 'PreExistingDiabetes', 'PreExistingHypertension', 'PreExistingImmunocompromised', 'Smoking', 'Symptomatic', 'Requiredvasoactive', 'Respiratorysupport', 'SARSCoV2PCR', 'Outcome', 'TimeSinceOnset', 'Ethnicity', 'Tissue', 'DiseaseClassification', 'Pool_ID', 'Channel_ID', 'ifn_1_score', '_scvi_batch', '_scvi_labels', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'Institute', 'ObjectCreateDate', 'Source_colors', 'Technology', 'X_scpoli', '_scvi_manager_uuid', '_scvi_uuid', 'genome_annotation_version', 'hvg', 'log1p', 'pca', 'scpoli_distances', 'scpoli_parameters', 'scpoli_samples'
    obsm: 'X_pca', 'X_scANVI_batch', 'X_scANVI_sample', 'X_scVI_batch', 'X_scVI_sample', 'X_scpoli', 'X_umap', 'X_umap_source'
    varm: 'PCs'
    layers: 'X_raw_counts'
sample_key = "scRNASeq_sample_ID"
cell_type_key = "Annotation_major_subset"

The following file contains metadata for donors in COMBAT dataset. It is available on GitHub. We will work with it in ehrapy but note that this is just anndata you are used to in single-cell.

There is a bunch of patient representation methods precalculated. Each of them has distances matrix and UMAP in obsm, and donors clustering in obs. Think of patient representation as distances between patiens: if a distance is equal to 0, transcriptomics profiles of patients are identical. The bigger the number, the bigger the differences.

meta_adata = ep.io.read_h5ad("data/combat_meta_adata.h5ad")
meta_adata
AnnData object with n_obs × n_vars = 137 × 88
    obs: 'Race', 'Institute', 'Tissue', 'DiseaseClassification', 'Pool_ID', 'Source', 'Ethnicity', 'longitudinal_oxygen_phase', 'sampled_at_max_severity', 'potential_exclude', 'low_cellnumber_exclude', 'oxygen_status_sample', 'diabetes', 'comp_total', 'comp_minimal', 'comp_minimal_hosp_covid', 'comp_minimal_plus_transitional', 'MJ_samples', 'random_vec_leiden', 'composition_leiden', 'pseudobulk_leiden', 'pseudobulk_scanvi_pool_leiden', 'pseudobulk_scanvi_leiden', 'pseudobulk_scpoli_leiden', 'pseudobulk_scvi_pool_leiden', 'pseudobulk_scvi_leiden', 'scpoli_leiden', 'mrvi_leiden', 'wasserstein_scanvi_pool_leiden', 'wasserstein_scanvi_leiden', 'wasserstein_scvi_pool_leiden', 'wasserstein_scvi_leiden', 'pilot_leiden', 'gloscope_combat_JS_GMM_25_leiden', 'gloscope_combat_KL_GMM_25_leiden', 'gloscope_combat_KL_KNN_25_leiden', 'WHO_ordinal_at_sample', 'Outcome'
    uns: 'Institute_colors', 'Pool_ID_colors', 'Source_colors', 'composition_leiden_colors', 'composition_neighbors', 'encoded_non_numerical_columns', 'encoding_to_var', 'gloscope_combat_JS_GMM_25_leiden_colors', 'gloscope_combat_JS_GMM_25_neighbors', 'gloscope_combat_KL_GMM_25_leiden_colors', 'gloscope_combat_KL_GMM_25_neighbors', 'gloscope_combat_KL_KNN_25_leiden_colors', 'gloscope_combat_KL_KNN_25_neighbors', 'leiden', 'mrvi_leiden_colors', 'mrvi_neighbors', 'non_numerical_columns', 'numerical_columns', 'original_values_categoricals', 'pilot_leiden_colors', 'pilot_neighbors', 'pseudobulk_leiden_colors', 'pseudobulk_neighbors', 'pseudobulk_scanvi_leiden_colors', 'pseudobulk_scanvi_neighbors', 'pseudobulk_scanvi_pool_leiden_colors', 'pseudobulk_scanvi_pool_neighbors', 'pseudobulk_scpoli_leiden_colors', 'pseudobulk_scpoli_neighbors', 'pseudobulk_scvi_leiden_colors', 'pseudobulk_scvi_neighbors', 'pseudobulk_scvi_pool_leiden_colors', 'pseudobulk_scvi_pool_neighbors', 'random_vec_leiden_colors', 'random_vec_neighbors', 'scpoli_leiden_colors', 'scpoli_neighbors', 'var_to_encoding', 'wasserstein_scanvi_leiden_colors', 'wasserstein_scanvi_neighbors', 'wasserstein_scanvi_pool_leiden_colors', 'wasserstein_scanvi_pool_neighbors', 'wasserstein_scvi_leiden_colors', 'wasserstein_scvi_neighbors', 'wasserstein_scvi_pool_leiden_colors', 'wasserstein_scvi_pool_neighbors'
    obsm: 'composition_UMAP', 'composition_distances', 'gloscope_combat_JS_GMM_25_UMAP', 'gloscope_combat_JS_GMM_25_distances', 'gloscope_combat_KL_GMM_25_UMAP', 'gloscope_combat_KL_GMM_25_distances', 'gloscope_combat_KL_KNN_25_UMAP', 'gloscope_combat_KL_KNN_25_distances', 'mrvi_UMAP', 'mrvi_distances', 'pilot_UMAP', 'pilot_distances', 'pseudobulk_UMAP', 'pseudobulk_distances', 'pseudobulk_scanvi_UMAP', 'pseudobulk_scanvi_distances', 'pseudobulk_scanvi_pool_UMAP', 'pseudobulk_scanvi_pool_distances', 'pseudobulk_scpoli_UMAP', 'pseudobulk_scpoli_distances', 'pseudobulk_scvi_UMAP', 'pseudobulk_scvi_distances', 'pseudobulk_scvi_pool_UMAP', 'pseudobulk_scvi_pool_distances', 'random_vec_UMAP', 'random_vec_distances', 'scpoli_UMAP', 'scpoli_distances', 'umap', 'wasserstein_scanvi_UMAP', 'wasserstein_scanvi_distances', 'wasserstein_scanvi_pool_UMAP', 'wasserstein_scanvi_pool_distances', 'wasserstein_scvi_UMAP', 'wasserstein_scvi_distances', 'wasserstein_scvi_pool_UMAP', 'wasserstein_scvi_pool_distances'
    layers: 'original'
    obsp: 'composition_neighbors_connectivities', 'composition_neighbors_distances', 'gloscope_combat_JS_GMM_25_neighbors_connectivities', 'gloscope_combat_JS_GMM_25_neighbors_distances', 'gloscope_combat_KL_GMM_25_neighbors_connectivities', 'gloscope_combat_KL_GMM_25_neighbors_distances', 'gloscope_combat_KL_KNN_25_neighbors_connectivities', 'gloscope_combat_KL_KNN_25_neighbors_distances', 'mrvi_neighbors_connectivities', 'mrvi_neighbors_distances', 'pilot_neighbors_connectivities', 'pilot_neighbors_distances', 'pseudobulk_neighbors_connectivities', 'pseudobulk_neighbors_distances', 'pseudobulk_scanvi_neighbors_connectivities', 'pseudobulk_scanvi_neighbors_distances', 'pseudobulk_scanvi_pool_neighbors_connectivities', 'pseudobulk_scanvi_pool_neighbors_distances', 'pseudobulk_scpoli_neighbors_connectivities', 'pseudobulk_scpoli_neighbors_distances', 'pseudobulk_scvi_neighbors_connectivities', 'pseudobulk_scvi_neighbors_distances', 'pseudobulk_scvi_pool_neighbors_connectivities', 'pseudobulk_scvi_pool_neighbors_distances', 'random_vec_neighbors_connectivities', 'random_vec_neighbors_distances', 'scpoli_neighbors_connectivities', 'scpoli_neighbors_distances', 'wasserstein_scanvi_neighbors_connectivities', 'wasserstein_scanvi_neighbors_distances', 'wasserstein_scanvi_pool_neighbors_connectivities', 'wasserstein_scanvi_pool_neighbors_distances', 'wasserstein_scvi_neighbors_connectivities', 'wasserstein_scvi_neighbors_distances', 'wasserstein_scvi_pool_neighbors_connectivities', 'wasserstein_scvi_pool_neighbors_distances'

Here, we’ll focus on the GloScope representation obtained on scPoli cell embeddings. Notably, this method doesn’t use any information about patients: neither their clinical status nor even cell type annotation of the data. However, as we’ll see, it provides extremely good patient representation.

representation = "gloscope_combat_KL_KNN_25"
# Set obsm key so that standart umap plotting uses this representation
meta_adata.obsm["umap"] = meta_adata.obsm[f"{representation}_UMAP"]
# Plot UMAP, just like in scanpy. Note that every dot is a donor here!
ep.pl.umap(meta_adata, color=[f"{representation}_leiden", "Source"])
../../_images/18d5e7c7cb02821355c16c2b53a4760bf7206707344204cad68a4ecf4fc46d6c.png

Let’s rename some of the columns for a better readability:

meta_adata.obs["Source"] = meta_adata.obs["Source"].map(
    {
        "HV": "Healthy",
        "COVID_MILD": "Mild",
        "COVID_SEV": "Severe",
        "COVID_CRIT": "Critical",
        "COVID_HCW_MILD": "Mild",  # These are healthcare workers with mild disease
        "COVID_LDN": "Critical",
        "Sepsis": "Sepsis",
        "Flu": "Flu",
    }
)
source_palette = {
    "Healthy": "#1f77b4",  # HV
    "Mild": "#ff9999",  # COVID_MILD
    "Severe": "#ff4c4c",  # COVID_SEV
    "Critical": sns.color_palette("Reds")[-1],  # COVID_CRIT
    # "Other": "darkgrey",
    "Sepsis": "#8c564b",  # Sepsis
    "Flu": "#ff7f0e",  # Flu
}
no_box = {
    "axes.spines.top": False,
    "axes.spines.right": False,
    "axes.spines.left": False,
    "axes.spines.bottom": False,
    "xtick.bottom": False,
    "xtick.labelbottom": False,
    "ytick.left": False,
    "ytick.labelleft": False,
}
plot_params = {"figure.figsize": (5, 5), "figure.dpi": 150, "axes.grid": False}
plot_params.update(no_box)
with plt.rc_context(plot_params):
    sns.scatterplot(
        x=meta_adata.obsm[f"{representation}_UMAP"][:, 0],
        y=meta_adata.obsm[f"{representation}_UMAP"][:, 1],
        hue=meta_adata.obs["Source"],
        palette=source_palette,
        hue_order=source_palette.keys(),
        s=150,
    )
    plt.legend(loc=(1.05, 0.6), title="Condition")
../../_images/3ec4018236bd98e12c261123712ea52d1ce0a3d3c72e272bfb687389b2548b69.png

Note how all the healthy donors group together, and the severity gradient to the top. This is remarkable because sample representation methods knows nothing about donors condition and only evaluates how similar their transcriptomics profiles are. There are also some mild and even a septic patient grouping together with healthy. Are those errors of the method or misannotations?

With ehrapy, we can apply some single-cell methods to our data. For example, having a sample representation, we can discover trajectories in it via diffusion pseudotime. For this, we need to set the start of trajectory. Let’s take a look at the lowest patient on the UMAP:

control_donor_idx = meta_adata.obsm[f"{representation}_UMAP"][:, 1].argmin()
control_donor = meta_adata.obs_names[control_donor_idx]
meta_adata.obs.loc[control_donor, "Source"]
'Healthy'

Some variables are stored in meta_adata.X rather then in .obs. You can check which ones by running meta_adata.var_names

meta_adata[control_donor, ["Age", "Sex", "BMI", "Smoking"]].X
ArrayView([[74.,  0.,  5.,  0.]], dtype=float32)

Ok, so we have a 74-years old non-smoking individual. Perhaps, somebody younger would be a better control but this should be a good start. Also, the dataset has a bit of bias for the older population:

sns.histplot(meta_adata[:, "Age"].X)
<Axes: ylabel='Count'>
../../_images/dc7bf4e2fc4c952f65842bcc472fc0efbc7b2a82b3c3e761bbdbec7fdf6cf062.png

Let’s set this individual as a root for pseudotime, and run the diffusion:

meta_adata.uns["iroot"] = np.flatnonzero(meta_adata.obs_names == control_donor)[0]
ep.tl.dpt(meta_adata, neighbors_key=f"{representation}_neighbors")
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.

We now have a contunuous score from 0 (for the root we set) to 1 (for a most dissimilar sample), putting every donor on a continuous trajectory

with plt.rc_context(plot_params):
    sns.scatterplot(
        x=meta_adata.obsm[f"{representation}_UMAP"][:, 0],
        y=meta_adata.obsm[f"{representation}_UMAP"][:, 1],
        hue=meta_adata.obs["dpt_pseudotime"],
        s=150,
        palette="Reds",
    )
    plt.legend(loc=(1.05, 0.6), title="Diffusion pseudotime")
../../_images/b9b9e788ae7471ae08ba66418ff16fae3ae0296432e15b7abdcac09fd6ce3e40.png

But is this score meaningful? Let’s take a look at how it corresponds to the severity:

sns.boxplot(
    x=meta_adata.obs["dpt_pseudotime"],
    y=meta_adata.obs["Source"],
    palette=source_palette,
    order=[key for key in source_palette.keys() if key in meta_adata.obs["Source"].unique()],
)
plt.xlabel("Diffusion pseudotime (severity score)")
plt.ylabel("")
Text(0, 0.5, '')
../../_images/a46de19d8e23272c42a622d3a0a65ea560ba547421298cdb0947878ac68ee587.png

We see that higher severity patients have a higher pseudotime score! But instead of a coarse clinical resolution, we now have a continuous trajectory. We now can take a look at what correlates with this trajectory. It could be:

  • Cell type proportions

  • Gene expression

For the former, let’s calculate cell type composition for each patient. You can easily do it with patpy package:

cell_type_corrs = patpy.tl.correlate_composition(
    meta_adata=meta_adata,
    expression_adata=adata,
    sample_key=sample_key,
    cell_type_key=cell_type_key,
    target="dpt_pseudotime",
)
cell_type_corrs
correlation p_value p_value_adj -log_p_value_adj
PB 0.451239 3.131071e-08 5.322821e-07 14.446092
PLT 0.435727 1.028075e-07 5.825759e-07 14.355806
DC -0.438586 8.293978e-08 5.825759e-07 14.355806
cMono 0.430760 1.485891e-07 6.315037e-07 14.275162
MAIT -0.422133 2.778327e-07 9.446311e-07 13.872471
HSC 0.339631 4.902797e-05 1.389126e-04 8.881666
NK -0.333424 6.850823e-05 1.663771e-04 8.701253
iNKT -0.298163 4.019908e-04 8.542304e-04 7.065310
CD8 -0.291742 5.421241e-04 1.024012e-03 6.884027
DN -0.238690 4.970933e-03 7.682351e-03 4.868830
GDT -0.240101 4.712976e-03 7.682351e-03 4.868830
DP -0.231360 6.524369e-03 9.242856e-03 4.683904
ncMono -0.220700 9.553095e-03 1.249251e-02 4.382626
RET 0.205614 1.593459e-02 1.934915e-02 3.945107
CD4 -0.193011 2.383841e-02 2.701686e-02 3.611294
B 0.172737 4.353834e-02 4.625949e-02 3.073489
Mast -0.051713 5.484133e-01 5.484133e-01 0.600726

Note that in the COMBAT paper, authors define the following hallmarks of COVID severity:

  • Increased level of plasmoblasts (PB), classical monocytes (cMono), platelets (PLT), hemapoetic stem cells (HSCs)

  • Decreased levels of dendritic cells (DC), NK cells, T

This is precicely what we see! Interestingly, But for MAIT, our results contradict the paper: they suggest that this cell type should be positively correlated with severity.

But are there particular cell states or gene progrmas that explain COVID? To find them, let’s take a look at the average gene expression within each cell type, and correlate it with our severity score

expression_dpt_correlation_df = patpy.tl.correlate_cell_type_expression(
    meta_adata=meta_adata,
    expression_adata=adata,
    sample_key=sample_key,
    cell_type_key=cell_type_key,
    target="dpt_pseudotime",
    layer="X",
    min_sample_size=500,
)
1 samples removed: U00504-Ua005E-PBUa
expression_dpt_correlation_df
cell_type gene_name correlation p_value n_observations p_value_adj -log_p_value_adj
9161 cMono S100A8 0.884355 1.671033e-46 137 8.079777e-42 94.619210
11813 cMono NKG7 0.876631 1.003818e-44 137 2.426831e-40 91.216818
11694 cMono MCEMP1 0.868571 5.428303e-43 137 8.748976e-39 87.631882
11946 cMono LGALS1 0.867283 1.002285e-42 137 1.211562e-38 87.306323
11277 cMono CYP19A1 0.866300 1.593919e-42 137 1.541384e-38 87.065553
... ... ... ... ... ... ... ...
49299 Mast CRLF2 -0.074605 6.344566e-01 43 1.000000e+00 -0.000000
49584 Mast LCN6 -0.074605 6.344566e-01 43 1.000000e+00 -0.000000
49651 Mast TCN1 -0.074605 6.344566e-01 43 1.000000e+00 -0.000000
49827 Mast GLRX3 -0.075151 6.319715e-01 43 1.000000e+00 -0.000000
49645 Mast PRG2 -0.075699 6.294827e-01 43 1.000000e+00 -0.000000

48352 rows × 7 columns

sns.histplot(expression_dpt_correlation_df["correlation"]);
../../_images/f7460ca43076538ef1eb26af6089404f4d6805e3f821ec8c8fe7127850af814e.png
sns.histplot(expression_dpt_correlation_df["p_value"]);
../../_images/6d836863a59276271970611168c3d6a0a023b205f9ef00d593be362396235df3.png

The p-values distribution above is not uniform, and has a peak around 0, which suggests that there are some genes with statistically significant correlation. Let’s take a look at top correlated genes:

expression_dpt_correlation_df.head(50)
cell_type gene_name correlation p_value n_observations p_value_adj -log_p_value_adj
9161 cMono S100A8 0.884355 1.671033e-46 137 8.079777e-42 94.619210
11813 cMono NKG7 0.876631 1.003818e-44 137 2.426831e-40 91.216818
11694 cMono MCEMP1 0.868571 5.428303e-43 137 8.748976e-39 87.631882
11946 cMono LGALS1 0.867283 1.002285e-42 137 1.211562e-38 87.306323
11277 cMono CYP19A1 0.866300 1.593919e-42 137 1.541384e-38 87.065553
10655 cMono MS4A4A 0.863694 5.352660e-42 137 4.313530e-38 86.036477
9159 cMono S100A9 0.848246 4.359564e-39 137 3.011338e-35 79.488094
10026 cMono HLA-DMB -0.844671 1.849540e-38 137 1.117862e-34 78.176475
10528 cMono ALDH1A1 -0.844298 2.146335e-38 137 1.153106e-34 78.145434
9160 cMono S100A12 0.816351 5.465913e-34 137 2.642878e-30 68.105684
10417 cMono CLU 0.811288 2.856435e-33 137 1.255585e-29 66.547366
10894 cMono TROAP 0.810760 3.383662e-33 137 1.363390e-29 66.464993
9011 cMono ENO1 0.809244 5.490735e-33 137 2.042215e-29 66.060932
11035 cMono RNASE2 0.804894 2.149305e-32 137 7.423085e-29 64.770373
10918 cMono METTL7B 0.804443 2.471023e-32 137 7.965261e-29 64.699878
10679 cMono CFL1 0.801459 6.164316e-32 137 1.862856e-28 63.850272
10623 cMono LDHA 0.793166 7.222266e-31 137 2.054182e-27 61.449920
10921 cMono MYL6B 0.784690 7.972819e-30 137 2.141676e-26 59.105624
11347 cMono CIITA -0.768981 5.193181e-28 137 1.321583e-24 54.983212
10599 cMono CTSD 0.765873 1.141259e-27 137 2.759108e-24 54.247135
10030 cMono HLA-DPB1 -0.764123 1.768498e-27 137 4.071925e-24 53.857926
7616 ncMono ADM 0.762942 2.371437e-27 137 5.211987e-24 53.611081
10545 cMono SEC61B 0.762414 2.701950e-27 137 5.602700e-24 53.538794
10665 cMono FTH1 -0.762298 2.780956e-27 137 5.602700e-24 53.538794
10444 cMono GGH 0.761924 3.049297e-27 137 5.897585e-24 53.487499
9758 cMono ANXA3 0.758692 6.721239e-27 137 1.249944e-23 52.736359
11034 cMono RNASE3 0.755460 1.463405e-26 137 2.620688e-23 51.996020
10179 cMono HSPB1 0.755162 1.571551e-26 137 2.713844e-23 51.961091
11441 cMono TXNDC17 0.752651 2.850176e-26 137 4.752128e-23 51.400865
10386 cMono HMGB3 0.751162 4.043102e-26 137 6.516403e-23 51.085135
9287 cMono CENPF 0.750981 4.218092e-26 137 6.579135e-23 51.075554
10696 cMono FOLR2 -0.750695 4.509140e-26 137 6.813311e-23 51.040579
9191 cMono CD1C -0.750489 4.731778e-26 137 6.933059e-23 51.023156
11650 cMono SLPI 0.747222 1.007990e-25 137 1.433480e-22 50.296767
10552 cMono TXN 0.746588 1.165624e-25 137 1.610293e-22 50.180456
9174 cMono FDPS 0.742649 2.849516e-25 137 3.827217e-22 49.314734
9355 cMono BOLA3 0.740722 4.387039e-25 137 5.733030e-22 48.910628
10021 cMono HLA-DQA1 -0.737926 8.147764e-25 137 1.036739e-21 48.318207
14921 CD4 MIF 0.737698 8.568028e-25 137 1.062260e-21 48.293888
10018 cMono HLA-DRA -0.734641 1.669887e-24 137 2.018560e-21 47.651903
9866 cMono MEF2C -0.734319 1.790537e-24 137 2.111611e-21 47.606836
10166 cMono SEC61G 0.730622 3.959189e-24 137 4.557969e-21 46.837410
11566 cMono BIRC5 0.729768 4.747083e-24 137 5.337930e-21 46.679449
10848 cMono CLEC4A -0.728686 5.969285e-24 137 6.559701e-21 46.473342
13623 CD4 LDHA 0.725540 1.154193e-23 137 1.240167e-20 45.836456
9228 cMono F5 0.724369 1.471996e-23 137 1.547260e-20 45.615216
10916 cMono ZNF385A -0.724229 1.515288e-23 137 1.558876e-20 45.607737
11573 cMono P4HB 0.722814 2.028650e-23 137 2.043527e-20 45.337025
11667 cMono SLCO4A1 0.720362 3.350763e-23 137 3.306451e-20 44.855826
10826 cMono MKI67 0.719580 3.927259e-23 137 3.733561e-20 44.734339

This list is largely dominated by classical monocytes. Look at the top correlated gene! In the paper, authors write that S100A8 is a known biomarker, and its expression in cMonocytes correlates with severity. Other markers they mention (S100A9, S100A12, negative marker HLA-DRA) can also be found in the table. Thus, our severity score works

Let’s visualize the correlated genes in classical monocytes as they by far dominate in the table. Average expression per sample is now stored in meta_adata.obsm in keys {cell_type}_pseudobulk. This is controlled by the argument keep_pseudobulks_in_data in patpy.tl.correlate_cell_type_expression, set it to False if you want to save memory

cmono_expression_df = meta_adata.obsm["cMono_pseudobulk"]
cmono_expression_df
SAMD11 ISG15 TNFRSF18 TNFRSF4 TMEM240 PRDM16 SMIM1 ACOT7 HES2 TAS1R1 ... S100B MT-CO1 MT-CO2 MT-CO3 MT-ND6 MT-CYB AC136616.2 AC141272.1 AC233755.2 AC233755.1
scRNASeq_sample_ID
S00078-Ja003E-PBCa 0.000000 0.666305 0.001183 0.001026 0.000000 0.0 0.007825 0.001336 0.000000 0.000000 ... 0.002201 4.252630 4.054601 3.459366 0.253607 2.783757 0.0 0.000000 0.008152 0.024226
G05064-Ja005E-PBCa 0.000000 0.329458 0.000000 0.001716 0.000000 0.0 0.009752 0.006973 0.000000 0.000000 ... 0.005361 3.937209 3.933448 3.238184 0.120764 2.759863 0.0 0.000000 0.008778 0.001008
N00027-Ja001E-PBGa 0.000000 0.161210 0.000000 0.001394 0.000000 0.0 0.015552 0.001609 0.000000 0.000000 ... 0.010047 4.248434 4.193965 3.502021 0.204099 2.744123 0.0 0.000000 0.012055 0.003364
N00040-Ja001E-PBGa 0.000215 0.186256 0.000875 0.000637 0.000000 0.0 0.002057 0.023947 0.000204 0.000218 ... 0.005497 4.030371 3.831872 3.575325 0.584262 2.572893 0.0 0.000219 0.005552 0.018434
S00066-Ja001E-PBCa 0.000000 0.702043 0.000779 0.000201 0.000000 0.0 0.008773 0.008580 0.000407 0.000000 ... 0.003274 4.106230 3.834374 3.595163 0.707436 2.959860 0.0 0.000255 0.012751 0.018012
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
G05112-Ja005E-PBCa 0.000000 2.783315 0.000000 0.002990 0.000000 0.0 0.005321 0.004147 0.000000 0.000000 ... 0.001852 3.401807 3.298067 2.567472 0.232005 1.962488 0.0 0.000000 0.008089 0.004275
S00042-Ja001E-PBCa 0.000000 0.076010 0.000700 0.000000 0.000000 0.0 0.011200 0.007901 0.000000 0.000987 ... 0.003222 4.230716 4.037877 3.988372 0.513297 2.790164 0.0 0.000000 0.008693 0.024193
S00067-Ja001E-PBCa 0.000000 0.403914 0.000000 0.002425 0.000541 0.0 0.009077 0.005603 0.000926 0.000000 ... 0.030901 4.086512 3.805610 3.605104 0.434191 2.720078 0.0 0.000000 0.032481 0.014675
N00006-Ja003E-PBGa 0.000000 0.219691 0.000000 0.000739 0.000000 0.0 0.018713 0.000641 0.000000 0.000000 ... 0.001720 4.694515 4.340000 4.050896 0.706753 3.369446 0.0 0.000000 0.010999 0.023512
S00113-Ja005E-PBCa 0.000000 0.506833 0.000000 0.000000 0.000000 0.0 0.007888 0.010469 0.000000 0.000000 ... 0.001530 4.045323 4.325161 4.005432 0.671465 3.210887 0.0 0.000000 0.018759 0.026449

137 rows × 3000 columns

top_genes = expression_dpt_correlation_df.sort_values(["p_value", "correlation"]).head(15)
pseudotime = meta_adata.obs["dpt_pseudotime"]

# Set up the plot
fig, axes = plt.subplots(3, 5, figsize=(15, 9))
fig.suptitle("Cell Type gene expression vs Pseudotime", fontsize=16)

# Flatten the axes array for easier iteration
axes_flat = axes.flatten()

# Plot scatterplots for each significant cell type
for i, (_, row) in enumerate(top_genes.iterrows()):
    ax = axes_flat[i]

    gene_expression = cmono_expression_df.loc[pseudotime.index, row["gene_name"]]
    sns.scatterplot(x=pseudotime, y=gene_expression, ax=ax, alpha=0.6)

    # Add regression line
    sns.regplot(x=pseudotime, y=gene_expression, ax=ax, scatter=False, color="red")

    # Set title with correlation value
    correlation = row["correlation"]
    ax.set_title(f"{row['gene_name']}\nr = {correlation:.2f}", fontsize=10)

    # Set labels
    ax.set_xlabel("Pseudotime", fontsize=8)
    ax.set_ylabel("Expression", fontsize=8)

    # Adjust tick label size
    ax.tick_params(axis="both", which="major", labelsize=6)

# Remove any unused subplots
for j in range(i + 1, len(axes_flat)):
    fig.delaxes(axes_flat[j])

plt.tight_layout()
plt.show()
../../_images/bca07297960fc19dbdc832d8644ba37985faff25527df8ec58fe4c91a3a5ca4b.png

There are certainly non-linear dependencies but S100A8 is a great marker indeed. Let’s take a look how it alone explains the severity:

gene_expression = cmono_expression_df.loc[pseudotime.index, "S100A8"]
sns.boxplot(
    x=gene_expression.loc[meta_adata.obs_names],
    y=meta_adata.obs["Source"],
    palette=source_palette,
    order=[key for key in source_palette.keys() if key in meta_adata.obs["Source"].unique()],
)
plt.xlabel("Gene expression")
plt.ylabel("")
Text(0, 0.5, '')
../../_images/d74e2f6e045a846693ce0cf9f048fc5cad059b11505ad276b62f88ee4286d14b.png

Well, our score works a bit better but this is definitely a good marker!

We can also visualize results on a volcano plot:

patpy.pl.correlation_volcano(expression_dpt_correlation_df, top_n=15);
../../_images/0a874cb7cebc3dfe62dc21b98eacc6f37fbf0d72cc661d147bfaa32533d6e141.png

In summary, we have seen an example of sample-level analysis that allowed us to obtain a continuous severity score, find potential errors in the original clinical annotation, and easily rediscover known markers. Applying such methods to less explored areas might help you discover something new. With ehrapy and patpy it only requires running a few lines of code