Basic Cell Clustering Using mCG-5Kb Bins

Content

Here we go through the basic steps to perform cell clustering using genome non-overlapping 5Kb bins as features. We start from hypo-methylation probability data stored in MCDS (quantified by the hypo- or hyper-methylation score option, see allcools generate-dataset). This notebook can be used to quickly evaluate cell-type composition in a single-cell methylome dataset (e.g., the dataset from a single experiment). Comparing with the 100Kb bins clustering process, this clustering process is more suitable for samples with low mCH fraction (many non-brain tissues) and narrow methylation diversity (so smaller feature works better).

Dataset used in this notebook

Input

  • MCDS with chrom5k hypo-score matrix

  • Cell metadata

Output

  • Cell-by-5kb-bin AnnData (sparse matrix) with embedding coordinates and cluster labels.

Import

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
from ALLCools.mcds import MCDS
from ALLCools.clustering import tsne, significant_pc_test, filter_regions, lsi, binarize_matrix
from ALLCools.plot import *

Parameters

metadata_path = '../../data/PIT/PIT.CellMetadata.csv.gz'
mcds_path = '../../data/PIT/RufZamojski2021NC.mcds'

# Basic filtering parameters
mapping_rate_cutoff = 0.5
mapping_rate_col_name = 'MappingRate'  # Name may change
final_reads_cutoff = 500000
final_reads_col_name = 'FinalmCReads'  # Name may change
mccc_cutoff = 0.03
mccc_col_name = 'mCCCFrac'  # Name may change
mch_cutoff = 0.2
mch_col_name = 'mCHFrac'  # Name may change
mcg_cutoff = 0.5
mcg_col_name = 'mCGFrac'  # Name may change

# PC cutoff
pc_cutoff = 0.1

# KNN
knn = -1  # -1 means auto determine

# Leiden
resolution = 1

Load Cell Metadata

metadata = pd.read_csv(metadata_path, index_col=0)
print(f'Metadata of {metadata.shape[0]} cells')
metadata.head()
Metadata of 2756 cells
CellInputReadPairs MappingRate FinalmCReads mCCCFrac mCGFrac mCHFrac Plate Col384 Row384 CellTypeAnno
index
PIT_P1-PIT_P2-A1-AD001 1858622.0 0.685139 1612023.0 0.003644 0.679811 0.005782 PIT_P1 0 0 Outlier
PIT_P1-PIT_P2-A1-AD004 1599190.0 0.686342 1367004.0 0.004046 0.746012 0.008154 PIT_P1 1 0 Gonadotropes
PIT_P1-PIT_P2-A1-AD006 1932242.0 0.669654 1580990.0 0.003958 0.683584 0.005689 PIT_P1 1 1 Somatotropes
PIT_P1-PIT_P2-A1-AD007 1588505.0 0.664612 1292770.0 0.003622 0.735217 0.005460 PIT_P2 0 0 Rbpms+
PIT_P1-PIT_P2-A1-AD010 1738409.0 0.703835 1539676.0 0.003769 0.744640 0.006679 PIT_P2 1 0 Rbpms+

Filter Cells

judge = (metadata[mapping_rate_col_name] > mapping_rate_cutoff) & \
        (metadata[final_reads_col_name] > final_reads_cutoff) & \
        (metadata[mccc_col_name] < mccc_cutoff) & \
        (metadata[mch_col_name] < mch_cutoff) & \
        (metadata[mcg_col_name] > mcg_cutoff)

metadata = metadata[judge].copy()
# cell metadata for this example is filtered already
print(f'{metadata.shape[0]} cells passed filtering')
2756 cells passed filtering

Load hypo-methylation score matrix from MCDS

mcds = MCDS.open(mcds_path, var_dim='chrom5k')
mcds.add_cell_metadata(metadata)
mcds = mcds.remove_black_list_region(black_list_path='../../data/genome/mm10-blacklist.v2.bed.gz')
49908 chrom5k features removed due to overlapping (bedtools intersect -f 0.2) with black list regions.
mcad = mcds.get_score_adata(mc_type='CGN', quant_type='hypo-score')
mcad
AnnData object with n_obs × n_vars = 2756 × 495206
    obs: 'CellInputReadPairs', 'MappingRate', 'FinalmCReads', 'mCCCFrac', 'mCGFrac', 'mCHFrac', 'Plate', 'Col384', 'Row384', 'CellTypeAnno'
    var: 'chrom', 'end', 'start'

Binarize

binarize_matrix(mcad, cutoff=0.95)

Filter Features

filter_regions(mcad)
Filter out 206777 regions with # of non-zero cells <= 13
array([False, False, False, ...,  True,  True,  True])
mcad
AnnData object with n_obs × n_vars = 2756 × 288429
    obs: 'CellInputReadPairs', 'MappingRate', 'FinalmCReads', 'mCCCFrac', 'mCGFrac', 'mCHFrac', 'Plate', 'Col384', 'Row384', 'CellTypeAnno'
    var: 'chrom', 'end', 'start'

TF-IDF Transform and Dimension Reduction

# by default we save the results in adata.obsm['X_pca'] which is the scanpy defaults in many following functions
# But this matrix is not calculated by PCA
lsi(mcad, algorithm='arpack', obsm='X_pca')
TruncatedSVD(algorithm='arpack', n_components=100, random_state=0)
# choose significant components
significant_pc_test(mcad, p_cutoff=pc_cutoff, update=True)
13 components passed P cutoff of 0.1.
Changing adata.obsm['X_pca'] from shape (2756, 100) to (2756, 13)
13

Clustering

Calculate Nearest Neighbors

if knn == -1:
    knn = max(15, int(np.log2(mcad.shape[0])*2))
sc.pp.neighbors(mcad, n_neighbors=knn)

Leiden Clustering

sc.tl.leiden(mcad, resolution=resolution)

Manifold learning

tSNE

tsne(mcad,
     obsm='X_pca',
     metric='euclidean',
     exaggeration=-1,  # auto determined
     perplexity=30,
     n_jobs=-1)
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
_ = categorical_scatter(data=mcad,
                        ax=ax,
                        coord_base='tsne',
                        hue='leiden',
                        text_anno='leiden',
                        show_legend=True)
../../_images/mcg_5kb_basic_30_0.png

UMAP

sc.tl.umap(mcad)
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
_ = categorical_scatter(data=mcad,
                        ax=ax,
                        coord_base='umap',
                        hue='leiden',
                        text_anno='leiden',
                        show_legend=True)
../../_images/mcg_5kb_basic_33_0.png

Interactive plot

interactive_scatter(data=mcad, hue='leiden', coord_base='umap')

Save results

mcad.write_h5ad('PIT.mCG-5K-clustering.h5ad')
mcad
mcad.obs.to_csv('PIT.ClusteringResults.csv.gz')
mcad.obs.head()

Sanity test

This test dataset come from [Ruf-Zamojski et al., 2021], so we already annotated the cell types. For new datasets, see following notebooks about identifying cluster markers and annotate clusters

if 'CellTypeAnno' in mcad.obs:
    mcad.obs['CellTypeAnno'] = mcad.obs['CellTypeAnno'].fillna('Outlier')
    
    fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
    _ = categorical_scatter(data=mcad,
                            ax=ax,
                            coord_base='umap',
                            hue='CellTypeAnno',
                            text_anno='CellTypeAnno',
                            palette='tab20',
                            show_legend=True)
../../_images/mcg_5kb_basic_40_0.png