Basic Cell Clustering Using mCG-5Kb Bins
Contents
Basic Cell Clustering Using mCG-5Kb Bins¶
Content¶
Here we go through the basic steps to perform cell clustering using genome non-overlapping 5Kb bins
as features. We start from hypo-methylation probability data stored in MCDS (quantified by the hypo- or hyper-methylation score option, see allcools generate-dataset
). This notebook can be used to quickly evaluate cell-type composition in a single-cell methylome dataset (e.g., the dataset from a single experiment).
Comparing with the 100Kb bins clustering process, this clustering process is more suitable for samples with low mCH fraction (many non-brain tissues) and narrow methylation diversity (so smaller feature works better).
Dataset used in this notebook¶
Adult (age P56) male mouse brain pituitary (PIT) snmC-seq2 data from [Ruf-Zamojski et al., 2021].
Input¶
MCDS with chrom5k hypo-score matrix
Cell metadata
Output¶
Cell-by-5kb-bin AnnData (sparse matrix) with embedding coordinates and cluster labels.
Import¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
from ALLCools.mcds import MCDS
from ALLCools.clustering import tsne, significant_pc_test, filter_regions, lsi, binarize_matrix
from ALLCools.plot import *
Parameters¶
metadata_path = '../../data/PIT/PIT.CellMetadata.csv.gz'
mcds_path = '../../data/PIT/RufZamojski2021NC.mcds'
# Basic filtering parameters
mapping_rate_cutoff = 0.5
mapping_rate_col_name = 'MappingRate' # Name may change
final_reads_cutoff = 500000
final_reads_col_name = 'FinalmCReads' # Name may change
mccc_cutoff = 0.03
mccc_col_name = 'mCCCFrac' # Name may change
mch_cutoff = 0.2
mch_col_name = 'mCHFrac' # Name may change
mcg_cutoff = 0.5
mcg_col_name = 'mCGFrac' # Name may change
# PC cutoff
pc_cutoff = 0.1
# KNN
knn = -1 # -1 means auto determine
# Leiden
resolution = 1
Load Cell Metadata¶
metadata = pd.read_csv(metadata_path, index_col=0)
print(f'Metadata of {metadata.shape[0]} cells')
metadata.head()
Metadata of 2756 cells
CellInputReadPairs | MappingRate | FinalmCReads | mCCCFrac | mCGFrac | mCHFrac | Plate | Col384 | Row384 | CellTypeAnno | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
PIT_P1-PIT_P2-A1-AD001 | 1858622.0 | 0.685139 | 1612023.0 | 0.003644 | 0.679811 | 0.005782 | PIT_P1 | 0 | 0 | Outlier |
PIT_P1-PIT_P2-A1-AD004 | 1599190.0 | 0.686342 | 1367004.0 | 0.004046 | 0.746012 | 0.008154 | PIT_P1 | 1 | 0 | Gonadotropes |
PIT_P1-PIT_P2-A1-AD006 | 1932242.0 | 0.669654 | 1580990.0 | 0.003958 | 0.683584 | 0.005689 | PIT_P1 | 1 | 1 | Somatotropes |
PIT_P1-PIT_P2-A1-AD007 | 1588505.0 | 0.664612 | 1292770.0 | 0.003622 | 0.735217 | 0.005460 | PIT_P2 | 0 | 0 | Rbpms+ |
PIT_P1-PIT_P2-A1-AD010 | 1738409.0 | 0.703835 | 1539676.0 | 0.003769 | 0.744640 | 0.006679 | PIT_P2 | 1 | 0 | Rbpms+ |
Filter Cells¶
judge = (metadata[mapping_rate_col_name] > mapping_rate_cutoff) & \
(metadata[final_reads_col_name] > final_reads_cutoff) & \
(metadata[mccc_col_name] < mccc_cutoff) & \
(metadata[mch_col_name] < mch_cutoff) & \
(metadata[mcg_col_name] > mcg_cutoff)
metadata = metadata[judge].copy()
# cell metadata for this example is filtered already
print(f'{metadata.shape[0]} cells passed filtering')
2756 cells passed filtering
Load hypo-methylation score matrix from MCDS¶
mcds = MCDS.open(mcds_path, var_dim='chrom5k')
mcds.add_cell_metadata(metadata)
mcds = mcds.remove_black_list_region(black_list_path='../../data/genome/mm10-blacklist.v2.bed.gz')
49908 chrom5k features removed due to overlapping (bedtools intersect -f 0.2) with black list regions.
mcad = mcds.get_score_adata(mc_type='CGN', quant_type='hypo-score')
mcad
AnnData object with n_obs × n_vars = 2756 × 495206
obs: 'CellInputReadPairs', 'MappingRate', 'FinalmCReads', 'mCCCFrac', 'mCGFrac', 'mCHFrac', 'Plate', 'Col384', 'Row384', 'CellTypeAnno'
var: 'chrom', 'end', 'start'
Binarize¶
binarize_matrix(mcad, cutoff=0.95)
Filter Features¶
filter_regions(mcad)
Filter out 206777 regions with # of non-zero cells <= 13
array([False, False, False, ..., True, True, True])
mcad
AnnData object with n_obs × n_vars = 2756 × 288429
obs: 'CellInputReadPairs', 'MappingRate', 'FinalmCReads', 'mCCCFrac', 'mCGFrac', 'mCHFrac', 'Plate', 'Col384', 'Row384', 'CellTypeAnno'
var: 'chrom', 'end', 'start'
TF-IDF Transform and Dimension Reduction¶
# by default we save the results in adata.obsm['X_pca'] which is the scanpy defaults in many following functions
# But this matrix is not calculated by PCA
lsi(mcad, algorithm='arpack', obsm='X_pca')
TruncatedSVD(algorithm='arpack', n_components=100, random_state=0)
# choose significant components
significant_pc_test(mcad, p_cutoff=pc_cutoff, update=True)
13 components passed P cutoff of 0.1.
Changing adata.obsm['X_pca'] from shape (2756, 100) to (2756, 13)
13
Clustering¶
Calculate Nearest Neighbors¶
if knn == -1:
knn = max(15, int(np.log2(mcad.shape[0])*2))
sc.pp.neighbors(mcad, n_neighbors=knn)
Leiden Clustering¶
sc.tl.leiden(mcad, resolution=resolution)
Manifold learning¶
tSNE¶
tsne(mcad,
obsm='X_pca',
metric='euclidean',
exaggeration=-1, # auto determined
perplexity=30,
n_jobs=-1)
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
_ = categorical_scatter(data=mcad,
ax=ax,
coord_base='tsne',
hue='leiden',
text_anno='leiden',
show_legend=True)
UMAP¶
sc.tl.umap(mcad)
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
_ = categorical_scatter(data=mcad,
ax=ax,
coord_base='umap',
hue='leiden',
text_anno='leiden',
show_legend=True)
Interactive plot¶
interactive_scatter(data=mcad, hue='leiden', coord_base='umap')
Save results¶
mcad.write_h5ad('PIT.mCG-5K-clustering.h5ad')
mcad
mcad.obs.to_csv('PIT.ClusteringResults.csv.gz')
mcad.obs.head()
Sanity test¶
This test dataset come from [Ruf-Zamojski et al., 2021], so we already annotated the cell types. For new datasets, see following notebooks about identifying cluster markers and annotate clusters
if 'CellTypeAnno' in mcad.obs:
mcad.obs['CellTypeAnno'] = mcad.obs['CellTypeAnno'].fillna('Outlier')
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
_ = categorical_scatter(data=mcad,
ax=ax,
coord_base='umap',
hue='CellTypeAnno',
text_anno='CellTypeAnno',
palette='tab20',
show_legend=True)