Cell Basic Filtering

Content

The purpose of this step is to get rid of cells having obvious issues, including the cells with low mapping rate (potentially contaminated), low final reads (empty well or lost a large amount of DNA during library prep.), or abnormal methylation fractions (failed in bisulfite conversion or contaminated).

We have two principles when applying these filters:

  1. We set the cutoff based on the distribution of the whole dataset, where we assume the input dataset is largely successful (mostly > 80-90% cells will pass QC). The cutoffs below are typical values we used in brain methylome analysis. Still, you may need to adjust cutoffs based on different data quality or sample source.

  2. The cutoff is intended to be loose. We do not use stringent cutoffs here to prevent potential data loss. Abormal cells may remain after basic filtering, and will likely be identified in the analysis based filtering (see later notebooks about doublet score and outliers in clustering)

Input

  • Cell metadata table that contains mapping metric for basic QC filtering.

Output

  • Filtered cell metadata table that contains only cells passed QC.

About Cell Mapping Metrics

We usually gather many mapping metrics from each processing step, but not all of the metrics are relevant to the cell filtering. Below are the most relevant metrics that we use to filter cells. The name of these metrics might be different in your dataset. Change it according to the file you have.

If you use YAP to do mapping, you can find up-to-date mapping metrics documentation for key metrics and all metrics in YAP doc.

Import

import pandas as pd
import seaborn as sns
sns.set_context(context='notebook', font_scale=1.3)

Parameters

# change this to the path to your metadata
metadata_path = '../../../data/PIT/PIT.CellMetadata.csv.gz'

# Basic filtering parameters
mapping_rate_cutoff = 0.5
mapping_rate_col_name = 'MappingRate'  # Name may change
final_reads_cutoff = 500000
final_reads_col_name = 'FinalmCReads'  # Name may change
mccc_cutoff = 0.03
mccc_col_name = 'mCCCFrac'  # Name may change
mch_cutoff = 0.2
mch_col_name = 'mCHFrac'  # Name may change
mcg_cutoff = 0.5
mcg_col_name = 'mCGFrac'  # Name may change

Load metadata

metadata = pd.read_csv(metadata_path, index_col=0)
total_cells = metadata.shape[0]
print(f'Metadata of {total_cells} cells')
Metadata of 2756 cells
metadata.head()
CellInputReadPairs MappingRate FinalmCReads mCCCFrac mCGFrac mCHFrac Plate Col384 Row384 CellTypeAnno
index
PIT_P1-PIT_P2-A1-AD001 1858622.0 0.685139 1612023.0 0.003644 0.679811 0.005782 PIT_P1 0 0 Outlier
PIT_P1-PIT_P2-A1-AD004 1599190.0 0.686342 1367004.0 0.004046 0.746012 0.008154 PIT_P1 1 0 Gonadotropes
PIT_P1-PIT_P2-A1-AD006 1932242.0 0.669654 1580990.0 0.003958 0.683584 0.005689 PIT_P1 1 1 Somatotropes
PIT_P1-PIT_P2-A1-AD007 1588505.0 0.664612 1292770.0 0.003622 0.735217 0.005460 PIT_P2 0 0 Rbpms+
PIT_P1-PIT_P2-A1-AD010 1738409.0 0.703835 1539676.0 0.003769 0.744640 0.006679 PIT_P2 1 0 Rbpms+

Filter by key mapping metrics

Bismark Mapping Rate

  • Low mapping rate indicates potential contamination.

  • Usually R1 mapping rate is 8-10% higher than R2 mapping rate for snmC based technologies, but they should be highly correlated. Here I am using the combined mapping rate. If you are using the R1MappingRate or R2MappingRate, change the cutoff accordingly.

  • Usually there is a peak on the left, which corresponding to the empty wells.

_cutoff = mapping_rate_cutoff
_col_name = mapping_rate_col_name

# plot distribution to make sure cutoff is appropriate
g = sns.displot(metadata[_col_name], binrange=(0, 1))
g.ax.plot((_cutoff, _cutoff), g.ax.get_ylim(), c='r', linestyle='--')

mapping_rate_judge = metadata[_col_name] > _cutoff
_passed_cells = mapping_rate_judge.sum()
print(
    f'{_passed_cells} / {total_cells} cells ({_passed_cells / total_cells * 100:.1f}%) '
    f'passed the {_col_name} cutoff {_cutoff}.')
2756 / 2756 cells (100.0%) passed the MappingRate cutoff 0.5.
../../../_images/01-CellBasicFiltering_11_11.png

Final Reads

  • The cutoff may change depending on how deep the library has been sequenced.

  • Usually there is a peak on the left, which corresponding to the empty wells.

  • There are also some cells having small number of reads, these wells may lost most of the DNA during library prep. Cells having too less reads can be hard to classify, since the methylome sequencing is an untargeted whole-genome sequencing.

_cutoff = final_reads_cutoff
_col_name = final_reads_col_name

# plot distribution to make sure cutoff is appropriate
g = sns.displot(metadata[_col_name], binrange=(0, 5e6))
g.ax.plot((_cutoff, _cutoff), g.ax.get_ylim(), c='r', linestyle='--')

final_reads_judge = metadata[_col_name] > _cutoff
_passed_cells = final_reads_judge.sum()
print(
    f'{_passed_cells} / {total_cells} cells ({_passed_cells / total_cells * 100:.1f}%) '
    f'passed the {_col_name} cutoff {_cutoff}.')
2756 / 2756 cells (100.0%) passed the FinalmCReads cutoff 500000.
../../../_images/01-CellBasicFiltering_13_11.png

mCCC / CCC

  • The mCCC fraction is used as the proxy of the upper bound of the non-conversion rate for cell-level QC. The methylation level at CCC sites is the lowest among all of the different 3 base-contexts (CNN), and, in fact, it is very close to the unmethylated lambda mC fraction.

  • However, mCCC fraction is correlated with mCH (especially in brain data), so you can see a similar shape of distribution of mCCC and mCH, but the range is different.

_cutoff = mccc_cutoff
_col_name = mccc_col_name

# plot distribution to make sure cutoff is appropriate
g = sns.displot(metadata[_col_name], binrange=(0, 0.05))
g.ax.plot((_cutoff, _cutoff), g.ax.get_ylim(), c='r', linestyle='--')

mccc_judge = metadata[_col_name] < _cutoff
_passed_cells = mccc_judge.sum()
print(
    f'{_passed_cells} / {total_cells} cells ({_passed_cells / total_cells * 100:.1f}%) '
    f'passed the {_col_name} cutoff {_cutoff}.')
2756 / 2756 cells (100.0%) passed the mCCCFrac cutoff 0.03.
../../../_images/01-CellBasicFiltering_15_11.png

mCH / CH

  • Usually failed cells (empty well or contaminated) tend to have abormal methylation level as well.

_cutoff = mch_cutoff
_col_name = mch_col_name

# plot distribution to make sure cutoff is appropriate
g = sns.displot(metadata[_col_name], binrange=(0, 0.3))
g.ax.plot((_cutoff, _cutoff), g.ax.get_ylim(), c='r', linestyle='--')

mch_judge = metadata[_col_name] < _cutoff
_passed_cells = mch_judge.sum()
print(
    f'{_passed_cells} / {total_cells} cells ({_passed_cells / total_cells * 100:.1f}%) '
    f'passed the {_col_name} cutoff {_cutoff}.')
2756 / 2756 cells (100.0%) passed the mCHFrac cutoff 0.2.
../../../_images/01-CellBasicFiltering_17_11.png

mCG

  • Usually failed cells (empty well or contaminated) tend to have abormal methylation level as well.

_cutoff = mcg_cutoff
_col_name = mcg_col_name

# plot distribution to make sure cutoff is appropriate
g = sns.displot(metadata[_col_name], binrange=(0.3, 1))
g.ax.plot((_cutoff, _cutoff), g.ax.get_ylim(), c='r', linestyle='--')

mcg_judge = metadata[_col_name] > _cutoff
_passed_cells = mcg_judge.sum()
print(
    f'{_passed_cells} / {total_cells} cells ({_passed_cells / total_cells * 100:.1f}%) '
    f'passed the {_col_name} cutoff {_cutoff}.')
2756 / 2756 cells (100.0%) passed the mCGFrac cutoff 0.5.
../../../_images/01-CellBasicFiltering_19_11.png

Combine filters

judge = mapping_rate_judge & final_reads_judge & mccc_judge & mch_judge & mcg_judge
passed_cells = judge.sum()
print(
    f'{passed_cells} / {total_cells} cells ({passed_cells / total_cells * 100:.1f}%) '
    f'passed all the filters.')
2756 / 2756 cells (100.0%) passed all the filters.

Sanity Test

try:
    assert (passed_cells / total_cells) > 0.6
except AssertionError as e:
    e.args += (
        'A large amount of the cells do not pass filter, check your cutoffs or overall dataset quality.',
    )
    raise e

try:
    assert passed_cells > 0
except AssertionError as e:
    e.args += ('No cell remained after all the filters.', )
    raise e

print('Feel good')
Feel good

Save filtered metadata

metadata_filtered = metadata[judge].copy()

metadata_filtered.to_csv('CellMetadata.PassQC.csv.gz')