Feature Basic Filtering

Contents

Feature Basic Filtering¶

Purpose¶

Apply basic filters to remove these problematic features:

Extremly low coverage or high coverage features
ENCODE Blcaklist
Some chromosomes (usually, chrY and chrM)

Input¶

Cell metadata (after basic cell filter)
MCDS files

Output¶

FeatureList.BasicFilter.txt: List of feature ids passed all filters

Import¶

import pathlib
import pandas as pd
import seaborn as sns
from ALLCools.mcds import MCDS

sns.set_context(context='notebook', font_scale=1.3)

Parameters¶

# change this to the path to your filtered metadata
metadata_path = 'CellMetadata.PassQC.csv.gz'

# change this to the paths to your MCDS files
mcds_path = '../../../data/Brain/snmC-seq2/Liu2021Nature.mcds'

# Dimension name used to do clustering
obs_dim = 'cell'  # observation
var_dim = 'chrom100k'  # feature

min_cov = 500
max_cov = 3000

# change this to the path to ENCODE blacklist.
# The ENCODE blacklist can be download from https://github.com/Boyle-Lab/Blacklist/
black_list_path = '../../../data/genome/mm10-blacklist.v2.bed.gz'
f = 0.2

exclude_chromosome = ['chrM', 'chrY']

Load Data¶

Metadata¶

metadata = pd.read_csv(metadata_path, index_col=0)
total_cells = metadata.shape[0]
print(f'Metadata of {total_cells} cells')

Metadata of 16985 cells

metadata.head()

	AllcPath	mCCCFrac	mCGFrac	mCGFracAdj	mCHFrac	mCHFracAdj	FinalReads	InputReads	MappedReads	DissectionRegion	BamFilteringRate	MappingRate	Plate	Col384	Row384	FANSDate	Slice	Sample
10E_M_0	/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E...	0.008198	0.822633	0.821166	0.041640	0.033718	1626504.0	4407752	2892347.0	10E	0.562347	0.656195	CEMBA190625-10E-1	0	0	190625	10	10E_190625
10E_M_1	/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E...	0.006019	0.743035	0.741479	0.024127	0.018218	2009998.0	5524084	3657352.0	10E	0.549577	0.662074	CEMBA190625-10E-1	0	1	190625	10	10E_190625
10E_M_10	/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E...	0.006569	0.750172	0.748520	0.027665	0.021235	1383636.0	3455260	2172987.0	10E	0.636744	0.628892	CEMBA190625-10E-1	19	0	190625	10	10E_190625
10E_M_101	/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E...	0.006353	0.760898	0.759369	0.026547	0.020323	2474670.0	7245482	4778768.0	10E	0.517847	0.659551	CEMBA190625-10E-1	18	3	190625	10	10E_190625
10E_M_102	/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E...	0.005409	0.752980	0.751637	0.019497	0.014164	2430290.0	7004754	4609570.0	10E	0.527227	0.658063	CEMBA190625-10E-1	19	2	190625	10	10E_190625

MCDS¶

mcds = MCDS.open(mcds_path, 
                 var_dim='chrom100k', 
                 use_obs=metadata.index)
total_feature = mcds.get_index(var_dim).size

mcds

<xarray.MCDS>
Dimensions:              (cell: 16985, chrom100k: 27269, count_type: 2, mc_type: 2)
Coordinates:
  * cell                 (cell) <U10 '10E_M_207' '10E_M_338' ... '9J_M_2969'
  * chrom100k            (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268
    chrom100k_bin_end    (chrom100k) int64 dask.array<chunksize=(27269,), meta=np.ndarray>
    chrom100k_bin_start  (chrom100k) int64 dask.array<chunksize=(27269,), meta=np.ndarray>
    chrom100k_chrom      (chrom100k) <U5 dask.array<chunksize=(27269,), meta=np.ndarray>
  * count_type           (count_type) <U3 'mc' 'cov'
  * mc_type              (mc_type) <U3 'CGN' 'CHN'
    strand_type          <U4 'both'
Data variables:
    chrom100k_da         (cell, chrom100k, mc_type, count_type) uint16 dask.array<chunksize=(3397, 2479, 2, 2), meta=np.ndarray>
Attributes:
    obs_dim:  cell
    var_dim:  chrom100k

Filter Features¶

Filter by mean coverage¶

mcds.add_feature_cov_mean()

Feature chrom100k mean cov across cells added in MCDS.coords['chrom100k_cov_mean'].

../../../_images/02-BasicFeatureFiltering_13_1.png

mcds = mcds.filter_feature_by_cov_mean(
    min_cov=min_cov,  # minimum coverage
    max_cov=max_cov  # Maximum coverage
)

Before cov mean filter: 27269 chrom100k
 After cov mean filter: 25242 chrom100k 92.6%

Filter by ENCODE Blacklist¶

mcds = mcds.remove_black_list_region(
    black_list_path=black_list_path,
    f=f  # Features having overlap > f with any black list region will be removed.
)

1189 chrom100k features removed due to overlapping (bedtools intersect -f 0.2) with black list regions.

Remove chromosomes¶

mcds = mcds.remove_chromosome(exclude_chromosome)

20 chrom100k features in ['chrM', 'chrY'] removed.

Save Feature List¶

print(
    f'{mcds.get_index(var_dim).size} ({mcds.get_index(var_dim).size * 100 / total_feature:.1f}%) '
    f'{var_dim} remained after all the basic filter.')

24045 (88.2%) chrom100k remained after all the basic filter.

with open('FeatureList.BasicFilter.txt', 'w') as f:
    for var in mcds.get_index(var_dim).astype(str):
        f.write(var + '\n')

mcds

<xarray.MCDS>
Dimensions:              (cell: 16985, chrom100k: 24045, count_type: 2, mc_type: 2)
Coordinates:
  * cell                 (cell) <U10 '10E_M_207' '10E_M_338' ... '9J_M_2969'
  * chrom100k            (chrom100k) int64 30 31 32 33 ... 26335 26336 26337
    chrom100k_bin_end    (chrom100k) int64 dask.array<chunksize=(24045,), meta=np.ndarray>
    chrom100k_bin_start  (chrom100k) int64 dask.array<chunksize=(24045,), meta=np.ndarray>
    chrom100k_chrom      (chrom100k) <U5 dask.array<chunksize=(24045,), meta=np.ndarray>
  * count_type           (count_type) <U3 'mc' 'cov'
  * mc_type              (mc_type) <U3 'CGN' 'CHN'
    strand_type          <U4 'both'
    chrom100k_cov_mean   (chrom100k) float64 1.4e+03 1.378e+03 ... 715.3 696.9
Data variables:
    chrom100k_da         (cell, chrom100k, mc_type, count_type) uint16 dask.array<chunksize=(3397, 2364, 2, 2), meta=np.ndarray>
Attributes:
    obs_dim:  cell
    var_dim:  chrom100k

previous

Cell Basic Filtering

next

Calculate Highly Variable Features And Get mC Fraction AnnData