{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Basic Filtering\n", "\n", "## Purpose\n", "Apply basic filters to remove these problematic features:\n", "- Extremly low coverage or high coverage features\n", "- ENCODE Blcaklist\n", "- Some chromosomes (usually, chrY and chrM)\n", "\n", "## Input\n", "- Cell metadata (after basic cell filter)\n", "- MCDS files\n", "\n", "## Output\n", "- FeatureList.BasicFilter.txt: List of feature ids passed all filters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:08:30.573306Z", "start_time": "2022-02-15T22:08:28.334624Z" } }, "outputs": [], "source": [ "import pathlib\n", "import pandas as pd\n", "import seaborn as sns\n", "from ALLCools.mcds import MCDS" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:08:30.577985Z", "start_time": "2022-02-15T22:08:30.574931Z" } }, "outputs": [], "source": [ "sns.set_context(context='notebook', font_scale=1.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parameters" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:08:47.499885Z", "start_time": "2022-02-15T22:08:47.492245Z" }, "tags": [ "parameters" ] }, "outputs": [], "source": [ "# change this to the path to your filtered metadata\n", "metadata_path = 'CellMetadata.PassQC.csv.gz'\n", "\n", "# change this to the paths to your MCDS files\n", "mcds_path = '../../../data/Brain/snmC-seq2/Liu2021Nature.mcds'\n", "\n", "# Dimension name used to do clustering\n", "obs_dim = 'cell' # observation\n", "var_dim = 'chrom100k' # feature\n", "\n", "min_cov = 500\n", "max_cov = 3000\n", "\n", "# change this to the path to ENCODE blacklist.\n", "# The ENCODE blacklist can be download from https://github.com/Boyle-Lab/Blacklist/\n", "black_list_path = '../../../data/genome/mm10-blacklist.v2.bed.gz'\n", "f = 0.2\n", "\n", "exclude_chromosome = ['chrM', 'chrY']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data\n", "\n", "### Metadata" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:08:51.448782Z", "start_time": "2022-02-15T22:08:51.346615Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Metadata of 16985 cells\n" ] } ], "source": [ "metadata = pd.read_csv(metadata_path, index_col=0)\n", "total_cells = metadata.shape[0]\n", "print(f'Metadata of {total_cells} cells')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:08:51.703556Z", "start_time": "2022-02-15T22:08:51.685896Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | AllcPath | \n", "mCCCFrac | \n", "mCGFrac | \n", "mCGFracAdj | \n", "mCHFrac | \n", "mCHFracAdj | \n", "FinalReads | \n", "InputReads | \n", "MappedReads | \n", "DissectionRegion | \n", "BamFilteringRate | \n", "MappingRate | \n", "Plate | \n", "Col384 | \n", "Row384 | \n", "FANSDate | \n", "Slice | \n", "Sample | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10E_M_0 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.008198 | \n", "0.822633 | \n", "0.821166 | \n", "0.041640 | \n", "0.033718 | \n", "1626504.0 | \n", "4407752 | \n", "2892347.0 | \n", "10E | \n", "0.562347 | \n", "0.656195 | \n", "CEMBA190625-10E-1 | \n", "0 | \n", "0 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_1 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.006019 | \n", "0.743035 | \n", "0.741479 | \n", "0.024127 | \n", "0.018218 | \n", "2009998.0 | \n", "5524084 | \n", "3657352.0 | \n", "10E | \n", "0.549577 | \n", "0.662074 | \n", "CEMBA190625-10E-1 | \n", "0 | \n", "1 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_10 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.006569 | \n", "0.750172 | \n", "0.748520 | \n", "0.027665 | \n", "0.021235 | \n", "1383636.0 | \n", "3455260 | \n", "2172987.0 | \n", "10E | \n", "0.636744 | \n", "0.628892 | \n", "CEMBA190625-10E-1 | \n", "19 | \n", "0 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_101 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.006353 | \n", "0.760898 | \n", "0.759369 | \n", "0.026547 | \n", "0.020323 | \n", "2474670.0 | \n", "7245482 | \n", "4778768.0 | \n", "10E | \n", "0.517847 | \n", "0.659551 | \n", "CEMBA190625-10E-1 | \n", "18 | \n", "3 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_102 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.005409 | \n", "0.752980 | \n", "0.751637 | \n", "0.019497 | \n", "0.014164 | \n", "2430290.0 | \n", "7004754 | \n", "4609570.0 | \n", "10E | \n", "0.527227 | \n", "0.658063 | \n", "CEMBA190625-10E-1 | \n", "19 | \n", "2 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
<xarray.MCDS>\n", "Dimensions: (cell: 16985, chrom100k: 27269, count_type: 2, mc_type: 2)\n", "Coordinates:\n", " * cell (cell) <U10 '10E_M_207' '10E_M_338' ... '9J_M_2969'\n", " * chrom100k (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268\n", " chrom100k_bin_end (chrom100k) int64 dask.array<chunksize=(27269,), meta=np.ndarray>\n", " chrom100k_bin_start (chrom100k) int64 dask.array<chunksize=(27269,), meta=np.ndarray>\n", " chrom100k_chrom (chrom100k) <U5 dask.array<chunksize=(27269,), meta=np.ndarray>\n", " * count_type (count_type) <U3 'mc' 'cov'\n", " * mc_type (mc_type) <U3 'CGN' 'CHN'\n", " strand_type <U4 'both'\n", "Data variables:\n", " chrom100k_da (cell, chrom100k, mc_type, count_type) uint16 dask.array<chunksize=(3397, 2479, 2, 2), meta=np.ndarray>\n", "Attributes:\n", " obs_dim: cell\n", " var_dim: chrom100k
<xarray.MCDS>\n", "Dimensions: (cell: 16985, chrom100k: 24045, count_type: 2, mc_type: 2)\n", "Coordinates:\n", " * cell (cell) <U10 '10E_M_207' '10E_M_338' ... '9J_M_2969'\n", " * chrom100k (chrom100k) int64 30 31 32 33 ... 26335 26336 26337\n", " chrom100k_bin_end (chrom100k) int64 dask.array<chunksize=(24045,), meta=np.ndarray>\n", " chrom100k_bin_start (chrom100k) int64 dask.array<chunksize=(24045,), meta=np.ndarray>\n", " chrom100k_chrom (chrom100k) <U5 dask.array<chunksize=(24045,), meta=np.ndarray>\n", " * count_type (count_type) <U3 'mc' 'cov'\n", " * mc_type (mc_type) <U3 'CGN' 'CHN'\n", " strand_type <U4 'both'\n", " chrom100k_cov_mean (chrom100k) float64 1.4e+03 1.378e+03 ... 715.3 696.9\n", "Data variables:\n", " chrom100k_da (cell, chrom100k, mc_type, count_type) uint16 dask.array<chunksize=(3397, 2364, 2, 2), meta=np.ndarray>\n", "Attributes:\n", " obs_dim: cell\n", " var_dim: chrom100k