{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculate Highly Variable Features And Get mC Fraction AnnData\n", "\n", "## Purpose\n", "The purpose of this step is to select highly variable features (HVF) and generate cell-by-feature methylation fraction matrix for clustering. The highly variable features are selected by comparing feature's normalized dispersion among cells.\n", "\n", "## Input\n", "- Filtered cell metadata;\n", "- MCDS files;\n", "- Feature list from basic feature filtering\n", "\n", "## Output\n", "- cell-by-HVF methylation fraction matrix stored in AnnData format, e.g., mCH adata and mCG adata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:10:22.450380Z", "start_time": "2022-02-15T22:10:20.482397Z" } }, "outputs": [], "source": [ "import pathlib\n", "import pandas as pd\n", "import dask\n", "from ALLCools.mcds import MCDS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parameters" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:10:47.527199Z", "start_time": "2022-02-15T22:10:47.524254Z" }, "tags": [ "parameters" ] }, "outputs": [], "source": [ "# If True, will load all data into memory.\n", "# Computation will be much faster, but also very memory intensive, only use this for small number of cells (<10,000)\n", "load = True\n", "\n", "# change this to the path to your filtered metadata\n", "metadata_path = 'CellMetadata.PassQC.csv.gz'\n", "\n", "# change this to the paths to your MCDS files\n", "mcds_path = '../../../data/Brain/snmC-seq2/Liu2021Nature.mcds'\n", "\n", "# Feature list after basic filter\n", "feature_path = 'FeatureList.BasicFilter.txt'\n", "\n", "# Dimension name used to do clustering\n", "obs_dim = 'cell' # observation\n", "var_dim = 'chrom100k' # feature\n", "\n", "# HVF method:\n", "# SVR: regression based\n", "# Bins: normalize dispersion per bin\n", "hvf_method = 'SVR'\n", "mch_pattern = 'CHN'\n", "mcg_pattern = 'CGN'\n", "n_top_feature = 20000\n", "\n", "# Downsample cells\n", "downsample = 20000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data\n", "\n", "### Metadata" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:10:25.821604Z", "start_time": "2022-02-15T22:10:25.709131Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Metadata of 16985 cells\n" ] } ], "source": [ "metadata = pd.read_csv(metadata_path, index_col=0)\n", "total_cells = metadata.shape[0]\n", "print(f'Metadata of {total_cells} cells')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2022-02-15T22:10:27.233579Z", "start_time": "2022-02-15T22:10:27.199147Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | AllcPath | \n", "mCCCFrac | \n", "mCGFrac | \n", "mCGFracAdj | \n", "mCHFrac | \n", "mCHFracAdj | \n", "FinalReads | \n", "InputReads | \n", "MappedReads | \n", "DissectionRegion | \n", "BamFilteringRate | \n", "MappingRate | \n", "Plate | \n", "Col384 | \n", "Row384 | \n", "FANSDate | \n", "Slice | \n", "Sample | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10E_M_0 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.008198 | \n", "0.822633 | \n", "0.821166 | \n", "0.041640 | \n", "0.033718 | \n", "1626504.0 | \n", "4407752 | \n", "2892347.0 | \n", "10E | \n", "0.562347 | \n", "0.656195 | \n", "CEMBA190625-10E-1 | \n", "0 | \n", "0 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_1 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.006019 | \n", "0.743035 | \n", "0.741479 | \n", "0.024127 | \n", "0.018218 | \n", "2009998.0 | \n", "5524084 | \n", "3657352.0 | \n", "10E | \n", "0.549577 | \n", "0.662074 | \n", "CEMBA190625-10E-1 | \n", "0 | \n", "1 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_10 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.006569 | \n", "0.750172 | \n", "0.748520 | \n", "0.027665 | \n", "0.021235 | \n", "1383636.0 | \n", "3455260 | \n", "2172987.0 | \n", "10E | \n", "0.636744 | \n", "0.628892 | \n", "CEMBA190625-10E-1 | \n", "19 | \n", "0 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_101 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.006353 | \n", "0.760898 | \n", "0.759369 | \n", "0.026547 | \n", "0.020323 | \n", "2474670.0 | \n", "7245482 | \n", "4778768.0 | \n", "10E | \n", "0.517847 | \n", "0.659551 | \n", "CEMBA190625-10E-1 | \n", "18 | \n", "3 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
10E_M_102 | \n", "/gale/raidix/rdx-4/mapping/10E/CEMBA190625-10E... | \n", "0.005409 | \n", "0.752980 | \n", "0.751637 | \n", "0.019497 | \n", "0.014164 | \n", "2430290.0 | \n", "7004754 | \n", "4609570.0 | \n", "10E | \n", "0.527227 | \n", "0.658063 | \n", "CEMBA190625-10E-1 | \n", "19 | \n", "2 | \n", "190625 | \n", "10 | \n", "10E_190625 | \n", "
<xarray.MCDS>\n", "Dimensions: (cell: 16985, chrom100k: 24045, count_type: 2, mc_type: 2)\n", "Coordinates:\n", " * cell (cell) <U10 '10E_M_207' '10E_M_338' ... '9J_M_2969'\n", " * chrom100k (chrom100k) int64 30 31 32 33 ... 26335 26336 26337\n", " chrom100k_bin_end (chrom100k) int64 dask.array<chunksize=(24045,), meta=np.ndarray>\n", " chrom100k_bin_start (chrom100k) int64 dask.array<chunksize=(24045,), meta=np.ndarray>\n", " chrom100k_chrom (chrom100k) <U5 dask.array<chunksize=(24045,), meta=np.ndarray>\n", " * count_type (count_type) <U3 'mc' 'cov'\n", " * mc_type (mc_type) <U3 'CGN' 'CHN'\n", " strand_type <U4 'both'\n", "Data variables:\n", " chrom100k_da (cell, chrom100k, mc_type, count_type) uint16 dask.array<chunksize=(3397, 2364, 2, 2), meta=np.ndarray>\n", " chrom100k_da_frac (cell, chrom100k, mc_type) float64 dask.array<chunksize=(3397, 2364, 2), meta=np.ndarray>\n", "Attributes:\n", " obs_dim: cell\n", " var_dim: chrom100k