{ "cells": [ { "cell_type": "markdown", "id": "0cd18eb7", "metadata": { "nteract": { "transient": { "deleting": false } }, "pycharm": { "name": "#%% md\n" } }, "source": [ "# Basic Cell Clustering Using mCG-5Kb Bins\n", "\n", "## Content\n", "\n", "Here we go through the basic steps to perform cell clustering using genome non-overlapping 5Kb bins\n", "as features. We start from hypo-methylation probability data stored in MCDS (quantified by the hypo- or hyper-methylation score option, see `allcools generate-dataset`). This notebook can be used to quickly evaluate cell-type composition in a single-cell methylome dataset (e.g., the dataset from a single experiment).\n", "Comparing with the 100Kb bins clustering process, this clustering process is more suitable for samples with low mCH fraction (many non-brain tissues) and narrow methylation diversity (so smaller feature works better).\n", "\n", "### Dataset used in this notebook\n", "- Adult (age P56) male mouse brain pituitary (PIT) snmC-seq2 data from {cite:p}`RufZamojski2021`.\n", "\n", "## Input\n", "- MCDS with chrom5k hypo-score matrix\n", "- Cell metadata\n", "\n", "## Output\n", "- Cell-by-5kb-bin AnnData (sparse matrix) with embedding coordinates and cluster labels." ] }, { "cell_type": "markdown", "id": "7d134ba5", "metadata": {}, "source": [ "## Import" ] }, { "cell_type": "code", "execution_count": 1, "id": "a87974bf", "metadata": { "ExecuteTime": { "end_time": "2022-02-15T21:39:16.611583Z", "start_time": "2022-02-15T21:39:14.558031Z" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import scanpy as sc\n", "from ALLCools.mcds import MCDS\n", "from ALLCools.clustering import tsne, significant_pc_test, filter_regions, lsi, binarize_matrix\n", "from ALLCools.plot import *" ] }, { "cell_type": "markdown", "id": "2d0d5ee8", "metadata": {}, "source": [ "## Parameters" ] }, { "cell_type": "code", "execution_count": 2, "id": "8e9ac2db", "metadata": { "ExecuteTime": { "end_time": "2022-02-15T21:39:16.616454Z", "start_time": "2022-02-15T21:39:16.613058Z" } }, "outputs": [], "source": [ "metadata_path = '../../data/PIT/PIT.CellMetadata.csv.gz'\n", "mcds_path = '../../data/PIT/RufZamojski2021NC.mcds'\n", "\n", "# Basic filtering parameters\n", "mapping_rate_cutoff = 0.5\n", "mapping_rate_col_name = 'MappingRate' # Name may change\n", "final_reads_cutoff = 500000\n", "final_reads_col_name = 'FinalmCReads' # Name may change\n", "mccc_cutoff = 0.03\n", "mccc_col_name = 'mCCCFrac' # Name may change\n", "mch_cutoff = 0.2\n", "mch_col_name = 'mCHFrac' # Name may change\n", "mcg_cutoff = 0.5\n", "mcg_col_name = 'mCGFrac' # Name may change\n", "\n", "# PC cutoff\n", "pc_cutoff = 0.1\n", "\n", "# KNN\n", "knn = -1 # -1 means auto determine\n", "\n", "# Leiden\n", "resolution = 1" ] }, { "cell_type": "markdown", "id": "884c3a61", "metadata": {}, "source": [ "## Load Cell Metadata" ] }, { "cell_type": "code", "execution_count": 3, "id": "77bf9062", "metadata": { "ExecuteTime": { "end_time": "2022-02-15T21:39:16.646725Z", "start_time": "2022-02-15T21:39:16.618010Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Metadata of 2756 cells\n" ] }, { "data": { "text/html": [ "
\n", " | CellInputReadPairs | \n", "MappingRate | \n", "FinalmCReads | \n", "mCCCFrac | \n", "mCGFrac | \n", "mCHFrac | \n", "Plate | \n", "Col384 | \n", "Row384 | \n", "CellTypeAnno | \n", "
---|---|---|---|---|---|---|---|---|---|---|
index | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
PIT_P1-PIT_P2-A1-AD001 | \n", "1858622.0 | \n", "0.685139 | \n", "1612023.0 | \n", "0.003644 | \n", "0.679811 | \n", "0.005782 | \n", "PIT_P1 | \n", "0 | \n", "0 | \n", "Outlier | \n", "
PIT_P1-PIT_P2-A1-AD004 | \n", "1599190.0 | \n", "0.686342 | \n", "1367004.0 | \n", "0.004046 | \n", "0.746012 | \n", "0.008154 | \n", "PIT_P1 | \n", "1 | \n", "0 | \n", "Gonadotropes | \n", "
PIT_P1-PIT_P2-A1-AD006 | \n", "1932242.0 | \n", "0.669654 | \n", "1580990.0 | \n", "0.003958 | \n", "0.683584 | \n", "0.005689 | \n", "PIT_P1 | \n", "1 | \n", "1 | \n", "Somatotropes | \n", "
PIT_P1-PIT_P2-A1-AD007 | \n", "1588505.0 | \n", "0.664612 | \n", "1292770.0 | \n", "0.003622 | \n", "0.735217 | \n", "0.005460 | \n", "PIT_P2 | \n", "0 | \n", "0 | \n", "Rbpms+ | \n", "
PIT_P1-PIT_P2-A1-AD010 | \n", "1738409.0 | \n", "0.703835 | \n", "1539676.0 | \n", "0.003769 | \n", "0.744640 | \n", "0.006679 | \n", "PIT_P2 | \n", "1 | \n", "0 | \n", "Rbpms+ | \n", "