{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Motif Scan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transcription Factor binding motifs are commonly found enriched in cis-regulatory elements and can inform the potential regulatory mechanism of the elements. The first step to study these DNA motifs is to scan their occurence in the genome regions.\n", "\n", "## The Default MotifSet\n", "Currently, the motif scan function uses a default motif dataset that contains >2000 motifs from three databases {cite}`Khan2018,Kulakovskiy2018,Jolma2013`, each motif is also annotated with human and mouse gene names to facilitate further interpretation. \n", "\n", "Following the analysis in {cite}`Vierstra2020` (see also [this great blog](https://www.vierstra.org/resources/motif_clustering)), these motifs are clustered into 286 motif-clusters based on their similarity (and some motifs are almost identical). We will scan all individual motifs here, but also aggregate the results to motif-cluster level. It is recommended to perform futher analysis at the motif-cluster level (such as motif enrichment analysis)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:44:38.283532Z", "start_time": "2022-01-10T00:44:38.281512Z" } }, "outputs": [], "source": [ "from ALLCools.mcds import RegionDS\n", "from ALLCools.motif import MotifSet, get_default_motif_set" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:45:10.601636Z", "start_time": "2022-01-10T00:45:10.344270Z" } }, "outputs": [ { "data": { "text/plain": [ "2179" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check out the default motif set\n", "default_motif_set = get_default_motif_set()\n", "default_motif_set.n_motifs" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:45:14.687326Z", "start_time": "2022-01-10T00:45:14.666887Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
human_genesmouse_genescluster_iddatabaseconcensusrelative_orientationwidthleft_offsetright_offset
motif
LHX6_homeodomain_3LHX6Lhx6c1Taipale_Cell_2013TGATTGCAATCAPositive1200
Lhx8.mouse_homeodomain_3LHX8Lhx8c1Taipale_Cell_2013TGATTGCAATTANegative1200
LHX2_MOUSE.H11MO.0.ALHX2Lhx2c2HOCOMOCO_v11ACTAATTAACNegative1079
LHX2_HUMAN.H11MO.0.ALHX2Lhx2c2HOCOMOCO_v11AACTAATTAAAANegative1268
LHX3_MOUSE.H11MO.0.CLHX3Lhx3c2HOCOMOCO_v11TTAATTAGCNegative989
..............................
Ahr+Arnt_MA0006.1AMTAmtc284Jaspar2018TGCGTGPositive621
KLF8_HUMAN.H11MO.0.CKLF8Klf8c285HOCOMOCO_v11CAGGGGGTGPositive900
KLF8_MOUSE.H11MO.0.CKLF8Klf8c285HOCOMOCO_v11CAGGGGGTGPositive900
ZSCAN4_MA1155.1ZSCAN4Zscan4bc286Jaspar2018TGCACACACTGAAAAPositive1500
ZSCAN4_C2H2_1ZSCAN4Zscan4bc286Taipale_Cell_2013TGCACACACTGAAAAPositive1500
\n", "

2220 rows × 9 columns

\n", "
" ], "text/plain": [ " human_genes mouse_genes ... left_offset right_offset\n", "motif ... \n", "LHX6_homeodomain_3 LHX6 Lhx6 ... 0 0\n", "Lhx8.mouse_homeodomain_3 LHX8 Lhx8 ... 0 0\n", "LHX2_MOUSE.H11MO.0.A LHX2 Lhx2 ... 7 9\n", "LHX2_HUMAN.H11MO.0.A LHX2 Lhx2 ... 6 8\n", "LHX3_MOUSE.H11MO.0.C LHX3 Lhx3 ... 8 9\n", "... ... ... ... ... ...\n", "Ahr+Arnt_MA0006.1 AMT Amt ... 2 1\n", "KLF8_HUMAN.H11MO.0.C KLF8 Klf8 ... 0 0\n", "KLF8_MOUSE.H11MO.0.C KLF8 Klf8 ... 0 0\n", "ZSCAN4_MA1155.1 ZSCAN4 Zscan4b ... 0 0\n", "ZSCAN4_C2H2_1 ZSCAN4 Zscan4b ... 0 0\n", "\n", "[2220 rows x 9 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# metadata of the motifs\n", "default_motif_set.meta_table" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:45:34.401020Z", "start_time": "2022-01-10T00:45:34.397088Z" } }, "outputs": [ { "data": { "text/plain": [ "motif\n", "LHX6_homeodomain_3 c1\n", "Lhx8.mouse_homeodomain_3 c1\n", "LHX2_MOUSE.H11MO.0.A c2\n", "LHX2_HUMAN.H11MO.0.A c2\n", "LHX3_MOUSE.H11MO.0.C c2\n", " ... \n", "Ahr+Arnt_MA0006.1 c284\n", "KLF8_HUMAN.H11MO.0.C c285\n", "KLF8_MOUSE.H11MO.0.C c285\n", "ZSCAN4_MA1155.1 c286\n", "ZSCAN4_C2H2_1 c286\n", "Length: 2174, dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# motif cluster\n", "default_motif_set.motif_cluster" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:52:30.466688Z", "start_time": "2022-01-10T00:52:30.463906Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# single motif object\n", "default_motif_set.motif_list[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Motif PSSM Cutoffs" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:57:48.335776Z", "start_time": "2022-01-10T00:57:48.333894Z" } }, "outputs": [], "source": [ "# To re-calculate motif thresholds with a different method or parameter\n", "\n", "# default_motif_set.calculate_threshold(method='balanced', cpu=1, threshold_value=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scan Default Motifs" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:45:53.131357Z", "start_time": "2022-01-10T00:45:52.822824Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using dmr as region_dim\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.RegionDS>\n",
       "Dimensions:              (count_type: 2, dmr: 132, sample: 20, sample_collapsed: 10)\n",
       "Coordinates:\n",
       "  * count_type           (count_type) <U3 'mc' 'cov'\n",
       "  * dmr                  (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'\n",
       "    dmr_chrom            (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19'\n",
       "    dmr_end              (dmr) int64 10002172 10003542 ... 5099203 5099952\n",
       "    dmr_length           (dmr) int64 2 305 54 2 2 2 ... 924 632 842 195 399 335\n",
       "    dmr_ndms             (dmr) int64 1 7 2 1 1 1 1 13 3 ... 2 7 13 19 9 9 3 6 13\n",
       "    dmr_start            (dmr) int64 10002170 10003237 ... 5098804 5099617\n",
       "  * sample               (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC'\n",
       "  * sample_collapsed     (sample_collapsed) object 'ASC' 'CA1' ... 'ODC' 'OPC'\n",
       "Data variables:\n",
       "    dmr_da               (sample, dmr, count_type) uint32 4294967295 ... 4294...\n",
       "    dmr_da_frac          (sample, dmr) float32 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0\n",
       "    dmr_state            (sample, dmr) int8 -1 0 0 1 -1 -1 -1 ... 0 0 0 0 0 0 0\n",
       "    dmr_state_collapsed  (dmr, sample_collapsed) int8 0 0 0 0 0 0 ... 0 0 0 0 0\n",
       "Attributes:\n",
       "    chrom_size_path:     /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...\n",
       "    region_dim:          dmr\n",
       "    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
" ], "text/plain": [ "\n", "Dimensions: (count_type: 2, dmr: 132, sample: 20, sample_collapsed: 10)\n", "Coordinates:\n", " * count_type (count_type) ` method of RegionDS will perform motif scan using the default motif set over all the regions. This is a time consumming step, scanning 2M regions with 40 CPUs take ~3 days." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:49:21.671177Z", "start_time": "2022-01-10T00:46:51.245092Z" }, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "index file ../../data/genome/mm10.fa.fai not found, generating...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Scan 2179 motif in 132 sequences.\n", "Job 0 returned\n" ] } ], "source": [ "dmr_ds.scan_motifs(genome_fasta='../../data/genome/mm10.fa',\n", " cpu=45,\n", " standardize_length=500,\n", " motif_set_path=None,\n", " chrom_size_path=None,\n", " combine_cluster=True,\n", " chunk_size=10000,\n", " dim='motif')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Motif Values\n", "After motif scaning, three value for each motif in each region is stored:\n", "- n_motifs\n", "- max_score\n", "- total_score" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:58:47.983428Z", "start_time": "2022-01-10T00:58:47.979567Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['n_motifs', 'max_score', 'total_score'], dtype='object', name='motif_value')" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dmr_ds.get_index('motif_value')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Individual motifs" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:51:46.393278Z", "start_time": "2022-01-10T00:51:46.373330Z" }, "collapsed": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'dmr_motif_da' (dmr: 132, motif: 2179, motif_value: 3)>\n",
       "array([[[0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        ...,\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0]],\n",
       "\n",
       "       [[0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        ...,\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0]],\n",
       "\n",
       "       [[0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        ...,\n",
       "...\n",
       "        ...,\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0]],\n",
       "\n",
       "       [[0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        ...,\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0]],\n",
       "\n",
       "       [[0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        ...,\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0],\n",
       "        [0, 0, 0]]], dtype=uint16)\n",
       "Coordinates:\n",
       "  * dmr          (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'\n",
       "    dmr_chrom    (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'\n",
       "    dmr_end      (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952\n",
       "    dmr_length   (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335\n",
       "    dmr_ndms     (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13\n",
       "    dmr_start    (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617\n",
       "  * motif        (motif) <U29 'ALX3_homeodomain_2' ... 'ZSC31_HUMAN.H11MO.0.C'\n",
       "  * motif_value  (motif_value) <U11 'n_motifs' 'max_score' 'total_score'
" ], "text/plain": [ "\n", "array([[[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", "...\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]]], dtype=uint16)\n", "Coordinates:\n", " * dmr (dmr) \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'dmr_motif-cluster_da' (motif-cluster: 286, dmr: 132, motif_value: 3)>\n",
       "[113256 values with dtype=uint16]\n",
       "Coordinates:\n",
       "  * dmr            (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'\n",
       "    dmr_chrom      (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'\n",
       "    dmr_end        (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952\n",
       "    dmr_length     (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335\n",
       "    dmr_ndms       (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 1 2 7 13 19 9 9 3 6 13\n",
       "    dmr_start      (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617\n",
       "  * motif_value    (motif_value) <U11 'n_motifs' 'max_score' 'total_score'\n",
       "  * motif-cluster  (motif-cluster) object 'c1' 'c10' 'c100' ... 'c98' 'c99'
" ], "text/plain": [ "\n", "[113256 values with dtype=uint16]\n", "Coordinates:\n", " * dmr (dmr)