{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Motif Scan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transcription Factor binding motifs are commonly found enriched in cis-regulatory elements and can inform the potential regulatory mechanism of the elements. The first step to study these DNA motifs is to scan their occurence in the genome regions.\n", "\n", "## The Default MotifSet\n", "Currently, the motif scan function uses a default motif dataset that contains >2000 motifs from three databases {cite}`Khan2018,Kulakovskiy2018,Jolma2013`, each motif is also annotated with human and mouse gene names to facilitate further interpretation. \n", "\n", "Following the analysis in {cite}`Vierstra2020` (see also [this great blog](https://www.vierstra.org/resources/motif_clustering)), these motifs are clustered into 286 motif-clusters based on their similarity (and some motifs are almost identical). We will scan all individual motifs here, but also aggregate the results to motif-cluster level. It is recommended to perform futher analysis at the motif-cluster level (such as motif enrichment analysis)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:44:38.283532Z", "start_time": "2022-01-10T00:44:38.281512Z" } }, "outputs": [], "source": [ "from ALLCools.mcds import RegionDS\n", "from ALLCools.motif import MotifSet, get_default_motif_set" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:45:10.601636Z", "start_time": "2022-01-10T00:45:10.344270Z" } }, "outputs": [ { "data": { "text/plain": [ "2179" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check out the default motif set\n", "default_motif_set = get_default_motif_set()\n", "default_motif_set.n_motifs" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2022-01-10T00:45:14.687326Z", "start_time": "2022-01-10T00:45:14.666887Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | human_genes | \n", "mouse_genes | \n", "cluster_id | \n", "database | \n", "concensus | \n", "relative_orientation | \n", "width | \n", "left_offset | \n", "right_offset | \n", "
---|---|---|---|---|---|---|---|---|---|
motif | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
LHX6_homeodomain_3 | \n", "LHX6 | \n", "Lhx6 | \n", "c1 | \n", "Taipale_Cell_2013 | \n", "TGATTGCAATCA | \n", "Positive | \n", "12 | \n", "0 | \n", "0 | \n", "
Lhx8.mouse_homeodomain_3 | \n", "LHX8 | \n", "Lhx8 | \n", "c1 | \n", "Taipale_Cell_2013 | \n", "TGATTGCAATTA | \n", "Negative | \n", "12 | \n", "0 | \n", "0 | \n", "
LHX2_MOUSE.H11MO.0.A | \n", "LHX2 | \n", "Lhx2 | \n", "c2 | \n", "HOCOMOCO_v11 | \n", "ACTAATTAAC | \n", "Negative | \n", "10 | \n", "7 | \n", "9 | \n", "
LHX2_HUMAN.H11MO.0.A | \n", "LHX2 | \n", "Lhx2 | \n", "c2 | \n", "HOCOMOCO_v11 | \n", "AACTAATTAAAA | \n", "Negative | \n", "12 | \n", "6 | \n", "8 | \n", "
LHX3_MOUSE.H11MO.0.C | \n", "LHX3 | \n", "Lhx3 | \n", "c2 | \n", "HOCOMOCO_v11 | \n", "TTAATTAGC | \n", "Negative | \n", "9 | \n", "8 | \n", "9 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
Ahr+Arnt_MA0006.1 | \n", "AMT | \n", "Amt | \n", "c284 | \n", "Jaspar2018 | \n", "TGCGTG | \n", "Positive | \n", "6 | \n", "2 | \n", "1 | \n", "
KLF8_HUMAN.H11MO.0.C | \n", "KLF8 | \n", "Klf8 | \n", "c285 | \n", "HOCOMOCO_v11 | \n", "CAGGGGGTG | \n", "Positive | \n", "9 | \n", "0 | \n", "0 | \n", "
KLF8_MOUSE.H11MO.0.C | \n", "KLF8 | \n", "Klf8 | \n", "c285 | \n", "HOCOMOCO_v11 | \n", "CAGGGGGTG | \n", "Positive | \n", "9 | \n", "0 | \n", "0 | \n", "
ZSCAN4_MA1155.1 | \n", "ZSCAN4 | \n", "Zscan4b | \n", "c286 | \n", "Jaspar2018 | \n", "TGCACACACTGAAAA | \n", "Positive | \n", "15 | \n", "0 | \n", "0 | \n", "
ZSCAN4_C2H2_1 | \n", "ZSCAN4 | \n", "Zscan4b | \n", "c286 | \n", "Taipale_Cell_2013 | \n", "TGCACACACTGAAAA | \n", "Positive | \n", "15 | \n", "0 | \n", "0 | \n", "
2220 rows × 9 columns
\n", "<xarray.RegionDS>\n", "Dimensions: (count_type: 2, dmr: 132, sample: 20, sample_collapsed: 10)\n", "Coordinates:\n", " * count_type (count_type) <U3 'mc' 'cov'\n", " * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'\n", " dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19'\n", " dmr_end (dmr) int64 10002172 10003542 ... 5099203 5099952\n", " dmr_length (dmr) int64 2 305 54 2 2 2 ... 924 632 842 195 399 335\n", " dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 ... 2 7 13 19 9 9 3 6 13\n", " dmr_start (dmr) int64 10002170 10003237 ... 5098804 5099617\n", " * sample (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC'\n", " * sample_collapsed (sample_collapsed) object 'ASC' 'CA1' ... 'ODC' 'OPC'\n", "Data variables:\n", " dmr_da (sample, dmr, count_type) uint32 4294967295 ... 4294...\n", " dmr_da_frac (sample, dmr) float32 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0\n", " dmr_state (sample, dmr) int8 -1 0 0 1 -1 -1 -1 ... 0 0 0 0 0 0 0\n", " dmr_state_collapsed (dmr, sample_collapsed) int8 0 0 0 0 0 0 ... 0 0 0 0 0\n", "Attributes:\n", " chrom_size_path: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...\n", " region_dim: dmr\n", " region_ds_location: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
<xarray.DataArray 'dmr_motif_da' (dmr: 132, motif: 2179, motif_value: 3)>\n", "array([[[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", "...\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " ...,\n", " [0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]]], dtype=uint16)\n", "Coordinates:\n", " * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'\n", " dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'\n", " dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952\n", " dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335\n", " dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13\n", " dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617\n", " * motif (motif) <U29 'ALX3_homeodomain_2' ... 'ZSC31_HUMAN.H11MO.0.C'\n", " * motif_value (motif_value) <U11 'n_motifs' 'max_score' 'total_score'
<xarray.DataArray 'dmr_motif-cluster_da' (motif-cluster: 286, dmr: 132, motif_value: 3)>\n", "[113256 values with dtype=uint16]\n", "Coordinates:\n", " * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'\n", " dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'\n", " dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952\n", " dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335\n", " dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 1 2 7 13 19 9 9 3 6 13\n", " dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617\n", " * motif_value (motif_value) <U11 'n_motifs' 'max_score' 'total_score'\n", " * motif-cluster (motif-cluster) object 'c1' 'c10' 'c100' ... 'c98' 'c99'" ], "text/plain": [ "