Motif Scan
Contents
Motif Scan¶
Transcription Factor binding motifs are commonly found enriched in cis-regulatory elements and can inform the potential regulatory mechanism of the elements. The first step to study these DNA motifs is to scan their occurence in the genome regions.
The Default MotifSet¶
Currently, the motif scan function uses a default motif dataset that contains >2000 motifs from three databases [Khan et al., 2018, Kulakovskiy et al., 2018, Jolma et al., 2013], each motif is also annotated with human and mouse gene names to facilitate further interpretation.
Following the analysis in [Vierstra et al., 2020] (see also this great blog), these motifs are clustered into 286 motif-clusters based on their similarity (and some motifs are almost identical). We will scan all individual motifs here, but also aggregate the results to motif-cluster level. It is recommended to perform futher analysis at the motif-cluster level (such as motif enrichment analysis).
from ALLCools.mcds import RegionDS
from ALLCools.motif import MotifSet, get_default_motif_set
# check out the default motif set
default_motif_set = get_default_motif_set()
default_motif_set.n_motifs
2179
# metadata of the motifs
default_motif_set.meta_table
human_genes | mouse_genes | cluster_id | database | concensus | relative_orientation | width | left_offset | right_offset | |
---|---|---|---|---|---|---|---|---|---|
motif | |||||||||
LHX6_homeodomain_3 | LHX6 | Lhx6 | c1 | Taipale_Cell_2013 | TGATTGCAATCA | Positive | 12 | 0 | 0 |
Lhx8.mouse_homeodomain_3 | LHX8 | Lhx8 | c1 | Taipale_Cell_2013 | TGATTGCAATTA | Negative | 12 | 0 | 0 |
LHX2_MOUSE.H11MO.0.A | LHX2 | Lhx2 | c2 | HOCOMOCO_v11 | ACTAATTAAC | Negative | 10 | 7 | 9 |
LHX2_HUMAN.H11MO.0.A | LHX2 | Lhx2 | c2 | HOCOMOCO_v11 | AACTAATTAAAA | Negative | 12 | 6 | 8 |
LHX3_MOUSE.H11MO.0.C | LHX3 | Lhx3 | c2 | HOCOMOCO_v11 | TTAATTAGC | Negative | 9 | 8 | 9 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Ahr+Arnt_MA0006.1 | AMT | Amt | c284 | Jaspar2018 | TGCGTG | Positive | 6 | 2 | 1 |
KLF8_HUMAN.H11MO.0.C | KLF8 | Klf8 | c285 | HOCOMOCO_v11 | CAGGGGGTG | Positive | 9 | 0 | 0 |
KLF8_MOUSE.H11MO.0.C | KLF8 | Klf8 | c285 | HOCOMOCO_v11 | CAGGGGGTG | Positive | 9 | 0 | 0 |
ZSCAN4_MA1155.1 | ZSCAN4 | Zscan4b | c286 | Jaspar2018 | TGCACACACTGAAAA | Positive | 15 | 0 | 0 |
ZSCAN4_C2H2_1 | ZSCAN4 | Zscan4b | c286 | Taipale_Cell_2013 | TGCACACACTGAAAA | Positive | 15 | 0 | 0 |
2220 rows × 9 columns
# motif cluster
default_motif_set.motif_cluster
motif
LHX6_homeodomain_3 c1
Lhx8.mouse_homeodomain_3 c1
LHX2_MOUSE.H11MO.0.A c2
LHX2_HUMAN.H11MO.0.A c2
LHX3_MOUSE.H11MO.0.C c2
...
Ahr+Arnt_MA0006.1 c284
KLF8_HUMAN.H11MO.0.C c285
KLF8_MOUSE.H11MO.0.C c285
ZSCAN4_MA1155.1 c286
ZSCAN4_C2H2_1 c286
Length: 2174, dtype: object
# single motif object
default_motif_set.motif_list[0]
<ALLCools.motif.motifs.Motif at 0x7f1bbc8c4730>
Motif PSSM Cutoffs¶
# To re-calculate motif thresholds with a different method or parameter
# default_motif_set.calculate_threshold(method='balanced', cpu=1, threshold_value=1000)
Scan Default Motifs¶
dmr_ds = RegionDS.open('test_HIP', select_dir=['dmr'])
dmr_ds
Using dmr as region_dim
<xarray.RegionDS> Dimensions: (count_type: 2, dmr: 132, sample: 20, sample_collapsed: 10) Coordinates: * count_type (count_type) <U3 'mc' 'cov' * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 ... 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 ... 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 ... 5098804 5099617 * sample (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC' * sample_collapsed (sample_collapsed) object 'ASC' 'CA1' ... 'ODC' 'OPC' Data variables: dmr_da (sample, dmr, count_type) uint32 4294967295 ... 4294... dmr_da_frac (sample, dmr) float32 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 dmr_state (sample, dmr) int8 -1 0 0 1 -1 -1 -1 ... 0 0 0 0 0 0 0 dmr_state_collapsed (dmr, sample_collapsed) int8 0 0 0 0 0 0 ... 0 0 0 0 0 Attributes: chrom_size_path: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus... region_dim: dmr region_ds_location: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
Default Motif Database¶
The scan_motifs
method of RegionDS will perform motif scan using the default motif set over all the regions. This is a time consumming step, scanning 2M regions with 40 CPUs take ~3 days.
dmr_ds.scan_motifs(genome_fasta='../../data/genome/mm10.fa',
cpu=45,
standardize_length=500,
motif_set_path=None,
chrom_size_path=None,
combine_cluster=True,
chunk_size=10000,
dim='motif')
index file ../../data/genome/mm10.fa.fai not found, generating...
Scan 2179 motif in 132 sequences.
Job 0 returned
Motif Values¶
After motif scaning, three value for each motif in each region is stored:
n_motifs
max_score
total_score
dmr_ds.get_index('motif_value')
Index(['n_motifs', 'max_score', 'total_score'], dtype='object', name='motif_value')
Individual motifs¶
dmr_ds['dmr_motif_da']
<xarray.DataArray 'dmr_motif_da' (dmr: 132, motif: 2179, motif_value: 3)> array([[[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., ... ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]]], dtype=uint16) Coordinates: * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * motif (motif) <U29 'ALX3_homeodomain_2' ... 'ZSC31_HUMAN.H11MO.0.C' * motif_value (motif_value) <U11 'n_motifs' 'max_score' 'total_score'
Motif Clusters¶
dmr_ds['dmr_motif-cluster_da']
<xarray.DataArray 'dmr_motif-cluster_da' (motif-cluster: 286, dmr: 132, motif_value: 3)> [113256 values with dtype=uint16] Coordinates: * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 1 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * motif_value (motif_value) <U11 'n_motifs' 'max_score' 'total_score' * motif-cluster (motif-cluster) object 'c1' 'c10' 'c100' ... 'c98' 'c99'
Scan Other Motifs¶
Comming soon…