Motif Scan

Transcription Factor binding motifs are commonly found enriched in cis-regulatory elements and can inform the potential regulatory mechanism of the elements. The first step to study these DNA motifs is to scan their occurence in the genome regions.

The Default MotifSet

Currently, the motif scan function uses a default motif dataset that contains >2000 motifs from three databases [Khan et al., 2018, Kulakovskiy et al., 2018, Jolma et al., 2013], each motif is also annotated with human and mouse gene names to facilitate further interpretation.

Following the analysis in [Vierstra et al., 2020] (see also this great blog), these motifs are clustered into 286 motif-clusters based on their similarity (and some motifs are almost identical). We will scan all individual motifs here, but also aggregate the results to motif-cluster level. It is recommended to perform futher analysis at the motif-cluster level (such as motif enrichment analysis).

# check out the default motif set
default_motif_set = get_default_motif_set()
default_motif_set.n_motifs
2179
# metadata of the motifs
default_motif_set.meta_table
human_genes mouse_genes cluster_id database concensus relative_orientation width left_offset right_offset
motif
LHX6_homeodomain_3 LHX6 Lhx6 c1 Taipale_Cell_2013 TGATTGCAATCA Positive 12 0 0
Lhx8.mouse_homeodomain_3 LHX8 Lhx8 c1 Taipale_Cell_2013 TGATTGCAATTA Negative 12 0 0
LHX2_MOUSE.H11MO.0.A LHX2 Lhx2 c2 HOCOMOCO_v11 ACTAATTAAC Negative 10 7 9
LHX2_HUMAN.H11MO.0.A LHX2 Lhx2 c2 HOCOMOCO_v11 AACTAATTAAAA Negative 12 6 8
LHX3_MOUSE.H11MO.0.C LHX3 Lhx3 c2 HOCOMOCO_v11 TTAATTAGC Negative 9 8 9
... ... ... ... ... ... ... ... ... ...
Ahr+Arnt_MA0006.1 AMT Amt c284 Jaspar2018 TGCGTG Positive 6 2 1
KLF8_HUMAN.H11MO.0.C KLF8 Klf8 c285 HOCOMOCO_v11 CAGGGGGTG Positive 9 0 0
KLF8_MOUSE.H11MO.0.C KLF8 Klf8 c285 HOCOMOCO_v11 CAGGGGGTG Positive 9 0 0
ZSCAN4_MA1155.1 ZSCAN4 Zscan4b c286 Jaspar2018 TGCACACACTGAAAA Positive 15 0 0
ZSCAN4_C2H2_1 ZSCAN4 Zscan4b c286 Taipale_Cell_2013 TGCACACACTGAAAA Positive 15 0 0

2220 rows × 9 columns

# motif cluster
default_motif_set.motif_cluster
motif
LHX6_homeodomain_3            c1
Lhx8.mouse_homeodomain_3      c1
LHX2_MOUSE.H11MO.0.A          c2
LHX2_HUMAN.H11MO.0.A          c2
LHX3_MOUSE.H11MO.0.C          c2
                            ... 
Ahr+Arnt_MA0006.1           c284
KLF8_HUMAN.H11MO.0.C        c285
KLF8_MOUSE.H11MO.0.C        c285
ZSCAN4_MA1155.1             c286
ZSCAN4_C2H2_1               c286
Length: 2174, dtype: object
# single motif object
default_motif_set.motif_list[0]
<ALLCools.motif.motifs.Motif at 0x7f1bbc8c4730>

Motif PSSM Cutoffs

# To re-calculate motif thresholds with a different method or parameter

# default_motif_set.calculate_threshold(method='balanced', cpu=1, threshold_value=1000)

Scan Default Motifs

dmr_ds = RegionDS.open('test_HIP', select_dir=['dmr'])
dmr_ds
Using dmr as region_dim
<xarray.RegionDS>
Dimensions:              (count_type: 2, dmr: 132, sample: 20, sample_collapsed: 10)
Coordinates:
  * count_type           (count_type) <U3 'mc' 'cov'
  * dmr                  (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom            (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19'
    dmr_end              (dmr) int64 10002172 10003542 ... 5099203 5099952
    dmr_length           (dmr) int64 2 305 54 2 2 2 ... 924 632 842 195 399 335
    dmr_ndms             (dmr) int64 1 7 2 1 1 1 1 13 3 ... 2 7 13 19 9 9 3 6 13
    dmr_start            (dmr) int64 10002170 10003237 ... 5098804 5099617
  * sample               (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC'
  * sample_collapsed     (sample_collapsed) object 'ASC' 'CA1' ... 'ODC' 'OPC'
Data variables:
    dmr_da               (sample, dmr, count_type) uint32 4294967295 ... 4294...
    dmr_da_frac          (sample, dmr) float32 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0
    dmr_state            (sample, dmr) int8 -1 0 0 1 -1 -1 -1 ... 0 0 0 0 0 0 0
    dmr_state_collapsed  (dmr, sample_collapsed) int8 0 0 0 0 0 0 ... 0 0 0 0 0
Attributes:
    chrom_size_path:     /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
    region_dim:          dmr
    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...

Default Motif Database

The scan_motifs method of RegionDS will perform motif scan using the default motif set over all the regions. This is a time consumming step, scanning 2M regions with 40 CPUs take ~3 days.

dmr_ds.scan_motifs(genome_fasta='../../data/genome/mm10.fa',
                   cpu=45,
                   standardize_length=500,
                   motif_set_path=None,
                   chrom_size_path=None,
                   combine_cluster=True,
                   chunk_size=10000,
                   dim='motif')
index file ../../data/genome/mm10.fa.fai not found, generating...
Scan 2179 motif in 132 sequences.
Job 0 returned

Motif Values

After motif scaning, three value for each motif in each region is stored:

  • n_motifs

  • max_score

  • total_score

dmr_ds.get_index('motif_value')
Index(['n_motifs', 'max_score', 'total_score'], dtype='object', name='motif_value')

Individual motifs

dmr_ds['dmr_motif_da']
<xarray.DataArray 'dmr_motif_da' (dmr: 132, motif: 2179, motif_value: 3)>
array([[[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
...
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]], dtype=uint16)
Coordinates:
  * dmr          (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom    (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end      (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length   (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335
    dmr_ndms     (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13
    dmr_start    (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * motif        (motif) <U29 'ALX3_homeodomain_2' ... 'ZSC31_HUMAN.H11MO.0.C'
  * motif_value  (motif_value) <U11 'n_motifs' 'max_score' 'total_score'

Motif Clusters

dmr_ds['dmr_motif-cluster_da']
<xarray.DataArray 'dmr_motif-cluster_da' (motif-cluster: 286, dmr: 132, motif_value: 3)>
[113256 values with dtype=uint16]
Coordinates:
  * dmr            (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom      (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end        (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length     (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335
    dmr_ndms       (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 1 2 7 13 19 9 9 3 6 13
    dmr_start      (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * motif_value    (motif_value) <U11 'n_motifs' 'max_score' 'total_score'
  * motif-cluster  (motif-cluster) object 'c1' 'c10' 'c100' ... 'c98' 'c99'

Scan Other Motifs

Comming soon…