# Motif Scan

Transcription Factor binding motifs are commonly found enriched in cis-regulatory elements and can inform the potential regulatory mechanism of the elements. The first step to study these DNA motifs is to scan their occurence in the genome regions.

## The Default MotifSet
Currently, the motif scan function uses a default motif dataset that contains >2000 motifs from three databases {cite}`Khan2018,Kulakovskiy2018,Jolma2013`, each motif is also annotated with human and mouse gene names to facilitate further interpretation. 

Following the analysis in {cite}`Vierstra2020` (see also [this great blog](https://www.vierstra.org/resources/motif_clustering)), these motifs are clustered into 286 motif-clusters based on their similarity (and some motifs are almost identical). We will scan all individual motifs here, but also aggregate the results to motif-cluster level. It is recommended to perform futher analysis at the motif-cluster level (such as motif enrichment analysis).

In [3]:
from ALLCools.mcds import RegionDS
from ALLCools.motif import MotifSet, get_default_motif_set

In [7]:
# check out the default motif set
default_motif_set = get_default_motif_set()
default_motif_set.n_motifs

2179

In [8]:
# metadata of the motifs
default_motif_set.meta_table

Unnamed: 0_level_0,human_genes,mouse_genes,cluster_id,database,concensus,relative_orientation,width,left_offset,right_offset
motif,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LHX6_homeodomain_3,LHX6,Lhx6,c1,Taipale_Cell_2013,TGATTGCAATCA,Positive,12,0,0
Lhx8.mouse_homeodomain_3,LHX8,Lhx8,c1,Taipale_Cell_2013,TGATTGCAATTA,Negative,12,0,0
LHX2_MOUSE.H11MO.0.A,LHX2,Lhx2,c2,HOCOMOCO_v11,ACTAATTAAC,Negative,10,7,9
LHX2_HUMAN.H11MO.0.A,LHX2,Lhx2,c2,HOCOMOCO_v11,AACTAATTAAAA,Negative,12,6,8
LHX3_MOUSE.H11MO.0.C,LHX3,Lhx3,c2,HOCOMOCO_v11,TTAATTAGC,Negative,9,8,9
...,...,...,...,...,...,...,...,...,...
Ahr+Arnt_MA0006.1,AMT,Amt,c284,Jaspar2018,TGCGTG,Positive,6,2,1
KLF8_HUMAN.H11MO.0.C,KLF8,Klf8,c285,HOCOMOCO_v11,CAGGGGGTG,Positive,9,0,0
KLF8_MOUSE.H11MO.0.C,KLF8,Klf8,c285,HOCOMOCO_v11,CAGGGGGTG,Positive,9,0,0
ZSCAN4_MA1155.1,ZSCAN4,Zscan4b,c286,Jaspar2018,TGCACACACTGAAAA,Positive,15,0,0


In [9]:
# motif cluster
default_motif_set.motif_cluster

motif
LHX6_homeodomain_3            c1
Lhx8.mouse_homeodomain_3      c1
LHX2_MOUSE.H11MO.0.A          c2
LHX2_HUMAN.H11MO.0.A          c2
LHX3_MOUSE.H11MO.0.C          c2
                            ... 
Ahr+Arnt_MA0006.1           c284
KLF8_HUMAN.H11MO.0.C        c285
KLF8_MOUSE.H11MO.0.C        c285
ZSCAN4_MA1155.1             c286
ZSCAN4_C2H2_1               c286
Length: 2174, dtype: object

In [22]:
# single motif object
default_motif_set.motif_list[0]

<ALLCools.motif.motifs.Motif at 0x7f1bbc8c4730>

### Motif PSSM Cutoffs

In [23]:
# To re-calculate motif thresholds with a different method or parameter

# default_motif_set.calculate_threshold(method='balanced', cpu=1, threshold_value=1000)

## Scan Default Motifs

In [10]:
dmr_ds = RegionDS.open('test_HIP', select_dir=['dmr'])
dmr_ds

Using dmr as region_dim


## Default Motif Database

The {func}`scan_motifs <ALLCools.mcds.region_ds.RegionDS.scan_motifs>` method of RegionDS will perform motif scan using the default motif set over all the regions. This is a time consumming step, scanning 2M regions with 40 CPUs take ~3 days.

In [11]:
dmr_ds.scan_motifs(genome_fasta='../../data/genome/mm10.fa',
                   cpu=45,
                   standardize_length=500,
                   motif_set_path=None,
                   chrom_size_path=None,
                   combine_cluster=True,
                   chunk_size=10000,
                   dim='motif')

index file ../../data/genome/mm10.fa.fai not found, generating...


Scan 2179 motif in 132 sequences.
Job 0 returned


### Motif Values
After motif scaning, three value for each motif in each region is stored:
- n_motifs
- max_score
- total_score

In [24]:
dmr_ds.get_index('motif_value')

Index(['n_motifs', 'max_score', 'total_score'], dtype='object', name='motif_value')

### Individual motifs

In [18]:
dmr_ds['dmr_motif_da']

### Motif Clusters

In [25]:
dmr_ds['dmr_motif-cluster_da']

## Scan Other Motifs

Comming soon...