Annotate RegionDS¶

After getting an RegionDS from DMR calling or any other genome region sets, we can annotate the regions with other epigenomic profiles or genomic features stored in the BigWig or BED format.

For example, in this section, we will annotate the DMR RegionDS with chromatin accessibility profiles and some general genomic features.

Import¶

import pandas as pd
import pathlib
from ALLCools.mcds import RegionDS

DMR Chromatin Accessibility Profile (BigWig)¶

For example, here we annotate cluster-matched chromatin accessibility profiles from a mouse hippocampus snATAC-seq dataset. Each profile is stored in BigWig format. The annotate_by_bigwigs method of RegionDS can help scan the regions on each BigWig file, resulting a new data variable stored in the RegionDS

# prepare the bigwig tab-separated table, first column is cluster name, second column is BigWig path
bigwig_dir = '../../data/HIPBulk/atac_bulk/'
bigwigs = pd.Series({
    p.name.split('.')[0].split('_')[-1]: str(p)
    for p in pathlib.Path(bigwig_dir).glob('HIP_snATAC_*.bw')
})
bigwigs.to_csv('test_bigwig.csv', header=False)
bigwigs

CA23    ../../data/HIPBulk/atac_bulk/HIP_snATAC_CA23.bw
CGE      ../../data/HIPBulk/atac_bulk/HIP_snATAC_CGE.bw
ASC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_ASC.bw
MGE      ../../data/HIPBulk/atac_bulk/HIP_snATAC_MGE.bw
CA1      ../../data/HIPBulk/atac_bulk/HIP_snATAC_CA1.bw
ODC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_ODC.bw
MGC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_MGC.bw
NonN    ../../data/HIPBulk/atac_bulk/HIP_snATAC_NonN.bw
OPC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_OPC.bw
DG        ../../data/HIPBulk/atac_bulk/HIP_snATAC_DG.bw
dtype: object

dmr_ds.annotate_by_bigwigs(slop=250,
                           bigwig_table='test_bigwig.csv',
                           dim='snATAC',
                           cpu=30)

Use chunk size 1

# the annotated matrix are stored in a new data variable
dmr_ds['dmr_snATAC_da']

<xarray.DataArray 'dmr_snATAC_da' (dmr: 132, snATAC: 10)>
dask.array<open_dataset-03cdf5fb902771957a3508f7758e6aeedmr_snATAC_da, shape=(132, 10), dtype=float32, chunksize=(132, 1), chunktype=numpy.ndarray>
Coordinates:
  * dmr         (dmr) <U9 'chr1-0' 'chr1-1' 'chr1-2' ... 'chr19-122' 'chr19-123'
    dmr_chrom   (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end     (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length  (dmr) int64 2 305 54 2 2 2 2 440 ... 589 924 632 842 195 399 335
    dmr_ndms    (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13
    dmr_start   (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * snATAC      (snATAC) <U4 'CA23' 'CGE' 'ASC' 'MGE' ... 'NonN' 'OPC' 'DG'

DMR Overlapping Genome Features (BED)¶

Next, we overlap the DMR regions with a set of BED files that containing different kinds of genome features (e.g. CGI, promoter). The annotate_by_beds method of RegionDS can help scan the regions on each BigWig file, resulting a new data variable stored in the RegionDS. The output dataset is a boolean matrix recording whether a DMR is overlapping with one feature.

genome_feature_dir = '../../data/genome/genome_feature/'
genome_feature_beds = {
    '.'.join(p.name.split('.')[:-4]): str(p)
    for p in pathlib.Path(genome_feature_dir).glob('*.bed.gz')
}

beds = pd.Series(genome_feature_beds)
beds.to_csv('test_genome_featue_bed.csv', header=False)
beds

CGI                          ../../data/genome/genome_feature/CGI.merge.sor...
CGI_Shore                    ../../data/genome/genome_feature/CGI_Shore.mer...
blacklist                    ../../data/genome/genome_feature/blacklist.mm1...
gene.all                     ../../data/genome/genome_feature/gene.all.merg...
gene.lincRNA                 ../../data/genome/genome_feature/gene.lincRNA....
intron.all                   ../../data/genome/genome_feature/intron.all.me...
intron.first                 ../../data/genome/genome_feature/intron.first....
intron.protein_coding        ../../data/genome/genome_feature/intron.protei...
promoter.all                 ../../data/genome/genome_feature/promoter.all....
splicing_site_slop100        ../../data/genome/genome_feature/splicing_site...
exon.all                     ../../data/genome/genome_feature/exon.all.merg...
exon.first                   ../../data/genome/genome_feature/exon.first.me...
stop_codon.all               ../../data/genome/genome_feature/stop_codon.al...
start_codon.all              ../../data/genome/genome_feature/start_codon.a...
exon.protein_coding          ../../data/genome/genome_feature/exon.protein_...
transcript.all               ../../data/genome/genome_feature/transcript.al...
TSS.all                      ../../data/genome/genome_feature/TSS.all.merge...
TSS.protein_coding           ../../data/genome/genome_feature/TSS.protein_c...
UTR3.all                     ../../data/genome/genome_feature/UTR3.all.merg...
UTR3.protein_coding          ../../data/genome/genome_feature/UTR3.protein_...
UTR5.all                     ../../data/genome/genome_feature/UTR5.all.merg...
UTR5.protein_coding          ../../data/genome/genome_feature/UTR5.protein_...
gene.protein_coding          ../../data/genome/genome_feature/gene.protein_...
promoter.protein_coding      ../../data/genome/genome_feature/promoter.prot...
transcript.protein_coding    ../../data/genome/genome_feature/transcript.pr...
dtype: object

dmr_ds.annotate_by_beds(slop=250,
                        bed_table='test_genome_featue_bed.csv',
                        dim='genome-features',
                        bed_sorted=False,
                        cpu=30)

Use chunk size 1

# the annotated matrix are stored in a new data variable
dmr_ds['dmr_genome-features_da']

<xarray.DataArray 'dmr_genome-features_da' (dmr: 132, genome-features: 25)>
dask.array<open_dataset-50fa93a20cd098fbd6ef8798079cfba4dmr_genome-features_da, shape=(132, 25), dtype=bool, chunksize=(132, 1), chunktype=numpy.ndarray>
Coordinates:
  * dmr              (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom        (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end          (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length       (dmr) int64 2 305 54 2 2 2 2 ... 924 632 842 195 399 335
    dmr_ndms         (dmr) int64 1 7 2 1 1 1 1 13 3 2 ... 1 2 7 13 19 9 9 3 6 13
    dmr_start        (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * genome-features  (genome-features) <U25 'CGI' ... 'transcript.protein_cod...

Annotate RegionDS

Contents

Annotate RegionDS¶

Import¶

Open RegionDS¶

DMR Chromatin Accessibility Profile (BigWig)¶

DMR Overlapping Genome Features (BED)¶

After Annotation¶

Selectively open¶