Annotate RegionDS

After getting an RegionDS from DMR calling or any other genome region sets, we can annotate the regions with other epigenomic profiles or genomic features stored in the BigWig or BED format.

For example, in this section, we will annotate the DMR RegionDS with chromatin accessibility profiles and some general genomic features.

Import

import pandas as pd
import pathlib
from ALLCools.mcds import RegionDS

Open RegionDS

dmr_ds = RegionDS.open('test_HIP')
dmr_ds
Using dmr as region_dim
<xarray.RegionDS>
Dimensions:      (count_type: 2, dmr: 132, sample: 20)
Coordinates:
  * count_type   (count_type) <U3 'mc' 'cov'
  * dmr          (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom    (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end      (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length   (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335
    dmr_ndms     (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13
    dmr_start    (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * sample       (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_ODC' 'snmC_OPC'
Data variables:
    dmr_da       (sample, dmr, count_type) uint32 4294967295 ... 4294967295
    dmr_da_frac  (sample, dmr) float32 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
    dmr_state    (sample, dmr) int8 -1 0 0 1 -1 -1 -1 0 0 ... 0 0 0 0 0 0 0 0 0
Attributes:
    region_dim:          dmr
    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
    chrom_size_path:     /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...

DMR Chromatin Accessibility Profile (BigWig)

For example, here we annotate cluster-matched chromatin accessibility profiles from a mouse hippocampus snATAC-seq dataset. Each profile is stored in BigWig format. The annotate_by_bigwigs method of RegionDS can help scan the regions on each BigWig file, resulting a new data variable stored in the RegionDS

# prepare the bigwig tab-separated table, first column is cluster name, second column is BigWig path
bigwig_dir = '../../data/HIPBulk/atac_bulk/'
bigwigs = pd.Series({
    p.name.split('.')[0].split('_')[-1]: str(p)
    for p in pathlib.Path(bigwig_dir).glob('HIP_snATAC_*.bw')
})
bigwigs.to_csv('test_bigwig.csv', header=False)
bigwigs
CA23    ../../data/HIPBulk/atac_bulk/HIP_snATAC_CA23.bw
CGE      ../../data/HIPBulk/atac_bulk/HIP_snATAC_CGE.bw
ASC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_ASC.bw
MGE      ../../data/HIPBulk/atac_bulk/HIP_snATAC_MGE.bw
CA1      ../../data/HIPBulk/atac_bulk/HIP_snATAC_CA1.bw
ODC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_ODC.bw
MGC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_MGC.bw
NonN    ../../data/HIPBulk/atac_bulk/HIP_snATAC_NonN.bw
OPC      ../../data/HIPBulk/atac_bulk/HIP_snATAC_OPC.bw
DG        ../../data/HIPBulk/atac_bulk/HIP_snATAC_DG.bw
dtype: object
dmr_ds.annotate_by_bigwigs(slop=250,
                           bigwig_table='test_bigwig.csv',
                           dim='snATAC',
                           cpu=30)
Use chunk size 1
# the annotated matrix are stored in a new data variable
dmr_ds['dmr_snATAC_da']
<xarray.DataArray 'dmr_snATAC_da' (dmr: 132, snATAC: 10)>
dask.array<open_dataset-03cdf5fb902771957a3508f7758e6aeedmr_snATAC_da, shape=(132, 10), dtype=float32, chunksize=(132, 1), chunktype=numpy.ndarray>
Coordinates:
  * dmr         (dmr) <U9 'chr1-0' 'chr1-1' 'chr1-2' ... 'chr19-122' 'chr19-123'
    dmr_chrom   (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end     (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length  (dmr) int64 2 305 54 2 2 2 2 440 ... 589 924 632 842 195 399 335
    dmr_ndms    (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13
    dmr_start   (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * snATAC      (snATAC) <U4 'CA23' 'CGE' 'ASC' 'MGE' ... 'NonN' 'OPC' 'DG'

DMR Overlapping Genome Features (BED)

Next, we overlap the DMR regions with a set of BED files that containing different kinds of genome features (e.g. CGI, promoter). The annotate_by_beds method of RegionDS can help scan the regions on each BigWig file, resulting a new data variable stored in the RegionDS. The output dataset is a boolean matrix recording whether a DMR is overlapping with one feature.

genome_feature_dir = '../../data/genome/genome_feature/'
genome_feature_beds = {
    '.'.join(p.name.split('.')[:-4]): str(p)
    for p in pathlib.Path(genome_feature_dir).glob('*.bed.gz')
}

beds = pd.Series(genome_feature_beds)
beds.to_csv('test_genome_featue_bed.csv', header=False)
beds
CGI                          ../../data/genome/genome_feature/CGI.merge.sor...
CGI_Shore                    ../../data/genome/genome_feature/CGI_Shore.mer...
blacklist                    ../../data/genome/genome_feature/blacklist.mm1...
gene.all                     ../../data/genome/genome_feature/gene.all.merg...
gene.lincRNA                 ../../data/genome/genome_feature/gene.lincRNA....
intron.all                   ../../data/genome/genome_feature/intron.all.me...
intron.first                 ../../data/genome/genome_feature/intron.first....
intron.protein_coding        ../../data/genome/genome_feature/intron.protei...
promoter.all                 ../../data/genome/genome_feature/promoter.all....
splicing_site_slop100        ../../data/genome/genome_feature/splicing_site...
exon.all                     ../../data/genome/genome_feature/exon.all.merg...
exon.first                   ../../data/genome/genome_feature/exon.first.me...
stop_codon.all               ../../data/genome/genome_feature/stop_codon.al...
start_codon.all              ../../data/genome/genome_feature/start_codon.a...
exon.protein_coding          ../../data/genome/genome_feature/exon.protein_...
transcript.all               ../../data/genome/genome_feature/transcript.al...
TSS.all                      ../../data/genome/genome_feature/TSS.all.merge...
TSS.protein_coding           ../../data/genome/genome_feature/TSS.protein_c...
UTR3.all                     ../../data/genome/genome_feature/UTR3.all.merg...
UTR3.protein_coding          ../../data/genome/genome_feature/UTR3.protein_...
UTR5.all                     ../../data/genome/genome_feature/UTR5.all.merg...
UTR5.protein_coding          ../../data/genome/genome_feature/UTR5.protein_...
gene.protein_coding          ../../data/genome/genome_feature/gene.protein_...
promoter.protein_coding      ../../data/genome/genome_feature/promoter.prot...
transcript.protein_coding    ../../data/genome/genome_feature/transcript.pr...
dtype: object
dmr_ds.annotate_by_beds(slop=250,
                        bed_table='test_genome_featue_bed.csv',
                        dim='genome-features',
                        bed_sorted=False,
                        cpu=30)
Use chunk size 1
# the annotated matrix are stored in a new data variable
dmr_ds['dmr_genome-features_da']
<xarray.DataArray 'dmr_genome-features_da' (dmr: 132, genome-features: 25)>
dask.array<open_dataset-50fa93a20cd098fbd6ef8798079cfba4dmr_genome-features_da, shape=(132, 25), dtype=bool, chunksize=(132, 1), chunktype=numpy.ndarray>
Coordinates:
  * dmr              (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom        (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end          (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length       (dmr) int64 2 305 54 2 2 2 2 ... 924 632 842 195 399 335
    dmr_ndms         (dmr) int64 1 7 2 1 1 1 1 13 3 2 ... 1 2 7 13 19 9 9 3 6 13
    dmr_start        (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * genome-features  (genome-features) <U25 'CGI' ... 'transcript.protein_cod...

After Annotation

After annotation, the RegionDS will contain additional dmr-by-feature matrix.

# note the dmr_genome-features and dmr_snATAC dir is newly added by the annotation functions
!tree -L 2 test_HIP/
test_HIP/
├── chrom_sizes.txt
├── dmr
│   ├── count_type
│   ├── dmr
│   ├── dmr_chrom
│   ├── dmr_da
│   ├── dmr_da_frac
│   ├── dmr_end
│   ├── dmr_length
│   ├── dmr_ndms
│   ├── dmr_start
│   ├── dmr_state
│   └── sample
├── dmr_genome-features
│   ├── dmr
│   ├── dmr_genome-features_da
│   └── genome-features
├── dmr_snATAC
│   ├── dmr
│   ├── dmr_snATAC_da
│   └── snATAC
└── dms
    ├── count_type
    ├── dms
    ├── dms_chrom
    ├── dms_contexts
    ├── dms_da
    ├── dms_da_frac
    ├── dms_pos
    ├── dms_p-values
    ├── dms_residual
    └── sample

31 directories, 1 file

Selectively open

When you open a RegionDS containing multiple annotation, you can select specific datasets with select_dir and select specific DMRs with use_regions.

# select 2 DMR and their snATAC annotation
RegionDS.open('test_HIP/', 
              select_dir=['dmr', 'dmr_snATAC'],
              use_regions=['chr1-0', 'chr1-1'])
Using dmr as region_dim
<xarray.RegionDS>
Dimensions:        (count_type: 2, dmr: 2, sample: 20, snATAC: 10)
Coordinates:
  * count_type     (count_type) <U3 'mc' 'cov'
  * dmr            (dmr) <U9 'chr1-0' 'chr1-1'
    dmr_chrom      (dmr) <U5 'chr1' 'chr1'
    dmr_end        (dmr) int64 10002172 10003542
    dmr_length     (dmr) int64 2 305
    dmr_ndms       (dmr) int64 1 7
    dmr_start      (dmr) int64 10002170 10003237
  * sample         (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC'
  * snATAC         (snATAC) <U4 'CA23' 'CGE' 'ASC' 'MGE' ... 'NonN' 'OPC' 'DG'
Data variables:
    dmr_da         (sample, dmr, count_type) uint32 4294967295 ... 4294967295
    dmr_da_frac    (sample, dmr) float32 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
    dmr_state      (sample, dmr) int8 -1 0 0 1 0 1 -1 0 -1 ... 0 1 0 0 1 0 0 0
    dmr_snATAC_da  (dmr, snATAC) float32 0.09983 0.02372 ... 0.02372 0.03385
Attributes:
    region_dim:          dmr
    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
    chrom_size_path:     /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...