Annotate RegionDS
Contents
Annotate RegionDS¶
After getting an RegionDS
from DMR calling or any other genome region sets, we can annotate the regions with other epigenomic profiles or genomic features stored in the BigWig or BED format.
For example, in this section, we will annotate the DMR RegionDS with chromatin accessibility profiles and some general genomic features.
Import¶
import pandas as pd
import pathlib
from ALLCools.mcds import RegionDS
Open RegionDS¶
dmr_ds = RegionDS.open('test_HIP')
dmr_ds
Using dmr as region_dim
<xarray.RegionDS> Dimensions: (count_type: 2, dmr: 132, sample: 20) Coordinates: * count_type (count_type) <U3 'mc' 'cov' * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * sample (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_ODC' 'snmC_OPC' Data variables: dmr_da (sample, dmr, count_type) uint32 4294967295 ... 4294967295 dmr_da_frac (sample, dmr) float32 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 dmr_state (sample, dmr) int8 -1 0 0 1 -1 -1 -1 0 0 ... 0 0 0 0 0 0 0 0 0 Attributes: region_dim: dmr region_ds_location: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus... chrom_size_path: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
DMR Chromatin Accessibility Profile (BigWig)¶
For example, here we annotate cluster-matched chromatin accessibility profiles from a mouse hippocampus snATAC-seq dataset. Each profile is stored in BigWig format. The annotate_by_bigwigs
method of RegionDS can help scan the regions on each BigWig file, resulting a new data variable stored in the RegionDS
# prepare the bigwig tab-separated table, first column is cluster name, second column is BigWig path
bigwig_dir = '../../data/HIPBulk/atac_bulk/'
bigwigs = pd.Series({
p.name.split('.')[0].split('_')[-1]: str(p)
for p in pathlib.Path(bigwig_dir).glob('HIP_snATAC_*.bw')
})
bigwigs.to_csv('test_bigwig.csv', header=False)
bigwigs
CA23 ../../data/HIPBulk/atac_bulk/HIP_snATAC_CA23.bw
CGE ../../data/HIPBulk/atac_bulk/HIP_snATAC_CGE.bw
ASC ../../data/HIPBulk/atac_bulk/HIP_snATAC_ASC.bw
MGE ../../data/HIPBulk/atac_bulk/HIP_snATAC_MGE.bw
CA1 ../../data/HIPBulk/atac_bulk/HIP_snATAC_CA1.bw
ODC ../../data/HIPBulk/atac_bulk/HIP_snATAC_ODC.bw
MGC ../../data/HIPBulk/atac_bulk/HIP_snATAC_MGC.bw
NonN ../../data/HIPBulk/atac_bulk/HIP_snATAC_NonN.bw
OPC ../../data/HIPBulk/atac_bulk/HIP_snATAC_OPC.bw
DG ../../data/HIPBulk/atac_bulk/HIP_snATAC_DG.bw
dtype: object
dmr_ds.annotate_by_bigwigs(slop=250,
bigwig_table='test_bigwig.csv',
dim='snATAC',
cpu=30)
Use chunk size 1
# the annotated matrix are stored in a new data variable
dmr_ds['dmr_snATAC_da']
<xarray.DataArray 'dmr_snATAC_da' (dmr: 132, snATAC: 10)> dask.array<open_dataset-03cdf5fb902771957a3508f7758e6aeedmr_snATAC_da, shape=(132, 10), dtype=float32, chunksize=(132, 1), chunktype=numpy.ndarray> Coordinates: * dmr (dmr) <U9 'chr1-0' 'chr1-1' 'chr1-2' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 2 440 ... 589 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * snATAC (snATAC) <U4 'CA23' 'CGE' 'ASC' 'MGE' ... 'NonN' 'OPC' 'DG'
DMR Overlapping Genome Features (BED)¶
Next, we overlap the DMR regions with a set of BED files that containing different kinds of genome features (e.g. CGI, promoter). The annotate_by_beds
method of RegionDS can help scan the regions on each BigWig file, resulting a new data variable stored in the RegionDS. The output dataset is a boolean matrix recording whether a DMR is overlapping with one feature.
genome_feature_dir = '../../data/genome/genome_feature/'
genome_feature_beds = {
'.'.join(p.name.split('.')[:-4]): str(p)
for p in pathlib.Path(genome_feature_dir).glob('*.bed.gz')
}
beds = pd.Series(genome_feature_beds)
beds.to_csv('test_genome_featue_bed.csv', header=False)
beds
CGI ../../data/genome/genome_feature/CGI.merge.sor...
CGI_Shore ../../data/genome/genome_feature/CGI_Shore.mer...
blacklist ../../data/genome/genome_feature/blacklist.mm1...
gene.all ../../data/genome/genome_feature/gene.all.merg...
gene.lincRNA ../../data/genome/genome_feature/gene.lincRNA....
intron.all ../../data/genome/genome_feature/intron.all.me...
intron.first ../../data/genome/genome_feature/intron.first....
intron.protein_coding ../../data/genome/genome_feature/intron.protei...
promoter.all ../../data/genome/genome_feature/promoter.all....
splicing_site_slop100 ../../data/genome/genome_feature/splicing_site...
exon.all ../../data/genome/genome_feature/exon.all.merg...
exon.first ../../data/genome/genome_feature/exon.first.me...
stop_codon.all ../../data/genome/genome_feature/stop_codon.al...
start_codon.all ../../data/genome/genome_feature/start_codon.a...
exon.protein_coding ../../data/genome/genome_feature/exon.protein_...
transcript.all ../../data/genome/genome_feature/transcript.al...
TSS.all ../../data/genome/genome_feature/TSS.all.merge...
TSS.protein_coding ../../data/genome/genome_feature/TSS.protein_c...
UTR3.all ../../data/genome/genome_feature/UTR3.all.merg...
UTR3.protein_coding ../../data/genome/genome_feature/UTR3.protein_...
UTR5.all ../../data/genome/genome_feature/UTR5.all.merg...
UTR5.protein_coding ../../data/genome/genome_feature/UTR5.protein_...
gene.protein_coding ../../data/genome/genome_feature/gene.protein_...
promoter.protein_coding ../../data/genome/genome_feature/promoter.prot...
transcript.protein_coding ../../data/genome/genome_feature/transcript.pr...
dtype: object
dmr_ds.annotate_by_beds(slop=250,
bed_table='test_genome_featue_bed.csv',
dim='genome-features',
bed_sorted=False,
cpu=30)
Use chunk size 1
# the annotated matrix are stored in a new data variable
dmr_ds['dmr_genome-features_da']
<xarray.DataArray 'dmr_genome-features_da' (dmr: 132, genome-features: 25)> dask.array<open_dataset-50fa93a20cd098fbd6ef8798079cfba4dmr_genome-features_da, shape=(132, 25), dtype=bool, chunksize=(132, 1), chunktype=numpy.ndarray> Coordinates: * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 ... 1 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * genome-features (genome-features) <U25 'CGI' ... 'transcript.protein_cod...
After Annotation¶
After annotation, the RegionDS will contain additional dmr-by-feature matrix.
# note the dmr_genome-features and dmr_snATAC dir is newly added by the annotation functions
!tree -L 2 test_HIP/
test_HIP/
├── chrom_sizes.txt
├── dmr
│ ├── count_type
│ ├── dmr
│ ├── dmr_chrom
│ ├── dmr_da
│ ├── dmr_da_frac
│ ├── dmr_end
│ ├── dmr_length
│ ├── dmr_ndms
│ ├── dmr_start
│ ├── dmr_state
│ └── sample
├── dmr_genome-features
│ ├── dmr
│ ├── dmr_genome-features_da
│ └── genome-features
├── dmr_snATAC
│ ├── dmr
│ ├── dmr_snATAC_da
│ └── snATAC
└── dms
├── count_type
├── dms
├── dms_chrom
├── dms_contexts
├── dms_da
├── dms_da_frac
├── dms_pos
├── dms_p-values
├── dms_residual
└── sample
31 directories, 1 file
Selectively open¶
When you open a RegionDS containing multiple annotation, you can select specific datasets with select_dir
and select specific DMRs with use_regions
.
# select 2 DMR and their snATAC annotation
RegionDS.open('test_HIP/',
select_dir=['dmr', 'dmr_snATAC'],
use_regions=['chr1-0', 'chr1-1'])
Using dmr as region_dim
<xarray.RegionDS> Dimensions: (count_type: 2, dmr: 2, sample: 20, snATAC: 10) Coordinates: * count_type (count_type) <U3 'mc' 'cov' * dmr (dmr) <U9 'chr1-0' 'chr1-1' dmr_chrom (dmr) <U5 'chr1' 'chr1' dmr_end (dmr) int64 10002172 10003542 dmr_length (dmr) int64 2 305 dmr_ndms (dmr) int64 1 7 dmr_start (dmr) int64 10002170 10003237 * sample (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC' * snATAC (snATAC) <U4 'CA23' 'CGE' 'ASC' 'MGE' ... 'NonN' 'OPC' 'DG' Data variables: dmr_da (sample, dmr, count_type) uint32 4294967295 ... 4294967295 dmr_da_frac (sample, dmr) float32 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 dmr_state (sample, dmr) int8 -1 0 0 1 0 1 -1 0 -1 ... 0 1 0 0 1 0 0 0 dmr_snATAC_da (dmr, snATAC) float32 0.09983 0.02372 ... 0.02372 0.03385 Attributes: region_dim: dmr region_ds_location: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus... chrom_size_path: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...