Alternative Way to Create RegionDS

Alternative Way to Create RegionDS

Parse Methylpy DMRfind output

If you used the methylpy DMRfind function to identify DMRs, you can create a RegionDS by running methylpy_to_region_ds

# DMR output of methylpy DMRfind
methylpy_dmr = '../../data/HIPBulk/DMR/snmC_CT/_rms_results_collapsed.tsv'
methylpy_to_region_ds(dmr_path=methylpy_dmr, output_dir='test_HIP_methylpy')
RegionDS.open('test_HIP_methylpy', region_dim='dmr')
<xarray.RegionDS>
Dimensions:      (dmr: 2337497, sample: 10)
Coordinates:
  * dmr          (dmr) <U15 'snmC_CT-0' 'snmC_CT-1' ... 'snmC_CT-2337496'
    dmr_chrom    (dmr) <U5 'chr1' 'chr1' 'chr1' 'chr1' ... 'chrY' 'chrY' 'chrY'
    dmr_end      (dmr) int64 3001020 3003900 3006189 ... 90811943 90812481
    dmr_ndms     (dmr) int64 1 3 2 1 3 2 1 1 1 1 3 1 ... 11 5 4 2 5 7 2 6 9 1 4
    dmr_start    (dmr) int64 3001018 3003640 3005998 ... 90811941 90812266
  * sample       (sample) <U17 'snmC_ASC' 'snmC_CA1' ... 'snmC_ODC' 'snmC_OPC'
Data variables:
    dmr_da_frac  (sample, dmr) float64 ...
    dmr_state    (sample, dmr) int16 ...
Attributes:
    region_dim:          dmr
    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...

Create RegionDS from a BED file

You can create an empty RegionDS with a BED file, with only the region coordinates recorded. You can then perform annotation, motif scan and further analysis using the methods described in the following sections.

The BED file contains three columns:

  1. chrom: required

  2. start: required

  3. end: required

  4. region_id: optional, but recommended to have. If not provided, RegionDS will automatically generate f"{region_dim}_{i_row}" as region_id. region_id must be unique.

You also need to provide a chrom_size_path which tells RegionDS the sizes of your chromosomes.

Important

About BED Sorting Region order matters throughout the genomic analysis. The best practice is to sort your BED file according to the chrom_size_path you are providing. If your BED file is already sorted, you can set sort_bed=False, which is True by default

# example BED file with region ID
!head test_from_bed_func.bed
chr1	45388517	45388519	snmC_CT-40246
chr1	58086003	58086005	snmC_CT-51693
chr10	96777313	96777315	snmC_CT-270457
chr10	97954303	97954318	snmC_CT-271472
chr10	106769860	106769862	snmC_CT-279004
chr10	111530721	111530723	snmC_CT-283627
chr10	116428091	116429149	snmC_CT-288520
chr11	10168273	10168460	snmC_CT-309845
chr11	19559808	19559926	snmC_CT-318359
chr11	42477074	42477076	snmC_CT-339956
bed_region_ds = RegionDS.from_bed(
    bed='test_from_bed_func.bed',
    location='test_from_bed_RegionDS',
    chrom_size_path='../../data/genome/mm10.main.nochrM.chrom.sizes',
    region_dim='bed_region',
    # True by default, set to False if bed is already sorted
    sort_bed=True)
# the RegionDS is stored at {location}
RegionDS.open('test_from_bed_RegionDS')
Using bed_region as region_dim
<xarray.RegionDS>
Dimensions:           (bed_region: 100)
Coordinates:
    bed_region_end    (bed_region) int64 45388519 58086005 ... 161863422
    bed_region_start  (bed_region) int64 45388517 58086003 ... 161863420
  * bed_region        (bed_region) <U15 'snmC_CT-40246' ... 'snmC_CT-2330818'
    bed_region_chrom  (bed_region) <U5 'chr1' 'chr1' 'chr2' ... 'chrX' 'chrX'
Data variables:
    *empty*
Attributes:
    chrom_size_path:     /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
    region_dim:          bed_region
    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...