DMR Calling

The genomic analysis starts from identifying genome region of interests. For DNA methylation data, we mostly interested in Differentially Methylated Regions (DMR), which can be identified from multiple pseudo-bulk merged ALLC files. The ALLCools DMR calling function is based on the methylpy package, which identifies cell-type-specific DMRs in two steps:

  1. call_dms: Call Differentially Methylated Sites (DMS) with permutation based root-mean-square test;

  2. call_dmr: Call DMR from the DMS results.

For more details, please read the methods section of [Schultz et al., 2015] and [He et al., 2020]. If you used the the call_dmr function, please also cite these references.

Importantly, the DMR calling results are stored in RegionDS format, which serves as the basic data structure for all the following genomic analysis.

Import

Parameters

mc_bulk_dir = '../../data/HIPBulk/mc_bulk/'
# make a dict, key is sample name, value is allc path
import pathlib
allc_table = {
    allc_path.name.split('.')[0]: str(allc_path)
    for allc_path in pathlib.Path(mc_bulk_dir).glob(
        '*/*.CGN-Merge.allc.tsv.gz')
}
samples = list(allc_table.keys())
allc_paths = list(allc_table.values())

chrom_size_path = '../../data/genome/mm10.main.nochrM.chrom.sizes'
output_dir = 'test_HIP'
len(samples)
20

Call Differentially Methylated Sites

Identify DMS from multiple ALLC files via permutation based root-mean-square test [Schultz et al., 2015].

This step is time-consuming, can take hours to days to run depending on how large your dataset is. If you want to have an estimate on run time, you may use a small region to have a test run first. The run time scales (roughly) linearly as the number of sites. For example, passing region="chr1:10000000-11000000" to call_dms.

You can also separate the DMS calling process with this region parameter and combine all the RegionDS in the end.

call_dms(
    output_dir=output_dir,
    allc_paths=allc_paths,
    samples=samples,
    chrom_size_path=chrom_size_path,
    cpu=45,
    max_row_count=50,
    n_permute=3000,
    min_pvalue=0.01,
    # here we just calculate some small regions for demo
    # do not provide region parameter if you want to run DMR calling for the whole genome
    # This parameter can also be used for call DMR/DMS in specific region of interest
    region=['chr1:0-100', 'chr1:10000000-10010000', 'chr19:5000000-5100000'])
RMS tests for 105 sites.
RMS tests for 1923 sites.

Call Differentially Methylated Regions

Identify DMR by combining adjacent DMSs (max_dist) with some filtering criteria [Schultz et al., 2015]. See call_dmr for more details.

call_dmr(output_dir=output_dir,
         p_value_cutoff=0.001,
         frac_delta_cutoff=0.3,
         max_dist=250,
         residual_quantile=0.7,
         corr_cutoff=0.3,
         cpu=30)

All Results Stored in RegionDS

After DMR is identified, the output_dir will contain multiple xarray.Dataset for DMR and DMS stored in zarr format. You may open each of these datasets via xarray.open_zarr. More conveniently, the RegionDS class can handle all of these datasets by RegionDS.open.

Some key design principles for the RegionDS:

  • Similar to MCDS, the RegionDS is based on xarray.Dataset class, inherit all of its APIs and can handle large matrix (with size exceed physical memory) efficently.

  • The RegionDS is stored on disk with xarray’s zarr backend.

  • In addtion to xarray’s functions, the RegionDS provides more functions related to genomic region analysis, including region annotation, motif analysis, and correlation analysis, etc.

  • A key parameter for RegionDS is the region_dim, which tells most of its functions which region set to focus on. By default, after DMR calling, RegionDS use 'dmr' as the region_dim. When open a RegionDS that contains multiple datasets (e.g., different annotations adding by following sections), only those related to region_dim will be loaded.

  • By changing the region_dim parameter, you can open other dataset (e.g., dms) as well. The select_dir parameter also allows you to specify datasets to open.

DMR Dataset

DMR dataset contians three data variables by default:

  • dmr_da: contains the raw methylated and total cytosine counts;

  • dmr_da_frac: contains the mC fraction of each DMR;

  • dmr_state: contains the DMR state judgement of each DMR in each sample. Three posible values means:

    • -1: the DMR is hyper-methylated in this sample;

    • 0: the DMR methylation state is insignificant or inconsistent among the DMSs in this sample;

    • 1: the DMR is hypo-methylated in this sample.

    • The stringency of DMR state judgement can be controled by residual_quantile parameter. See [Schultz et al., 2015] for more details on how DMR states are identified based on the DMS-sample residual of goodness-of-fit statistics.

dmr_ds = RegionDS.open(output_dir)
dmr_ds
Using dmr as region_dim
<xarray.RegionDS>
Dimensions:      (count_type: 2, dmr: 132, sample: 20)
Coordinates:
  * count_type   (count_type) <U3 'mc' 'cov'
  * dmr          (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom    (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end      (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length   (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335
    dmr_ndms     (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13
    dmr_start    (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * sample       (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_ODC' 'snmC_OPC'
Data variables:
    dmr_da       (sample, dmr, count_type) uint32 4294967295 ... 4294967295
    dmr_da_frac  (sample, dmr) float32 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
    dmr_state    (sample, dmr) int8 -1 0 0 1 -1 -1 -1 0 0 ... 0 0 0 0 0 0 0 0 0
Attributes:
    region_dim:          dmr
    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
    chrom_size_path:     /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...

DMS matrix

DMS matrix is also stored in the same output_dir, you can open it by changing region_dim. You can also use this to rerun the call_dmr function above.

The three data variables in DMS dataset are:

  • dms_da: contains the raw methylated and total cytosine counts;

  • dms_da_frac: contains the mC fraction of each DMS;

  • dms_residual: contains the goodness-of-fit residual of each DMS from the root-mean-square test. The residual is used to determine DMR methylation state (hypo-methylated or hyper-methylated) in call_dmr, see also [Schultz et al., 2015].

dmr_ds = RegionDS.open(output_dir, region_dim='dms')
dmr_ds
<xarray.RegionDS>
Dimensions:       (count_type: 2, dms: 767, sample: 20)
Coordinates:
  * count_type    (count_type) <U3 'mc' 'cov'
  * dms           (dms) <U13 'chr1-10000325' 'chr1-10001697' ... 'chr19-5099951'
    dms_chrom     (dms) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dms_contexts  (dms) <U3 'CGG' 'CGC' 'CGG' 'CGG' ... 'CGG' 'CGA' 'CGC' 'CGA'
    dms_p-values  (dms) float64 0.001 0.007667 0.008 ... 0.0003333 0.0003333
    dms_pos       (dms) int64 10000325 10001697 10001708 ... 5099924 5099951
  * sample        (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC'
Data variables:
    dms_da        (sample, dms, count_type) float64 59.0 79.0 58.0 ... 3.0 7.0
    dms_da_frac   (sample, dms) float64 0.7468 0.9667 0.9123 ... 0.6667 0.4286
    dms_residual  (sample, dms) float64 -1.951 2.119 1.626 ... 0.1779 -0.8692
Attributes:
    region_dim:          dms
    region_ds_location:  /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
    chrom_size_path:     /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...

On-disk Structure of RegionDS

RegionDS store all the datasets in output_dir with xarray.Dataset.to_zarr

!tree -L 1 -h test_HIP/
test_HIP/
├── [ 320]  chrom_sizes.txt
├── [ 290]  dmr
└── [ 278]  dms

2 directories, 1 file

Each dataset dir contains a zarr-based xarray.Dataset, i.e., in a spetial zarr format.

!tree -L 1 -h test_HIP/dmr
test_HIP/dmr
├── [  61]  count_type
├── [ 276]  dmr
├── [ 276]  dmr_chrom
├── [4.0K]  dmr_da
├── [ 310]  dmr_da_frac
├── [ 276]  dmr_end
├── [ 276]  dmr_length
├── [ 276]  dmr_ndms
├── [ 196]  dmr_start
├── [ 310]  dmr_state
└── [  61]  sample

11 directories, 0 files

The {output_dir}/.ALLCools file contains configuration recognized by RegionDS.open.

!cat test_HIP/.ALLCools
ds_region_dim:
  dmr: dmr
  dms: dms
region_dim: dmr