DMR Calling
Contents
DMR Calling¶
The genomic analysis starts from identifying genome region of interests. For DNA methylation data, we mostly interested in Differentially Methylated Regions (DMR), which can be identified from multiple pseudo-bulk merged ALLC files. The ALLCools DMR calling function is based on the methylpy package, which identifies cell-type-specific DMRs in two steps:
call_dms
: Call Differentially Methylated Sites (DMS) with permutation based root-mean-square test;call_dmr
: Call DMR from the DMS results.
For more details, please read the methods section of [Schultz et al., 2015] and [He et al., 2020]. If you used the the call_dmr
function, please also cite these references.
Importantly, the DMR calling results are stored in RegionDS
format, which serves as the basic data structure for all the following genomic analysis.
Import¶
from ALLCools.mcds import RegionDS
from ALLCools.dmr import call_dms, call_dmr
Parameters¶
mc_bulk_dir = '../../data/HIPBulk/mc_bulk/'
# make a dict, key is sample name, value is allc path
import pathlib
allc_table = {
allc_path.name.split('.')[0]: str(allc_path)
for allc_path in pathlib.Path(mc_bulk_dir).glob(
'*/*.CGN-Merge.allc.tsv.gz')
}
samples = list(allc_table.keys())
allc_paths = list(allc_table.values())
chrom_size_path = '../../data/genome/mm10.main.nochrM.chrom.sizes'
output_dir = 'test_HIP'
len(samples)
20
Call Differentially Methylated Sites¶
Identify DMS from multiple ALLC files via permutation based root-mean-square test [Schultz et al., 2015].
This step is time-consuming, can take hours to days to run depending on how large your dataset is. If you want to have an estimate on run time, you may use a small region to have a test run first. The run time scales (roughly) linearly as the number of sites. For example, passing region="chr1:10000000-11000000"
to call_dms
.
You can also separate the DMS calling process with this region
parameter and combine all the RegionDS in the end.
call_dms(
output_dir=output_dir,
allc_paths=allc_paths,
samples=samples,
chrom_size_path=chrom_size_path,
cpu=45,
max_row_count=50,
n_permute=3000,
min_pvalue=0.01,
# here we just calculate some small regions for demo
# do not provide region parameter if you want to run DMR calling for the whole genome
# This parameter can also be used for call DMR/DMS in specific region of interest
region=['chr1:0-100', 'chr1:10000000-10010000', 'chr19:5000000-5100000'])
RMS tests for 105 sites.
RMS tests for 1923 sites.
Call Differentially Methylated Regions¶
Identify DMR by combining adjacent DMSs (max_dist
) with some filtering criteria [Schultz et al., 2015]. See call_dmr
for more details.
call_dmr(output_dir=output_dir,
p_value_cutoff=0.001,
frac_delta_cutoff=0.3,
max_dist=250,
residual_quantile=0.7,
corr_cutoff=0.3,
cpu=30)
All Results Stored in RegionDS¶
After DMR is identified, the output_dir
will contain multiple xarray.Dataset
for DMR and DMS stored in zarr format. You may open each of these datasets via xarray.open_zarr
. More conveniently, the RegionDS
class can handle all of these datasets by RegionDS.open
.
Some key design principles for the RegionDS
:
Similar to
MCDS
, the RegionDS is based onxarray.Dataset
class, inherit all of its APIs and can handle large matrix (with size exceed physical memory) efficently.The RegionDS is stored on disk with xarray’s zarr backend.
In addtion to xarray’s functions, the RegionDS provides more functions related to genomic region analysis, including region annotation, motif analysis, and correlation analysis, etc.
A key parameter for RegionDS is the
region_dim
, which tells most of its functions which region set to focus on. By default, after DMR calling, RegionDS use'dmr'
as theregion_dim
. When open a RegionDS that contains multiple datasets (e.g., different annotations adding by following sections), only those related toregion_dim
will be loaded.By changing the
region_dim
parameter, you can open other dataset (e.g., dms) as well. Theselect_dir
parameter also allows you to specify datasets to open.
DMR Dataset¶
DMR dataset contians three data variables by default:
dmr_da
: contains the raw methylated and total cytosine counts;dmr_da_frac
: contains the mC fraction of each DMR;dmr_state
: contains the DMR state judgement of each DMR in each sample. Three posible values means:-1
: the DMR is hyper-methylated in this sample;0
: the DMR methylation state is insignificant or inconsistent among the DMSs in this sample;1
: the DMR is hypo-methylated in this sample.The stringency of DMR state judgement can be controled by
residual_quantile
parameter. See [Schultz et al., 2015] for more details on how DMR states are identified based on the DMS-sample residual of goodness-of-fit statistics.
dmr_ds = RegionDS.open(output_dir)
dmr_ds
Using dmr as region_dim
<xarray.RegionDS> Dimensions: (count_type: 2, dmr: 132, sample: 20) Coordinates: * count_type (count_type) <U3 'mc' 'cov' * dmr (dmr) <U9 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 589 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 1 ... 2 1 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * sample (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_ODC' 'snmC_OPC' Data variables: dmr_da (sample, dmr, count_type) uint32 4294967295 ... 4294967295 dmr_da_frac (sample, dmr) float32 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 dmr_state (sample, dmr) int8 -1 0 0 1 -1 -1 -1 0 0 ... 0 0 0 0 0 0 0 0 0 Attributes: region_dim: dmr region_ds_location: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus... chrom_size_path: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
DMS matrix¶
DMS matrix is also stored in the same output_dir
, you can open it by changing region_dim
. You can also use this to rerun the call_dmr
function above.
The three data variables in DMS dataset are:
dms_da
: contains the raw methylated and total cytosine counts;dms_da_frac
: contains the mC fraction of each DMS;dms_residual
: contains the goodness-of-fit residual of each DMS from the root-mean-square test. The residual is used to determine DMR methylation state (hypo-methylated or hyper-methylated) incall_dmr
, see also [Schultz et al., 2015].
dmr_ds = RegionDS.open(output_dir, region_dim='dms')
dmr_ds
<xarray.RegionDS> Dimensions: (count_type: 2, dms: 767, sample: 20) Coordinates: * count_type (count_type) <U3 'mc' 'cov' * dms (dms) <U13 'chr1-10000325' 'chr1-10001697' ... 'chr19-5099951' dms_chrom (dms) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dms_contexts (dms) <U3 'CGG' 'CGC' 'CGG' 'CGG' ... 'CGG' 'CGA' 'CGC' 'CGA' dms_p-values (dms) float64 0.001 0.007667 0.008 ... 0.0003333 0.0003333 dms_pos (dms) int64 10000325 10001697 10001708 ... 5099924 5099951 * sample (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC' Data variables: dms_da (sample, dms, count_type) float64 59.0 79.0 58.0 ... 3.0 7.0 dms_da_frac (sample, dms) float64 0.7468 0.9667 0.9123 ... 0.6667 0.4286 dms_residual (sample, dms) float64 -1.951 2.119 1.626 ... 0.1779 -0.8692 Attributes: region_dim: dms region_ds_location: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus... chrom_size_path: /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/clus...
On-disk Structure of RegionDS¶
RegionDS store all the datasets in output_dir
with xarray.Dataset.to_zarr
!tree -L 1 -h test_HIP/
test_HIP/
├── [ 320] chrom_sizes.txt
├── [ 290] dmr
└── [ 278] dms
2 directories, 1 file
Each dataset dir contains a zarr-based xarray.Dataset
, i.e., in a spetial zarr format.
!tree -L 1 -h test_HIP/dmr
test_HIP/dmr
├── [ 61] count_type
├── [ 276] dmr
├── [ 276] dmr_chrom
├── [4.0K] dmr_da
├── [ 310] dmr_da_frac
├── [ 276] dmr_end
├── [ 276] dmr_length
├── [ 276] dmr_ndms
├── [ 196] dmr_start
├── [ 310] dmr_state
└── [ 61] sample
11 directories, 0 files
The {output_dir}/.ALLCools
file contains configuration recognized by RegionDS.open
.
!cat test_HIP/.ALLCools
ds_region_dim:
dmr: dmr
dms: dms
region_dim: dmr