Filter DMR
Contents
Filter DMR¶
With the RegionDS
, you can further filter DMRs based on additional information. For example, remove those DMRs overlapping with ENCODE blacklist regions and readjust dmr state based on replicate information.
Import¶
from ALLCools.mcds import RegionDS
from ALLCools.dmr import collapse_replicates
Open RegionDS¶
region_ds = RegionDS.open('test_HIP', select_dir=['dmr', 'dmr_genome-features'])
Using dmr as region_dim
Remove Blacklist¶
Only if your genome has a blacklist and you’ve annotated the RegionDS with blacklist bed file through RegionDS.annotate_by_beds
.
# blacklist is annotated in the genome-features dataset
is_blacklist= region_ds.get_feature('blacklist', 'genome-features').astype(bool)
is_blacklist
dmr
chr1-0 False
chr1-1 False
chr1-2 False
chr1-3 False
chr1-4 False
...
chr19-119 False
chr19-120 False
chr19-121 False
chr19-122 False
chr19-123 False
Length: 132, dtype: bool
Filter by change DMR state¶
Here we do not really delete the “bad” DMRs, instead, just change the DMR state to 0 so they will not be included in the following selections. It is the recommended way to handel in-memory filtering
# assign blacklist-overlapping DMR state to 0 (not significant)
region_ds['dmr_state'].loc[{'dmr': is_blacklist.values}] = 0
Save the state change¶
If you do want to save the state change to the RegionDS on-disk storage, you would need to esplicitly save the whole dataset.
# mode='a' will overwrite the existing dataarray. See xarray.Dataset.to_zarr for details
region_ds.save(da_name='dmr_state', mode='a')
Filter by recreate in-memory RegionDS view¶
Alternatively, you can also recreate an RegionDS object by selecting the regions you need, like what you can do with any xarray objects.
Important
Changes only happen in-memory The xarray selection will not apply to on-disk storage of RegionDS unless esplicitly wrote. However, giving xarray’s lazy loading and dask backend, subsetting DMRs on-the-fly will not slow down computation. On the other hand, change the on-disk storage (zarr dataset) is (usually) much more expensive because the large matrix may need to be chunked again.
# select non-blacklist regions
new_region_ds = region_ds.sel({'dmr': ~is_blacklist.values})
Replicate Consistency¶
# each cell cluster has two "replicates"
sample_to_replicate = region_ds.get_feature('sample').apply(lambda i: i.split('_')[1])
sample_to_replicate.value_counts()
ASC 2
CA1 2
CA23 2
CGE-VipLamp5 2
DG 2
MGC 2
MGE-PvSst 2
NonN 2
ODC 2
OPC 2
dtype: int64
# add sample level DMR state matrix
collapse_replicates(region_ds=region_ds,
replicate_label=sample_to_replicate,
state_da='dmr_state')
Collapsed sample state added in exist RegionDS at /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/cluster_level/RegionDS/test_HIP
Final DMR hypo- hyper- state matrix¶
set dmr_state_collapsed
in futher sample based analysis or set use_collapsed=True
# this dataarray is newly added
region_ds['dmr_state_collapsed']
<xarray.DataArray 'dmr_state_collapsed' (dmr: 132, sample_collapsed: 10)> array([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 1, 1, ..., 0, 0, 0], [ 0, 0, 0, ..., -1, 0, -1], ..., [ 0, 0, -1, ..., 0, 0, 0], [ 0, -1, -1, ..., 0, 1, 0], [ 0, 0, 0, ..., 0, 0, 0]], dtype=int8) Coordinates: * dmr (dmr) object 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 2 ... 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 1 13 3 2 ... 2 7 13 19 9 9 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * sample_collapsed (sample_collapsed) object 'ASC' 'CA1' ... 'ODC' 'OPC'
# you can get collapsed sample DMR ids
ca1_hypo, ca1_hyper = region_ds.get_hypo_hyper_index('CA1')
ca1_hypo
Index(['chr1-3', 'chr1-5', 'chr19-1', 'chr19-4', 'chr19-16', 'chr19-23',
'chr19-24', 'chr19-25', 'chr19-34', 'chr19-43', 'chr19-45', 'chr19-46',
'chr19-48', 'chr19-52', 'chr19-54', 'chr19-57', 'chr19-58', 'chr19-60',
'chr19-61', 'chr19-63', 'chr19-64', 'chr19-67', 'chr19-68', 'chr19-79',
'chr19-91', 'chr19-95', 'chr19-96', 'chr19-97', 'chr19-98', 'chr19-101',
'chr19-103', 'chr19-108', 'chr19-109', 'chr19-112', 'chr19-114',
'chr19-116', 'chr19-117', 'chr19-118', 'chr19-119', 'chr19-120',
'chr19-122'],
dtype='object', name='dmr')
ca1_hyper
Index(['chr1-1', 'chr19-5', 'chr19-6', 'chr19-7', 'chr19-10', 'chr19-12',
'chr19-17', 'chr19-40', 'chr19-82', 'chr19-89'],
dtype='object', name='dmr')