Filter DMR¶

With the RegionDS, you can further filter DMRs based on additional information. For example, remove those DMRs overlapping with ENCODE blacklist regions and readjust dmr state based on replicate information.

Import¶

from ALLCools.mcds import RegionDS
from ALLCools.dmr import collapse_replicates

Open RegionDS¶

region_ds = RegionDS.open('test_HIP', select_dir=['dmr', 'dmr_genome-features'])

Using dmr as region_dim

Remove Blacklist¶

Only if your genome has a blacklist and you’ve annotated the RegionDS with blacklist bed file through RegionDS.annotate_by_beds.

# blacklist is annotated in the genome-features dataset
is_blacklist= region_ds.get_feature('blacklist', 'genome-features').astype(bool)
is_blacklist

dmr
chr1-0       False
chr1-1       False
chr1-2       False
chr1-3       False
chr1-4       False
             ...  
chr19-119    False
chr19-120    False
chr19-121    False
chr19-122    False
chr19-123    False
Length: 132, dtype: bool

Filter by change DMR state¶

Here we do not really delete the “bad” DMRs, instead, just change the DMR state to 0 so they will not be included in the following selections. It is the recommended way to handel in-memory filtering

# assign blacklist-overlapping DMR state to 0 (not significant)
region_ds['dmr_state'].loc[{'dmr': is_blacklist.values}] = 0

Save the state change¶

If you do want to save the state change to the RegionDS on-disk storage, you would need to esplicitly save the whole dataset.

# mode='a' will overwrite the existing dataarray. See xarray.Dataset.to_zarr for details
region_ds.save(da_name='dmr_state', mode='a')

Filter by recreate in-memory RegionDS view¶

Alternatively, you can also recreate an RegionDS object by selecting the regions you need, like what you can do with any xarray objects.

Important

Changes only happen in-memory The xarray selection will not apply to on-disk storage of RegionDS unless esplicitly wrote. However, giving xarray’s lazy loading and dask backend, subsetting DMRs on-the-fly will not slow down computation. On the other hand, change the on-disk storage (zarr dataset) is (usually) much more expensive because the large matrix may need to be chunked again.

# select non-blacklist regions 
new_region_ds = region_ds.sel({'dmr': ~is_blacklist.values})

Replicate Consistency¶

# each cell cluster has two "replicates"
sample_to_replicate = region_ds.get_feature('sample').apply(lambda i: i.split('_')[1])
sample_to_replicate.value_counts()

ASC             2
CA1             2
CA23            2
CGE-VipLamp5    2
DG              2
MGC             2
MGE-PvSst       2
NonN            2
ODC             2
OPC             2
dtype: int64

# add sample level DMR state matrix
collapse_replicates(region_ds=region_ds,
                    replicate_label=sample_to_replicate,
                    state_da='dmr_state')

Collapsed sample state added in exist RegionDS at /home/hanliu/pkg/ALLCools_pycharm/docs/allcools/cluster_level/RegionDS/test_HIP

Final DMR hypo- hyper- state matrix¶

set dmr_state_collapsed in futher sample based analysis or set use_collapsed=True

# this dataarray is newly added
region_ds['dmr_state_collapsed']

<xarray.DataArray 'dmr_state_collapsed' (dmr: 132, sample_collapsed: 10)>
array([[ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  1,  1, ...,  0,  0,  0],
       [ 0,  0,  0, ..., -1,  0, -1],
       ...,
       [ 0,  0, -1, ...,  0,  0,  0],
       [ 0, -1, -1, ...,  0,  1,  0],
       [ 0,  0,  0, ...,  0,  0,  0]], dtype=int8)
Coordinates:
  * dmr               (dmr) object 'chr1-0' 'chr1-1' ... 'chr19-122' 'chr19-123'
    dmr_chrom         (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19'
    dmr_end           (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952
    dmr_length        (dmr) int64 2 305 54 2 2 2 2 ... 924 632 842 195 399 335
    dmr_ndms          (dmr) int64 1 7 2 1 1 1 1 13 3 2 ... 2 7 13 19 9 9 3 6 13
    dmr_start         (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617
  * sample_collapsed  (sample_collapsed) object 'ASC' 'CA1' ... 'ODC' 'OPC'

# you can get collapsed sample DMR ids
ca1_hypo, ca1_hyper = region_ds.get_hypo_hyper_index('CA1')

ca1_hypo

Index(['chr1-3', 'chr1-5', 'chr19-1', 'chr19-4', 'chr19-16', 'chr19-23',
       'chr19-24', 'chr19-25', 'chr19-34', 'chr19-43', 'chr19-45', 'chr19-46',
       'chr19-48', 'chr19-52', 'chr19-54', 'chr19-57', 'chr19-58', 'chr19-60',
       'chr19-61', 'chr19-63', 'chr19-64', 'chr19-67', 'chr19-68', 'chr19-79',
       'chr19-91', 'chr19-95', 'chr19-96', 'chr19-97', 'chr19-98', 'chr19-101',
       'chr19-103', 'chr19-108', 'chr19-109', 'chr19-112', 'chr19-114',
       'chr19-116', 'chr19-117', 'chr19-118', 'chr19-119', 'chr19-120',
       'chr19-122'],
      dtype='object', name='dmr')

ca1_hyper

Index(['chr1-1', 'chr19-5', 'chr19-6', 'chr19-7', 'chr19-10', 'chr19-12',
       'chr19-17', 'chr19-40', 'chr19-82', 'chr19-89'],
      dtype='object', name='dmr')

Filter DMR

Contents