ALLCools: ALL methyl-Cytosine tools

GET STARTED

Installation
Input Files
Analysis Steps
Examples In This Document

CELLULAR ANALYSIS

Basic Clustering Walk-through
- Basic Cell Clustering Using 100Kb Bins
- Basic Cell Clustering Using mCG-5Kb Bins
Step-by-step Clustering Analysis
- Step-by-step Clustering Using 100Kb bins
- Step-by-step Clustering using mCG-5Kb Input
Differential Methylated Genes Analysis
Data Integration
Doublets Identification
- Calculate Cell Doublet Score

GENOMIC ANALYSIS

Post-clustering Genomic Analysis
Motif Scan and Analysis
- Motif Scan
- Motif Enrichment Analysis
Region-Region Correlation Analysis
- Prepare Cluster-by-Gene mC Profile
- DMR-Gene Correlation
REPTILE
- Prepare REPTILE
- Predict Enhancer with REPTILE Algorithm

COMMAND LINE TOOLS

Entry Point
allcools allc
allcools standard
allcools profile
allcools tbi
allcools merge
allcools extract
allcools bw
allcools region
allcools generate-dataset
allcools mcds
allcools mcad
allcools tbi

API

ALLCools

PROJECT INFO

Citation And Reference
Changelog

Powered by Jupyter Book

Contents

Merge Cluster
Cluster Enrichment

Prepare Cluster-by-Gene mC Profile

Contents

Merge Cluster
Cluster Enrichment

Prepare Cluster-by-Gene mC Profile¶

This notebook sum cell feature counts in a cell MCDS by cell clusters. The cluster MCDS can be transform and saved as RegionDS

import pandas as pd
from ALLCools.mcds import MCDS
from ALLCools.clustering import cluster_enriched_features

Merge Cluster¶

mc_cluster = pd.read_csv(
    '../../data/HIPBulk/DesideBulkLevel/snmC.anno.csv',
    header=None,
    index_col=0,
    squeeze=True)

m3c_cluster = pd.read_csv(
    '../../data/HIPBulk/DesideBulkLevel/snm3C.anno.csv',
    header=None,
    index_col=0,
    squeeze=True)

mc_cluster = 'snmC_' + mc_cluster
m3c_cluster = 'snm3C_' + m3c_cluster
cell_cluster = pd.concat([mc_cluster, m3c_cluster])
cell_cluster.index.name = 'cell'
cell_cluster.name = 'sample'

mcds = MCDS.open('../../data/Brain/snm*C-seq*/*.mcds')

Open MCDS with netcdf4 engine.

mcds.coords['sample'] = cell_cluster

cluster_mcds = mcds.merge_cluster(cluster_col='sample').to_region_ds('geneslop2k')

cluster_mcds

<xarray.RegionDS>
Dimensions:              (mc_type: 2, count_type: 2, geneslop2k: 55487, sample: 20, chrom100k: 27269)
Coordinates: (12/14)
  * mc_type              (mc_type) object 'CGN' 'CHN'
  * count_type           (count_type) object 'mc' 'cov'
    strand_type          <U4 'both'
  * geneslop2k           (geneslop2k) object 'ENSMUSG00000102693.1' ... 'ENSM...
    geneslop2k_chrom     (geneslop2k) object 'chr1' 'chr1' ... 'chrM' 'chrM'
    geneslop2k_start     (geneslop2k) int64 3071252 3100015 ... 13288 13355
    ...                   ...
    sample_CGN           (sample) float64 0.7203 0.7387 0.7422 ... 0.7434 0.7196
    sample_CHN           (sample) float64 0.007835 0.02362 ... 0.01121 0.005743
  * chrom100k            (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268
    chrom100k_chrom      (chrom100k) object 'chr1' 'chr1' ... 'chrY' 'chrM'
    chrom100k_bin_start  (chrom100k) int64 0 100000 200000 ... 91700000 0
    chrom100k_bin_end    (chrom100k) int64 100000 200000 ... 91744698 16299
Data variables:
    geneslop2k_da        (sample, geneslop2k, mc_type, count_type) float64 1....
    chrom100k_da         (sample, chrom100k, mc_type, count_type) float64 0.0...
    geneslop2k_da_frac   (sample, geneslop2k, mc_type) float64 1.252 ... 4.546
    chrom100k_da_frac    (sample, chrom100k, mc_type) float64 1.0 1.0 ... 5.264
Attributes:
    region_dim:  geneslop2k

xarray.RegionDS

Dimensions:
- mc_type: 2
- count_type: 2
- geneslop2k: 55487
- sample: 20
- chrom100k: 27269

Coordinates: (14)

mc_type
(mc_type)
object
'CGN' 'CHN'
```
array(['CGN', 'CHN'], dtype=object)
```
count_type
(count_type)
object
'mc' 'cov'
```
array(['mc', 'cov'], dtype=object)
```
strand_type
()
<U4
'both'
```
array('both', dtype='<U4')
```

geneslop2k

(geneslop2k)

object

'ENSMUSG00000102693.1' ... 'ENSM...

array(['ENSMUSG00000102693.1', 'ENSMUSG00000064842.1', 'ENSMUSG00000051951.5',
       ..., 'ENSMUSG00000064370.1', 'ENSMUSG00000064371.1',
       'ENSMUSG00000064372.1'], dtype=object)

geneslop2k_chrom

(geneslop2k)

object

'chr1' 'chr1' ... 'chrM' 'chrM'

array(['chr1', 'chr1', 'chr1', ..., 'chrM', 'chrM', 'chrM'], dtype=object)

geneslop2k_start

(geneslop2k)

int64

3071252 3100015 ... 13288 13355

array([3071252, 3100015, 3203900, ...,   12144,   13288,   13355])

geneslop2k_end

(geneslop2k)

int64

3076321 3104124 ... 16299 16299

array([3076321, 3104124, 3673497, ...,   16299,   16299,   16299])

sample

(sample)

object

'snm3C_ASC' ... 'snmC_OPC'

array(['snm3C_ASC', 'snm3C_CA1', 'snm3C_CA23', 'snm3C_CGE-VipLamp5',
       'snm3C_DG', 'snm3C_MGC', 'snm3C_MGE-PvSst', 'snm3C_NonN', 'snm3C_ODC',
       'snm3C_OPC', 'snmC_ASC', 'snmC_CA1', 'snmC_CA23', 'snmC_CGE-VipLamp5',
       'snmC_DG', 'snmC_MGC', 'snmC_MGE-PvSst', 'snmC_NonN', 'snmC_ODC',
       'snmC_OPC'], dtype=object)

sample_CGN

(sample)

float64

0.7203 0.7387 ... 0.7434 0.7196

array([0.72027702, 0.7386765 , 0.74223235, 0.78137746, 0.70998683,
       0.69529767, 0.79745986, 0.6957201 , 0.72761756, 0.70625517,
       0.73871943, 0.74009099, 0.75324386, 0.80065615, 0.72773628,
       0.71349106, 0.8149437 , 0.70807914, 0.74336865, 0.71960449])

sample_CHN

(sample)

float64

0.007835 0.02362 ... 0.005743

array([0.00783526, 0.02362182, 0.02493468, 0.02857215, 0.0106706 ,
       0.00696775, 0.03594073, 0.00793199, 0.01268308, 0.00754718,
       0.00681991, 0.01997363, 0.02210814, 0.02925379, 0.00979016,
       0.00561416, 0.03664812, 0.00748145, 0.01121077, 0.00574326])

chrom100k
(chrom100k)
int64
0 1 2 3 ... 27265 27266 27267 27268
```
array([    0,     1,     2, ..., 27266, 27267, 27268])
```

chrom100k_chrom

(chrom100k)

object

'chr1' 'chr1' ... 'chrY' 'chrM'

array(['chr1', 'chr1', 'chr1', ..., 'chrY', 'chrY', 'chrM'], dtype=object)

chrom100k_bin_start

(chrom100k)

int64

0 100000 200000 ... 91700000 0

array([       0,   100000,   200000, ..., 91600000, 91700000,        0])

chrom100k_bin_end

(chrom100k)

int64

100000 200000 ... 91744698 16299

array([  100000,   200000,   300000, ..., 91700000, 91744698,    16299])

Data variables: (4)

geneslop2k_da

(sample, geneslop2k, mc_type, count_type)

float64

1.193e+03 1.385e+03 ... 2.146e+04

array([[[[1.1930000e+03, 1.3850000e+03],
         [4.8500000e+02, 6.5949000e+04]],

        [[1.4510000e+03, 1.6610000e+03],
         [3.3900000e+02, 5.1618000e+04]],

        [[1.0676200e+05, 1.4673500e+05],
         [3.6626000e+04, 5.7407940e+06]],

        ...,

        [[1.6530000e+03, 6.8485000e+04],
         [1.2274000e+04, 6.4640900e+05]],

        [[8.2300000e+02, 4.0571000e+04],
         [7.8900000e+03, 4.5093500e+05]],

        [[8.0000000e+02, 3.9703000e+04],
         [7.6570000e+03, 4.4041600e+05]]],

...

       [[[1.8500000e+02, 2.4000000e+02],
         [3.7000000e+01, 1.0423000e+04]],

        [[1.9000000e+02, 2.2100000e+02],
         [3.4000000e+01, 7.5610000e+03]],

        [[1.8051000e+04, 2.3380000e+04],
         [5.0840000e+03, 8.9327000e+05]],

        ...,

        [[1.4600000e+02, 3.9620000e+03],
         [9.1300000e+02, 3.2601000e+04]],

        [[7.9000000e+01, 2.3630000e+03],
         [5.8800000e+02, 2.1761000e+04]],

        [[7.7000000e+01, 2.3230000e+03],
         [5.7600000e+02, 2.1461000e+04]]]])

chrom100k_da

(sample, chrom100k, mc_type, count_type)

float64

0.0 0.0 0.0 ... 3.125e+03 1.016e+05

array([[[[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        ...,

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[5.664000e+03, 2.102670e+05],
         [4.131200e+04, 1.888354e+06]]],

...

       [[[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        ...,

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[0.000000e+00, 0.000000e+00],
         [0.000000e+00, 0.000000e+00]],

        [[4.190000e+02, 1.206300e+04],
         [3.125000e+03, 1.015860e+05]]]])

geneslop2k_da_frac

(sample, geneslop2k, mc_type)

float64

1.252 0.9426 1.27 ... 0.05111 4.546

array([[[1.25219807, 0.94261248],
        [1.27002211, 0.84193406],
        [1.05833895, 0.81769466],
        ...,
        [0.03516715, 2.43346102],
        [0.02960564, 2.24232928],
        [0.02941026, 2.22808224]],

       [[1.22488888, 1.18910391],
        [1.18153842, 1.50760544],
        [1.01539405, 0.72134256],
        ...,
        [0.02887709, 0.82077725],
        [0.02440405, 0.74615862],
        [0.02426797, 0.74721835]],

       [[1.1946092 , 1.49740746],
        [1.26495285, 1.50378073],
        [1.01086379, 0.70067793],
        ...,
...
        ...,
        [0.05201749, 3.85798599],
        [0.04560926, 3.61770185],
        [0.04606893, 3.64285723]],

       [[1.14765229, 0.59260797],
        [1.25406652, 0.85432549],
        [1.15804136, 1.56300458],
        ...,
        [0.06438765, 3.47377861],
        [0.06960209, 3.61996134],
        [0.06915771, 3.57642823]],

       [[1.14162114, 0.66214186],
        [1.27064057, 0.81947595],
        [1.14617436, 1.00326578],
        ...,
        [0.05581611, 4.80563787],
        [0.05150077, 4.57853936],
        [0.05111019, 4.54569428]]])

chrom100k_da_frac

(sample, chrom100k, mc_type)

float64

1.0 1.0 1.0 ... 1.0 0.04887 5.264

array([[[1.        , 1.        ],
        [1.        , 1.        ],
        [1.        , 1.        ],
        ...,
        [1.        , 1.        ],
        [1.        , 1.        ],
        [0.03603531, 2.80319811]],

       [[1.        , 1.        ],
        [1.        , 1.        ],
        [1.        , 1.        ],
        ...,
        [1.        , 1.        ],
        [1.        , 1.        ],
        [0.03011631, 0.83989358]],

       [[1.        , 1.        ],
        [1.        , 1.        ],
        [1.        , 1.        ],
        ...,
...
        ...,
        [1.        , 1.        ],
        [1.        , 1.        ],
        [0.05617179, 4.39955729]],

       [[1.        , 1.        ],
        [1.        , 1.        ],
        [1.        , 1.        ],
        ...,
        [1.        , 1.        ],
        [1.        , 1.        ],
        [0.06058024, 3.63652247]],

       [[1.        , 1.        ],
        [1.        , 1.        ],
        [1.        , 1.        ],
        ...,
        [1.        , 1.        ],
        [1.        , 1.        ],
        [0.04887091, 5.26428485]]])

Attributes: (1)
region_dim :
geneslop2k

cluster_mcds.to_zarr('test_HIP_Cluster')

<xarray.backends.zarr.ZarrStore at 0x7fb99abac040>

Cluster Enrichment¶

mcds.add_mc_frac(var_dim='geneslop2k')

adata = mcds.get_adata(mc_type='CHN',
                       var_dim='geneslop2k',
                       da_suffix='frac',
                       obs_dim='cell',
                       select_hvf=False)

/home/hanliu/miniconda3/envs/allcools_new/lib/python3.8/site-packages/dask/core.py:119: RuntimeWarning: invalid value encountered in true_divide
  return func(*(_execute_task(a, cache) for a in args))
/home/hanliu/miniconda3/envs/allcools_new/lib/python3.8/site-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in true_divide
  return func(*(_execute_task(a, cache) for a in args))
/home/hanliu/miniconda3/envs/allcools_new/lib/python3.8/site-packages/dask/array/numpy_compat.py:39: RuntimeWarning: invalid value encountered in true_divide
  x = np.divide(x1, x2, out)

adata.obs['cluster'] = adata.obs['sample'].str.split('_').str[1]

cell_ids = adata.obs.groupby('cluster').apply(lambda i: i if i.shape[
    0] < 1500 else i.sample(1500)).index.get_level_values(1)
adata = adata[cell_ids, :]

cluster_enriched_features(adata,
                          cluster_col='cluster',
                          top_n=200,
                          alpha=0.05,
                          stat_plot=True,
                          method='mc')

Found 10 clusters to compute feature enrichment score
Computing enrichment score
Computing enrichment score FDR-corrected P values

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Generic family 'sans-serif' not found because none of the following families were found: Helvetica
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Generic family 'sans-serif' not found because none of the following families were found: Helvetica
Trying to set attribute `._uns` of view, copying.

Selected 1653 unique features

qvals = pd.DataFrame(
    adata.uns['cluster_feature_enrichment']['qvals'],
    index=adata.var_names,
    columns=adata.uns['cluster_feature_enrichment']
    ['cluster_order'])

qvals.to_hdf('mCH.cluster_enrichment_qvals.hdf', key='data')

previous

Region-Region Correlation Analysis

next

DMR-Gene Correlation

By Hanqing Liu
© Copyright 2019 - 2022.