DMR-Gene Correlation
Contents
DMR-Gene Correlation¶
Calculate the partial correlation between DMR and Genes using clusters/samples as features.
Import¶
import seaborn as sns
import matplotlib.pyplot as plt
from ALLCools.mcds import RegionDS
from ALLCools.mcds.correlation import region_correlation, get_corr_table
Load Data¶
# DMR data
dmr = RegionDS.open('../RegionDS/test_HIP/')
# gene data
gene = RegionDS.open('test_HIP_Cluster/', region_dim='geneslop2k')
Using dmr as region_dim
# global mC fractions, calculated from cell level MCDS by sum up cluster and genome base calls
sample_mch = gene.get_coords('sample_CHN')
sample_mcg = gene.get_coords('sample_CGN')
DMR Input¶
A 2-D sample-by-DMR DataArray, values are mCG fractions
data_a = dmr['dmr_da_frac'].transpose('sample', 'dmr')
data_a
<xarray.DataArray 'dmr_da_frac' (sample: 20, dmr: 131)> array([[0.538462, 0.816129, 0.70068 , ..., 0.144186, 0.340136, 0.292444], [0.717949, 0.97006 , 0.922078, ..., 0.03 , 0.142259, 0.249802], [0.717949, 0.974227, 0.886179, ..., 0.011628, 0.085575, 0.196755], ..., [0.888889, 0.670588, 0.038462, ..., 0.55 , 0.25974 , 0.621951], [0.857143, 0.727823, 0.666667, ..., 0.295455, 0.562387, 0.544554], [0.777778, 0.780822, 0.5625 , ..., 0.0625 , 0.297297, 0.292553]], dtype=float32) Coordinates: * dmr (dmr) <U9 'chr1-0' 'chr1-1' 'chr1-2' ... 'chr19-121' 'chr19-122' dmr_chrom (dmr) <U5 'chr1' 'chr1' 'chr1' ... 'chr19' 'chr19' 'chr19' dmr_end (dmr) int64 10002172 10003542 10003967 ... 5099203 5099952 dmr_length (dmr) int64 2 305 54 2 2 2 10 ... 589 924 632 842 195 399 335 dmr_ndms (dmr) int64 1 7 2 1 1 1 2 13 3 2 1 ... 2 1 2 7 13 19 8 8 3 6 13 dmr_start (dmr) int64 10002170 10003237 10003913 ... 5098804 5099617 * sample (sample) <U18 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_ODC' 'snmC_OPC'
Gene Input¶
A 2-D sample-by-gene DataArray, values are cluster-normalized mCH fractions
data_b = gene.sel(
mc_type='CHN',
sample=data_a.get_index('sample'))['geneslop2k_da_frac'].transpose(
'sample', 'geneslop2k')
data_b
<xarray.DataArray 'geneslop2k_da_frac' (sample: 20, geneslop2k: 55487)> [1109740 values with dtype=float64] Coordinates: mc_type <U3 'CHN' * sample (sample) object 'snm3C_ASC' 'snm3C_CA1' ... 'snmC_OPC' sample_CGN (sample) float64 0.7203 0.7387 0.7422 ... 0.7434 0.7196 sample_CHN (sample) float64 0.007835 0.02362 ... 0.01121 0.005743 strand_type <U4 'both' * geneslop2k (geneslop2k) object 'ENSMUSG00000102693.1' ... 'ENSMUSG... geneslop2k_chrom (geneslop2k) object 'chr1' 'chr1' 'chr1' ... 'chrM' 'chrM' geneslop2k_end (geneslop2k) int64 3076321 3104124 3673497 ... 16299 16299 geneslop2k_start (geneslop2k) int64 3071252 3100015 3203900 ... 13288 13355
Calculate correlations between DMR and genes¶
true_results, null_results = region_correlation(data_a,
data_b,
sample_mch,
sample_mcg,
method='pearson',
max_dist=1000000,
cpu=40,
null='sample',
null_n=100000)
Using dmr as region_dim
Using dmr_start and dmr_end to calculate pos
Using geneslop2k as region_dim
Using geneslop2k_start and geneslop2k_end to calculate pos
Calculating chr1
Calculating chr19
true_results.X.data.size, null_results.size
(19100, 19100)
Plot correlation distribution¶
import numpy as np
fig, ax = plt.subplots(dpi=200)
sns.histplot(ax=ax,
data=np.random.choice(null_results, 100000),
label='Null',
bins=100,
color='grey')
sns.histplot(ax=ax,
data=np.random.choice(true_results.X.data, 100000),
label='True',
bins=100,
color='red')
ax.legend()
<matplotlib.legend.Legend at 0x7f5ce2c67b80>
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Generic family 'sans-serif' not found because none of the following families were found: Helvetica