`ALLCools.mcds`¶

Core Data Structure

Submodules¶

Package Contents¶

class MCDS(dataset, obs_dim=None, var_dim=None)[source]¶

Bases: xarray.Dataset

MCDS Class

__slots__ = []¶

property var_dim(self)¶

property obs_dim(self)¶

property obs_names(self)¶

property var_names(self)¶

_verify_dim(self, dim, mode)¶

classmethod open(cls, mcds_paths, obs_dim='cell', use_obs=None, var_dim=None, chunks='auto', split_large_chunks=True, obj_to_str=True, engine=None)¶

Take one or multiple MCDS file paths and create single MCDS concatenated on obs_dim

Parameters

mcds_paths – Single MCDS path or MCDS path pattern with wildcard or MCDS path list
obs_dim – Dimension name of observations, default is ‘cell’
use_obs – Subset the MCDS by a list of observation IDs.
var_dim – Which var_dim dataset to use, needed when MCDS has multiple var_dim stored in the same directory
chunks – if not None, xarray will use chunks to load data as dask.array. The “auto” means xarray will determine chunks automatically. For more options, read the xarray.open_dataset chunks parameter documentation. If None, xarray will not use dask, which is not desired in most cases.
split_large_chunks – Whether split large chunks in dask config array.slicing.split_large_chunks
obj_to_str – Whether turn object coordinates into string data type
engine – xarray engine used to store MCDS, if multiple MCDS provided, the engine need to be the same

Returns

Return type

MCDS

add_mc_frac(self, var_dim=None, da=None, normalize_per_cell=True, clip_norm_value=10, da_suffix='frac')¶

Add posterior mC rate data array for certain feature type (var_dim).

Parameters

var_dim – Name of the feature type
da – if None, will use f’{var_dim}_da’
normalize_per_cell – if True, will normalize the mC rate data array per cell
clip_norm_value – reset larger values in the normalized mC rate data array to this
da_suffix – name suffix appended to the calculated mC rate data array

add_mc_rate(self, *args, **kwargs)¶

add_feature_cov_mean(self, obs_dim=None, var_dim=None, plot=True)¶

Add feature cov mean across obs_dim.

Parameters

var_dim – Name of var dimension
obs_dim – Name of obs dimension
plot – If true, plot the distribution of feature cov mean

Returns

Return type

None

add_cell_metadata(self, metadata, obs_dim=None)¶

filter_feature_by_cov_mean(self, var_dim=None, min_cov=0, max_cov=999999)¶

filter MCDS by feature cov mean. add_feature_cov_mean() must be called before this function.

Parameters

var_dim – Name of var dimension
min_cov – Minimum cov cutoff
max_cov – Maximum cov cutoff

Returns

Return type

MCDS

get_feature_bed(self, var_dim=None)¶

Get a bed format data frame of the var_dim

Parameters: var_dim – Name of var_dim
Returns
Return type: pd.DataFrame

remove_black_list_region(self, black_list_path, var_dim=None, f=0.2)¶

Remove regions overlap (bedtools intersect -f {f}) with regions in the black_list_path

Parameters

var_dim – Name of var_dim
black_list_path – Path to the black list bed file
f – Fraction of overlap when calling bedtools intersect

Returns

Return type

MCDS

remove_chromosome(self, exclude_chromosome, var_dim=None)¶

Remove regions in specific chromosome

Parameters

var_dim – Name of var_dim
exclude_chromosome – Chromosome to remove

Returns

Return type

MCDS (xr.Dataset)

calculate_hvf_svr(self, mc_type=None, var_dim=None, obs_dim=None, n_top_feature=5000, da_suffix='frac', plot=True)¶

calculate_hvf(self, mc_type=None, var_dim=None, obs_dim=None, min_disp=0.5, max_disp=None, min_mean=0, max_mean=5, n_top_feature=5000, bin_min_features=5, mean_binsize=0.05, cov_binsize=100, da_suffix='frac', plot=True)¶

Calculate normalized dispersion to select highly variable features.

Parameters

mc_type – Type of mC to calculate
var_dim – Name of variable
obs_dim – Name of observation, default is cell
min_disp – minimum dispersion for a feature to be considered
max_disp – maximum dispersion for a feature to be considered
min_mean – minimum mean for a feature to be considered
max_mean – maximum mean for a feature to be considered
n_top_feature – Top N feature to use as highly variable feature. If set, all the cutoff will be ignored, HDF selected based on order of normalized dispersion.
bin_min_features – Minimum number of features to be considered as a separate bin, if bellow this number, the bin will be merged to its closest bin.
mean_binsize – bin size to separate features across mean
cov_binsize – bin size to separate features across coverage
plot – If true, will plot mean, coverage and normalized dispersion scatter plots.

Returns

Return type

pd.DataFrame

get_score_adata(self, mc_type, quant_type, obs_dim=None, var_dim=None, sparse=True)¶

get_adata(self, mc_type=None, obs_dim=None, var_dim=None, da_suffix='frac', select_hvf=True, split_large_chunks=True)¶

Get anndata from MCDS mC rate matrix :param mc_type: mC rate type :param var_dim: Name of variable :param da_suffix: Suffix of mC rate matrix :param obs_dim: Name of observation :param select_hvf: Select HVF or not, if True, will use mcds.coords[‘{var_dim}_{mc_type}_feature_select’] to select HVFs :param split_large_chunks: Whether split large chunks in dask config array.slicing.split_large_chunks

Returns
Return type: anndata.Anndata

merge_cluster(self, cluster_col, obs_dim=None, add_mc_frac=True, add_overall_mc=True, overall_mc_da='chrom100k_da')¶

to_region_ds(self, region_dim=None)¶

write_dataset(self, output_path, mode='w-', obs_dim=None, var_dims: Union[str, list] = None, use_obs=None, chunk_size=1000)¶

Write MCDS into a on-disk zarr dataset. Data arrays for each var_dim will be saved in separate sub-directories of output_path.

Parameters

output_path – Path of the zarr dataset
mode – ‘w-‘ means write to output_path, fail if the path exists; ‘w’ means write to output_path, overwrite if the var_dim sub-directory exists
obs_dim – dimension name of observations
var_dims – dimension name, or a list of dimension names of variables
use_obs – Select AND order observations when write.
chunk_size – The load and write chunks, set this as large as possible based on available memory.

Returns

Return type

output_path

class RegionDS(dataset, region_dim=None, location=None, chrom_size_path=None)[source]¶

Bases: xarray.Dataset

A multi-dimensional, in memory, array database.

A dataset resembles an in-memory representation of a NetCDF file, and consists of variables, coordinates and attributes which together form a self describing dataset.

Dataset implements the mapping interface with keys given by variable names and values given by DataArray objects for each variable name.

One dimensional variables with name equal to their dimension are index coordinates used for label based indexing.

To load data from a file or file-like object, use the open_dataset function.

Parameters

data_vars (dict-like, optional) –
A mapping from variable names to DataArray objects, Variable objects or to tuples of the form (dims, data[, attrs]) which can be used as arguments to create a new Variable. Each dimension must have the same length in all variables in which it appears.

The following notations are accepted:
- mapping {var name: DataArray}
- mapping {var name: Variable}
- mapping {var name: (dimension name, array-like)}
- mapping {var name: (tuple of dimension names, array-like)}
- mapping {dimension name: array-like} (it will be automatically moved to coords, see below)
Each dimension must have the same length in all variables in which it appears.
coords (dict-like, optional) –
Another mapping in similar form as the data_vars argument, except the each item is saved on the dataset as a “coordinate”. These variables have an associated meaning: they describe constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in variables. Coordinates values may be given by 1-dimensional arrays or scalars, in which case dims do not need to be supplied: 1D arrays will be assumed to give index values along the dimension with the same name.

The following notations are accepted:
- mapping {coord name: DataArray}
- mapping {coord name: Variable}
- mapping {coord name: (dimension name, array-like)}
- mapping {coord name: (tuple of dimension names, array-like)}
- mapping {dimension name: array-like} (the dimension name is implicitly set to be the same as the coord name)
The last notation implies that the coord name is the same as the dimension name.
attrs (dict-like, optional) – Global attributes to save on this dataset.

Examples

Create data:

>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
>>> precipitation = 10 * np.random.rand(2, 2, 3)
>>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>>> lat = [[42.25, 42.21], [42.63, 42.59]]
>>> time = pd.date_range("2014-09-06", periods=3)
>>> reference_time = pd.Timestamp("2014-09-05")

Initialize a dataset with multiple dimensions:

>>> ds = xr.Dataset(
...     data_vars=dict(
...         temperature=(["x", "y", "time"], temperature),
...         precipitation=(["x", "y", "time"], precipitation),
...     ),
...     coords=dict(
...         lon=(["x", "y"], lon),
...         lat=(["x", "y"], lat),
...         time=time,
...         reference_time=reference_time,
...     ),
...     attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset>
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 29.11 18.2 22.83 ... 18.28 16.15 26.63
    precipitation   (x, y, time) float64 5.68 9.256 0.7104 ... 7.992 4.615 7.805
.. attribute:: description

Weather related data.

Find out where the coldest temperature was and what values the other variables had:

>>> ds.isel(ds.temperature.argmin(...))
<xarray.Dataset>
Dimensions:         ()
Coordinates:
    lon             float64 -99.32
    lat             float64 42.21
    time            datetime64[ns] 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Data variables:
    temperature     float64 7.182
    precipitation   float64 8.326
.. attribute:: description

Weather related data.

__slots__ = []¶

property region_dim(self)¶

property chrom_size_path(self)¶

property location(self)¶

classmethod from_bed(cls, bed, location, chrom_size_path, region_dim='region', sort_bed=True)¶

Create empty RegionDS from a bed file.

Parameters

bed –
location –
region_dim –
chrom_size_path –
sort_bed –

classmethod open(cls, path, region_dim=None, use_regions=None, split_large_chunks=True, chrom_size_path=None, select_dir=None, chunks='auto', engine='zarr')¶

classmethod _open_single_dataset(cls, path, region_dim, split_large_chunks=True, chrom_size_path=None, location=None, engine=None)¶

Take one or multiple RegionDS file paths and create single RegionDS concatenated on region_dim

Parameters

path – Single RegionDS path or RegionDS path pattern with wildcard or RegionDS path list
region_dim – Dimension name of regions
split_large_chunks – Split large dask array chunks if true

Returns

Return type

RegionDS

iter_index(self, chunk_size=100000, dim=None)¶

iter_array(self, chunk_size=100000, dim=None, da=None, load=False)¶

get_fasta(self, genome_fasta, output_path, slop=None, chrom_size_path=None, standardize_length=None)¶

get_bed(self, with_id=True, bedtools=False, slop=None, chrom_size_path=None, standardize_length=None)¶

_chunk_annotation_executor(self, annotation_function, cpu, save=True, **kwargs)¶

annotate_by_bigwigs(self, bigwig_table, dim, slop=100, chrom_size_path=None, value_type='mean', chunk_size='auto', dtype='float32', cpu=1, save=True)¶

annotate_by_beds(self, bed_table, dim, slop=100, chrom_size_path=None, chunk_size='auto', dtype='bool', bed_sorted=True, cpu=1, fraction=0.2, save=True)¶

get_feature(self, feature_name, dim=None, da_name=None)¶

scan_motifs(self, genome_fasta, cpu=1, standardize_length=500, motif_set_path=None, chrom_size_path=None, combine_cluster=True, fnr_fpr_fold=1000, chunk_size=None, motif_dim='motif', snakemake=False)¶

_scan_motif_local(self, fasta_path, cpu=1, motif_set_path=None, combine_cluster=True, fnr_fpr_fold=1000, chunk_size=None, motif_dim='motif')¶

_scan_motifs_snakemake(self, fasta_path, output_dir, cpu, motif_dim='motif', combine_cluster=True, motif_set_path=None, fnr_fpr_fold=1000, chunk_size=50000)¶

get_hypo_hyper_index(self, a, region_dim=None, region_state_da=None, sample_dim='sample', use_collapsed=True)¶

get_pairwise_differential_index(self, a, b, dmr_type='hypo', region_dim=None, region_state_da=None, sample_dim='sample', use_collapsed=True)¶

motif_enrichment(self, true_regions, background_regions, region_dim=None, motif_dim='motif-cluster', motif_da=None, alternative='two-sided')¶

sample_dmr_motif_enrichment(self, sample, region_dim=None, sample_dim='sample', motif_dim='motif-cluster', region_state_da=None, motif_da=None, alternative='two-sided', use_collapsed=True)¶

pairwise_dmr_motif_enrichment(self, a, b, dmr_type='hypo', region_dim=None, region_state_da=None, sample_dim='sample', motif_dim='motif-cluster', motif_da=None, alternative='two-sided')¶

object_coords_to_string(self, dtypes=None)¶

save(self, da_name=None, output_path=None, mode='w', change_region_dim=True)¶

get_coords(self, name)¶

ALLCools.mcds

Contents

ALLCools.mcds¶

Submodules¶

Package Contents¶

`ALLCools.mcds`¶