ALLCools.mcds

Core Data Structure

Package Contents

class MCDS(dataset, obs_dim=None, var_dim=None)[source]

Bases: xarray.Dataset

MCDS Class

__slots__ = []
property var_dim(self)
property obs_dim(self)
property obs_names(self)
property var_names(self)
_verify_dim(self, dim, mode)
classmethod open(cls, mcds_paths, obs_dim='cell', use_obs=None, var_dim=None, chunks='auto', split_large_chunks=True, obj_to_str=True, engine=None)

Take one or multiple MCDS file paths and create single MCDS concatenated on obs_dim

Parameters
  • mcds_paths – Single MCDS path or MCDS path pattern with wildcard or MCDS path list

  • obs_dim – Dimension name of observations, default is ‘cell’

  • use_obs – Subset the MCDS by a list of observation IDs.

  • var_dim – Which var_dim dataset to use, needed when MCDS has multiple var_dim stored in the same directory

  • chunks – if not None, xarray will use chunks to load data as dask.array. The “auto” means xarray will determine chunks automatically. For more options, read the xarray.open_dataset chunks parameter documentation. If None, xarray will not use dask, which is not desired in most cases.

  • split_large_chunks – Whether split large chunks in dask config array.slicing.split_large_chunks

  • obj_to_str – Whether turn object coordinates into string data type

  • engine – xarray engine used to store MCDS, if multiple MCDS provided, the engine need to be the same

Returns

Return type

MCDS

add_mc_frac(self, var_dim=None, da=None, normalize_per_cell=True, clip_norm_value=10, da_suffix='frac')

Add posterior mC rate data array for certain feature type (var_dim).

Parameters
  • var_dim – Name of the feature type

  • da – if None, will use f’{var_dim}_da’

  • normalize_per_cell – if True, will normalize the mC rate data array per cell

  • clip_norm_value – reset larger values in the normalized mC rate data array to this

  • da_suffix – name suffix appended to the calculated mC rate data array

add_mc_rate(self, *args, **kwargs)
add_feature_cov_mean(self, obs_dim=None, var_dim=None, plot=True)

Add feature cov mean across obs_dim.

Parameters
  • var_dim – Name of var dimension

  • obs_dim – Name of obs dimension

  • plot – If true, plot the distribution of feature cov mean

Returns

Return type

None

add_cell_metadata(self, metadata, obs_dim=None)
filter_feature_by_cov_mean(self, var_dim=None, min_cov=0, max_cov=999999)

filter MCDS by feature cov mean. add_feature_cov_mean() must be called before this function.

Parameters
  • var_dim – Name of var dimension

  • min_cov – Minimum cov cutoff

  • max_cov – Maximum cov cutoff

Returns

Return type

MCDS

get_feature_bed(self, var_dim=None)

Get a bed format data frame of the var_dim

Parameters

var_dim – Name of var_dim

Returns

Return type

pd.DataFrame

remove_black_list_region(self, black_list_path, var_dim=None, f=0.2)

Remove regions overlap (bedtools intersect -f {f}) with regions in the black_list_path

Parameters
  • var_dim – Name of var_dim

  • black_list_path – Path to the black list bed file

  • f – Fraction of overlap when calling bedtools intersect

Returns

Return type

MCDS

remove_chromosome(self, exclude_chromosome, var_dim=None)

Remove regions in specific chromosome

Parameters
  • var_dim – Name of var_dim

  • exclude_chromosome – Chromosome to remove

Returns

Return type

MCDS (xr.Dataset)

calculate_hvf_svr(self, mc_type=None, var_dim=None, obs_dim=None, n_top_feature=5000, da_suffix='frac', plot=True)
calculate_hvf(self, mc_type=None, var_dim=None, obs_dim=None, min_disp=0.5, max_disp=None, min_mean=0, max_mean=5, n_top_feature=5000, bin_min_features=5, mean_binsize=0.05, cov_binsize=100, da_suffix='frac', plot=True)

Calculate normalized dispersion to select highly variable features.

Parameters
  • mc_type – Type of mC to calculate

  • var_dim – Name of variable

  • obs_dim – Name of observation, default is cell

  • min_disp – minimum dispersion for a feature to be considered

  • max_disp – maximum dispersion for a feature to be considered

  • min_mean – minimum mean for a feature to be considered

  • max_mean – maximum mean for a feature to be considered

  • n_top_feature – Top N feature to use as highly variable feature. If set, all the cutoff will be ignored, HDF selected based on order of normalized dispersion.

  • bin_min_features – Minimum number of features to be considered as a separate bin, if bellow this number, the bin will be merged to its closest bin.

  • mean_binsize – bin size to separate features across mean

  • cov_binsize – bin size to separate features across coverage

  • plot – If true, will plot mean, coverage and normalized dispersion scatter plots.

Returns

Return type

pd.DataFrame

get_score_adata(self, mc_type, quant_type, obs_dim=None, var_dim=None, sparse=True)
get_adata(self, mc_type=None, obs_dim=None, var_dim=None, da_suffix='frac', select_hvf=True, split_large_chunks=True)

Get anndata from MCDS mC rate matrix :param mc_type: mC rate type :param var_dim: Name of variable :param da_suffix: Suffix of mC rate matrix :param obs_dim: Name of observation :param select_hvf: Select HVF or not, if True, will use mcds.coords[‘{var_dim}_{mc_type}_feature_select’] to select HVFs :param split_large_chunks: Whether split large chunks in dask config array.slicing.split_large_chunks

Returns

Return type

anndata.Anndata

merge_cluster(self, cluster_col, obs_dim=None, add_mc_frac=True, add_overall_mc=True, overall_mc_da='chrom100k_da')
to_region_ds(self, region_dim=None)
write_dataset(self, output_path, mode='w-', obs_dim=None, var_dims: Union[str, list] = None, use_obs=None, chunk_size=1000)

Write MCDS into a on-disk zarr dataset. Data arrays for each var_dim will be saved in separate sub-directories of output_path.

Parameters
  • output_path – Path of the zarr dataset

  • mode – ‘w-‘ means write to output_path, fail if the path exists; ‘w’ means write to output_path, overwrite if the var_dim sub-directory exists

  • obs_dim – dimension name of observations

  • var_dims – dimension name, or a list of dimension names of variables

  • use_obs – Select AND order observations when write.

  • chunk_size – The load and write chunks, set this as large as possible based on available memory.

Returns

Return type

output_path

class RegionDS(dataset, region_dim=None, location=None, chrom_size_path=None)[source]

Bases: xarray.Dataset

A multi-dimensional, in memory, array database.

A dataset resembles an in-memory representation of a NetCDF file, and consists of variables, coordinates and attributes which together form a self describing dataset.

Dataset implements the mapping interface with keys given by variable names and values given by DataArray objects for each variable name.

One dimensional variables with name equal to their dimension are index coordinates used for label based indexing.

To load data from a file or file-like object, use the open_dataset function.

Parameters
  • data_vars (dict-like, optional) –

    A mapping from variable names to DataArray objects, Variable objects or to tuples of the form (dims, data[, attrs]) which can be used as arguments to create a new Variable. Each dimension must have the same length in all variables in which it appears.

    The following notations are accepted:

    • mapping {var name: DataArray}

    • mapping {var name: Variable}

    • mapping {var name: (dimension name, array-like)}

    • mapping {var name: (tuple of dimension names, array-like)}

    • mapping {dimension name: array-like} (it will be automatically moved to coords, see below)

    Each dimension must have the same length in all variables in which it appears.

  • coords (dict-like, optional) –

    Another mapping in similar form as the data_vars argument, except the each item is saved on the dataset as a “coordinate”. These variables have an associated meaning: they describe constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in variables. Coordinates values may be given by 1-dimensional arrays or scalars, in which case dims do not need to be supplied: 1D arrays will be assumed to give index values along the dimension with the same name.

    The following notations are accepted:

    • mapping {coord name: DataArray}

    • mapping {coord name: Variable}

    • mapping {coord name: (dimension name, array-like)}

    • mapping {coord name: (tuple of dimension names, array-like)}

    • mapping {dimension name: array-like} (the dimension name is implicitly set to be the same as the coord name)

    The last notation implies that the coord name is the same as the dimension name.

  • attrs (dict-like, optional) – Global attributes to save on this dataset.

Examples

Create data:

>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
>>> precipitation = 10 * np.random.rand(2, 2, 3)
>>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>>> lat = [[42.25, 42.21], [42.63, 42.59]]
>>> time = pd.date_range("2014-09-06", periods=3)
>>> reference_time = pd.Timestamp("2014-09-05")

Initialize a dataset with multiple dimensions:

>>> ds = xr.Dataset(
...     data_vars=dict(
...         temperature=(["x", "y", "time"], temperature),
...         precipitation=(["x", "y", "time"], precipitation),
...     ),
...     coords=dict(
...         lon=(["x", "y"], lon),
...         lat=(["x", "y"], lat),
...         time=time,
...         reference_time=reference_time,
...     ),
...     attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset>
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 29.11 18.2 22.83 ... 18.28 16.15 26.63
    precipitation   (x, y, time) float64 5.68 9.256 0.7104 ... 7.992 4.615 7.805
.. attribute:: description

Weather related data.

Find out where the coldest temperature was and what values the other variables had:

>>> ds.isel(ds.temperature.argmin(...))
<xarray.Dataset>
Dimensions:         ()
Coordinates:
    lon             float64 -99.32
    lat             float64 42.21
    time            datetime64[ns] 2014-09-08
    reference_time  datetime64[ns] 2014-09-05
Data variables:
    temperature     float64 7.182
    precipitation   float64 8.326
.. attribute:: description

Weather related data.

__slots__ = []
property region_dim(self)
property chrom_size_path(self)
property location(self)
classmethod from_bed(cls, bed, location, chrom_size_path, region_dim='region', sort_bed=True)

Create empty RegionDS from a bed file.

Parameters
  • bed

  • location

  • region_dim

  • chrom_size_path

  • sort_bed

classmethod open(cls, path, region_dim=None, use_regions=None, split_large_chunks=True, chrom_size_path=None, select_dir=None, chunks='auto', engine='zarr')
classmethod _open_single_dataset(cls, path, region_dim, split_large_chunks=True, chrom_size_path=None, location=None, engine=None)

Take one or multiple RegionDS file paths and create single RegionDS concatenated on region_dim

Parameters
  • path – Single RegionDS path or RegionDS path pattern with wildcard or RegionDS path list

  • region_dim – Dimension name of regions

  • split_large_chunks – Split large dask array chunks if true

Returns

Return type

RegionDS

iter_index(self, chunk_size=100000, dim=None)
iter_array(self, chunk_size=100000, dim=None, da=None, load=False)
get_fasta(self, genome_fasta, output_path, slop=None, chrom_size_path=None, standardize_length=None)
get_bed(self, with_id=True, bedtools=False, slop=None, chrom_size_path=None, standardize_length=None)
_chunk_annotation_executor(self, annotation_function, cpu, save=True, **kwargs)
annotate_by_bigwigs(self, bigwig_table, dim, slop=100, chrom_size_path=None, value_type='mean', chunk_size='auto', dtype='float32', cpu=1, save=True)
annotate_by_beds(self, bed_table, dim, slop=100, chrom_size_path=None, chunk_size='auto', dtype='bool', bed_sorted=True, cpu=1, fraction=0.2, save=True)
get_feature(self, feature_name, dim=None, da_name=None)
scan_motifs(self, genome_fasta, cpu=1, standardize_length=500, motif_set_path=None, chrom_size_path=None, combine_cluster=True, fnr_fpr_fold=1000, chunk_size=None, motif_dim='motif', snakemake=False)
_scan_motif_local(self, fasta_path, cpu=1, motif_set_path=None, combine_cluster=True, fnr_fpr_fold=1000, chunk_size=None, motif_dim='motif')
_scan_motifs_snakemake(self, fasta_path, output_dir, cpu, motif_dim='motif', combine_cluster=True, motif_set_path=None, fnr_fpr_fold=1000, chunk_size=50000)
get_hypo_hyper_index(self, a, region_dim=None, region_state_da=None, sample_dim='sample', use_collapsed=True)
get_pairwise_differential_index(self, a, b, dmr_type='hypo', region_dim=None, region_state_da=None, sample_dim='sample', use_collapsed=True)
motif_enrichment(self, true_regions, background_regions, region_dim=None, motif_dim='motif-cluster', motif_da=None, alternative='two-sided')
sample_dmr_motif_enrichment(self, sample, region_dim=None, sample_dim='sample', motif_dim='motif-cluster', region_state_da=None, motif_da=None, alternative='two-sided', use_collapsed=True)
pairwise_dmr_motif_enrichment(self, a, b, dmr_type='hypo', region_dim=None, region_state_da=None, sample_dim='sample', motif_dim='motif-cluster', motif_da=None, alternative='two-sided')
object_coords_to_string(self, dtypes=None)
save(self, da_name=None, output_path=None, mode='w', change_region_dim=True)
get_coords(self, name)