ALLCools._merge_allc

Some of the functions are modified from methylpy https://github.com/yupenghe/methylpy

Original Author: Yupeng He

Module Contents

log[source]
DEFAULT_MAX_ALLC = 150[source]
PROCESS[source]
class _ALLC(path, region)[source]
readline(self)[source]
close(self)[source]
_increase_soft_fd_limit()[source]

Increase soft file descriptor limit to hard limit, this is the maximum a process can do Use this in merge_allc, because for single cell, a lot of file need to be opened

Some useful discussion https://unix.stackexchange.com/questions/36841/why-is-number-of-open-files-limited-in-linux https://docs.python.org/3.6/library/resource.html https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles/6776345

_batch_merge_allc_files_tabix(allc_files, out_file, chrom_size_file, bin_length, cpu=10, binarize=False, snp=False)[source]
_merge_allc_files_tabix(allc_files, out_file, chrom_size_file, query_region=None, buffer_line_number=10000, binarize=False)[source]
_merge_allc_files_tabix_with_snp_info(allc_files, out_file, chrom_size_file, query_region=None, buffer_line_number=10000, binarize=False)[source]
merge_allc_files(allc_paths, output_path, chrom_size_path, bin_length=10000000, cpu=10, binarize=False, snp=False)[source]

Merge N ALLC files into 1 ALLC file.

Parameters
  • allc_paths – {allc_paths_doc}

  • output_path – Path to the output merged ALLC file.

  • chrom_size_path – {chrom_size_path_doc}

  • bin_length – Length of the genome bin in each parallel job, large number means more memory usage.

  • cpu – {cpu_basic_doc} The real CPU usage is ~1.5 times than this number, due to the sub processes of handling ALLC files using tabix/bgzip. Monitor the CPU and Memory usage when running this function.

  • binarize – {binarize_doc}

  • snp – {snp_doc}