`ALLCools.utilities`¶

Module Contents¶

parse_chrom_size(path, remove_chr_list=None)[source]¶

support simple UCSC chrom size file, or .fai format (1st and 2nd columns same as chrom size file)

return chrom:length dict

get_bin_id(chrom, chrom_index_dict, bin_start, bin_size) → int[source]¶

genome_region_chunks(chrom_size_path: str, bin_length: int = 10000000, combine_small: bool = True) → List[str][source]¶

Split the whole genome into bins, where each bin is {bin_length} bp. Used for tabix region query

Parameters

Returns

Return type

list of records in tabix query format

generate_chrom_bin_bed_dataframe(chrom_size_path: str, window_size: int, step_size: int = None) → pandas.DataFrame[source]¶: Generate BED format dataframe based on UCSC chrom size file and window_size return dataframe contain 3 columns: chrom, start, end. The index is 0 based continue bin index.

profile_allc(allc_path, drop_n=True, n_rows=1000000, output_path=None)[source]¶

Generate some summary statistics of 1 ALLC. 1e8 rows finish in about 5 min.

Parameters

allc_path – {allc_path_doc}
drop_n – Whether to drop context that contain N, such as CCN. This is usually very rare and need to be dropped.
n_rows – Number of rows to calculate the profile from. The default number is usually sufficient to get pretty precise assumption.
output_path – Path of the output file. If None, will save the profile next to input ALLC file.

is_gz_file(filepath)[source]¶: Check if a file is gzip file, bgzip also return True Learnt from here: https://stackoverflow.com/questions/3703276/how-to-tell-if-a-file-is-gzip-compressed

tabix_allc(allc_path, reindex=False)[source]¶

a simple wrapper of tabix command to index 1 ALLC file

Parameters

standardize_allc(allc_path, chrom_size_path, compress_level=5, remove_additional_chrom=False)[source]¶

Standardize 1 ALLC file by checking:

No header in the ALLC file;
Chromosome names in ALLC must be same as those in the chrom_size_path file, including “chr”;
Output file will be bgzipped with .tbi index
Remove additional chromosome (remove_additional_chrom=True) or raise KeyError if unknown chromosome found (default)

Parameters

_transfer_bin_size(bin_size: int) → str[source]¶: Get proper str for a large bin_size