ALLCools.utilities
Contents
ALLCools.utilities
¶
Module Contents¶
- parse_chrom_size(path, remove_chr_list=None)[source]¶
support simple UCSC chrom size file, or .fai format (1st and 2nd columns same as chrom size file)
return chrom:length dict
- genome_region_chunks(chrom_size_path: str, bin_length: int = 10000000, combine_small: bool = True) List[str] [source]¶
Split the whole genome into bins, where each bin is {bin_length} bp. Used for tabix region query
- Parameters
chrom_size_path – Path of UCSC genome size file
bin_length – length of each bin
combine_small – whether combine small regions into one record
- Returns
- Return type
list of records in tabix query format
- generate_chrom_bin_bed_dataframe(chrom_size_path: str, window_size: int, step_size: int = None) pandas.DataFrame [source]¶
Generate BED format dataframe based on UCSC chrom size file and window_size return dataframe contain 3 columns: chrom, start, end. The index is 0 based continue bin index.
- profile_allc(allc_path, drop_n=True, n_rows=1000000, output_path=None)[source]¶
Generate some summary statistics of 1 ALLC. 1e8 rows finish in about 5 min.
- Parameters
allc_path – {allc_path_doc}
drop_n – Whether to drop context that contain N, such as CCN. This is usually very rare and need to be dropped.
n_rows – Number of rows to calculate the profile from. The default number is usually sufficient to get pretty precise assumption.
output_path – Path of the output file. If None, will save the profile next to input ALLC file.
- is_gz_file(filepath)[source]¶
Check if a file is gzip file, bgzip also return True Learnt from here: https://stackoverflow.com/questions/3703276/how-to-tell-if-a-file-is-gzip-compressed
- tabix_allc(allc_path, reindex=False)[source]¶
a simple wrapper of tabix command to index 1 ALLC file
- Parameters
allc_path – {allc_path_doc}
reindex – If True, will force regenerate the ALLC index.
- standardize_allc(allc_path, chrom_size_path, compress_level=5, remove_additional_chrom=False)[source]¶
- Standardize 1 ALLC file by checking:
No header in the ALLC file;
Chromosome names in ALLC must be same as those in the chrom_size_path file, including “chr”;
Output file will be bgzipped with .tbi index
Remove additional chromosome (remove_additional_chrom=True) or raise KeyError if unknown chromosome found (default)
- Parameters
allc_path – {allc_path_doc}
chrom_size_path – {chrom_size_path_doc}
compress_level – {compress_level_doc}
remove_additional_chrom – {remove_additional_chrom_doc}