ALLCools.count_matrix.mcds

Module Contents

DEFAULT_MCDS_DTYPE[source]
clip_too_large_cov(data_df, machine_max)[source]
_region_count_table_to_csr_npz(region_count_tables, region_id_map, output_prefix, compression=True, dtype=DEFAULT_MCDS_DTYPE)[source]

helper func of _aggregate_region_count_to_mcds

Take a list of region count table paths, read, aggregate them into a 2D sparse matrix and save the mC and COV separately. This function don’t take care of any path selection, but assume all region_count_table is homogeneous type It return the saved file path

_csr_matrix_to_dataarray(matrix_table, row_name, row_index, col_name, col_index, other_dim_info)[source]

helper func of _aggregate_region_count_to_mcds

This function aggregate sparse array files into a single xarray.DataArray, combining cell chunks, mc/cov count type together. The matrix_table provide all file paths, each row is for a cell chunk, with mc and cov matrix path separately.

_aggregate_region_count_to_mcds(output_dir, dataset_name, chunk_size=100, row_name='cell', cpu=1, dtype=DEFAULT_MCDS_DTYPE)[source]

This function aggregate all the region count table into a single mcds

generate_mcds(allc_table, output_prefix, chrom_size_path, mc_contexts, rna_table=None, split_strand=False, bin_sizes=None, region_bed_paths=None, region_bed_names=None, cov_cutoff=9999, cpu=1, remove_tmp=True, max_per_mcds=3072, cell_chunk_size=100, dtype=DEFAULT_MCDS_DTYPE, binarize=False, engine='zarr')[source]

Generate MCDS from a list of ALLC file provided with file id.

Parameters
  • allc_table – {allc_table_doc}

  • output_prefix – Output prefix of the MCDS

  • chrom_size_path – {chrom_size_path_doc}

  • mc_contexts – {mc_contexts_doc}

  • rna_table – {rna_table_doc}

  • split_strand – {split_strand_doc}

  • bin_sizes – {bin_sizes_doc}

  • region_bed_paths – {region_bed_paths_doc}

  • region_bed_names – {region_bed_names_doc}

  • cov_cutoff – {cov_cutoff_doc}

  • cpu – {cpu_basic_doc}

  • remove_tmp – Whether to remove the temp directory for generating MCDS

  • max_per_mcds – Maximum number of ALLC files to aggregate into 1 MCDS, if number of ALLC provided > max_per_mcds, will generate MCDS in chunks, with same prefix provided.

  • cell_chunk_size – Size of cell chunk in parallel aggregation. Do not have any effect on results. Large chunksize needs large memory.

  • dtype – Data type of MCDS count matrix. Default is np.uint32. For single cell feature count, this can be set to np.uint16, which means the value is 0-65536. The values exceed max will be clipped.

  • binarize – {binarize_doc}

  • engine – use zarr or netcdf to store dataset, default is zarr