Conclusion from benchmark tests:
Future directions:
conda create -n ext_strobemer -c bioconda -c conda-forge -c conda --file ./src/requirements.txt python=3.7
conda activate ext_strobemer
Build scripts to perform specific analysis.
Basic functions of the k-mer object for extension-based randstrobe analysis. To accommodate the purpose of sequence alignment, the sequence name (e.g. “>chr1”) and k-mer locations are both stored as values in the k-mer dict.
Parameters (also attributes)
Keyword | Default value | Information |
---|---|---|
k | k-mer length | |
n | number of strobes | |
l | length of a single strobe | |
smin | 5 | length of sub-strobe for extension-based randstrobe |
wmin | 25 | min window size for strobemer |
wmax | 50 | max window size for strobemer |
prime | 6229 | prime number to mod for randstrobe selection |
label | "no_label" | keyword for output files |
filename | "no_filename" | store the corresponding genome file if needed |
Functions:
Function name | Purpose |
---|---|
print_info | Print parameters and status of k-mer dict of this object |
load_seq_dict | Load a dict of [seq_id : sequence] pairs into this object |
build_kmer_dict_with_start_pos | Build k-mer dict of [k-mer : seq_id + location] pairs and save k-mers into a ternary search tree |
build_both_randstrobe_dict | 1. Build regular randstrobe dict of [k-mer : seq_id + location] pairs 2. Build extension-based randstrobe dict of [k-mer : seq_id + location] pairs 3. Save k-mers from extension-based randstrobe into a ternary search tree for prefix lookup |
export_to_pkl | Save this object into a pickle object |
kmer_query | Check if a k-mer exists in this object |
prefix_query | Check if a prefix exists in this object |
Name | Contents |
---|---|
seq_dict | a dict for [seq_id : sequence] pairs |
kmer_dict | a dict for [k-mer : seq_id + location] pairs |
kmer_tst | a ternary search tree for all k-mers |
regu_randstrobe_dict | a dict for [randstrobe : seq_id + location] pairs |
ext_randstrobe_dict | a dict for [ext_randstrobe : seq_id + location] pairs |
ext_randstrobe_tst | a ternary search tree for all ext_randstrobes |
(To be added) |
Examples
from src.extension_strobemer_obj import *
# use the toy data
input_file = "test/test_data/toy_data.fna"
# build a dict of [seq_id : sequence] pair from fasta file. It's necessary for multi-contig genomes.
# the reason why I leave this step outside the object is for comparison of mutated genomes
raw_seq_dict = generate_string_dict_from_genome(input_file)
# build a extension_strobemer object with k=30 and 2 strobes (each of length 15). By default, window size wmin=25, wmax=50
temp_obj = ext_strobemer_obj(k=30, n=2, l=15, smin=10, label="test", filename="none")
# load sequences into the object
temp_obj.load_seq_dict(raw_seq_dict)
# build k-mer dict
temp_obj.build_kmer_dict_with_start_pos()
# build randstrobe and extension-randstrobe dict
temp_obj.build_both_randstrobe_dict()
# print the stats of this object
temp_obj.print_info()
### Now we can perform some simple searching of k-mers or prefixes
# 'ers' for extension-randstrobe
# 'rs' for randstrobe
# 'kmer' for regular k-mer
some_manual_prefix = 'CTAATGGAGAAACTCAT'
# now only support prefix search in [kmer, ers]
temp_obj.prefix_query(some_manual_prefix, kmer_type='ers') #this is empty
temp_obj.prefix_query(some_manual_prefix, kmer_type='kmer')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='kmer')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='rs')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='ers')
A figure illustration:
Select $k{min} \text{ and } k{max} \leq W_{min}$, the extension/truncation only happens in this region
Build strobemers from reference genomes on $k=k_{max}$:
a) generate strobemer with parameter $(n, k{min}, W{min}, W_{max})$
b) extend each strobe from $k{min}$ to $k{max}$
c) note that the 2 steps above can be done at the same time as 2a) actually select strobe starting point
a) concatenate all $k_{min}-mers$ from each strobe
b) for all remaining letters from each strobe, we iteratively append 1 letter from each strobe to the sequence in a)
c) step b) ensures that truncation will get a prefix match for strobemers
d) then we can truncate this KTST by step of n instead of 1 (n is number of strobes) to get shorter strobemers
a) generate strobemer with parameter $(n, k{min}, W{min}, W_{max})$
b) extend each strobe from $k_{min}$ to $k'$
c) this ensures that the sampling protocol are the same in query and ref though being random