Apply the truncation idea to Strobemer

Current tasks:

Init a genome object to incorporate the following features: k-mer, randstrobe, extension-randstrobe, syncmer, and extension-syncmer
Write a document for the object usage

Brainstorm

Conclusion from benchmark tests:
1. Strobemer == Extension_strobemer: evaluation matrices (match, sequence coverage etc.) are similar in strobemer and extension-strobemer
2. Strobemer always have better sequence coverage than k-mer; and its match ratio will outperform k-mer when mutation rate p is high (>0.2)
Future directions:
1. Apply the extension idea to syncmer and integrate it into Strobealign
2. Apply the extension idea to randstrobe and use the adjustable sequence match to compare sequences (show superority than k-mer)
3. Relate extension-randstrobe-based CI to k-mer based CI. Similar to the Mash Distance plot, try if we can find high correlation between the CIs. If so, some theory works might be added.

Reproducibility
- Usage (TBD)
- K-mer object document
Algorithm overview: k-min extension for strobemer

Reproducibility

Create environment

conda create -n ext_strobemer -c bioconda -c conda-forge -c conda --file ./src/requirements.txt python=3.7
conda activate ext_strobemer

Usage

Build scripts to perform specific analysis.

K-mer object document

Basic functions of the k-mer object for extension-based randstrobe analysis. To accommodate the purpose of sequence alignment, the sequence name (e.g. “>chr1”) and k-mer locations are both stored as values in the k-mer dict.

Parameters (also attributes)

Keyword	Default value	Information
k		k-mer length
n		number of strobes
l		length of a single strobe
smin	5	length of sub-strobe for extension-based randstrobe
wmin	25	min window size for strobemer
wmax	50	max window size for strobemer
prime	6229	prime number to mod for randstrobe selection
label	"no_label"	keyword for output files
filename	"no_filename"	store the corresponding genome file if needed

Functions:

Function name	Purpose
print_info	Print parameters and status of k-mer dict of this object
load_seq_dict	Load a dict of [seq_id : sequence] pairs into this object
build_kmer_dict_with_start_pos	Build k-mer dict of [k-mer : seq_id + location] pairs and save k-mers into a ternary search tree
build_both_randstrobe_dict	1. Build regular randstrobe dict of [k-mer : seq_id + location] pairs 2. Build extension-based randstrobe dict of [k-mer : seq_id + location] pairs 3. Save k-mers from extension-based randstrobe into a ternary search tree for prefix lookup
export_to_pkl	Save this object into a pickle object
kmer_query	Check if a k-mer exists in this object
prefix_query	Check if a prefix exists in this object

Additional attributes

Name	Contents
seq_dict	a dict for [seq_id : sequence] pairs
kmer_dict	a dict for [k-mer : seq_id + location] pairs
kmer_tst	a ternary search tree for all k-mers
regu_randstrobe_dict	a dict for [randstrobe : seq_id + location] pairs
ext_randstrobe_dict	a dict for [ext_randstrobe : seq_id + location] pairs
ext_randstrobe_tst	a ternary search tree for all ext_randstrobes
(To be added)

Examples

from src.extension_strobemer_obj import *

# use the toy data
input_file = "test/test_data/toy_data.fna"

# build a dict of [seq_id : sequence] pair from fasta file. It's necessary for multi-contig genomes.
# the reason why I leave this step outside the object is for comparison of mutated genomes
raw_seq_dict = generate_string_dict_from_genome(input_file)

# build a extension_strobemer object with k=30 and 2 strobes (each of length 15). By default, window size wmin=25, wmax=50
temp_obj = ext_strobemer_obj(k=30, n=2, l=15, smin=10, label="test", filename="none")

# load sequences into the object
temp_obj.load_seq_dict(raw_seq_dict)

# build k-mer dict
temp_obj.build_kmer_dict_with_start_pos()

# build randstrobe and extension-randstrobe dict
temp_obj.build_both_randstrobe_dict()

# print the stats of this object
temp_obj.print_info()


### Now we can perform some simple searching of k-mers or prefixes
# 'ers' for extension-randstrobe
# 'rs' for randstrobe
# 'kmer' for regular k-mer

some_manual_prefix = 'CTAATGGAGAAACTCAT'
# now only support prefix search in [kmer, ers]
temp_obj.prefix_query(some_manual_prefix, kmer_type='ers') #this is empty
temp_obj.prefix_query(some_manual_prefix, kmer_type='kmer')

temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='kmer')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='rs')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='ers')

Algorithm overview

A figure illustration:

Select $k{min} \text{ and } k{max} \leq W_{min}$, the extension/truncation only happens in this region
Build strobemers from reference genomes on $k=k_{max}$:

a) generate strobemer with parameter $(n, k{min}, W{min}, W_{max})$

b) extend each strobe from $k{min}$ to $k{max}$

c) note that the 2 steps above can be done at the same time as 2a) actually select strobe starting point

Build KTST via strobe concatenation

a) concatenate all $k_{min}-mers$ from each strobe

b) for all remaining letters from each strobe, we iteratively append 1 letter from each strobe to the sequence in a)

c) step b) ensures that truncation will get a prefix match for strobemers

d) then we can truncate this KTST by step of n instead of 1 (n is number of strobes) to get shorter strobemers

For arbitrary query data with interested $k-mer$ length $k=k' \in [k{min}, k{max}]$, we can build its strobemer in the same way

a) generate strobemer with parameter $(n, k{min}, W{min}, W_{max})$

b) extend each strobe from $k_{min}$ to $k'$

c) this ensures that the sampling protocol are the same in query and ref though being random

README.md 7.4 KB Permalink Riwayat Mentahan