刘凡 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
..
figure 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
input 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
output 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
script 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
src 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
test 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
.gitignore 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ
README.md 9ff4d1d109 add S3,archive,truncate 2 gadi atpakaļ

README.md

Apply the truncation idea to Strobemer

Current tasks:

  • Init a genome object to incorporate the following features: k-mer, randstrobe, extension-randstrobe, syncmer, and extension-syncmer
  • Write a document for the object usage

Brainstorm

  1. Conclusion from benchmark tests:

    1. Strobemer == Extension_strobemer: evaluation matrices (match, sequence coverage etc.) are similar in strobemer and extension-strobemer
    2. Strobemer always have better sequence coverage than k-mer; and its match ratio will outperform k-mer when mutation rate p is high (>0.2)
  2. Future directions:

    1. Apply the extension idea to syncmer and integrate it into Strobealign
    2. Apply the extension idea to randstrobe and use the adjustable sequence match to compare sequences (show superority than k-mer)
    3. Relate extension-randstrobe-based CI to k-mer based CI. Similar to the Mash Distance plot, try if we can find high correlation between the CIs. If so, some theory works might be added.

Contents

Reproducibility


Create environment

conda create -n ext_strobemer -c bioconda -c conda-forge -c conda --file ./src/requirements.txt python=3.7
conda activate ext_strobemer

Usage

Build scripts to perform specific analysis.

K-mer object document

Basic functions of the k-mer object for extension-based randstrobe analysis. To accommodate the purpose of sequence alignment, the sequence name (e.g. “>chr1”) and k-mer locations are both stored as values in the k-mer dict.

Parameters (also attributes)

Keyword Default value Information
k k-mer length
n number of strobes
l length of a single strobe
smin 5 length of sub-strobe for extension-based randstrobe
wmin 25 min window size for strobemer
wmax 50 max window size for strobemer
prime 6229 prime number to mod for randstrobe selection
label "no_label" keyword for output files
filename "no_filename" store the corresponding genome file if needed

Functions:

Function name Purpose
print_info Print parameters and status of k-mer dict of this object
load_seq_dict Load a dict of [seq_id : sequence] pairs into this object
build_kmer_dict_with_start_pos Build k-mer dict of [k-mer : seq_id + location] pairs and save k-mers into a ternary search tree
build_both_randstrobe_dict 1. Build regular randstrobe dict of [k-mer : seq_id + location] pairs
2. Build extension-based randstrobe dict of [k-mer : seq_id + location] pairs
3. Save k-mers from extension-based randstrobe into a ternary search tree for prefix lookup
export_to_pkl Save this object into a pickle object
kmer_query Check if a k-mer exists in this object
prefix_query Check if a prefix exists in this object

Additional attributes

Name Contents
seq_dict a dict for [seq_id : sequence] pairs
kmer_dict a dict for [k-mer : seq_id + location] pairs
kmer_tst a ternary search tree for all k-mers
regu_randstrobe_dict a dict for [randstrobe : seq_id + location] pairs
ext_randstrobe_dict a dict for [ext_randstrobe : seq_id + location] pairs
ext_randstrobe_tst a ternary search tree for all ext_randstrobes
(To be added)

Examples

from src.extension_strobemer_obj import *

# use the toy data
input_file = "test/test_data/toy_data.fna"

# build a dict of [seq_id : sequence] pair from fasta file. It's necessary for multi-contig genomes.
# the reason why I leave this step outside the object is for comparison of mutated genomes
raw_seq_dict = generate_string_dict_from_genome(input_file)

# build a extension_strobemer object with k=30 and 2 strobes (each of length 15). By default, window size wmin=25, wmax=50
temp_obj = ext_strobemer_obj(k=30, n=2, l=15, smin=10, label="test", filename="none")

# load sequences into the object
temp_obj.load_seq_dict(raw_seq_dict)

# build k-mer dict
temp_obj.build_kmer_dict_with_start_pos()

# build randstrobe and extension-randstrobe dict
temp_obj.build_both_randstrobe_dict()

# print the stats of this object
temp_obj.print_info()


### Now we can perform some simple searching of k-mers or prefixes
# 'ers' for extension-randstrobe
# 'rs' for randstrobe
# 'kmer' for regular k-mer

some_manual_prefix = 'CTAATGGAGAAACTCAT'
# now only support prefix search in [kmer, ers]
temp_obj.prefix_query(some_manual_prefix, kmer_type='ers') #this is empty
temp_obj.prefix_query(some_manual_prefix, kmer_type='kmer')

temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='kmer')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='rs')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='ers')

Algorithm overview

A figure illustration: darkmatter_plan

  1. Select $k{min} \text{ and } k{max} \leq W_{min}$, the extension/truncation only happens in this region

  2. Build strobemers from reference genomes on $k=k_{max}$:

a) generate strobemer with parameter $(n, k{min}, W{min}, W_{max})$

b) extend each strobe from $k{min}$ to $k{max}$

c) note that the 2 steps above can be done at the same time as 2a) actually select strobe starting point

  1. Build KTST via strobe concatenation

a) concatenate all $k_{min}-mers$ from each strobe

b) for all remaining letters from each strobe, we iteratively append 1 letter from each strobe to the sequence in a)

c) step b) ensures that truncation will get a prefix match for strobemers

d) then we can truncate this KTST by step of n instead of 1 (n is number of strobes) to get shorter strobemers

  1. For arbitrary query data with interested $k-mer$ length $k=k' \in [k{min}, k{max}]$, we can build its strobemer in the same way

a) generate strobemer with parameter $(n, k{min}, W{min}, W_{max})$

b) extend each strobe from $k_{min}$ to $k'$

c) this ensures that the sampling protocol are the same in query and ref though being random