# Apply the truncation idea to Strobemer
### Current tasks:
- [ ] Init a genome object to incorporate the following features: k-mer, randstrobe, extension-randstrobe, syncmer, and extension-syncmer
- [ ] Write a document for the object usage
### Brainstorm
1. Conclusion from benchmark tests:
1. **Strobemer == Extension_strobemer**: evaluation matrices (match, sequence coverage etc.) are similar in strobemer and extension-strobemer
2. **Strobemer always have better sequence coverage** than k-mer; and its match ratio will outperform k-mer when mutation rate p is high (>0.2)
2. Future directions:
1. Apply the extension idea to syncmer and integrate it into Strobealign
2. Apply the extension idea to randstrobe and use the adjustable sequence match to compare sequences (show superority than k-mer)
3. Relate extension-randstrobe-based CI to k-mer based CI. Similar to the Mash Distance plot, try if we can find high correlation between the CIs. If so, some theory works might be added.
### Contents
- [Reproducibility](#repro)
- [Usage (TBD)](#usage)
- [K-mer object document](#obj)
- [Algorithm overview: k-min extension for strobemer](#alg)
### Reproducibility
---
#### Create environment
```
conda create -n ext_strobemer -c bioconda -c conda-forge -c conda --file ./src/requirements.txt python=3.7
conda activate ext_strobemer
```
#### Usage
Build scripts to perform specific analysis.
#### K-mer object document
Basic functions of the k-mer object for extension-based randstrobe analysis. To accommodate the purpose of sequence alignment, the sequence name (e.g. “>chr1”) and k-mer locations are both stored as values in the k-mer dict.
Parameters (also attributes)
| Keyword | Default value | Information |
| -------- | ------------- | --------------------------------------------------- |
| k | | k-mer length |
| n | | number of strobes |
| l | | length of a single strobe |
| smin | 5 | length of sub-strobe for extension-based randstrobe |
| wmin | 25 | min window size for strobemer |
| wmax | 50 | max window size for strobemer |
| prime | 6229 | prime number to mod for randstrobe selection |
| label | "no_label" | keyword for output files |
| filename | "no_filename" | store the corresponding genome file if needed |
Functions:
| Function name | Purpose |
| ------------------------------ | ------------------------------------------------------------ |
| print_info | Print parameters and status of k-mer dict of this object |
| load_seq_dict | Load a dict of [seq_id : sequence] pairs into this object |
| build_kmer_dict_with_start_pos | Build k-mer dict of [k-mer : seq_id + location] pairs and save k-mers into a ternary search tree |
| build_both_randstrobe_dict | 1. Build regular randstrobe dict of [k-mer : seq_id + location] pairs
2. Build extension-based randstrobe dict of [k-mer : seq_id + location] pairs
3. Save k-mers from extension-based randstrobe into a ternary search tree for prefix lookup |
| export_to_pkl | Save this object into a pickle object |
| kmer_query | Check if a k-mer exists in this object |
| prefix_query | Check if a prefix exists in this object |
[Additional attributes](https://github.com/KoslickiLab/Strobemer_extension/blob/902eab72676788b25ebadf2f387a65d8e9a23b2f/src/extension_strobemer_obj.py#L93)
| Name | Contents |
| -------------------- | ----------------------------------------------------- |
| seq_dict | a dict for [seq_id : sequence] pairs |
| kmer_dict | a dict for [k-mer : seq_id + location] pairs |
| kmer_tst | a ternary search tree for all k-mers |
| regu_randstrobe_dict | a dict for [randstrobe : seq_id + location] pairs |
| ext_randstrobe_dict | a dict for [ext_randstrobe : seq_id + location] pairs |
| ext_randstrobe_tst | a ternary search tree for all ext_randstrobes |
| (To be added) | |
Examples
```
from src.extension_strobemer_obj import *
# use the toy data
input_file = "test/test_data/toy_data.fna"
# build a dict of [seq_id : sequence] pair from fasta file. It's necessary for multi-contig genomes.
# the reason why I leave this step outside the object is for comparison of mutated genomes
raw_seq_dict = generate_string_dict_from_genome(input_file)
# build a extension_strobemer object with k=30 and 2 strobes (each of length 15). By default, window size wmin=25, wmax=50
temp_obj = ext_strobemer_obj(k=30, n=2, l=15, smin=10, label="test", filename="none")
# load sequences into the object
temp_obj.load_seq_dict(raw_seq_dict)
# build k-mer dict
temp_obj.build_kmer_dict_with_start_pos()
# build randstrobe and extension-randstrobe dict
temp_obj.build_both_randstrobe_dict()
# print the stats of this object
temp_obj.print_info()
### Now we can perform some simple searching of k-mers or prefixes
# 'ers' for extension-randstrobe
# 'rs' for randstrobe
# 'kmer' for regular k-mer
some_manual_prefix = 'CTAATGGAGAAACTCAT'
# now only support prefix search in [kmer, ers]
temp_obj.prefix_query(some_manual_prefix, kmer_type='ers') #this is empty
temp_obj.prefix_query(some_manual_prefix, kmer_type='kmer')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='kmer')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='rs')
temp_obj.kmer_query('GTCTTGCAATAATGGCAAAACTAAATGTAC', kmer_type='ers')
```
## Algorithm overview
A figure illustration:
1. Select $k_{min} \text{ and } k_{max} \leq W_{min}$, the extension/truncation only happens in this region
2. Build strobemers from reference genomes on $k=k_{max}$:
a) generate strobemer with parameter $(n, k_{min}, W_{min}, W_{max})$
b) extend each strobe from $k_{min}$ to $k_{max}$
c) note that the 2 steps above can be done at the same time as 2a) actually select strobe starting point
3. Build KTST via strobe concatenation
a) concatenate all $k_{min}-mers$ from each strobe
b) for all remaining letters from each strobe, we iteratively append 1 letter from each strobe to the sequence in a)
c) step b) ensures that truncation will get a prefix match for strobemers
d) then we can truncate this KTST by step of n instead of 1 (n is number of strobes) to get shorter strobemers
4. For arbitrary query data with interested $k-mer$ length $k=k' \in [k_{min}, k_{max}]$, we can build its strobemer in the same way
a) generate strobemer with parameter $(n, k_{min}, W_{min}, W_{max})$
b) extend each strobe from $k_{min}$ to $k'$
c) this ensures that the sampling protocol are the same in query and ref though being random