Our work on chloroplast structural haplotype scaffolding has been published in Algorithms for Molecular Biology (BMC)!
From a set of short-read contigs and a set of links between the contigs, we aim to assemble several chloroplast structural haplotypes
https://almob.biomedcentral.com/articles/10.1186/s13015-023-00243-1
data:image/s3,"s3://crabby-images/d4acd/d4acdd09e98a6e26821f45bd75343497cf9282c2" alt=""
The literature highlights several specific chloroplast circular genomic structures divided into regions, including direct and inverted repeats (DR and IR), and single-copies (SC).
The most studied structure consists of a pair of IR joined with SCs.
data:image/s3,"s3://crabby-images/f42d0/f42d0a63951a314fa14041f0a5f983d906519dfa" alt=""
This pair of IR is known to be involved in flip-flop inversions: one of the two SC can be reversed during the DNA replication phase.
These two versions of the genome are structural haplotypes.
They co-exist in the same chloroplast: this phenomenon is named heteroplasmy.
data:image/s3,"s3://crabby-images/f878e/f878e486ecce5a3d6f06aa6795e298aa6387227c" alt=""
To retrieve several structural haplotypes, our method scaffolds hierarchically each region type.
We first scaffold the repeats and join them by single-copies.
data:image/s3,"s3://crabby-images/bfaa0/bfaa098a2d71aae0d449eb7584999b3b4408b206" alt=""
The first tricky thing is to pass from the biological definition of a repeat to one we can exploit mathematically:
here we define a repeat as a couple of identical (for DRs) or perfectly-reversed regions (for IRs).
data:image/s3,"s3://crabby-images/9a644/9a6447fd4d43163716519bce5a13236743e86ff6" alt=""
In our application case, the contigs come from short-read data.
Each contig is provided with a multiplicity (an upper-bound for use) and an existence-weight.
We also have a contig that is known to be part of a single-copy (multiplicity = 1).
The links are ordered pairs of oriented contigs.
data:image/s3,"s3://crabby-images/01df6/01df68e2bbd3e92ce20b176d49510b297fb29075" alt=""
We define the chloroplast scaffolding problem (CHSP) as:
- using the maximum number of contig occurrences only if they assemble the minimum number of repeats under a region order constraint;
- joining the repeats by single-copies of maximum weights.
data:image/s3,"s3://crabby-images/1bd15/1bd15aae3b2cd495c959552a9e5a8c1dffef8548" alt=""
The contigs and their links are represented in a directed fragment graph.
For each contig, there is one vertex for its two possible orientation (forward/reverse).
The genome is a circuit in this graph.
We model the region order constraints for the DRs and the IRs thanks to ILPs.
data:image/s3,"s3://crabby-images/d5783/d5783d2c5cca7acb90ff3cbcf444ed084e31b935" alt=""
For each region type scaffolding, we fix previously scaffolded regions.
When no region type remains, we extract the regions from the final circuit and represent them in a region graph.
An Eulerian circuit in this last graph corresponds to a structural haplotype.
data:image/s3,"s3://crabby-images/eeece/eeece40207cdb5ac01cad57ee0b853ce4c86ad80" alt=""
We have tested our approach with synthetic contigs coming for various chloroplast genome structures and get very encouraging results.
Interestingly, the results suggest to not only one optimal solution but a pool of near-optimal solutions.
data:image/s3,"s3://crabby-images/fd011/fd0117dbe451cd6632eae42173020412dc048938" alt=""
The method is available through the khloraascaf Python PyPI package: https://pypi.org/project/khloraascaf/.
The codes to run the tests are available, c.f. https://khloraascaf-results.readthedocs.io/en/latest/.
data:image/s3,"s3://crabby-images/189ee/189eecdd8e01aed37c0a86125d2e576b24459708" alt=""