Coding for Insertions /Deletions and DNA Storage
Codes for correcting insertions and deletions, first proposed for communication applications, have particular importance in DNA storage, which is recently attracting attention as a dense and robust storage solution. The data is written on DNA strands by synthesis and read by sequencing. DNA storage has to cope with sequence losses, loss of ordering, and errors such as insertions and deletions.
Focus Group: Coding for DNA Storage
Prof. Eitan Yaakobi (Technion – Israel Institute of Technology), Alumnus Hans Fischer Fellow | Dr. Rawad Bitar (TUM), Postdoctoral Researcher | Andreas Lenz, Lorenz Welter, Anisha Banerjee (TUM), Doctoral Candidates | Host: Prof. Antonia Wachter-Zeh (TUM)
(Image: Nitzan Zohar (Yaakobi), Astrid Eckert (Wachter-Zeh))
In the past decade, drawing on advances in biotechnology, several experiments have demonstrated the viability of DNA as a storage medium that can offer unprecedented information density, archival durability, and device independence. One major impediment to the adoption of this technology is that data coding requirements are fundamentally different from those of existing storage media. To overcome this and realize the full potential of DNA-based storage, it is necessary to establish the technology’s information-theoretic limits and to develop coding methods to achieve these limits. To attain this goal, our collaboration focused on several fundamental problems in coding and information theory to improve the reliability and performance of DNA-based storage solutions
Figure 1
A typical DNA storage system consists of three components (see Fig. 1): (1) DNA synthesis that produces the molecules encoding the data. In state-of-the-art technology acceptable error rates are achieved for synthetic oligos of length no more than 250-300nt; (2) a storage container to store the synthetic DNA; (3) a DNA sequencing device that serves for reading and for retrieving the data. The encoding and decoding are external processes that convert the binary data into molecules of DNA in a way that enables reconstruction even under errors. From a coding perspective DNA as a storage system has several attributes that distinguish it from other storage systems. The most prominent one is that the oligos are not ordered in the memory and thus it is not possible to know the order in which they were stored. One solution uses indices, or barcodes, that are stored as part of the oligos. Furthermore, errors in DNA are typically substitutions, insertions, and deletions, while the error rates depend upon the specific technology used for synthesis and sequencing.
In [1], we studied error-correcting codes for the storage of data in synthetic DNA. We investigated a storage model where a data set is represented by an unordered set of M sequences, each of length L (see Fig. 4). Errors within that model are a loss of whole sequences and point errors inside the sequences, such as insertions, deletions, and substitutions. We derived Gilbert-Varshamov lower bounds and sphere packing upper bounds on achievable cardinalities of error-correcting codes within this storage model. We further proposed explicit code constructions that can be encoded and decoded efficiently. Comparing the sizes of these codes to the upper bounds, we showed that many of the constructions are close to optimal.
To explore the information theoretical limits of DNA-based data storage, we studied in [2] a communication system where information is conveyed over many sequences in parallel. The receiver cannot control the access to these sequences and can only draw from these sequences, unaware which sequence has been drawn (see Fig. 2). Further, the drawn sequences are susceptible to errors. We considered a suitable channel model that models this input-output relationship and computed its information capacity for a wide range of parameters. We believe that this analysis can guide future DNA-based data storage experiments by establishing theoretical limits on achievable information rates and by proposing decoding techniques. In [3], we studied coding for the synthesizing DNA, which is the most costly part of existing systems. As a step toward more efficient synthesis, we designed codes that minimize the time and number of cycles to produce the DNA strands. We considered a popular synthesis process that builds many strands in parallel in a step-by-step fashion using a fixed supersequence S. The machine iterates through S one nucleotide at a time, and in each cycle, it adds the next nucleotide to a subset of the strands (see Fig. 3). The synthesis time is determined by the length of S. We showed that by introducing redundancy to the synthesized strands, we can significantly decrease the number of synthesis cycles. We derived the maximum amount of information per synthesis cycle assuming S is an arbitrary periodic sequence. To prove our results, we exhibited new connections to cost-constrained codes.
In [4], we analyzed synthetic polymer-based data storage, which involves designing molecules of distinct masses to represent the respective bits, followed by the synthesis of a polymer of molecular units that reflects the order of bits. Reading out the stored data requires the use of a tandem mass spectrometer that fragments the polymer into shorter substrings and provides their corresponding masses, from which the composition, i.e., the number of 1s and 0s in the concerned substring, can be inferred. Prior works have dealt with the problem of unique string reconstruction from the set of all possible compositions, called composition multiset. Additionally, error-correcting schemes to deal with substitution errors caused by imprecise fragmentation during the readout process have also been suggested. Our work built on this research by extending previously considered error models, mainly confined to substitution of compositions. To this end, we defined new error models that consider insertions of spurious compositions and deletions of existing ones, thereby corrupting the composition multiset. We analyzed if the reconstruction codebook proposed in a previous work is indeed robust to such errors and, if not, proposed new coding constraints to remedy this. In [5], we extended the study of codes correcting deletions and insertions to the problem of constructing codes correcting deletions in arrays. Under this model, it is assumed that an array can experience deletions (insertions) of rows and columns. These deletion errors are referred to as crisscross deletions (insertions). We first showed that the problems of correcting crisscross deletions and crisscross insertions are equivalent. Our focus was on code correcting one row and one column deletion. A non-asymptotic bound assures that the redundancy is at least 2n − 3 + 2logn bits, and a code construction with an existential encoding and an explicit decoding algorithm was presented. The redundancy of the construction is at most 2n + 4logn + 7+ 2 log e.
Figure 2
Figure 3
Figure 4
[1]
Lenz, A. et al. (2020).
[2]
Lenz, A. et al. (2023).
[3]
Lenz, A. et al. (June 2020).
[4]
Banerjee, A. et al. (2023).
[5]
Bitar, I. et al. (2021).
Selected publications
- Lenz, A., Siegel, P.H., Wachter-Zeh, A. & Yaakobi, E. Coding over Sets for DNA Storage. IEEE Trans. Inform. Theory 66, 2331–2351 (2020).
- Lenz, A., Siegel, P.H., Wachter-Zeh, A. & Yaakobi, E. The Noisy Drawing Channel: Reliable Data Storage in DNA Sequences. IEEE Trans. Inform. Theory 69, 2757–2778 (2023).
- Lenz, A., Liu, Y., Rashtchian, C., Siegel, P.H., Wachter-Zeh, A. & Yaakobi, E. Coding for
Efficient DNA Synthesis. IEEE Int’l Symp. on Information Theory, Los Angeles, California, 2903–2908 (June 2020).
- Banerjee, A., Wachter-Zeh, A. & Yaakobi, E. Insertion and Deletion Correction in Polymer-based Data Storage. IEEE Trans. Inform. Theory 69, 4384–4406 (2023).
- Bitar, R., Smagloy, I., Welter, L., Wachter-Zeh, A. & Yaakobi, E. Criss-Cross Insertion and Deletion Correcting Codes. IEEE Trans. Inform. Theory 67, 7999–8015 (2021).