Coding for DNA Storage Applications

The surge of Big Data platforms and energy conservation issues are creating new challenges for the storage community in terms of identifying extremely high volume, non-volatile and durable recording media. Despite continuing advances in traditional data recording techniques, innovative approaches must be developed to meet these challenges. It is reported that 2.7 Zettabytes of data exists in the digital universe, while 90% of this data was generated in the past two years. More than that, it is expected that by 2025 the world's data will hit 163 Zettabytes! On the other hand, due to capacity limitation of existing storage solutions, the amount of storage is not predicted to scale at nearly the same pace.

The potential capacity and endurance of DNA storage make them an attractive solution for near future storage solutions, mostly for archiving applications. However, this media still suffers major challenges in the areas of device reliability and performance. These challenges can be overcome, in part, through innovative coding and data handling techniques, which is the subject of the proposed research in the Focus Group around Hans Fischer Fellow Prof. Eitan Yaakobi and his Host Rudolf Mößbauer Tenure Track Prof. Antonia Wachter-Zeh (TUM Department of Electrical and Computer Engineering). Specifically, this project focuses on three important topics on coding and algorithms for DNA-based storage systems.

  1. Codes for clustering. Clustering is the first step done when decoding the DNA strands and the goal is to partition all received input strands into groups such that every group corresponds to one of the input strands. Designing codes for this step is an important task to guarantee its success.

  2. Codes for the reconstruction problem. In this paradigm, the information is transmitted over several noisy channels and the decoder needs to reconstruct the transmitted word given access to all channels' outputs. This problem mimics the synthesis and sequencing processes in DNA where every DNA strand has a large number of copies, thereby providing several noisy versions of the information.

  3. Error-correcting codes. We will study the design of error-correcting codes over sets that are suitable for data storage in DNA. Errors within this model are losses of sequences and point errors inside the sequences, such as insertions, deletions, and substitutions.

The proposed research involves the information-theoretic analysis and development of novel coding schemes. We anticipate that the proposed problems will contribute not only to coding solutions that will improve the reliability of DNA storage, but also innovative concepts in the fields of information and coding theory.