Abstract
DNA sequencing offers an opportunity to unravel life’s rules. With this information, synthetic biology has made significant progress in controlling living systems. Central to cellular engineering is the design of genetic parts to build biological programs. For many genetic parts, connecting DNA sequence to function remains elusive. This knowledge gap hampers our ability to develop physics-based sequence-to-function models, limiting our design options for genetic parts. Recent progress in genetic part testing has led to the collection of large-scale datasets linking sequence to function. Coupled with representation learning methods, such datasets could accelerate genetic part design. Here, I investigate data-driven methods to improve sequence-to-function models for genetic parts, with main focus the 5’Untranslated Region.First, data quality for Flow-seq is examined using a computational model. Simulations reveal the experimental determinants for data precision, and an alternative inference method based on maximum likelihood is proposed to improve the quality of the estimates. This computational pipeline is also used to evaluate data precision for various Flow-seq datasets.
Next, deep learning methods are explored to model 5’UTR-mediated protein expression regulation. Sequence-based models are developed, which achieve optimal prediction accuracy when trained on enough examples. However, sequence-based models perform poorly when considering different genetic contexts. Enhancing the reusability of sequence-based models across contexts is studied through transfer learning. This lowers the number of examples needed for model retraining across contexts. Improving model generalization is further examined by integrating RNA secondary structure predictions into the model. This structural information is processed through graph neural networks, revealing superior generalization on multiple datasets exhibiting significant structural diversity. This brings us closer to the reliable, general-purpose prediction of 5’UTR-mediated protein expression regulation.
Overall, this thesis presents a systematic approach for integrating high-throughput experimental data to predict gene expression, with potential applicability to other genetic parts.
| Date of Award | 23 Jan 2024 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Thomas E Gorochowski (Supervisor) & Christophe Andrieu (Supervisor) |