Abstract
This thesis explores two main approaches for making and improving explanatory predictions of phenotype and protein function from genotype. Both predictors seek to leverage the power of the researchers around the world which contribute their results to community databases, andcombines these where possible to get a fuller picture of the complex system of interacting molecules.
The first part of this thesis contains all of the necessary background, and contains three chapters. Chapter 1 briefly introduces the philosophy of this thesis. The biology background chapter (2) then presents a detailed overview of the scientific model that links genotype and phenotype.It does not contain any of my own research. The computational biology background chapter (3) follows on from the previous chapter by discussing popular resources in computational biology, their provenance, and the impact of this on the field. In this chapter, I also present my contributions to collaborative projects: the Proteome Quality Index paper[2], and the 2014 SUPERFAMILY update paper[3].
In Chapter 4, I briefly present the Snowflake phenotype predictor, which uses variants conservation scores, prevalence in the population, and protein domain architectures as input to an unsupervised learning method. This predictor, the development of which resulted in a patent[4], finds unusual combinations of variants associated with phenotypes, and is designed to create explanatory predictions of complex traits.
In investigating Snowflake’s predictions, it became clear that it was possible for it to include protein-coding SNPs in predictions about phenotypes that exist in tissues in which the protein is never expressed, which brings us to the third and final part of this thesis. The Filip protein function prediction filter is discussed in chapter 5, which uses gene expression data to filter out predictions of proteins which are not expressed in the tissue relating to a given phenotype. I discuss attempts to validate Filip’s predictions, including its performance in the CAFA3 protein function prediction competition[5]. In addition, this part presents tools and datasets that were developed through creating and developing Filip: Ontolopy a Python package for querying OBO files in chapter 6, and a combined data set of gene expression data in chapter 7.
| Date of Award | 22 Mar 2022 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Oliver Ray (Supervisor) |