The rapid acceleration of next-generation sequencing has led to the identification of more than 600 million single nucleotide variants (SNVs) within human populations around the world (National Library of Medicine). Despite this progress, the interpretation of these variants and their functional consequences remains a formidable challenge (Zhang et al., 2024). While numerous variant effect prediction tools have been developed over the past three decades (Ng and Henikoff, 2003; Adzhubei et al., 2010; Kircher et al., 2014; Quang et al., 2015), most adopt a disease-agnostic approach, failing to account for the unique mutational landscapes and selective pressures specific to cancer. Furthermore, experimental functional assays, though valuable, cannot comprehensively characterise the vast landscape of potential variants due to the combinatorial explosion of mutations (Wei and Li, 2023), making computational prediction tools particularly valuable for translational genomics. There is a pressing need for cancer-specific methods to distinguish driver variants that actively promote oncogenesis from functionally neutral variants. Identifying these key mutations is crucial for deepening our understanding of tumour biology and has multiple translational applications: developing innovative therapies, predicting drug resistance mechanisms, guiding gene editing strategies, and enabling personalised medicine. By identifying functionally relevant variants, we can develop more accurate predictive biomarkers for drug response and design treatments tailored to each tumour’s unique genetic landscape, ultimately leading to more effective cancer care. In Chapter 1, we establish the context through a critical review of the literature prior to this PhD, identifying significant gaps in feature extraction and predictive modelling in cancer genomics. Chapter 2 details the genomic and protein features central to building supervised machine learning models in this thesis, exploring their biological significance and extraction methods. In Chapter 3, we introduce DrivR-Base, a publicly available codebase that facilitates the efficient mining of more than 1500 features for use in machine learning models, published in Bioinformatics (Francis et al., 2024). DrivR-Base addresses a critical bottleneck by automating the collection of features previously scattered across diverse sources (Rentzsch et al., 2019; Adzhubei et al., 2013), and is designed to be widely adopted by the research community, thus accelerating progress in the field by providing a reproducible feature extraction toolkit. Chapter 4 extends our work with CScape-XF, a model that integrates novel features from DrivR-Base, particularly DNA shape features not previously explored in variant effect prediction, to improve accuracy in the classification of cancer variants (Francis et al., 2024; Chiu et al., 2016). In Chapter 5, we introduce CanDrivR-CS, a framework to predict the pathogenicity of genetic variants in specific cancer sub-types. Unlike existing tools that aggregate cancer mutations across all tissues (Rogers et al., 2017, 2020), we propose that unique driver variants across cancer contexts might benefit from specialised predictors, using 50 custom models to tackle tumour heterogeneity. iChapters 6 and 7 provide a discussion of our findings, including a critical analysis of the thesis’s contributions, limitations, and future research priorities, whilst integrating recent advances and comparing state-of-the-art approaches. We explore the potential integration of our methods with emerging foundation models (Brixi et al., 2025; Meier et al., 2021) and propose lab-in-the-loop methodologies to address persistent challenges in obtaining high-quality training data. By combining the extensive feature set of DrivR-Base with embeddings from recent language foundation models and multi-modal data integration, incorporating genomic sequences, protein sequences, and cellular imaging, we suggest future work with exciting opportunities for cancer therapeutics. This convergence of approaches has the potential to transform variant effect prediction by capturing complementary perspectives on mutation impacts, enabling more precise identification of therapeutic targets, elucidation of resistance mechanisms, and guidance for gene editing strategies in clinical applications.
| Date of Award | 30 Sept 2025 |
|---|
| Original language | English |
|---|
| Awarding Institution | |
|---|
| Supervisor | Tom R Gaunt (Supervisor), Pau Erola (Supervisor) & I C G Campbell (Supervisor) |
|---|