Investigating breast cancer risk factors and mediation pathways by integrating genetic and literature-mined evidence

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)


Breast cancer is the leading cause of cancer-related deaths among women
worldwide. It is a heterogeneous disease with a complex aetiology that arises
from being affected by both genetic and lifestyle risk factors. As incidence
rates continue to rise globally, there is an urgent need to identify new and
modifiable breast cancer risk factors. Over the years, the breast cancer research
field has produced abundant data at molecular, genetic, and population levels,
and has made significant advances in understanding disease development. In
this Thesis, I combined hypothesis-driven and hypothesis-free/data mining
approaches to investigate breast cancer risk factors and their mechanisms. I
focused on the link between early life adiposity and breast cancer and explored
the potential use of biomedical ‘Big Data’ platforms in epidemiology.

In Chapter 3, I investigated the unexplained protective effect that childhood
adiposity has on breast cancer risk. Through reviewing the literature, I iden-
tified a number of potential mediator traits, i.e. traits that are affected by
adiposity in childhood and are also known to influence breast cancer risk.
The aim was to perform a mediation analysis using Mendelian randomization
(MR) causal estimates to investigate their mediating role. I designed an MR
mediation workflow and applied it to 15 hypothesised mediators from four
categories: hormones, reproductive traits, physical traits, and glycaemic traits.
None of the tested traits appeared to strongly mediate the effect, but there was
evidence of a small mediating effect of IGF-1.

In Chapter 4, I used EpiGraphDB, a biomedical knowledge graph, to search
for breast cancer risk factors in a hypothesis-free way, as a discovery screen.
Using the MR causal estimates data (MR Everything-vs-Everything, MR-EvE)
stored in the graph, I replicated the results for previously known causally
related traits and identified novel risk factors. The results were made publicly
available in a Shiny app. I also used MR-EvE data to perform a rapid screen
for ‘potential mediators’ of the identified risk factor traits. This Chapter
presents the complete risk factor search results, and also focuses on several
case study traits, including childhood body size, for which, in Chapter 3, the
hypothesis-driven mediator search was not entirely successful.

In Chapter 5, I used the other part of EpiGraphDB – literature-mined biomedi-
cal entity relationship data, to investigate how the selected risk factor traits
(Chapter 4 case studies) may be linked to breast cancer. I developed a method
for connecting ‘literature spaces’ between traits that enables the identification
of potential mechanisms or intermediates between them. The intermediates
gathered in this way could provide an overview of the shared biology between
a trait and breast cancer (or another disease outcome). Also, the identified
intermediates could be used for hypothesis generation of potential mediators,
which could be tested and validated using the MR mediation framework,
similar to the one employed in Chapters 3 and 6.

In Chapter 6, I carried out a follow-up study on the hypothesis-driven inves-
tigation in Chapter 3. Previously, I was not able to explore one of the most
plausible mediators, mammographic density (MD), due to the unavailability
of the full summary data. Through collaboration, I gained access to MD
data and applied the MR mediation framework devised in Chapter 3 to test
MD as a potential mediator. I found that mammographic dense area is a
plausible mediator, accounting for 56% of the protective effect that childhood
adiposity has on breast cancer risk. In this Chapter, I also performed an ex-
tensive investigation of MD genetic instruments and MD effect on breast cancer.

Overall, this Thesis is a study of breast cancer aetiology, with a focus on
deciphering the protective effect of childhood adiposity on breast cancer and
leveraging the available biomedical ‘Big Data’ for hypothesis generation and
evidence triangulation in epidemiology. I identified a plausible mediator –
mammographic density – of the unexplained protective effect, which has been
a subject of discussion and debate in the literature for over two decades. I
also presented a novel and efficient approach for undertaking a systematic
investigation of disease risk factors by mining EpiGraphDB, using both genetic
and literature data. The identified risk factors can be followed up in future
studies, and the overall approach can be used to study the aetiology of other
Date of Award19 Mar 2024
Original languageEnglish
Awarding Institution
  • The University of Bristol
SupervisorTom R Gaunt (Supervisor), Tim Robinson (Supervisor) & Yi Liu (Supervisor)


  • Genetic epidemiology
  • Mendelian randomization
  • Breast cancer
  • Risk factors
  • data mining
  • literature mining
  • childhood adiposity
  • adiposity

Cite this