Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets

Motivation Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping. Results In our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity. Availability and Implementation Our code is available at https://github.com/MRCIEU/vectology.


23
In our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model 24 performed best at predicting these, matching 40.3% of the manual mappings correctly. The 25 BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the 26 manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. 27 Pairwise mapping of traits to each other demonstrated that many of the models can accurately 28 group similar traits based on their semantic similarity. 29 Kadoorie Biobank 5 and others to discover new predictive biomarkers and interventions. Such studies 38 measure many thousands of phenotypic variables. Systematic analyses such as phenome-wide 39 association studies (PheWAS) 6-8 can describe relationships between thousands of variables, 40 producing large datasets. However, many variables are inconsistently named across studies, and can 41 prove difficult to map to each other or an existing ontology such as the Experimental Factor 42 Ontology (EFO) 9 , Human Phenotype Ontology (HPO) 10 or the Disease Ontology 11 . In parallel, the 43 biomedical literature contains a wealth of data on human diseases, traits and risk factors described 44 using free text (with some mappings to Medical Subject Headings; MeSH). Systematically integrating 45 knowledge across these different datasets and domains would enable us to triangulate the 46 evidence 12 for different risk factor/disease combinations, but at the moment this is hindered by the 47 inconsistencies in trait nomenclature. 48 The complexity of variable names is illustrated by UK Biobank, an internationally important 49 population study that has collected a wealth of data on half a million people 1 . Examples of text 50 labels for variables in UK Biobank include easily recognizable traits such as "systolic blood pressure" 51 and disease names such as "coronary heart disease". However, the study also includes more 52 complex variables, including those derived from questionnaire data, including "able to walk or cycle 53 unaided for 10 minutes" and "cough on most days". An array of other variables also exist, including 54 International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD10) 55 codes such as "anaemia due to enzyme disorders" (D55) and "syncope and collapse" (R55), the 56 former mapping directly to the EFO (EFO:0009529), but the latter not. Direct mapping by text 57 matching to ontology terms is therefore not realistic, and whilst manual mapping to ontologies is 58 sometimes appropriate, this is time consuming, especially if mapping to multiple different ontologies 59 (which cover different domains of the human phenome and exposome Here we apply a range of text embedding methods and BERT language models (including one trained 91 on EFO) to the problem of mapping biomedical variables (from UK Biobank) to an ontology (EFO) and 92 compare their performance, strengths and weaknesses. We also illustrate the use of these models 93 on a direct trait-to-trait mapping problem. 94

95
The EFO dataset 96 The Experimental Factor Ontology (EFO) 9 contains parts of several biological ontologies as well as 97 variables from many large scale databases. Whilst many other ontologies exist, this particular 98 ontology is widely used for human traits and is well documented, so was considered a good choice 99 for this evaluation. A version of the EFO data set was downloaded from the EBI RDF platform 37 in 100 March 2021 containing 25,390 terms. This was used for all subsequent analysis and is available in 101 the supplementary material (supplementary files S1 and S2). 102 Ontology distance metric 103 To understand the relative distance between any two EFO terms and enable us to measure how well 104 a trait was mapped we used the nxontology Python library 38 . By creating a parent child network of 105 EFO terms we could compute a similarity measure between any pair of EFO terms and use this to 106 create a measure of how close two terms are within the EFO hierarchy. For this analysis we used the 107 Batet (parameter "batet" in the library) measure 39 as this was developed using biomedical taxonomy 108 data and produced good correlation results to manual biomedical concept comparisons. The 109 measure ([0, 1]) is a ratio calculated from the shared and non-shared information between a pair of 110 concepts, where the lower the score the less shared ancestry between the two ontology concepts 111 have . From here on we will refer to this metric as the EFO-Batet score. 112 To create a nxontology instance, we provided the parent/child EFO edge data to the NXOntology 113 class (supplementary code block S1). 114 Trait-to-trait mapping distance score 115 The different models use different approaches for measuring distance between text terms 116 (supplementary table S2). For simplicity we refer to these metrics (edit distance, cosine similarity, 117 semantic distance) as "trait similarity score" throughout. 118 Mapping methods 119 We used a range of existing string comparison language models representing different approaches 120 to language representation and different pre-training datasets to enable us to evaluate the impact of 121 these differences on mapping performance. 122

123
Levenshtein edit distance ratio 40 was used to understand how well a basic string comparison 124 performs. Using the ratio() function we obtained a measure of similarity between two strings. 125 Zooma 41 is an established tool to map text to ontologies using a combination of curated mapping to 126 existing data sets and standard text matching (the exact method is undocumented). For this analysis 127 we utilised the Zooma API setting the "required" parameter set to "None" and "ontologies" 128 parameter set to "efo" (supplementary code block S2) to avoid circularity. 129 ScispaCy is built on spaCy and provides models for processing biomedical, scientific or clinical text 46 . 145

Text embedding methods
The en_core_sci_lg model was downloaded and installed as described in the documentation 146 (https://allenai.github.io/scispacy). The model is accessed via the same spaCy methods as above. 147 BlueBERT 31 (NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12) and BioBERT 47 148 (biobert_v1.1_pubmed) are biomedical language model implementations based on the original BERT 149 pretrained weights, with further language model training with biomedical corpora to improve 150 language understanding tasks in the biomedical domain. For transformer models, the vector 151 representation of the entity is computed as the average of the hidden state tensor of the − 1 152 layer as a fixed representation of the tokenised sequence (i.e. the default strategy in 48 ). These 153 models were obtained from their respective model repositories, then served via the bert-as-service 48 154 API (see supplementary code block S6 for example usage and code repository for detailed set up). 155 Bespoke ontology classifier 156 In addition to established language models we also explored the effect of tailoring a transformer 157 model to the EFO using transfer learning.
BlueBERT-EFO was developed by finetuning BlueBERT with an ontology entity alignment training 159 process designed as a sequence classification task (for details see supplementary text S1). To create 160 a similarity matrix of the entities, for each pair of terms the model produces a score representing the 161 inferred ontology distance, where the lower number of steps between two entities as predicted by 162 the model, the closer they are represented in an ontology graph.  205 We first explored how well the top prediction of each method compared to the manual annotation 206 (Figure 2). For results that exactly agree with the manual annotation (Figure 2A), the best 207 performing methods were BioSentVec (40.3%), BlueBERT-EFO (38.8%), Zooma (37.2%) and ScispaCy 208 (36.5%), the results of which were notably higher than those of the methods included in the analysis. 209 Pairwise proportions Z-test results (supplementary Table S4) between each of the mapped 210 proportions confirm that there is a notable difference between results of the best performing group 211 and the those of the other methods, but the differences are minimal within the group (largest 212 difference is between BioSentVec and ScispaCy, P-Value = 0.058). 213 Whilst none of the methods exceeded 40.3% exact mapping, it is important to consider three key 214 points: (1) some of the manual predictions are likely to be incorrect; (2) the methods and models 215 used here to automate this approach are quick and easy to use, and would scale to a task size that 216 would be impractical for manual annotation; (3) even the most sophisticated natural language 217 processing models will struggle to predict the same result as a human, particularly in cases where 218 the query string contains two un-linked entities, or even a negated term, e.g. "enduring personality 219 changes not attributable to brain damage and disease". 220 In some situations (e.g. a recommender of similar concepts), an exact match may not be required, 221 and if the top prediction from a model is sufficiently close to the manual annotation, this may be a 222 suitable result. We then examine how well the top predictions from a method align with the manual 223 annotation in terms of their EFO-Batet score distance to the manual EFO terms. Figure 2B shows the 224 aggregate results for the subset (see supplementary Figure S9 for full results) of methods over 225 different range of EFO-Batet score threshold for top predictions to be included, from total number of 226 top predictions that are strictly identical to manual annotation (threshold == 1, i.e. Figure 2B), to 227 those that are sufficiently close to the manual annotation in the ontology space (e.g. threshold >= 228 0.9), and then to results with a greater ontology distance tolerance (e.g. threshold >= 0.6). predictions (which we would expect to be enriched for related terms, and potentially contain the 253 correct mapping term). We then investigated the distribution of EFO-Batet scores for both the top 254 prediction ( Figure 3A) and the top 10 predictions (weighted average EFO-Batet scores, Figure 3B), 255 and the aggregate results of generalised top ranges, to determine which models prioritize the most 256 relevant set of traits. As shown in Figure 3A, for top predictions BlueBERT-EFO is able to retrieve 257 higher number of candidates that have high ontology relevance to the manual annotation (greater 258 mass in the upper tail) and lower number of candidates that have low relevance (lower mass in the 259 lower tail), which is also confirmed by the pairwise Kolmogorov-Smirnov two sample tests 260 (supplementary table S4) on the statistical difference of its distribution to those of other methods 261 ( − ≤ 3.3 − 09). 262

Figure 3: Distribution of predicted EFO-Batet scores by method. (A) Distribution of EFO-Batet score 263
for the highest-ranking (top 1) match for each query term; (B) Distribution of weighted average EFO-264 Batet score for the top 10 matches for each query term. (C) Averaged sum of the top N weighted 265 averaged EFO-Batet score of the predicted EFO candidates for a query term, for subset methods of 266 BlueBERT-EFO, BlueBERT, and BioSentVec (full results are available in Supplementary Figure S10). 267

268
We then extended the analysis to consider a set of top results. Figure 3B   weighting of 10, second 9 and so on) to show the aggregate ontology relevance of the retrieved 274 candidates. Figure 3C shows the averaged sum of the weighted average scores for each top N level 275 to provide an overall measure on the general ontology relevance of the candidate retrieval for a 276 subset of methods (see supplementary figure S10 for full results). The results suggest that BlueBERT-277 EFO will generally return a set of traits that are more closely associated with the correct part of the 278 EFO ontology compared to other methods, and corroborates with earlier analysis findings that the 279 finetuning of the BlueBERT language model with EFO structure information will notably improve EFO 280 candidate retrieval. 281 We also investigated on the performance of a hybrid method (BioSentVec-X-BlueBERT-EFO) where 282 BioSentVec is applied in the first stage to select the top X (e. and Methods) and all others (excluding itself). Figure 4 shows the results of a Spearman rank 298 correlation analysis comparing the matrices of these pairwise trait-mapping scores between each 299 pair of models. The results broadly indicate three clusters of models. One contains the EFO-Batet 300 (manual mapping) and BlueBERT-EFO scores, suggesting again that the BlueBERT-EFO model, as 301 expected, is predicting distances most similar to that which we find in the EFO hierarchy. A second 302 group contains the other BERT models (BioBERT and BlueBERT) highlighting the similarity between 303 those two transformer models. A third group contains the spaCy, ScispaCy and BioSentVec models, 304 which may represent their shared underlying methodology, (i.e. variations of word2vec). Whilst this 305 analysis can't tell us which method performs "best" at trait-to-trait mapping, it highlights that these 306 models do perform differently at this task, which should be taken into account in the development 307 of future automated trait-to-trait mapping methods. 308 to illustrate how these models perform at this task. Supplementary Figure S11 is provided as a 318 reference and shows a clustered dendrogram of EFO-Batet scores for the distance between traits in 319 the EFO hierarchy. The clusters represent the relationships between EFO terms as determined by the 320 EFO hierarchy and batet scores. We observe a sharp separation between measurement based 321 quantitative traits and disease traits. This reflects the structure of the EFO, with quantitative traits 322 falling into the "information entity" and disease traits into the "material property" top-level 323 branches of EFO (https://www.ebi.ac.uk/ols/ontologies/efo). 324 Using the same 43 traits, we then produced a matrix of trait-to-trait distance scores for each model, 325 but this time based on cosine distances (or equivalent -see System and Methods). These matrices 326 were compared to each other using the Mantel test 51 , a method to compute correlation distances 327 between matrices (supplementary Figure 12). Here we see a similar pattern, with the BlueBERT-EFO 328 and EFO-Batet (i.e. position in the EFO hierarchy) scores clustered together. This similarity is obvious 329 in the BlueBERT-EFO clustermap (supplementary Figure 13) where there are some clear differences, 330 but the major distinction between quantitative traits and disease is present, with almost exactly the 331 same traits clustering into the same two groups. This likely reflect the finetuning of this model to 332 EFO. 333 334 Discussion 335 A number of approaches exist for text matching and semantic representation of text. We set out to 336 investigate the use of these approaches for the automated mapping of human trait names to 337 ontologies (using the specific example of EFO) and explore how they perform at direct trait-to-trait 338 mapping. 339 Comparison of approaches for automated mapping to ontology 340 Our analyses illustrate that using text embeddings to map biomedical variables to EFO has a fairly 341 high error rate, but is at least comparable to existing approaches (e.g. Zooma 14 ). Given the ease of 342 use and scalability of some of the models, we recommend this approach when tackling problems 343 that involve many thousands of variables and manual annotation is not feasible. When attempting 344 an exact match (i.e. top match) BioSentVec 21 appears to perform best in terms of speed, precision 345 and accuracy. However, if it is more important that the top N predictions are close to the truth, then 346 BlueBERT-EFO consistently out-performed all other models. 347 It is important to note that several of the models had similar performance at finding a top match, 348 with the group including BioSentVec, BLUEBERT-EFO, Zooma and ScispaCy 46 showing little statistical 349 evidence of a difference. It is important to note that the standard Zooma tool also brings the benefit 350 of continually updated manually curated mappings 14 . 351 Embedding methods appear to perform well when the query string describes a single event or 352 entity, e.g. "whooping cough / pertussis". They perform poorly when the query string describes 353 multiple entities, e.g. "hiv disease resulting in malignant neoplasms". This is perhaps not surprising, 354 as the embedding of this phrase is unlikely to be close to either HIV or cancer terms. Addressing such 355 traits therefore remains a complex challenge, i.e. properly identifying mentioned concepts via 356 named entity recognition (NER) and then incorporating pretrained concept embeddings from the 357 knowledge base to the document embeddings 52,53 . In other words, a complex processing system 358 which includes major components of NER, document level embeddings, and concept embeddings is 359 required to approach mapping of complex traits in a generalised and robust manner, though we are 360 keen to explore this aspect in future research. 361 We compared our models to a manually mapped set of trait names, but it is important to recognise 362 this may itself contain errors. Supplementary file S7 lists examples where no models predicted an 363 EFO term with an EFO-Batet score >0.95. Here, for example the query term "malignant neoplasm of 364 colon" was manually mapped to "colon carcinoma". However, six of the models predicted the EFO 365 term "malignant colon neoplasm" which has an EFO-Batet score of 0.86 and is therefore a better fit. 366 (It is possible these differences reflect changes in the EFO since the initial mapping rather than a 367 mapping error). 368 369 Mapping traits directly between two datasets has potential value, but in the absence of a 370 benchmark it is hard to validate. We therefore focused on variables that had been mapped to a 371 single EFO term, and then refined that further for closer inspection. The use of clustering methods 372 enabled us to manually inspect groups of traits and describe events that agree with standard 373 biomedical knowledge. Our analyses show that by including topological information from a well 374 established ontology like the EFO, the BlueBERT-EFO model can create sensible pairwise distances 375 between variables, without actually mapping to ontology. 376

Comparison of approaches for trait-to-trait mapping
When focusing on a specific set of traits, we see that whilst the finetuning of BlueBERT-EFO has 377 produced a model which reflects major patterns in the EFO hierarchy, there are some differences. 378 One example is the loss of the "angina", "worrier / anxious feeling" cluster (present in EFO, 379 Supplementary Figure S11), with "angina" joining the larger disease cluster next to "atrial fibrilation 380 and flutter" and "worrier / anxious feeling" moving next to "neuroticism score" (Supplementary 381 Figure S13). The manual EFO term assigned to "angina" was "EFO_0003913" (angina pectoris, 382 http://www.ebi.ac.uk/efo/EFO_0003913) which can be found within the "material phenotype" EFO 383 group as it is listed as a "Phenotype abnormality" and not a disease. Even though the BLUEBERT-EFO 384 model has been finetuned on the EFO hierarchy, the biomedical literature underpinning the model 385 has created distances placing "angina" with other diseases rather than measurements. This 386 highlights the subtle balance of information contained within this model. 387 Interestingly, the BlueBERT-EFO model fails to group together the neurological illnesses 388 ("parkinson's disease", "alzheimer's disease" and "secondary parkinsonism"). Looking at the other 389 models, several also fail to do this, often grouping traits with the word "disease" together 390 (Supplementary Figures S14-20). However, BioSentVec, BlueBERT and BioBERT appear to group 391 these appropriately. This highlights one of the key challenges that the developers of these models 392 face: how to distinguish between informative words and ignore the generic (e.g. "disease"). This 393 point is again present in the BioBERT cluster map (Supplementary Figure S19), with "weight" an 394 outlier to all other traits, suggesting this term was not sufficiently similar to anthropometric traits. 395 It is worth noting, that the alternative methods to using language models for this type of distance 396 analysis appear to perform less well (e.g. Levenshtein edit distance, Supplementary Figure S14). 397 Other established methods, such as Zooma, are just not possible to use when comparing data in this 398 way. 399 At the moment there is no practical alternative automated approach to trait to trait mapping, so our 400 results using language models are promising. However, they are far from perfect with many cases of 401 traits not grouping together as we might expect, and the models often focusing on generic words 402 such as disease over and above other more defining terms. This approach therefore requires further 403 development before it can be of practical use. 404 Use cases of these models 405 The models are imperfect but are still successful in mapping 40% of trait names in the dataset we 406 used. One obvious use case would be a semi-automated mapping tool which would provide a 407 suggestion for the user to approve or edit. As highlighted above, many simple trait names map well, 408 and it is the more complex traits (e.g. combinations of entities) that would need manual 409 intervention. 410 Another scenario in which an imperfect one-to-many mapping tool like those presented here may be 411 useful is in a "trait name recommender". One example of this is our OpenGWAS 54 recommender, 412 which provides recommended trait matches from amongst thousands of GWAS datasets to enable a 413 user to see other relevant GWAS traits they may be interested in. The OpenGWAS recommender 414 uses a combination of ScispaCy and BlueBERT-EFO to search for the top matching GWAS traits in the 415 semantic embedding vector space and optionally predict the ontology relationships between the 416 query term and the match candidates 55 . 417 In a follow-up study 56 we applied ScispaCy and BlueBERT-EFO as an ontology mapper in a hybrid 418 architecture, where a first stage model is used to efficiently filter EFO ontology candidates 419 associated with the query ULMS terms, and in the second stage BlueBERT-EFO is then used to 420 predict the ranking of the top N results (similar to results in supplementary figures S5-S8 where 421 BioSentVec was applied as the first stage model