Concepts in Word Embeddings
: Theory and Applications

  • Adam J Sutton

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)

Abstract

Word embeddings have become a integral part of machine learning solutions for natural language processing (NLP) challenges. They are learned by taking the statistical co-occurrence information from a corpus and representing it in a dense vector (often hundreds of dimensions). This has resulted in many NLP tasks employing word embedding solutions to achieve state of the art performance metrics. The embedding process produces a challenge of understanding and interpreting these vectors as humans.

In this thesis I aim to explore how to understand and explain what word embeddings are representing and why they improve the performance of many tasks. I achieve this by utilising concepts, human understandable lists of words that aim to define an abstract object or class.

Using concepts I define a method to show that they remain present in word embeddings, I then use this method to measure word embeddings understanding of these same concepts. I then use these measurements to provide reasoning for deciding on a ``better'' word embedding algorithms, or finding corpora that when embedded better represents a domain (such as medicine).

I then define and use a method to measure a word's association to a concept, and by extension association to a bias. I show that unwanted biases (such as gendered, and racial) exist in word embeddings. I further show that gendered biases are representative of real world statistics. I then show that while removing these biases may seem like an ideal solution, it decreases a word embeddings representation of the same real world statistics. I also show colour biases and compare them to real world psychology studies, showing that pink is has a feminine bias and pink and blue are positively biased.

I apply the methods I have defined for measuring biases to historical corpora (1800 - 1959) and look to see if occupational words change in gender or emotional biases over time. I then look at the semantic changes of words over those 150 years. Finally, I see if there is any correlation between bias changes and changes in the semantic meaning of a word over time.
Date of Award11 May 2021
Original languageEnglish
Awarding Institution
  • The University of Bristol
SponsorsEngineering and Physical Sciences Research Council
SupervisorNello Cristianini (Supervisor) & Trevor P Martin (Supervisor)

Keywords

  • Artificial Intelligence
  • Machine Learning
  • Natural Language Processing

Cite this

'