CLARITY: comparing heterogeneous data using dissimilarity

Daniel J. Lawson*, Vinesh Solanki, Igor Yanovich, Johannes Dellert, Damian Ruck, Phillip Endicott

*Corresponding author for this work

Research output: Contribution to journalArticle (Academic Journal)peer-review

62 Downloads (Pure)

Abstract

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation versus expression, evolution of language sounds versus word use, and country-level economic metrics versus cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a ‘structural’ component analogous to a clustering, and an underlying ‘relationship’ between those structures. This allows a ‘structural comparison’ between two similarity matrices using their predictability from ‘structure’. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from github.com/danjlawson/CLARITY.
Original languageEnglish
Article number 202182
Number of pages25
JournalRoyal Society Open Science
Volume8
Issue number12
Early online date8 Dec 2021
DOIs
Publication statusE-pub ahead of print - 8 Dec 2021

Bibliographical note

R package available from https://github.com/danjlawson/CLARITY . 23 pages, 6 Figures

Keywords

  • stat.ME
  • stat.ML

Fingerprint

Dive into the research topics of 'CLARITY: comparing heterogeneous data using dissimilarity'. Together they form a unique fingerprint.

Cite this