This Thesis brings together complementary research from higher-order computational logic and workflow systems to investigate software and theoretical frameworks for profiling and matching heterogeneous data. One motivating use case is submission sifting, which matches submitted conference or journal papers to potential peer reviewers based on the similarity between the paper's abstract and the reviewer's publications as found in online bibliographic databases. Inspired by e-Science workflows, we introduce the SubSift submission sifting framework for developing web-based research intelligence applications that profile and match heterogeneous textual content from web pages and documents. Abstracting SubSift we define a formal higher-order dataflow framework that ranges over a class of higher-order relations that are sufficiently expressive to represent a wide variety data types and structures. This dataflow model is shown to be embarrassingly parallel. Our serial proof of concept implementation, JSONMatch, is used to demonstrate that the combination of this model and higher-order representation provides a flexible approach to analysing heterogeneous data. Finally we propose a theoretical framework for querying structured data, elevating Codd's relational algebra to a higher-order algebra defined on the basic terms of a higher-order logic. An extension incorporates approximate joins on structured data and is demonstrated to be feasible and have promise for future work.
|Date of Award||2014|