Projects per year
Inferring causal networks behind observed data is an active area of research with wide applicability to areas such as epidemiology, microbiology and social science. In particular recent research has focused on identifying how information propagates through the Internet. This research has so far only used temporal features of observations, and while reasonable results have been achieved, there is often further information which can be used. In this paper we show that additional features of the observed data can be used very effectively to improve an existing method. Our particular example is one of inferring an underlying network for how text is reused in the Internet, although the general approach is applicable to other inference methods and information sources. We develop a method to identify how a piece of text evolves as it moves through the underlying network and how substring information can be used to narrow down where in the evolutionary process a particular observation at a node lies and hence to narrow down the number of ways the node could have acquired the infection. Text reuse is detected using a suffix tree which is also used to identify the substring relations between chunks of reused text. We then use a modication of the NetCover method to infer the underlying network. Experimental results on both synthetic and real life data show that using more information than just timing leads to greater accuracy in the inferred networks.
|Translated title of the contribution||Reﬁning causality: who copied from whom?|
|Title of host publication||The 17th ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD)|
|Number of pages||9|
|Publication status||Published - 2011|