Abstract
A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that
the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise
similarity measures between individuals. Similarity matrices have been
constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different
frequencies. Additionally, methods are now being developed that take
linkage into account. We review several such matrices and evaluate their
information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering.
We review a range of common clustering algorithms and evaluate their
performance through a simulation study. The clustering step can be
performed either on the matrix or by first using a dimension-reduction
technique; we find that the latter approach substantially improves the
performance of most algorithms. Based on these results, we describe
the population structure signal contained in each similarity matrix and
find that accounting for linkage leads to significant improvements for
sequence data. We also perform a comparison on real data, where we
find that population genetics models outperform generic clustering ap-
proaches, particularly with regard to robustness for features such as
relatedness between individuals.
the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise
similarity measures between individuals. Similarity matrices have been
constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different
frequencies. Additionally, methods are now being developed that take
linkage into account. We review several such matrices and evaluate their
information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering.
We review a range of common clustering algorithms and evaluate their
performance through a simulation study. The clustering step can be
performed either on the matrix or by first using a dimension-reduction
technique; we find that the latter approach substantially improves the
performance of most algorithms. Based on these results, we describe
the population structure signal contained in each similarity matrix and
find that accounting for linkage leads to significant improvements for
sequence data. We also perform a comparison on real data, where we
find that population genetics models outperform generic clustering ap-
proaches, particularly with regard to robustness for features such as
relatedness between individuals.
Original language | English |
---|---|
Pages (from-to) | 337-361 |
Number of pages | 27 |
Journal | Annual Review of Genomics and Human Genetics |
DOIs | |
Publication status | Published - 2012 |