## Abstract

A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that

the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise

similarity measures between individuals. Similarity matrices have been

constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different

frequencies. Additionally, methods are now being developed that take

linkage into account. We review several such matrices and evaluate their

information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering.

We review a range of common clustering algorithms and evaluate their

performance through a simulation study. The clustering step can be

performed either on the matrix or by first using a dimension-reduction

technique; we find that the latter approach substantially improves the

performance of most algorithms. Based on these results, we describe

the population structure signal contained in each similarity matrix and

find that accounting for linkage leads to significant improvements for

sequence data. We also perform a comparison on real data, where we

find that population genetics models outperform generic clustering ap-

proaches, particularly with regard to robustness for features such as

relatedness between individuals.

the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise

similarity measures between individuals. Similarity matrices have been

constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different

frequencies. Additionally, methods are now being developed that take

linkage into account. We review several such matrices and evaluate their

information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering.

We review a range of common clustering algorithms and evaluate their

performance through a simulation study. The clustering step can be

performed either on the matrix or by first using a dimension-reduction

technique; we find that the latter approach substantially improves the

performance of most algorithms. Based on these results, we describe

the population structure signal contained in each similarity matrix and

find that accounting for linkage leads to significant improvements for

sequence data. We also perform a comparison on real data, where we

find that population genetics models outperform generic clustering ap-

proaches, particularly with regard to robustness for features such as

relatedness between individuals.

Original language | English |
---|---|

Pages (from-to) | 337-361 |

Number of pages | 27 |

Journal | Annual Review of Genomics and Human Genetics |

DOIs | |

Publication status | Published - 2012 |