HIPred: an integrative approach to predicting haploinsufficient genes

Abstract Motivation A major cause of autosomal dominant disease is haploinsufficiency, whereby a single copy of a gene is not sufficient to maintain the normal function of the gene. A large proportion of existing methods for predicting haploinsufficiency incorporate biological networks, e.g. protein-protein interaction networks that have recently been shown to introduce study bias. As a result, these methods tend to perform best on well-studied genes, but underperform on less studied genes. The advent of large genome sequencing consortia, such as the 1000 genomes project, NHLBI Exome Sequencing Project and the Exome Aggregation Consortium creates an urgent need for unbiased haploinsufficiency prediction methods. Results Here, we describe a machine learning approach, called HIPred, that integrates genomic and evolutionary information from ENSEMBL, with functional annotations from the Encyclopaedia of DNA Elements consortium and the NIH Roadmap Epigenomics Project to predict haploinsufficiency, without the study bias described earlier. We benchmark HIPred using several datasets and show that our unbiased method performs as well as, and in most cases, outperforms existing biased algorithms. Availability and Implementation HIPred scores for all gene identifiers are available at: https://github.com/HAShihab/HIPred. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Technological advances and the falling costs of next-generation sequencing technologies have accelerated the identification of genetic variation in the human genome (The 1000 Genomes Project Consortium, 2012). The most common form of genetic variation is single nucleotide variants (SNVs) and small insertions/deletions (INDELs). Identifying which of these are functional promises to improve our understanding of the molecular mechanisms of human disease and lead to novel treatments. As a result, there is a plethora of in silico algorithms capable of predicting the functional impact of SNVs and INDELs, e.g. (Choi et al., 2012;Kircher et al., 2014;Ritchie et al., 2014;Shihab et al., 2013Shihab et al., , 2015. On the other hand, loss-of-function (LoF) variants, i.e. truncating mutations and whole gene deletions, have traditionally been considered to be rare and highly deleterious. However, there is growing evidence to suggest that LoF variants are common amongst healthy individuals (MacArthur et al., 2012;Ng et al., 2008;Pelak et al., 2010). Haploinsufficiency, whereby a single copy of a gene product is not sufficient to maintain the normal function of the gene, is just one possible biological mechanism implicating LoF variants to abnormal phenotypes (Veitia and Birchler, 2010). Prediction of haploinsufficiency is an important aspect to interpreting whole genome sequence data, in which each individual will have a number of nonsense and missense mutations. Therefore, accurate methods for identifying haploinsufficiency within the genome are of increasing importance.
A large proportion of the existing algorithms for predicting haploinsufficiency utilize biological networks, such as protein-protein interaction networks. However, it has been shown that commonly used biological networks are heavily affected by study bias (Steinberg et al., 2015); i.e. well studied genes are over-represented with respect to the number of networks they are part of and the number of links they form within these networks. As a result, these methods tend to perform best on well-studied genes but underperform on less studied genes. Steinberg et al. (2015) constructed an unbiased genome-wide haploinsufficiency score (GHIS) by replacing these biological networks with co-expression networks. However, other potentially informative sources for functional annotation include the Encyclopaedia of DNA Elements (ENCODE) consortium (The ENCODE Project Consortium, 2012) and the NIH Roadmap Epigenomics Project (Roadmap Epigenomics Consortium et al., 2015). Following our previous work (Shihab et al., 2015), we describe a machine learning approach (called HIPred) that integrates genomic and evolutionary features with functional annotations from ENCODE and NIH Roadmap Epigenomics to predict haploinsufficiency. We observe improved performance when compared with five existing methods, but without the potential study bias described earlier. Pre-computed HIPred scores for all gene identifiers (GRCh37) are available at: https://github.com/HAShihab/HIPred.

Datasets
We assembled two datasets for training: our positive dataset was constructed using 299 known haploinsufficient genes from (Dang et al., 2008) and our negative dataset was constructed using 386 putative loss-of-function tolerant (LoFT) genes from (MacArthur et al., 2012). After removing records with conflicting annotations and ambiguous mappings, we retained 298 haploinsufficient and 386 LoFT genes.
Following a similar procedure described in Steinberg et al. (2015), we used the following benchmarks from (Petrovski et al., 2013) to evaluate the performance of HIPred: 175 genes listed as haploinsufficient in OMIM (OMIM HI), 108 genes listed as haploinsufficient with known de novo mutations in OMIM (OMIM HI de novo), 91 genes for which a heterozygous gene knockout causes 'lethality' phenotypes in mouse (MGI Lethality) and 95 genes for which a heterozygous gene knockout causes seizures in mouse (MGI Seizures). Next, we collected a list of 59 genes disrupted by de novo LoF mutations in autism probands (ASD1) (Iossifov et al., 2012) and a further 64 genes disrupted by de novo LoF mutations in other sets of autism probands (ASD2). (Neale et al., 2012;O'Roak et al., 2012;Sanders et al., 2012). The composition of haploinsufficient genes across these benchmarks, and their overlap with our training data, is summarized in Supplementary Table S1. The actual genes used in these benchmarks are given as Supplementary Material.
Finally, for each gene, we also obtained the number of associated publications in PubMed using the NCBI Entrez Search and Retrieval System and used this as a measure of how 'well-studied' these genes are.

Feature groups
Following our previous work (Shihab et al., 2015), we annotated our datasets using a number of feature groups, which could be predictive of haploinsufficiency. A detailed description of these feature groups can be found in Supplementary Table S2, but a short description is as follows: • Genomic and evolutionary: we used a number of genomic properties such as the length of the gene, number of transcripts and the average number of predicted protein domains across transcripts. A comprehensive set of conservation-based measures, such as dN/dS ratios between human and 65 different species (one-to-one orthologues), was also used. In addition, we also tested whether the number of observed rare variants (MAF < 0.01) from the Exome Aggregation Consortium (ExAC) (Lek et al., 2016), the number of expected rare variants across the gene, and a z-score representing the deviation of observed from expected added any predictive value. • Histone modifications: we used regions of enrichment based on histone ChIP-seq peak calls from ENCODE and NIH Roadmap Epigenomics. • Open chromatin: we used regions of enrichment based on DNase-seq and Formaldehyde-Assisted Isolation of Regulatory Elements peak calls from ENCODE and NIH Roadmap Epigenomics. • Transcription factor-binding sites: based on PeakSeq and SPP peak calls for 119 transcription factors across 77 cell lines from ENCODE. • Gene expression: based on RNA-seq signal coverage using consolidated epigenomes from NIH Roadmap Epigenomics. • Methylation: based on whole genome bisulphite sequencing from NIH Roadmap Epigenomics. • Digital genomic footprinting sites: for transcription factor recognition sequences within DNase-hypersensitive sites using consolidated epigenomes from the NIH Roadmap Epigenomics Project. • Networks: we used measures of centrality from cell-type specific interactome and tissue-specific co-expression networks.
As described in the Supplementary Material, the majority of our feature groups comprise multiple annotations across a gene. For example, there could be multiple ChIP-seq values across a given region (one value for each position). In these instances, we used the median value across the region. We also tested other summary measures, specifically the mean and maximum value across a region; however, these summaries did not yield any significant improvements in the overall performance of our method (data not shown).

Data integration
The resulting product of our data preparation is several large matrices comprising data from the above feature groups, each of which can have different measurement scales. Therefore, we investigated three approaches for data integration (see Fig. 1). First, we evaluated data integration at the data level (i.e. concatenating datasets into a single matrix). This form of data integration is the simplest and most intuitive; however, combining feature groups in this way creates additional analytical challenges. For example, classifiers will need to handle a larger number of heterogeneous features. Therefore, we used a gradient boosted machine (Chen and Guestrin, 2016) as they can handle heterogeneous datasets, are robust to missing data and can estimate the relative importance of features. To illustrate the potential benefits of using a gradient boosted machine on this type of data, as opposed to alternative machine learning algorithms, we also evaluated the performance of a support vector machine (SVM) (Campbell and Ying, 2011) on the same task.
Next, we evaluated data integration based on multiple kernel learning (MKL). In MKL, feature groups are encoded into a corresponding base kernel K ' (where ' ¼ 1; . . . ; p if there are p feature groups), from which we can derive a composite kernel matrix This composite kernel can then be used with a kernel-based classifier such as the SVM, which was the classifier used here. The k ' are kernel weights where P p '¼1 k ' ¼ 1 and k ' ! 0. These weights can be adjusted according to the relative informativeness of the different feature groups. We used an L1-norm to yield sparse solutions that implicitly excludes uninformative feature groups by assigning them zero weight.
Finally, we evaluated data integration based on stacking. Here, each feature group was tested against a number of machine learning algorithms, e.g. naïve Bayes, SVMs and random forests, and the best performing algorithm was chosen as the base classifier for the group C ' (where ' ¼ 1; . . . ; p if there are p feature groups). These base classifiers were then 'stacked' (i.e. combined) using a logistic regression: where the b i of each base classifier was deduced through the regression process. As with MKL, we used an L1-norm to implicitly exclude uninformative feature groups by assigning them zero coefficient.
We present our results using several performance statistics, such as the overall accuracy, sensitivity and specificity. In addition, we provide receiver operating characteristic (ROC) curves and area under the curve (AUC) statistics. Individual algorithm parameters, e.g. the SVM cost parameter C, were optimized through a 10-fold cross-validation and grid search.
To remove the potential bias caused by the random partitioning of the datasets during cross validation, we repeated our analysis 30 times and report the mean values and SDs above 0.01. In order to alleviate any performance artifacts arising from potential gene similarity within our training dataset, we performed a gene similarity analysis using NCBI's BLASTCLUST algorithm using the following parameters: -p F, -L 0.6, -b F and -S 10.
Finally, we performed a feature selection analysis to identify important features and improve model interpretation. All analyses in this study were performed using scikit-learn (Pedregosa et al., 2011), SHOGUN (Sonnenburg et al., 2010) and xgboost (Chen and Guestrin, 2016).

Sequential learning
Both MKL and stacking rely on the assumption that data are available in every feature group for every training example. However, in practice, data can be absent from some feature groups, e.g. not every position in the genome will have annotations for Open Chromatin. We could use all available feature groups, but this would result in fewer training examples. With all values present for all feature groups, our training dataset consisted 156 haploinsufficient genes and just 52 LoFT genes. On the other hand, we could use a smaller number of feature groups, but with many more training examples. Therefore, we opted for an iterative sequential learning approach to determine the best combination of base kernels (MKL) and classifiers (stacking) (Rogers et al., 2015). Here, we rank the base kernels/ classifiers based on their individual cross-validation performance. Then, starting with the best performing base kernel/classifier, we iteratively add feature groups and retest the combined model using the same cross-validation procedure. If performance improves, then the feature group is added to the final model and the process is repeated until no more feature groups can be added.

Comparison with existing methods
For each of the benchmarks described in Datasets, we compared HIPred with 5 alternative methods that could be used to predict haploinsufficiency: predicted haploinsufficiency probabilities, HIS and HIS Imputed (Huang et al., 2010); predicted gene indispensability scores (IS) (Khurana et al., 2013); Residual Variance Intolerance Scores (RVIS) (Petrovski et al., 2013); Evolutionary Intolerance (EvoTol) (Rackham et al., 2014); and predicted genome-wide haploinsufficiency probabilities (GHIS) (Steinberg et al., 2015). In addition, we evaluated the correlation between each of the methods evaluated based on the absolute Spearmando rank correlation coefficient.

Performance of the method
On our training dataset, the performance of existing methods ranged from 0.6929 to 0.8549, with the HIS (Imputed) probabilities achieving the highest AUC (see Table 1). However, due to potential overlaps between this dataset and the datasets used to train these algorithms, the performances reported here may be over-inflated and may not represent the true generalizability of existing methods (see section 3.2 below).
Next, we evaluated the performance of a gradient boosted machine, i.e. data integration at the data level. In terms of AUC, the performance of our gradient boosted machine outperformed all existing methods with an average AUC of 0.8940. Comparing the performance of a gradient boosted machine and SVMs, we achieved a nominal AUC of 0.8133 using SVMs; thereby highlighting the In our experiments, the highest performing MKL model comprised seven feature groups and achieved an average AUC of 0.8747. Here, Genomic and Evolutionary was the highest performing individual feature group with an average AUC of 0.8179, followed by Open Chromatin and Histone Modifications from the NIH Roadmap Epigenomics Project (i.e. gappedPeak and narrowPeak) with an average AUC of 0.8103 and 0.8035, respectively. Histone Modifications from ENCODE yielded an average AUC of 0.7518. We observed a performance boost of 4.61% during the first stage of our sequential learning approach. However, we observed minor improvements at each subsequent iteration. Interestingly, MKL assigned the largest weight to Histone Modifications from ENCODE (0.6056), whose individual performance was ranked 4th overall, followed by Genomic and Evolutionary (0.2233) and Open Chromatin/Histone Modifications (narrowPeak) from NIH Roadmap Epigenomes Project (0.1701). A lower weight was assigned to gappedPeak and broadPeak (0.0008 and 0.0001, respectively), probably because of the similarities between these feature groups and the narrowPeak feature group (see Supplementary Tables S3 and S4).
The best performing stacked model comprised four feature groups and had an average AUC of 0.8866. As with MKL, Genomic and Evolutionary was the best performing feature group with an average AUC of 0.8196, followed by Open Chromatin and Histone Modifications (narrowPeak) with an AUC of 0.8794. However, in contrast to MKL, we observed a small performance boost of 0.38% after the initial sequential learning iteration. In our experiments, the most informative feature group was Genomic and Evolutionary with a regression coefficient of 4.0810, followed by Gene Expression and Open Chromatin (regression coefficients 2.1805 and 2.1391, respectively). Despite the best performing model comprising four feature groups during cross-validation, the final logistic regression assigned zero coefficients to one of these feature groups (see Supplementary Tables S5 and S6).
From our analysis, it would appear that all 3 data integration classifiers evaluated outperform existing methods, with a classifier based on integration at the data level performing best. However, the difference in performance between these three data integration techniques evaluated is marginal. Nonetheless, for maximum performance and improved model interpretation, the final version of HIPred is based on a gradient boosted machine. All subsequent analyses presented are based on this version.
Next, we tested for potential gene similarity (at the nucleotide level) within our training data using the NCBI BLASTCLUST algorithm. Using a minimum sequence identity and sequence coverage of 60%, we did not find any gene clusters.
Finally, we performed a feature selection analysis to identify the most informative features (see Fig. 2). The most important feature identified from this analysis was the ExAC (E) missense z-scores for the deviation of observed missense variants from expectation (gain ¼ 0.34), followed by several cell-type specific interactomes (I), such as the Mesenchymal Precursor (gain ¼ 0.09), and genomic/evolutionary features (G), such as the dN/dS ratios and percent identity with other Ensembl genomes. We also assessed the performance of our final model using a progressive number of features and found that a maximum tree depth of 2 yields optimal performance (see Supplementary Material).

Performance using known and candidate disease genes
We evaluated HIPred on a set of known human disease-associated genes and mouse model equivalents. After removing genes that were present in our training data (see Supplementary Table S1), we were left with 81 OMIM HI and 49 OMIM HI de novo genes. Following the procedure described in Steinberg et al. (2015), we matched these genes to an equal number of random genes based on gene length, which were assumed to be neutral. In general, HIPred outperformed the RVIS, EvoTol and GHIS across both OMIM datasets (see Table 2). Furthermore, HIPred marginally outperformed the HIS (both imputed and non-imputed) and Indispensability probabilities. However, these OMIM datasets comprise some of the most studied genes which could artificially inflate the observed performances of biased predictors. ROC curves are made available as Supplementary Figure S1.
Next, we tested these methods using a set of genes for which a heterozygous gene knockout causes 'lethality' phenotypes and seizures in mouse. After removing genes that were also present in our training data, we were left with 75 MGI Lethality and 90 MGI Seizure genes, which were matched using the same procedure as above. From our analysis, it would appear that HIPred outperforms all other methods across these datasets. Although the performance The reported performance of HIPred is the average performance observed across our repeated cross-validation procedure. of HIPred appears to drop on the MGI Seizures datasets, the drop in performance is not as drastic as that observed with other methods. Using the number of associated publications in PubMed as a proxy of how well genes are studied, we tested whether the MGI dataset was enriched for less-studied genes compared with the OMIM datasets. Although the median number of publications was lower in the MGI datasets, we did not reach statistical significance using a Mann-Whitney U-test (P ¼ 0.13 for MGI Seizures versus OMIM HI and P ¼ 0.34 for MGI Lethality versus OMIM HI). Therefore, we also tested these methods on a set of candidate disease genes linked to autism (ASD1 and ASD2). These datasets were statistically enriched for less studied genes than the OMIM datasets (P ¼ 0.02 for ASD1 versus OMIM HI and P ¼ 0.01 ASD2 versus OMIM HI). After removing genes that were also present in our training data, we matched the remaining genes to to a random set of genes based on gene length as above. Our analysis shows that the performance of all methods drops significantly on these datasets, with RVIS performing best. The performance of HIPred is comparable to GHIS across the ASD datasets. However, it should be noted that we cannot be sure which ASD genes are casual (Steinberg et al., 2015). Therefore, the results of this benchmark should be interpreted with some caution.

Rank correlation between methods
Following the above benchmarks, we tested the correlation in gene ranks between the methods (based on absolute Spearman's rank correlation coefficient, see Table 3). Unsurprisingly, the highest correlation was observed between HIS and HIS (Imputed). Disregarding the HIS scores, rank correlations fall in the range 0.03-0.58, with correlations between EvoTol and all other methods being generally low. It appears that HIPred has a moderate correlation with all existing methods (coefficients range from 0.4994 to 0.5739, with the exception of EvoTol which yields a correlation coefficient of 0.0478).

Discussion
In this study, we outlined HIPred, an integrative approach that combines genomic and evolutionary features with functional annotations from ENCODE and Roadmap Epigenomics to predict haploinsufficiency. We evaluated 3 approaches for data integration: integration at the data, kernel (MKL) and classifier (stacking) level; and observed improved performances over existing methods using all data integration techniques. In our experiments, we observed that MKL and stacking classifiers outperformed classifiers constructed for one type of data. However, we found that the most intuitive data integration technique, i.e. integration at the data level, outperformed other (more complex) data integration techniques. We observed comparable performances to existing methods using SVMs on the integrated data, but improved performances using a gradient boosted machine. The improved performance may be the result of the implicit feature selection performed in gradient boosted machines. Therefore, it may be possible to improve the performance of our MKL-based classifier using feature selection before data integration. However, our stacking classifier uses random forests (which are also tree-based methods similar to gradient boosted machines) for most feature groups and therefore performs some form of feature selection before data integration, so it is unclear how much benefit MKL would gain from feature selection. One main advantage to integration at the data level is the ability to capture the potential relationships between features across feature groups (which are missed using MKL and stacking based approaches). We benchmarked HIPred using several datasets and have shown that our method performs as well as, and in most cases, outperforms existing algorithms. An important issue to consider when comparing the performance of any prediction algorithm is the benchmark being used. Here, it is preferable to use blind datasets, i.e. data that have not been used to train the algorithm, to minimize any bias in the observed performance. Although we took care to reduce this bias in our results by performing an extensive cross-validation analysis and excluding genes from our benchmarks that were also present in our training data, this level of testing is not possible with alternative methods as it would require training each method using common data. Therefore, the performance of alternative methods may be inflated. Furthermore, it has been shown that most biological networks used in existing methods are effected by study bias (Steinberg et al., 2015), i.e. well-studied genes are over-represented in these networks compared with less studied genes. As a result, existing algorithms may not generalize well to less studied genes. For example, the performance of most existing algorithms drops when predicting on the MGI datasets, which comprise less studied genes. In contrast, HIPred doesn't appear to be affected by this study bias and outperforms existing methods on these datasets.
Other important factors to consider when evaluating predictive methods are potential artifacts in performance arising due to gene similarity. Although we did not observe any gene similarities within our training data at the nucleotide sequence (up to 60% sequence similarity), we did not test for potential gene similarities at the protein sequence level.
The advent of large genome sequencing consortia, such as the 1000 genomes project (The 1000 Genomes Project Consortium, 2012), NHLBI Exome Sequencing Project (ESP) and the ExAC (Lek et al., 2016), creates an urgent need for unbiased haploinsufficiency prediction methods such as HIPred.