Bayesian networks for feature selection and patient pre-screening for depressive symptomatology: a prototype

Eduardo Maekawa*, Eoin M Grua, Carina A Nakamura, Marcia Scazufca, Ricardo Araya, Tim J Peters, Pepijn van de Ven

*Corresponding author for this work

Research output: Contribution to journalArticle (Academic Journal)peer-review


Background: Identifying individuals with depressive symptomatology (DS) promptly and effectively is of paramount importance for providing timely treatment. Machine learning models have shown promise in this area, yet studies often fall short in demonstrating the practical benefits of utilizing these models and fail to provide tangible real-world applications.

Objective: The objectives of this study were: 1) to establish a novel methodology for identifying individuals likely to exhibit DS; 2) to identify the most influential features in a more explainable way via probabilistic measures; 3) to propose tools that can be used in real-world applications.

Methods: Three datasets were utilized in this study: the PROACTIVE dataset, along with the Brazilian National Health Survey (PNS) datasets from 2013 and 2019, comprising socio-demographic and health-related features. A Bayesian Network was used for feature selection. Selected features were then employed to train machine learning models to predict DS, operationalized as a score of 10 or higher on the 9-item Patient Health Questionnaire (PHQ-9). Furthermore, an analysis was conducted to evaluate the influence of different sensitivities on the reduction in number of screening interviews achieved through the utilization of the model compared with a random approach.

Results: The methodology allows the end-user to make an informed trade-off between sensitivity, specificity and a reduction in the number of interviews. At the thresholds of 0.444, 0.412, and 0.472, determined by maximizing Youden's index, the models achieved sensitivities of 0.717, 0.741, and 0.718, and specificities of 0.644, 0.737, and 0.766 for PROACTIVE, PNS 2013, and PNS 2019, respectively. The area under the receiver operating characteristic curve (AUC) was 0.736, 0.801, and 0.809 for these three datasets respectively. For the PROACTIVE dataset, the most influential features identified were postural balance, shortness of breath, and how old people feel they are. In the PNS 2013 dataset, the features were: the ability to do usual activities, chest pain, sleep problems, and chronic back problems. The PNS 2019 dataset shared three of the most influential features with the PNS 2013 dataset. However, the difference was the replacement of chronic back problems with verbal abuse. It is important to note that the features contained in the PNS datasets differ from those found in the PROACTIVE dataset. An empirical analysis demonstrated that utilizing the proposed model led to a potential reduction in screening interviews of up to 52% while maintaining a sensitivity of 0.80.

Conclusion: This study developed a novel methodology for identifying individuals with DS by demonstrating the practical benefits of employing Bayesian networks to identify the most significant features to be used in a machine learning model for the prediction of DS in three general health and socio-economic datasets. Moreover, simulations indicated that the utilization of this approach has the potential to substantially reduce the screening interviews for identifying people with DS while maintaining a high sensitivity. These findings pave the way for improved early identification and intervention strategies for individuals experiencing depressive symptomatology.
Original languageEnglish
JournalJournal of Medical Internet Research Mental Health
Publication statusAccepted/In press - 17 Apr 2024


Dive into the research topics of 'Bayesian networks for feature selection and patient pre-screening for depressive symptomatology: a prototype'. Together they form a unique fingerprint.

Cite this