Association Rule Mining for Risk Assessment in Epidemiology

Date

2016-08

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In epidemiology, a risk assessment measures the association between exposures and a health outcome. Risk characterization has traditionally been performed using statistical methods such as logistic regression, but such methods are not effective when working with highly correlated variables and when trying to assess synergic actions between exposures. These limitations become evident in studies related to asthma, a common chronic that affects 25 million people in the US. The prevalence of asthma is growing and research is struggling to find the reason. Many factors have been associated with causing and triggering asthma, but their interactions, as well as which one is the most responsible for the spreading of asthma, are still unclear. Outdoor air pollution is on the list of possible causes and triggers. Characterizing the connection between asthma and air pollution is not an easy task, because of high collinearity between pollutant agents, possible synergic actions, and difficulty in controlling the exposure. The research community is currently encouraging the use of multi-pollutant models to yield better results. In this dissertation we propose: (i) a modified Apriori association rule mining method for identification of connections between exposures and risk variations, and (ii) a novel genetic algorithm (GA) designed to mine risk-based quantitative association rules. Both methods were tested on a group of synthetic datasets, and on real data collection about pediatric asthma cases and pollution levels in Houston. The results on the synthetic datasets show the advantages of applying our methods to augment traditional logistic regression, and help determining the best metrics to include in the GA fitness function (odds ratio, length, repetition and redundancy). Tests on clinical data suggest the existence of a correlation between asthma and outdoor air pollutants, both alone and as a mixture. The genetic algorithm improves the results of the Apriori-based method by recognizing what appear to be the most dangerous levels of exposure. Future work will help to improve aspects of the GA such as population initialization or rule selection. To date, the proposed methods represent a significant step in the direction of risk assessment based on association rule mining in epidemiological studies.

Description

Keywords

Association Rule Mining, Genetic algorithms, Epidemiology, Outdoor Air Pollution

Citation