Toward Improved Classification of Imbalanced Data
Almogahed, Bassam A. 1980-
MetadataShow full item record
There is an unprecedented amount of data available. This has caused knowledge discovery to garner attention in recent years. However, many real-world datasets are imbalanced. Learning from imbalanced data poses major challenges and is recognized as needing significant research. The problem with imbalanced data is the performance of learning algorithms in the presence of underrepresented data and severely skewed class distributions. Models trained on imbalanced datasets strongly favor the majority class and largely ignore the minority class. Several approaches introduced to date present both data-based and algorithmic solutions. However, both types of approaches have been criticized for their lack of generalization, tendency to forfeit information, and likelihood of resulting in over-fitting difficulties. The goal of this thesis is to develop algorithms to balance imbalanced datasets to allow each classifier to reach optimal predictions. The specific objectives are to: (i) develop sampling methods for imbalanced data, (ii) develop a framework capable of determining which sampling method to use, (iii) evaluate performance of these methods on a variety of imbalanced datasets, and (iv) develop a new machine learning risk-prediction framework for cardiovascular events. We propose a method for filtering over-sampled data using non-cooperative game theory. It addresses the imbalanced data issue by formulating the problem as a non-cooperative game. The proposed algorithm does not require any prior assumptions and selects representative synthetic instances while generating only a very small amount of noise. We also propose a technique for addressing imbalanced data using semi-supervised learning. Our method integrates under-sampling and semi-supervised learning (US-SSL) to tackle the imbalance problem. The proposed algorithm, on average, significantly outperforms all other sampling algorithms in 67% of cases, across three different classifiers, and ranks second best for the remaining 33% of cases. Finally, we propose a novel framework based on the US-SSL algorithm to select the appropriate semi-supervised algorithm to balance and refine a given dataset in order to establish a well-defined training set.