Toward Improved Classification of Imbalanced Data

dc.contributor.advisorKakadiaris, Ioannis A.
dc.contributor.committeeMemberShah, Shishir Kirit
dc.contributor.committeeMemberEick, Christoph F.
dc.contributor.committeeMemberVilalta, Ricardo
dc.contributor.committeeMemberTsiamyrtzis, Panagiotis
dc.creatorAlmogahed, Bassam A. 1980-
dc.date.accessioned2017-07-18T21:02:57Z
dc.date.available2017-07-18T21:02:57Z
dc.date.createdDecember 2014
dc.date.issued2014-12
dc.date.submittedDecember 2014
dc.date.updated2017-07-18T21:02:57Z
dc.description.abstractThere is an unprecedented amount of data available. This has caused knowledge discovery to garner attention in recent years. However, many real-world datasets are imbalanced. Learning from imbalanced data poses major challenges and is recognized as needing significant research. The problem with imbalanced data is the performance of learning algorithms in the presence of underrepresented data and severely skewed class distributions. Models trained on imbalanced datasets strongly favor the majority class and largely ignore the minority class. Several approaches introduced to date present both data-based and algorithmic solutions. However, both types of approaches have been criticized for their lack of generalization, tendency to forfeit information, and likelihood of resulting in over-fitting difficulties. The goal of this thesis is to develop algorithms to balance imbalanced datasets to allow each classifier to reach optimal predictions. The specific objectives are to: (i) develop sampling methods for imbalanced data, (ii) develop a framework capable of determining which sampling method to use, (iii) evaluate performance of these methods on a variety of imbalanced datasets, and (iv) develop a new machine learning risk-prediction framework for cardiovascular events. We propose a method for filtering over-sampled data using non-cooperative game theory. It addresses the imbalanced data issue by formulating the problem as a non-cooperative game. The proposed algorithm does not require any prior assumptions and selects representative synthetic instances while generating only a very small amount of noise. We also propose a technique for addressing imbalanced data using semi-supervised learning. Our method integrates under-sampling and semi-supervised learning (US-SSL) to tackle the imbalance problem. The proposed algorithm, on average, significantly outperforms all other sampling algorithms in 67% of cases, across three different classifiers, and ranks second best for the remaining 33% of cases. Finally, we propose a novel framework based on the US-SSL algorithm to select the appropriate semi-supervised algorithm to balance and refine a given dataset in order to establish a well-defined training set.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.citationPortions of this document appear in: Almogahed, Bassam A., and Ioannis A. Kakadiaris. "NEATER: filtering of over-sampled data using non-cooperative game theory." Soft Computing 19, no. 11 (2015): 3301-3322. DOI: 10.1007/s00500-014-1484-5
dc.identifier.urihttp://hdl.handle.net/10657/1910
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. UH Libraries has secured permission to reproduce any and all previously published materials contained in the work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectClassification
dc.subjectImbalanced data
dc.titleToward Improved Classification of Imbalanced Data
dc.type.dcmitext
dc.type.genreThesis
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ALMOGAHED-DISSERTATION-2014.pdf
Size:
2.12 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.85 KB
Format:
Plain Text
Description: