Adversarial and Non-Adversarial Approaches for Imbalanced Data Classification

Date

2022-03-22

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Due to significant impact of imbalanced classification in many domain of computation from security to medical science, any step to address the related challenges like lost diversity, high false positive rate or uncertainty is closely monitored by scientific community. Due to surge of adversarial approaches like GANs, non-adversarial methods like Synthetic Minority Over-Sampling Technique (SMOTE) gradually gain less and less attention every day. In this dissertation, we propose not only novel adversarial and non-adversarial approaches, but we investigate how Adversarial and Non-Adversarial approaches can meet each other. Beyond that, we propose a novel method to make the evaluation of adversarial approaches explainable which is called Real-Fake Validation Loss (RFVL). The data driven approaches to tackle the imbalanced data classification suffer two major problems: (i) lack of diversity (ii) uncertainty. In this research, we propose a set of novel approaches to address the mentioned problems: First, we propose Cross-Concatenation, the first projection-based method to address imbalanced data classification problem. Cross-Concatenation is the first projection method which can balance the size of both minority and majority classes. We prove that, Cross-Concatenation can create larger margins with better class separation. Despite SMOTE and its variations, Cross-Concatenation is not based on random procedures. Thus, in case of running it on fixed training and test data the same efficiency results are obtained. This stability is one of the most important advantages of Cross-Concatenation versus SMOTE. Besides, our experimental results show the competitive Cross-Concatenation results versus SMOTE and its variants as the most popular over-sampling approaches in terms of F1 score and AUC in majority of test cases. Second, we introduced a new concept called virtual big data. Virtual big data is high dimensional version of original training data which is generated by concatenation of c different original instances. This technique can increase the number of training data from N to C(N, c). We prove that, the curse of dimensionality of virtual big data can alleviate the vanishing generator gradient problem.

Description

Keywords

Imbalanced Classification, Over-Sampling, Generative Adversarial Neural Networks

Citation

Portions of this document appear in: Mansourifar, Hadi, Lin Chen, and Weidong Shi. "Virtual big data for GAN based data augmentation." In 2019 IEEE International Conference on Big Data (Big Data), pp. 1478-1487. IEEE, 2019; and in: Mansourifar, Hadi, and Weidong Shi. "Cross-Concatenation: Tackling Uncertainty in Imbalanced Big Data Classification." In 2021 IEEE International Conference on Big Data (Big Data), pp. 867-875. IEEE, 2021; and in: Kasichainula, Keshav, Hadi Mansourifar, and Weidong Shi. "Poisoning Attacks via Generative Adversarial Text to Image Synthesis." In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 158-165. IEEE, 2021.