Adversarial and Non-Adversarial Approaches for Imbalanced Data Classification
Abstract
Due to significant impact of imbalanced classification in many domain of computation from security to medical science, any step to address the related challenges like lost diversity, high false positive rate or uncertainty is closely monitored by scientific community. Due to surge of adversarial approaches like GANs, non-adversarial methods like Synthetic Minority Over-Sampling Technique (SMOTE) gradually gain less and less attention every day. In this dissertation, we propose not only novel adversarial and non-adversarial approaches, but we investigate how Adversarial and Non-Adversarial approaches can meet each other. Beyond that, we propose a novel method to make the evaluation of adversarial approaches explainable which is called Real-Fake Validation Loss (RFVL). The data driven approaches to tackle the imbalanced data classification suffer two major problems: (i) lack of diversity (ii) uncertainty. In this research, we propose a set of novel approaches to address the mentioned problems: First, we propose Cross-Concatenation, the first projection-based method to address imbalanced data classification problem. Cross-Concatenation is the first projection method which can balance the size of both minority and majority classes. We prove that, Cross-Concatenation can create larger margins with better class separation. Despite SMOTE and its variations, Cross-Concatenation is not based on random procedures. Thus, in case of running it on fixed training and test data the same efficiency results are obtained. This stability is one of the most important advantages of Cross-Concatenation versus SMOTE. Besides, our experimental results show the competitive Cross-Concatenation results versus SMOTE and its variants as the most popular over-sampling approaches in terms of F1 score and AUC in majority of test cases. Second, we introduced a new concept called virtual big data. Virtual big data is high dimensional version of original training data which is generated by concatenation of c different original instances. This technique can increase the number of training data from N to C(N, c). We prove that, the curse of dimensionality of virtual big data can alleviate the vanishing generator gradient problem.