Domain Adaptation using Deep Adversarial Models



Journal Title

Journal ISSN

Volume Title



In Machine Learning, a good model is one that generalizes from training data and makes accurate predictions when presented with previously unseen samples. Traditionally, data sets lie within the same domain and the same distribution is assumed for both training and testing sets. In many real-world scenarios such assumption would lead to very poor results, because data distribution may frequently be similar but not exactly identical. Sometimes, samples that belong to different domains may unintentionally end up in the same data set, which poses a major problem -- learning on such data violates some fundamental machine learning principles and often results in poor predictions or on the inability to converge to a solution at all. One way to address this problem is through the use of domain adaptation -- a transfer learning technique that aims to mitigate the differences between data set distributions. In the following manuscript, we present several novel methods that can learn from multiple source domains and extract a domain-agnostic model to be applied to one or more target domains. Because some of the most frequently domains bearing distributional discrepancies relate to text-based data sets, much of the following discussion involves topics from natural language processing and specifically authorship attribution; however, our approach is not restricted to a specific field or data type, and a variety of tasks are explored. We investigate an unsupervised multiple-domain adaptation scenario with multiple labeled source data sets and unlabeled target data, as well as a semi-supervised scenario where a few labeled samples are available. We also briefly address domain generalization, where the algorithm has no access to target data, labeled or not. We use our work on authorship problems and experiments performed with two standard semantic data sets, as well as a custom data set we created. Our approach works by mapping the data from the source or the target (and at times both) domains into a domain-invariant feature space. Here, the data were aligned in accordance to its distribution yet maximally separated with regard to domain and sometimes class. To learn this embedding subspace, we created three algorithms and designed a number of deep adversarial models which learn an embedding subspace. The mapped domains were semantically aligned and maximally separated. To maximize model performance through the use of high-risk high-reward learning techniques and to mitigate some difficulties in training adversarial networks, we developed a number of improvements on existing loss functions and an intelligent regularization and early stopping approach. We validate the hypotheses made on a multitude of standard linguistic tasks as well as one of our own making. As a part of this effort we mined and contributed an authorship data set that has been accepted for use as a standard language resource. An extensive amount of experiments were conducted. We showed the general applicability of our proposed ideas on linguistic tasks (different from authorship tasks) by evaluating the performance of our method on a standard multi-class classification tasks. Finally, we demonstrated that our algorithm is not confined to language tasks. We use it to transfer knowledge between images of hand-written digits stemming from different domains. Experimental results validated our ideas. The proposed methods beat traditional baselines in every task, and are very competitive when compared with current state-of-the-art, while significantly outperforming the competition in authorship problems.



Domain adaptation, Generative Adversarial Networks, NLP, Deep learning, Authorship, Machine learning