Paper Group NANR 35
Cross-Lingual Ability of Multilingual BERT: An Empirical Study. Efficient Wrapper Feature Selection using Autoencoder and Model Based Elimination. Knockoff-Inspired Feature Selection via Generative Models. Self-Induced Curriculum Learning in Neural Machine Translation. Granger Causal Structure Reconstruction from Heterogeneous Multivariate Time Ser …
Cross-Lingual Ability of Multilingual BERT: An Empirical Study
Title | Cross-Lingual Ability of Multilingual BERT: An Empirical Study |
Authors | Anonymous |
Abstract | Recent work has exhibited the surprising cross-lingual abilities of multilingual BERT (M-BERT) – surprising since it is trained without any cross-lingual objective and with no aligned data. In this work, we provide a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability. We study the impact of linguistic properties of the languages, the architecture of the model, and of the learning objectives. The experimental study is done in the context of three typologically different languages – Spanish, Hindi, and Russian – and using two conceptually different NLP tasks, textual entailment and named entity recognition. Among our key conclusions is the fact that lexical overlap between languages plays a negligible role in the cross-lingual success, while the depth of the network is an important part of it |
Tasks | Named Entity Recognition, Natural Language Inference |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJeT3yrtDr |
https://openreview.net/pdf?id=HJeT3yrtDr | |
PWC | https://paperswithcode.com/paper/cross-lingual-ability-of-multilingual-bert-an |
Repo | |
Framework | |
Efficient Wrapper Feature Selection using Autoencoder and Model Based Elimination
Title | Efficient Wrapper Feature Selection using Autoencoder and Model Based Elimination |
Authors | Anonymous |
Abstract | We propose a computationally efficient wrapper feature selection method - called Autoencoder and Model Based Elimination of features using Relevance and Redundancy scores (AMBER) - that uses a single ranker model along with autoencoders to perform greedy backward elimination of features. The ranker model is used to prioritize the removal of features that are not critical to the classification task, while the autoencoders are used to prioritize the elimination of correlated features. We demonstrate the superior feature selection ability of AMBER on 4 well known datasets corresponding to different domain applications via comparing the accuracies with other computationally efficient state-of-the-art feature selection techniques. Interestingly, we find that the ranker model that is used for feature selection does not necessarily have to be the same as the final classifier that is trained on the selected features. Finally, we hypothesize that overfitting the ranker model on the training set facilitates the selection of more salient features. |
Tasks | Feature Selection |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkgWeJrYwr |
https://openreview.net/pdf?id=SkgWeJrYwr | |
PWC | https://paperswithcode.com/paper/efficient-wrapper-feature-selection-using-1 |
Repo | |
Framework | |
Knockoff-Inspired Feature Selection via Generative Models
Title | Knockoff-Inspired Feature Selection via Generative Models |
Authors | Anonymous |
Abstract | We propose a feature selection algorithm for supervised learning inspired by the recently introduced knockoff framework for variable selection in statistical regression. While variable selection in statistics aims to distinguish between true and false predictors, feature selection in machine learning aims to reduce the dimensionality of the data while preserving the performance of the learning method. The knockoff framework has attracted significant interest due to its strong control of false discoveries while preserving predictive power. In contrast to the original approach and later variants that assume a given probabilistic model for the variables, our proposed approach relies on data-driven generative models that learn mappings from data space to a parametric space that characterizes the probability distribution of the data. Our approach requires only the availability of mappings from data space to a distribution in parametric space and from parametric space to a distribution in data space; thus, it can be integrated with multiple popular generative models from machine learning. We provide example knockoff designs using a variational autoencoder and a Gaussian process latent variable model. We also propose a knockoff score metric for a softmax classifier that accounts for the contribution of each feature and its knockoff during supervised learning. Experimental results with multiple benchmark datasets for feature selection showcase the advantages of our knockoff designs and the knockoff framework with respect to existing approaches. |
Tasks | Feature Selection |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryex8CEKPr |
https://openreview.net/pdf?id=ryex8CEKPr | |
PWC | https://paperswithcode.com/paper/knockoff-inspired-feature-selection-via |
Repo | |
Framework | |
Self-Induced Curriculum Learning in Neural Machine Translation
Title | Self-Induced Curriculum Learning in Neural Machine Translation |
Authors | Anonymous |
Abstract | Self-supervised neural machine translation (SS-NMT) learns how to extract/select suitable training data from comparable (rather than parallel) corpora and how to translate, in a way that the two tasks support each other in a virtuous circle. SS-NMT has been shown to be competitive with state-of-the-art unsupervised NMT. In this study we provide an in-depth analysis of the sampling choices the SS-NMT model takes during training. We show that, without it having been told to do so, the model selects samples of increasing (i) complexity and (ii) task-relevance in combination with (iii) a denoising curriculum. We observe that the dynamics of the mutual-supervision of both system internal representation types is vital for the extraction and hence translation performance. We show that in terms of the human Gunning-Fog Readability index (GF), SS-NMT starts by extracting and learning from Wikipedia data suitable for high school (GF=10–11) and quickly moves towards content suitable for first year undergraduate students (GF=13). |
Tasks | Denoising, Machine Translation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJxRmlStDB |
https://openreview.net/pdf?id=rJxRmlStDB | |
PWC | https://paperswithcode.com/paper/self-induced-curriculum-learning-in-neural |
Repo | |
Framework | |
Granger Causal Structure Reconstruction from Heterogeneous Multivariate Time Series
Title | Granger Causal Structure Reconstruction from Heterogeneous Multivariate Time Series |
Authors | Anonymous |
Abstract | Granger causal structure reconstruction is an emerging topic that can uncover causal relationship behind multivariate time series data. In many real-world systems, it is common to encounter a large amount of multivariate time series data collected from heterogeneous individuals with sharing commonalities, however there are ongoing concerns regarding its applicability in such large scale complex scenarios, presenting both challenges and opportunities for Granger causal reconstruction. To bridge this gap, we propose a Granger cAusal StructurE Reconstruction (GASER) framework for inductive Granger causality learning and common causal structure detection on heterogeneous multivariate time series. In particular, we address the problem through a novel attention mechanism, called prototypical Granger causal attention. Extensive experiments, as well as an online A/B test on an E-commercial advertising platform, demonstrate the superior performances of GASER. |
Tasks | Time Series |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJxyCRVKvB |
https://openreview.net/pdf?id=SJxyCRVKvB | |
PWC | https://paperswithcode.com/paper/granger-causal-structure-reconstruction-from |
Repo | |
Framework | |
Structural Multi-agent Learning
Title | Structural Multi-agent Learning |
Authors | Anonymous |
Abstract | In this paper, we propose a multi-agent learning framework to model communication in complex multi-agent systems. Most existing multi-agent reinforcement learning methods require agents to exchange information with the environment or global manager to achieve effective and efficient interaction. We model the multi-agent system with an online adaptive graph where all agents communicate with each other through the edges. We update the graph network with a relation system which takes the current graph network and the hidden variable of agents as input. Messages and rewards are shared through the graph network. Finally, we optimize the whole system via the policy gradient algorithm. Experimental results of several multi-agent systems show the efficiency of the proposed method and its strength compared to existing methods in cooperative scenarios. |
Tasks | Multi-agent Reinforcement Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SklEs2EYvS |
https://openreview.net/pdf?id=SklEs2EYvS | |
PWC | https://paperswithcode.com/paper/structural-multi-agent-learning |
Repo | |
Framework | |
Learning from Explanations with Neural Module Execution Tree
Title | Learning from Explanations with Neural Module Execution Tree |
Authors | Anonymous |
Abstract | While deep neural networks have achieved impressive performance on a range of NLP tasks, these data-hungry models heavily rely on labeled data. To make the most of each example, previous work has introduced natural language (NL) explanations to serve as supplements to mere labels. Such NL explanations can provide sufficient domain knowledge for generating more labeled data over new instances, while the annotation time only doubles. However, directly applying the NL explanations for augmenting model learning encounters two challenges. First, NL explanations are unstructured and inherently compositional, which asks for modularized model to represent their semantics. Second, NL explanations often have large numbers of linguistic variants, resulting in low recall and limited generalization ability when applied to unlabeled data. In this paper, we propose a novel Neural Modular Execution Tree (NMET) framework for augmenting sequence classification with NL explanations. After transforming NL explanations into executable logical forms with a semantic parser, NMET employs a neural module network architecture to generalize different type of actions (specified by the logical forms) for labeling data instances, and accumulates the results with soft logic, which substantially increases the coverage of each NL explanation. Experiments on two NLP tasks, relation extraction and sentiment analysis, demonstrate its superiority over baseline methods by leveraging NL explanation. Its extension to multi-hop question answering achieves performance gain with light annotation effort. Also, NMET achieves much better performance compared to traditional label-only supervised models in the same annotation time. |
Tasks | Question Answering, Relation Extraction, Sentiment Analysis |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJlUt0EYwS |
https://openreview.net/pdf?id=rJlUt0EYwS | |
PWC | https://paperswithcode.com/paper/learning-from-explanations-with-neural-module |
Repo | |
Framework | |
DOUBLE-HARD DEBIASING: TAILORING WORD EMBEDDINGS FOR GENDER BIAS MITIGATION
Title | DOUBLE-HARD DEBIASING: TAILORING WORD EMBEDDINGS FOR GENDER BIAS MITIGATION |
Authors | Anonymous |
Abstract | Gender bias in word embeddings has been widely investigated. However, recent work has shown that existing approaches, including the well-known Hard Debias algorithm which projects word embeddings to a subspace orthogonal to an inferred gender direction, are insufficient to deliver gender-neutral word embeddings. In our work, we discover that semantic-agnostic corpus statistics such as word frequency are important factors that limit the debiasing performance. We propose a simple but effective processing technique, Double-Hard Debias, to attenuate the effect due to such noise. We experiment with Word2Vec and GloVe embeddings and demonstrate on several benchmarks that our approach preserves the distributional semantics while effectively reducing gender bias to a larger extent than previous debiasing techniques. |
Tasks | Word Embeddings |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hyg6neStDr |
https://openreview.net/pdf?id=Hyg6neStDr | |
PWC | https://paperswithcode.com/paper/double-hard-debiasing-tailoring-word |
Repo | |
Framework | |
Finding Mixed Strategy Nash Equilibrium for Continuous Games through Deep Learning
Title | Finding Mixed Strategy Nash Equilibrium for Continuous Games through Deep Learning |
Authors | Anonymous |
Abstract | Nash equilibrium has long been a desired solution concept in multi-player games, especially for those on continuous strategy spaces, which have attracted a rapidly growing amount of interests due to advances in research applications such as the generative adversarial networks. Despite the fact that several deep learning based approaches are designed to obtain pure strategy Nash equilibrium, it is rather luxurious to assume the existence of such an equilibrium. In this paper, we present a new method to approximate mixed strategy Nash equilibria in multi-player continuous games, which always exist and include the pure ones as a special case. We remedy the pure strategy weakness by adopting the pushforward measure technique to represent a mixed strategy in continuous spaces. That allows us to generalize the Gradient-based Nikaido-Isoda (GNI) function to measure the distance between the players’ joint strategy profile and a Nash equilibrium. Applying the gradient descent algorithm, our approach is shown to converge to a stationary Nash equilibrium under the convexity assumption on payoff functions, the same popular setting as in previous studies. In numerical experiments, our method consistently and significantly outperforms recent works on approximating Nash equilibrium for quadratic games, general blotto games, and GAMUT games. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rygG7AEtvB |
https://openreview.net/pdf?id=rygG7AEtvB | |
PWC | https://paperswithcode.com/paper/finding-mixed-strategy-nash-equilibrium-for |
Repo | |
Framework | |
Stochastic Prototype Embeddings
Title | Stochastic Prototype Embeddings |
Authors | Anonymous |
Abstract | Supervised deep-embedding methods project inputs of a domain to a representational space in which same-class instances lie near one another and different-class instances lie far apart. We propose a probabilistic method that treats embeddings as random variables. Extending a state-of-the-art deterministic method, Prototypical Networks (Snell et al., 2017), our approach supposes the existence of a class prototype around which class instances are Gaussian distributed. The prototype posterior is a product distribution over labeled instances, and query instances are classified by marginalizing relative prototype proximity over embedding uncertainty. We describe an efficient sampler for approximate inference that allows us to train the model at roughly the same space and time cost as its deterministic sibling. Incorporating uncertainty improves performance on few-shot learning and gracefully handles label noise and out-of-distribution inputs. Compared to the state-of-the-art stochastic method, Hedged Instance Embeddings (Oh et al., 2019), we achieve superior large- and open-set classification accuracy. Our method also aligns class-discriminating features with the axes of the embedding space, yielding an interpretable, disentangled representation. |
Tasks | Few-Shot Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rke2HRVYvH |
https://openreview.net/pdf?id=rke2HRVYvH | |
PWC | https://paperswithcode.com/paper/stochastic-prototype-embeddings-1 |
Repo | |
Framework | |
Why Not to Use Zero Imputation? Correcting Sparsity Bias in Training Neural Networks
Title | Why Not to Use Zero Imputation? Correcting Sparsity Bias in Training Neural Networks |
Authors | Anonymous |
Abstract | Handling missing data is one of the most fundamental problems in machine learning. Among many approaches, the simplest and most intuitive way is zero imputation, which treats the value of a missing entry simply as zero. However, many studies have experimentally confirmed that zero imputation results in suboptimal performances in training neural networks. Yet, none of the existing work has explained what brings such performance degradations. In this paper, we introduce the variable sparsity problem (VSP), which describes a phenomenon where the output of a predictive model largely varies with respect to the rate of missingness in the given input, and show that it adversarially affects the model performance. We first theoretically analyze this phenomenon and propose a simple yet effective technique to handle missingness, which we refer to as Sparsity Normalization (SN), that directly targets and resolves the VSP. We further experimentally validate SN on diverse benchmark datasets, to show that debiasing the effect of input-level sparsity improves the performance and stabilizes the training of neural networks. |
Tasks | Imputation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BylsKkHYvH |
https://openreview.net/pdf?id=BylsKkHYvH | |
PWC | https://paperswithcode.com/paper/why-not-to-use-zero-imputation-correcting |
Repo | |
Framework | |
Learning Through Limited Self-Supervision: Improving Time-Series Classification Without Additional Data via Auxiliary Tasks
Title | Learning Through Limited Self-Supervision: Improving Time-Series Classification Without Additional Data via Auxiliary Tasks |
Authors | Anonymous |
Abstract | Self-supervision, in which a target task is improved without external supervision, has primarily been explored in settings that assume the availability of additional data. However, in many cases, particularly in healthcare, one may not have access to additional data (labeled or otherwise). In such settings, we hypothesize that self-supervision based solely on the structure of the data at-hand can help. We explore a novel self-supervision framework for time-series data, in which multiple auxiliary tasks (e.g., forecasting) are included to improve overall performance on a sequence-level target task without additional training data. We call this approach limited self-supervision, as we limit ourselves to only the data at-hand. We demonstrate the utility of limited self-supervision on three sequence-level classification tasks, two pertaining to real clinical data and one using synthetic data. Within this framework, we introduce novel forms of self-supervision and demonstrate their utility in improving performance on the target task. Our results indicate that limited self-supervision leads to a consistent improvement over a supervised baseline, across a range of domains. In particular, for the task of identifying atrial fibrillation from small amounts of electrocardiogram data, we observe a nearly 13% improvement in the area under the receiver operating characteristics curve (AUC-ROC) relative to the baseline (AUC-ROC=0.55 vs. AUC-ROC=0.62). Limited self-supervision applied to sequential data can aid in learning intermediate representations, making it particularly applicable in settings where data collection is difficult. |
Tasks | Time Series, Time Series Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJl5MeHKvB |
https://openreview.net/pdf?id=rJl5MeHKvB | |
PWC | https://paperswithcode.com/paper/learning-through-limited-self-supervision |
Repo | |
Framework | |
Deep Multi-View Learning via Task-Optimal CCA
Title | Deep Multi-View Learning via Task-Optimal CCA |
Authors | Anonymous |
Abstract | Canonical Correlation Analysis (CCA) is widely used for multimodal data analysis and, more recently, for discriminative tasks such as multi-view learning; however, it makes no use of class labels. Recent CCA methods have started to address this weakness but are limited in that they do not simultaneously optimize the CCA projection for discrimination and the CCA projection itself, or they are linear only. We address these deficiencies by simultaneously optimizing a CCA-based and a task objective in an end-to-end manner. Together, these two objectives learn a non-linear CCA projection to a shared latent space that is highly correlated and discriminative. Our method shows a significant improvement over previous state-of-the-art (including deep supervised approaches) for cross-view classification (8.5% increase), regularization with a second view during training when only one view is available at test time (2.2-3.2%), and semi-supervised learning (15%) on real data. |
Tasks | MULTI-VIEW LEARNING |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkgRW64twr |
https://openreview.net/pdf?id=SkgRW64twr | |
PWC | https://paperswithcode.com/paper/deep-multi-view-learning-via-task-optimal-cca-1 |
Repo | |
Framework | |
A Generalized Framework of Sequence Generation with Application to Undirected Sequence Models
Title | A Generalized Framework of Sequence Generation with Application to Undirected Sequence Models |
Authors | Anonymous |
Abstract | Undirected neural sequence models such as BERT (Devlin et al., 2019) have received renewed interest due to their success on discriminative natural language understanding tasks such as question-answering and natural language inference. The problem of generating sequences directly from these models has received relatively little attention, in part because generating from such models departs significantly from the conventional approach of monotonic generation in directed sequence models. We investigate this problem by first proposing a generalized model of sequence generation that unifies decoding in directed and undirected models. The proposed framework models the process of generation rather than a resulting sequence, and under this framework, we derive various neural sequence models as special cases, such as autoregressive, semi-autoregressive, and refinement-based non-autoregressive models. This unification enables us to adapt decoding algorithms originally developed for directed sequence models to undirected models. We demonstrate this by evaluating various decoding strategies for a cross-lingual masked translation model (Lample and Conneau, 2019). Our experiments show that generation from undirected sequence models, under our framework, is competitive with the state of the art on WMT’14 English-German translation. We also demonstrate that the proposed approach enables constant-time translation with similar performance to linear-time translation from the same model by rescoring hypotheses with an autoregressive model. |
Tasks | Natural Language Inference, Question Answering |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJlbo6VtDH |
https://openreview.net/pdf?id=BJlbo6VtDH | |
PWC | https://paperswithcode.com/paper/a-generalized-framework-of-sequence-1 |
Repo | |
Framework | |
Reducing Sentiment Bias in Language Models via Counterfactual Evaluation
Title | Reducing Sentiment Bias in Language Models via Counterfactual Evaluation |
Authors | Anonymous |
Abstract | Recent improvements in large-scale language models have driven progress on automatic generation of syntactically and semantically consistent text for many real-world applications. Many of these advances leverage the availability of large corpora. While training on such corpora encourages the model to understand long-range dependencies in text, it can also result in the models internalizing the social biases present in the corpora. This paper aims to quantify and reduce biases exhibited by language models. Given a conditioning context (e.g. a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g. country names, occupations, genders, etc.) in the conditioning context, a.k.a. counterfactual evaluation. We quantify these biases by adapting individual and group fairness metrics from the fair machine learning literature. Extensive evaluation on two different corpora (news articles and Wikipedia) shows that state-of-the-art Transformer-based language models exhibit biases learned from data. We propose embedding-similarity and sentiment-similarity regularization methods that improve both individual and group fairness metrics without sacrificing perplexity and semantic similarity—a positive step toward development and deployment of fairer language models for real-world applications. |
Tasks | Language Modelling, Semantic Similarity, Semantic Textual Similarity |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1l2IyrYPr |
https://openreview.net/pdf?id=S1l2IyrYPr | |
PWC | https://paperswithcode.com/paper/reducing-sentiment-bias-in-language-models |
Repo | |
Framework | |