Paper Group NANR 140
Learning Effective Exploration Strategies For Contextual Bandits. Efficient Content-Based Sparse Attention with Routing Transformers. On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective. EINS: Long Short-Term Memory with Extrapolated Input Network Simplification. Improving Neural Language Generation with Spectrum Con …
Learning Effective Exploration Strategies For Contextual Bandits
Title | Learning Effective Exploration Strategies For Contextual Bandits |
Authors | Anonymous |
Abstract | In contextual bandits, an algorithm must choose actions given observed contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, MELEE, that learns an exploration policy based on simulated, synthetic contextual bandit tasks. MELEE uses imitation learning against these simulations to train an exploration policy that can be applied to true contextual bandit tasks at test time. We evaluate on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual bandit problems derived from classification tasks. MELEE outperforms seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an exploration strategy. |
Tasks | Imitation Learning, Learning-To-Rank, Meta-Learning, Multi-Armed Bandits |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bkgk624KDB |
https://openreview.net/pdf?id=Bkgk624KDB | |
PWC | https://paperswithcode.com/paper/learning-effective-exploration-strategies-for |
Repo | |
Framework | |
Efficient Content-Based Sparse Attention with Routing Transformers
Title | Efficient Content-Based Sparse Attention with Routing Transformers |
Authors | Anonymous |
Abstract | Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attention to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O(n^{1.5}d) from O(n^2d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Code will be open-sourced on acceptance. |
Tasks | Image Generation, Language Modelling |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1gjs6EtDr |
https://openreview.net/pdf?id=B1gjs6EtDr | |
PWC | https://paperswithcode.com/paper/efficient-content-based-sparse-attention-with |
Repo | |
Framework | |
On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective
Title | On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective |
Authors | Anonymous |
Abstract | This work tackles the problem of characterizing and understanding the decision boundaries of neural networks with piece-wise linear non-linearity activations. We use tropical geometry, a new development in the area of algebraic geometry, to provide a characterization of the decision boundaries of a simple neural network of the form (Affine, ReLU, Affine). Specifically, we show that the decision boundaries are a subset of a tropical hypersurface, which is intimately related to a polytope formed by the convex hull of two zonotopes. The generators of the zonotopes are precise functions of the neural network parameters. We utilize this geometric characterization to shed light and new perspective on three tasks. In doing so, we propose a new tropical perspective for the lottery ticket hypothesis, where we see the effect of different initializations on the tropical geometric representation of the decision boundaries. Also, we leverage this characterization as a new set of tropical regularizers, which deal directly with the decision boundaries of a network. We investigate the use of these regularizers in neural network pruning (removing network parameters that do not contribute to the tropical geometric representation of the decision boundaries) and in generating adversarial input attacks (with input perturbations explicitly perturbing the decision boundaries geometry to change the network prediction of the input). |
Tasks | Network Pruning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BylldnNFwS |
https://openreview.net/pdf?id=BylldnNFwS | |
PWC | https://paperswithcode.com/paper/on-the-decision-boundaries-of-deep-neural |
Repo | |
Framework | |
EINS: Long Short-Term Memory with Extrapolated Input Network Simplification
Title | EINS: Long Short-Term Memory with Extrapolated Input Network Simplification |
Authors | Anonymous |
Abstract | This paper contrasts the two canonical recurrent neural networks (RNNs) of long short-term memory (LSTM) and gated recurrent unit (GRU) to propose our novel light-weight RNN of Extrapolated Input for Network Simplification (EINS). We treat LSTMs and GRUs as differential equations, and our analysis highlights several auxiliary components in the standard LSTM design that are secondary in importance. Guided by these insights, we present a design that abandons the LSTM redundancies, thereby introducing EINS. We test EINS against the LSTM over a carefully chosen range of tasks from language modelling and medical data imputation-prediction through a sentence-level variational autoencoder and image generation to learning to learn to optimise another neural network. Despite having both a simpler design and fewer parameters, this simplification either performs comparably, or better, than the LSTM in each task. |
Tasks | Image Generation, Imputation, Language Modelling |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1l5m6VFwr |
https://openreview.net/pdf?id=B1l5m6VFwr | |
PWC | https://paperswithcode.com/paper/eins-long-short-term-memory-with-extrapolated |
Repo | |
Framework | |
Improving Neural Language Generation with Spectrum Control
Title | Improving Neural Language Generation with Spectrum Control |
Authors | Anonymous |
Abstract | Recent Transformer-based models such as Transformer-XL and BERT have achieved huge success on various natural language processing tasks. However, contextualized embeddings at the output layer of these powerful models tend to degenerate and occupy an anisotropic cone in the vector space, which is called the representation degeneration problem. In this paper, we propose a novel spectrum control approach to address this degeneration problem. The core idea of our method is to directly guide the spectra training of the output embedding matrix with a slow-decaying singular value prior distribution through a reparameterization framework. We show that our proposed method encourages isotropy of the learned word representations while maintains the modeling power of these contextual neural models. We further provide a theoretical analysis and insight on the benefit of modeling singular value distribution. We demonstrate that our spectrum control method outperforms the state-of-the-art Transformer-XL modeling for language model, and various Transformer-based models for machine translation, on common benchmark datasets for these tasks. |
Tasks | Language Modelling, Machine Translation, Text Generation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ByxY8CNtvr |
https://openreview.net/pdf?id=ByxY8CNtvr | |
PWC | https://paperswithcode.com/paper/improving-neural-language-generation-with |
Repo | |
Framework | |
On the Variance of the Adaptive Learning Rate and Beyond
Title | On the Variance of the Adaptive Learning Rate and Beyond |
Authors | Anonymous |
Abstract | The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate – its variance is problematically large in the early stage, and presume warmup works as a variance reduction technique. We provide both empirical and theoretical evidence to verify our hypothesis. We further propose Rectified Adam (RAdam), a novel variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the efficacy and robustness of RAdam. |
Tasks | Image Classification, Language Modelling, Machine Translation, Stochastic Optimization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkgz2aEKDr |
https://openreview.net/pdf?id=rkgz2aEKDr | |
PWC | https://paperswithcode.com/paper/on-the-variance-of-the-adaptive-learning-rate-1 |
Repo | |
Framework | |
Generalizing Natural Language Analysis through Span-relation Representations
Title | Generalizing Natural Language Analysis through Span-relation Representations |
Authors | Anonymous |
Abstract | A large number of natural language processing tasks exist to analyze syntax, semantics, and information content of human language. These seemingly very different tasks are usually solved by specially designed architectures. In this paper, we provide the simple insight that a great variety of tasks can be represented in a single unified format consisting of labeling spans and relations between spans, thus a single task-independent model can be used across different tasks. We perform extensive experiments to test this insight on 10 disparate tasks as broad as dependency parsing (syntax), semantic role labeling (semantics), relation extraction (information content), aspect based sentiment analysis (sentiment), and many others, achieving comparable performance as state-of-the-art specialized models. We further demonstrate benefits in multi-task learning. We convert these datasets into a unified format to build a benchmark, which provides a holistic testbed for evaluating future models for generalized natural language analysis. |
Tasks | Aspect-Based Sentiment Analysis, Dependency Parsing, Multi-Task Learning, Relation Extraction, Semantic Role Labeling, Sentiment Analysis |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1x_K04YwS |
https://openreview.net/pdf?id=B1x_K04YwS | |
PWC | https://paperswithcode.com/paper/generalizing-natural-language-analysis |
Repo | |
Framework | |