April 1, 2020

3246 words 16 mins read

Paper Group NANR 115

Paper Group NANR 115

Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning. Provable Convergence and Global Optimality of Generative Adversarial Network. Representation Learning with Multisets. Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks. Inferring Dynamical Systems with Long-Range Dependencies through Line …

Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning

Title Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning
Authors Anonymous
Abstract Solving tasks with sparse rewards is one of the most important challenges in reinforcement learning. In the single-agent setting, this challenge has been addressed by introducing intrinsic rewards that motivate agents to explore unseen regions of their state spaces. Applying these techniques naively to the multi-agent setting results in agents exploring independently, without any coordination among themselves. We argue that learning in cooperative multi-agent settings can be accelerated and improved if agents coordinate with respect to what they have explored. In this paper we propose an approach for learning how to dynamically select between different types of intrinsic rewards which consider not just what an individual agent has explored, but all agents, such that the agents can coordinate their exploration and maximize extrinsic returns. Concretely, we formulate the approach as a hierarchical policy where a high-level controller selects among sets of policies trained on different types of intrinsic rewards and the low-level controllers learn the action policies of all agents under these specific rewards. We demonstrate the effectiveness of the proposed approach in a multi-agent gridworld domain with sparse rewards, and then show that our method scales up to more complex settings by evaluating on the VizDoom platform.
Tasks Multi-agent Reinforcement Learning
Published 2020-01-01
URL https://openreview.net/forum?id=rkltE0VKwH
PDF https://openreview.net/pdf?id=rkltE0VKwH
PWC https://paperswithcode.com/paper/coordinated-exploration-via-intrinsic-rewards-1
Repo
Framework

Provable Convergence and Global Optimality of Generative Adversarial Network

Title Provable Convergence and Global Optimality of Generative Adversarial Network
Authors Anonymous
Abstract Generative adversarial networks (GANs) train implicit generative models through solving minimax problems. Such minimax problems are known as nonconvex- nonconcave, for which the dynamics of first-order methods are not well understood. In this paper, we consider GANs in the type of the integral probability metrics (IPMs) with the generator represented by an overparametrized neural network. When the discriminator is solved to approximate optimality in each iteration, we prove that stochastic gradient descent on a regularized IPM objective converges globally to a stationary point with a sublinear rate. Moreover, we prove that when the width of the generator network is sufficiently large and the discriminator function class has enough discriminative ability, the obtained stationary point corresponds to a generator that yields a distribution that is close to the distribution of the observed data in terms of the total variation. To the best of our knowledge, we seem to first establish both the global convergence and global optimality of training GANs when the generator is parametrized by a neural network.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=H1lnZlHYDS
PDF https://openreview.net/pdf?id=H1lnZlHYDS
PWC https://paperswithcode.com/paper/provable-convergence-and-global-optimality-of
Repo
Framework

Representation Learning with Multisets

Title Representation Learning with Multisets
Authors Anonymous
Abstract We study the problem of learning permutation invariant representations that can capture containment relations. We propose training a model on a novel task: predicting the size of the symmetric difference between pairs of multisets, sets which may contain multiple copies of the same object. With motivation from fuzzy set theory, we formulate both multiset representations and how to predict symmetric difference sizes given these representations. We model multiset elements as vectors on the standard simplex and multisets as the summations of such vectors, and we predict symmetric difference as the l1-distance between multiset representations. We demonstrate that our representations more effectively predict the sizes of symmetric differences than DeepSets-based approaches with unconstrained object representations. Furthermore, we demonstrate that the model learns meaningful representations, mapping objects of different classes to different standard basis vectors.
Tasks Representation Learning
Published 2020-01-01
URL https://openreview.net/forum?id=H1eUz1rKPr
PDF https://openreview.net/pdf?id=H1eUz1rKPr
PWC https://paperswithcode.com/paper/representation-learning-with-multisets
Repo
Framework

Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks

Title Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks
Authors Anonymous
Abstract Stochastic gradient descent (SGD) has been the dominant optimization method for training deep neural networks due to its many desirable properties. One of the more remarkable and least understood quality of SGD is that it generalizes relatively well on unseen data even when the neural network has millions of parameters. We hypothesize that in certain cases it is desirable to relax its intrinsic generalization properties and introduce an extension of SGD called deep gradient boosting (DGB). The key idea of DGB is that back-propagated gradients inferred using the chain rule can be viewed as pseudo-residual targets of a gradient boosting problem. Thus at each layer of a neural network the weight update is calculated by solving the corresponding boosting problem using a linear base learner. The resulting weight update formula can also be viewed as a normalization procedure of the data that arrives at each layer during the forward pass. When implemented as a separate input normalization layer (INN) the new architecture shows improved performance on image recognition tasks when compared to the same architecture without normalization layers. As opposed to batch normalization (BN), INN has no learnable parameters however it matches its performance on CIFAR10 and ImageNet classification tasks.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BkxzsT4Yvr
PDF https://openreview.net/pdf?id=BkxzsT4Yvr
PWC https://paperswithcode.com/paper/deep-gradient-boosting-layer-wise-input
Repo
Framework

Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization

Title Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization
Authors Anonymous
Abstract Vanilla RNN with ReLU activation have a simple structure that is amenable to systematic dynamical systems analysis and interpretation, but they suffer from the exploding vs. vanishing gradients problem. Recent attempts to retain this simplicity while alleviating the gradient problem are based on proper initialization schemes or orthogonality/unitary constraints on the RNN’s recurrency matrix, which, however, comes with limitations to its expressive power with regards to dynamical systems phenomena like chaos or multi-stability. Here, we instead suggest a regularization scheme that pushes part of the RNN’s latent subspace toward a line attractor configuration that enables long short-term memory and arbitrarily slow time scales. We show that our approach excels on a number of benchmarks like the sequential MNIST or multiplication problems, and enables reconstruction of dynamical systems which harbor widely different time scales.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=rylZKTNYPr
PDF https://openreview.net/pdf?id=rylZKTNYPr
PWC https://paperswithcode.com/paper/inferring-dynamical-systems-with-long-range-1
Repo
Framework

SoftAdam: Unifying SGD and Adam for better stochastic gradient descent

Title SoftAdam: Unifying SGD and Adam for better stochastic gradient descent
Authors Anonymous
Abstract Abstract Stochastic gradient descent (SGD) and Adam are commonly used to optimize deep neural networks, but choosing one usually means making tradeoffs between speed, accuracy and stability. Here we present an intuition for why the tradeoffs exist as well as a method for unifying the two in a continuous way. This makes it possible to control the way models are trained in much greater detail. We show that for default parameters, the new algorithm equals or outperforms SGD and Adam across a range of models for image classification tasks and outperforms SGD for language modeling tasks.
Tasks Image Classification, Language Modelling
Published 2020-01-01
URL https://openreview.net/forum?id=Skgfr1rYDH
PDF https://openreview.net/pdf?id=Skgfr1rYDH
PWC https://paperswithcode.com/paper/softadam-unifying-sgd-and-adam-for-better
Repo
Framework

BERT-AL: BERT for Arbitrarily Long Document Understanding

Title BERT-AL: BERT for Arbitrarily Long Document Understanding
Authors Ruixuan Zhang, Zhuoyu Wei, Yu Shi, Yining Chen
Abstract Pretrained language models attract lots of attentions, and they take advantage of the two-stages training process: pretraining on huge corpus and finetuning on specific tasks. Thereinto, BERT (Devlin et al., 2019) is a Transformer (Vaswani et al., 2017) based model and has been the state-of-the-art for many kinds of Nature Language Processing (NLP) tasks. However, BERT cannot take text longer than the maximum length as input since the maximum length is predefined during pretraining. When we apply BERT to long text tasks, e.g., document-level text summarization: 1) Truncating inputs by the maximum sequence length will decrease performance, since the model cannot capture long dependency and global information ranging the whole document. 2) Extending the maximum length requires re-pretraining which will cost a mass of time and computing resources. What’s even worse is that the computational complexity will increase quadratically with the length, which will result in an unacceptable training time. To resolve these problems, we propose to apply Transformer to only model local dependency and recurrently capture long dependency by inserting multi-channel LSTM into each layer of BERT. The proposed model is named as BERT-AL (BERT for Arbitrarily Long Document Understanding) and it can accept arbitrarily long input without re-pretraining from scratch. We demonstrate BERT-AL’s effectiveness on text summarization by conducting experiments on the CNN/Daily Mail dataset. Furthermore, our method can be adapted to other Transformer based models, e.g., XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), for various NLP tasks with long text.
Tasks Text Summarization
Published 2020-01-01
URL https://openreview.net/forum?id=SklnVAEFDB
PDF https://openreview.net/pdf?id=SklnVAEFDB
PWC https://paperswithcode.com/paper/bert-al-bert-for-arbitrarily-long-document
Repo
Framework

Learning Good Policies By Learning Good Perceptual Models

Title Learning Good Policies By Learning Good Perceptual Models
Authors Anonymous
Abstract Reinforcement learning (RL) has led to increasingly complex looking behavior in recent years. However, such complexity can be misleading and hides over-fitting. We find that visual representations may be a useful metric of complexity, and both correlates well objective optimization and causally effects reward optimization. We then propose curious representation learning (CRL) which allows us to use better visual representation learning algorithms to correspondingly increase visual representation in policy through an intrinsic objective on both simulated environments and transfer to real images. Finally, we show better visual representations induced by CRL allows us to obtain better performance on Atari without any reward than other curiosity objectives.
Tasks Representation Learning
Published 2020-01-01
URL https://openreview.net/forum?id=HkgYEyrFDr
PDF https://openreview.net/pdf?id=HkgYEyrFDr
PWC https://paperswithcode.com/paper/learning-good-policies-by-learning-good
Repo
Framework

The Implicit Bias of Depth: How Incremental Learning Drives Generalization

Title The Implicit Bias of Depth: How Incremental Learning Drives Generalization
Authors Anonymous
Abstract A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=H1lj0nNFwB
PDF https://openreview.net/pdf?id=H1lj0nNFwB
PWC https://paperswithcode.com/paper/the-implicit-bias-of-depth-how-incremental-1
Repo
Framework

TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES

Title TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES
Authors Amir Yazdanbakhsh, Ebrahim Songhori, Robert Ormandi, Anna Goldie, Azalia Mirhoseini
Abstract Monte Carlo Tree Search (MCTS) has achieved impressive results on a range of discrete environments, such as Go, Mario and Arcade games, but it has not yet fulfilled its true potential in continuous domains.In this work, we introduceTPO, a tree search based policy optimization method for continuous environments. TPO takes a hybrid approach to policy optimization. Building the MCTS tree in a continuous action space and updating the policy gradient using off-policy MCTS trajectories are non-trivial. To overcome these challenges, we propose limiting tree search branching factor by drawing only few action samples from the policy distribution and define a new loss function based on the trajectories’ mean and standard deviations. Our approach led to some non-intuitive findings. MCTS training generally requires a large number of samples and simulations. However, we observed that bootstrappingtree search with a pre-trained policy allows us to achieve high quality results with a low MCTS branching factor and few number of simulations. Without the proposed policy bootstrapping, continuous MCTS would require a much larger branching factor and simulation count, rendering it computationally and prohibitively expensive. In our experiments, we use PPO as our baseline policy optimization algorithm. TPO significantly improves the policy on nearly all of our benchmarks. For example, in complex environments such as Humanoid, we achieve a 2.5×improvement over the baseline algorithm.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=HJew70NYvH
PDF https://openreview.net/pdf?id=HJew70NYvH
PWC https://paperswithcode.com/paper/tpo-tree-search-policy-optimization-for
Repo
Framework

GraphNVP: an Invertible Flow-based Model for Generating Molecular Graphs

Title GraphNVP: an Invertible Flow-based Model for Generating Molecular Graphs
Authors Anonymous
Abstract We propose GraphNVP, an invertible flow-based molecular graph generation model. Existing flow-based models only handle node attributes of a graph with invertible maps. In contrast, our model is the first invertible model for the whole graph components: both of dequantized node attributes and adjacency tensor are converted into latent vectors through two novel invertible flows. This decomposition yields the exact likelihood maximization on graph-structured data. We decompose the generation of a graph into two steps: generation of (i) an adjacency tensor and(ii) node attributes. We empirically demonstrate that our model and the two-step generation efficiently generates valid molecular graphs with almost no duplicated molecules, although there are no domain-specific heuristics ingrained in the model. We also confirm that the sampling (generation) of graphs is faster in magnitude than other models in our implementation. In addition, we observe that the learned latent space can be used to generate molecules with desired chemical properties
Tasks Graph Generation
Published 2020-01-01
URL https://openreview.net/forum?id=ryxQ6T4YwB
PDF https://openreview.net/pdf?id=ryxQ6T4YwB
PWC https://paperswithcode.com/paper/graphnvp-an-invertible-flow-based-model-for
Repo
Framework

Deep neuroethology of a virtual rodent

Title Deep neuroethology of a virtual rodent
Authors Anonymous
Abstract Parallel developments in neuroscience and deep learning have led to mutually productive exchanges, pushing our understanding of real and artificial neural networks in sensory and cognitive systems. However, this interaction between fields is less developed in the study of motor control. Existing experimental research and neural network models have been focused on the production of individual behaviors, yielding little insight into how intelligent systems can produce a rich and varied set of motor behaviors. In this work we develop a virtual rodent that learns to flexibly apply a broad motor repertoire, including righting, running, leaping and rearing, to solve multiple tasks in a simulated world. We analyze the artificial neural mechanisms underlying the virtual rodent’s motor capabilities using a neuroethological approach, where we characterize neural activity patterns relative to the rodent’s behavior and goals. We show that the rodent solves tasks by using a shared set of force patterns that are orchestrated into task-specific behaviors over longer timescales. Through methods familiar to neuroscientists, including representational similarity analysis, dimensionality reduction techniques, and targeted perturbations, we show that the networks produce these behaviors using at least two classes of behavioral representations, one that explicitly encodes behavioral kinematics in a task-invariant manner, and a second that encodes task-specific behavioral strategies. Overall, the virtual rat promises to facilitate grounded collaborations between deep reinforcement learning and motor neuroscience.
Tasks Dimensionality Reduction
Published 2020-01-01
URL https://openreview.net/forum?id=SyxrxR4KPS
PDF https://openreview.net/pdf?id=SyxrxR4KPS
PWC https://paperswithcode.com/paper/deep-neuroethology-of-a-virtual-rodent
Repo
Framework

Domain-Agnostic Few-Shot Classification by Learning Disparate Modulators

Title Domain-Agnostic Few-Shot Classification by Learning Disparate Modulators
Authors Yongseok Choi, Junyoung Park, Subin Yi, Dong-Yeon Cho
Abstract Although few-shot learning research has advanced rapidly with the help of meta-learning, its practical usefulness is still limited because most of the researches assumed that all meta-training and meta-testing examples came from a single domain. We propose a simple but effective way for few-shot classification in which a task distribution spans multiple domains including previously unseen ones during meta-training. The key idea is to build a pool of embedding models which have their own metric spaces and to learn to select the best one for a particular task through multi-domain meta-learning. This simplifies task-specific adaptation over a complex task distribution as a simple selection problem rather than modifying the model with a number of parameters at meta-testing time. Inspired by common multi-task learning techniques, we let all models in the pool share a base network and add a separate modulator to each model to refine the base network in its own way. This architecture allows the pool to maintain representational diversity and each model to have domain-invariant representation as well. Experiments show that our selection scheme outperforms other few-shot classification algorithms when target tasks could come from many different domains. They also reveal that aggregating outputs from all constituent models is effective for tasks from unseen domains showing the effectiveness of our framework.
Tasks Few-Shot Learning, Meta-Learning, Multi-Task Learning
Published 2020-01-01
URL https://openreview.net/forum?id=S1xjJpNYvB
PDF https://openreview.net/pdf?id=S1xjJpNYvB
PWC https://paperswithcode.com/paper/domain-agnostic-few-shot-classification-by
Repo
Framework

Alleviating Privacy Attacks via Causal Learning

Title Alleviating Privacy Attacks via Causal Learning
Authors Anonymous
Abstract Machine learning models, especially deep neural networks have been shown to reveal membership information of inputs in the training data. Such membership inference attacks are a serious privacy concern, for example, patients providing medical records to build a model that detects HIV would not want their identity to be leaked. Further, we show that the attack accuracy amplifies when the model is used to predict samples that come from a different distribution than the training set, which is often the case in real world applications. Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. Causal models are known to be invariant to the training distribution and hence generalize well to shifts between samples from the same distribution and across different distributions. First, we prove that models learned using causal structure provide stronger differential privacy guarantees than associational models under reasonable assumptions. Next, we show that causal models trained on sufficiently large samples are robust to membership inference attacks across different distributions of datasets and those trained on smaller sample sizes always have lower attack accuracy than corresponding associational models. Finally, we confirm our theoretical claims with experimental evaluation on 4 datasets with moderately complex Bayesian networks. We observe that neural network-based associational models exhibit upto 80% attack accuracy under different test distributions and sample sizes whereas causal models exhibit attack accuracy close to a random guess. Our results confirm the value of the generalizability of causal models in reducing susceptibility to privacy attacks.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=Hyxfs1SYwH
PDF https://openreview.net/pdf?id=Hyxfs1SYwH
PWC https://paperswithcode.com/paper/alleviating-privacy-attacks-via-causal
Repo
Framework

Attention over Phrases

Title Attention over Phrases
Authors Anonymous
Abstract How to represent the sentence That's the last straw for her''? The answer of the self-attention is a weighted sum of each individual words, i.e. $$semantics=\alpha_1Emb(\text{That})+\alpha_2Emb(\text{'s})+\cdots+\alpha_nEmb(\text{her})$$. But the weighted sum of That’s’', the'', last’', ``straw’’ can hardly represent the semantics of the phrase. We argue that the phrases play an important role in attention. If we combine some words into phrases, a more reasonable representation with compositions is $$semantics=\alpha_1Emb(\text{That’s})+Emb_2(\text{the last straw})+\alpha_3Emb(\text{for})+\alpha_4Emb(\text{her})$$. While recent studies prefer to use the attention mechanism to represent the natural language, few noticed the word compositions. In this paper, we study the problem of representing such compositional attentions in phrases. In this paper, we proposed a new attention architecture called HyperTransformer. Besides representing the words of the sentence, we introduce hypernodes to represent the candidate phrases in attention. HyperTransformer has two phases. The first phase is used to attend over all word/phrase pairs, which is similar to the standard Transformer. The second phase is used to represent the inductive bias within each phrase. Specially, we incorporate the non-linear attention in the second phase. The non-linearity represents the the semantic mutations in phrases. The experimental performance has been greatly improved. In WMT16 English-German translation task, the BLEU increases from 20.90 (by Transformer) to 34.61 (by HyperTransformer). |
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=HJeYalBKvr
PDF https://openreview.net/pdf?id=HJeYalBKvr
PWC https://paperswithcode.com/paper/attention-over-phrases
Repo
Framework
comments powered by Disqus