Paper Group NANR 115
Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning. Provable Convergence and Global Optimality of Generative Adversarial Network. Representation Learning with Multisets. Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks. Inferring Dynamical Systems with Long-Range Dependencies through Line …
Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning
Title | Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning |
Authors | Anonymous |
Abstract | Solving tasks with sparse rewards is one of the most important challenges in reinforcement learning. In the single-agent setting, this challenge has been addressed by introducing intrinsic rewards that motivate agents to explore unseen regions of their state spaces. Applying these techniques naively to the multi-agent setting results in agents exploring independently, without any coordination among themselves. We argue that learning in cooperative multi-agent settings can be accelerated and improved if agents coordinate with respect to what they have explored. In this paper we propose an approach for learning how to dynamically select between different types of intrinsic rewards which consider not just what an individual agent has explored, but all agents, such that the agents can coordinate their exploration and maximize extrinsic returns. Concretely, we formulate the approach as a hierarchical policy where a high-level controller selects among sets of policies trained on different types of intrinsic rewards and the low-level controllers learn the action policies of all agents under these specific rewards. We demonstrate the effectiveness of the proposed approach in a multi-agent gridworld domain with sparse rewards, and then show that our method scales up to more complex settings by evaluating on the VizDoom platform. |
Tasks | Multi-agent Reinforcement Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkltE0VKwH |
https://openreview.net/pdf?id=rkltE0VKwH | |
PWC | https://paperswithcode.com/paper/coordinated-exploration-via-intrinsic-rewards-1 |
Repo | |
Framework | |
Provable Convergence and Global Optimality of Generative Adversarial Network
Title | Provable Convergence and Global Optimality of Generative Adversarial Network |
Authors | Anonymous |
Abstract | Generative adversarial networks (GANs) train implicit generative models through solving minimax problems. Such minimax problems are known as nonconvex- nonconcave, for which the dynamics of first-order methods are not well understood. In this paper, we consider GANs in the type of the integral probability metrics (IPMs) with the generator represented by an overparametrized neural network. When the discriminator is solved to approximate optimality in each iteration, we prove that stochastic gradient descent on a regularized IPM objective converges globally to a stationary point with a sublinear rate. Moreover, we prove that when the width of the generator network is sufficiently large and the discriminator function class has enough discriminative ability, the obtained stationary point corresponds to a generator that yields a distribution that is close to the distribution of the observed data in terms of the total variation. To the best of our knowledge, we seem to first establish both the global convergence and global optimality of training GANs when the generator is parametrized by a neural network. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1lnZlHYDS |
https://openreview.net/pdf?id=H1lnZlHYDS | |
PWC | https://paperswithcode.com/paper/provable-convergence-and-global-optimality-of |
Repo | |
Framework | |
Representation Learning with Multisets
Title | Representation Learning with Multisets |
Authors | Anonymous |
Abstract | We study the problem of learning permutation invariant representations that can capture containment relations. We propose training a model on a novel task: predicting the size of the symmetric difference between pairs of multisets, sets which may contain multiple copies of the same object. With motivation from fuzzy set theory, we formulate both multiset representations and how to predict symmetric difference sizes given these representations. We model multiset elements as vectors on the standard simplex and multisets as the summations of such vectors, and we predict symmetric difference as the l1-distance between multiset representations. We demonstrate that our representations more effectively predict the sizes of symmetric differences than DeepSets-based approaches with unconstrained object representations. Furthermore, we demonstrate that the model learns meaningful representations, mapping objects of different classes to different standard basis vectors. |
Tasks | Representation Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1eUz1rKPr |
https://openreview.net/pdf?id=H1eUz1rKPr | |
PWC | https://paperswithcode.com/paper/representation-learning-with-multisets |
Repo | |
Framework | |
Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks
Title | Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks |
Authors | Anonymous |
Abstract | Stochastic gradient descent (SGD) has been the dominant optimization method for training deep neural networks due to its many desirable properties. One of the more remarkable and least understood quality of SGD is that it generalizes relatively well on unseen data even when the neural network has millions of parameters. We hypothesize that in certain cases it is desirable to relax its intrinsic generalization properties and introduce an extension of SGD called deep gradient boosting (DGB). The key idea of DGB is that back-propagated gradients inferred using the chain rule can be viewed as pseudo-residual targets of a gradient boosting problem. Thus at each layer of a neural network the weight update is calculated by solving the corresponding boosting problem using a linear base learner. The resulting weight update formula can also be viewed as a normalization procedure of the data that arrives at each layer during the forward pass. When implemented as a separate input normalization layer (INN) the new architecture shows improved performance on image recognition tasks when compared to the same architecture without normalization layers. As opposed to batch normalization (BN), INN has no learnable parameters however it matches its performance on CIFAR10 and ImageNet classification tasks. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkxzsT4Yvr |
https://openreview.net/pdf?id=BkxzsT4Yvr | |
PWC | https://paperswithcode.com/paper/deep-gradient-boosting-layer-wise-input |
Repo | |
Framework | |
Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization
Title | Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization |
Authors | Anonymous |
Abstract | Vanilla RNN with ReLU activation have a simple structure that is amenable to systematic dynamical systems analysis and interpretation, but they suffer from the exploding vs. vanishing gradients problem. Recent attempts to retain this simplicity while alleviating the gradient problem are based on proper initialization schemes or orthogonality/unitary constraints on the RNN’s recurrency matrix, which, however, comes with limitations to its expressive power with regards to dynamical systems phenomena like chaos or multi-stability. Here, we instead suggest a regularization scheme that pushes part of the RNN’s latent subspace toward a line attractor configuration that enables long short-term memory and arbitrarily slow time scales. We show that our approach excels on a number of benchmarks like the sequential MNIST or multiplication problems, and enables reconstruction of dynamical systems which harbor widely different time scales. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rylZKTNYPr |
https://openreview.net/pdf?id=rylZKTNYPr | |
PWC | https://paperswithcode.com/paper/inferring-dynamical-systems-with-long-range-1 |
Repo | |
Framework | |
SoftAdam: Unifying SGD and Adam for better stochastic gradient descent
Title | SoftAdam: Unifying SGD and Adam for better stochastic gradient descent |
Authors | Anonymous |
Abstract | Abstract Stochastic gradient descent (SGD) and Adam are commonly used to optimize deep neural networks, but choosing one usually means making tradeoffs between speed, accuracy and stability. Here we present an intuition for why the tradeoffs exist as well as a method for unifying the two in a continuous way. This makes it possible to control the way models are trained in much greater detail. We show that for default parameters, the new algorithm equals or outperforms SGD and Adam across a range of models for image classification tasks and outperforms SGD for language modeling tasks. |
Tasks | Image Classification, Language Modelling |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Skgfr1rYDH |
https://openreview.net/pdf?id=Skgfr1rYDH | |
PWC | https://paperswithcode.com/paper/softadam-unifying-sgd-and-adam-for-better |
Repo | |
Framework | |
BERT-AL: BERT for Arbitrarily Long Document Understanding
Title | BERT-AL: BERT for Arbitrarily Long Document Understanding |
Authors | Ruixuan Zhang, Zhuoyu Wei, Yu Shi, Yining Chen |
Abstract | Pretrained language models attract lots of attentions, and they take advantage of the two-stages training process: pretraining on huge corpus and finetuning on specific tasks. Thereinto, BERT (Devlin et al., 2019) is a Transformer (Vaswani et al., 2017) based model and has been the state-of-the-art for many kinds of Nature Language Processing (NLP) tasks. However, BERT cannot take text longer than the maximum length as input since the maximum length is predefined during pretraining. When we apply BERT to long text tasks, e.g., document-level text summarization: 1) Truncating inputs by the maximum sequence length will decrease performance, since the model cannot capture long dependency and global information ranging the whole document. 2) Extending the maximum length requires re-pretraining which will cost a mass of time and computing resources. What’s even worse is that the computational complexity will increase quadratically with the length, which will result in an unacceptable training time. To resolve these problems, we propose to apply Transformer to only model local dependency and recurrently capture long dependency by inserting multi-channel LSTM into each layer of BERT. The proposed model is named as BERT-AL (BERT for Arbitrarily Long Document Understanding) and it can accept arbitrarily long input without re-pretraining from scratch. We demonstrate BERT-AL’s effectiveness on text summarization by conducting experiments on the CNN/Daily Mail dataset. Furthermore, our method can be adapted to other Transformer based models, e.g., XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), for various NLP tasks with long text. |
Tasks | Text Summarization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SklnVAEFDB |
https://openreview.net/pdf?id=SklnVAEFDB | |
PWC | https://paperswithcode.com/paper/bert-al-bert-for-arbitrarily-long-document |
Repo | |
Framework | |
Learning Good Policies By Learning Good Perceptual Models
Title | Learning Good Policies By Learning Good Perceptual Models |
Authors | Anonymous |
Abstract | Reinforcement learning (RL) has led to increasingly complex looking behavior in recent years. However, such complexity can be misleading and hides over-fitting. We find that visual representations may be a useful metric of complexity, and both correlates well objective optimization and causally effects reward optimization. We then propose curious representation learning (CRL) which allows us to use better visual representation learning algorithms to correspondingly increase visual representation in policy through an intrinsic objective on both simulated environments and transfer to real images. Finally, we show better visual representations induced by CRL allows us to obtain better performance on Atari without any reward than other curiosity objectives. |
Tasks | Representation Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HkgYEyrFDr |
https://openreview.net/pdf?id=HkgYEyrFDr | |
PWC | https://paperswithcode.com/paper/learning-good-policies-by-learning-good |
Repo | |
Framework | |
The Implicit Bias of Depth: How Incremental Learning Drives Generalization
Title | The Implicit Bias of Depth: How Incremental Learning Drives Generalization |
Authors | Anonymous |
Abstract | A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1lj0nNFwB |
https://openreview.net/pdf?id=H1lj0nNFwB | |
PWC | https://paperswithcode.com/paper/the-implicit-bias-of-depth-how-incremental-1 |
Repo | |
Framework | |
TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES
Title | TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES |
Authors | Amir Yazdanbakhsh, Ebrahim Songhori, Robert Ormandi, Anna Goldie, Azalia Mirhoseini |
Abstract | Monte Carlo Tree Search (MCTS) has achieved impressive results on a range of discrete environments, such as Go, Mario and Arcade games, but it has not yet fulfilled its true potential in continuous domains.In this work, we introduceTPO, a tree search based policy optimization method for continuous environments. TPO takes a hybrid approach to policy optimization. Building the MCTS tree in a continuous action space and updating the policy gradient using off-policy MCTS trajectories are non-trivial. To overcome these challenges, we propose limiting tree search branching factor by drawing only few action samples from the policy distribution and define a new loss function based on the trajectories’ mean and standard deviations. Our approach led to some non-intuitive findings. MCTS training generally requires a large number of samples and simulations. However, we observed that bootstrappingtree search with a pre-trained policy allows us to achieve high quality results with a low MCTS branching factor and few number of simulations. Without the proposed policy bootstrapping, continuous MCTS would require a much larger branching factor and simulation count, rendering it computationally and prohibitively expensive. In our experiments, we use PPO as our baseline policy optimization algorithm. TPO significantly improves the policy on nearly all of our benchmarks. For example, in complex environments such as Humanoid, we achieve a 2.5×improvement over the baseline algorithm. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJew70NYvH |
https://openreview.net/pdf?id=HJew70NYvH | |
PWC | https://paperswithcode.com/paper/tpo-tree-search-policy-optimization-for |
Repo | |
Framework | |
GraphNVP: an Invertible Flow-based Model for Generating Molecular Graphs
Title | GraphNVP: an Invertible Flow-based Model for Generating Molecular Graphs |
Authors | Anonymous |
Abstract | We propose GraphNVP, an invertible flow-based molecular graph generation model. Existing flow-based models only handle node attributes of a graph with invertible maps. In contrast, our model is the first invertible model for the whole graph components: both of dequantized node attributes and adjacency tensor are converted into latent vectors through two novel invertible flows. This decomposition yields the exact likelihood maximization on graph-structured data. We decompose the generation of a graph into two steps: generation of (i) an adjacency tensor and(ii) node attributes. We empirically demonstrate that our model and the two-step generation efficiently generates valid molecular graphs with almost no duplicated molecules, although there are no domain-specific heuristics ingrained in the model. We also confirm that the sampling (generation) of graphs is faster in magnitude than other models in our implementation. In addition, we observe that the learned latent space can be used to generate molecules with desired chemical properties |
Tasks | Graph Generation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxQ6T4YwB |
https://openreview.net/pdf?id=ryxQ6T4YwB | |
PWC | https://paperswithcode.com/paper/graphnvp-an-invertible-flow-based-model-for |
Repo | |
Framework | |
Deep neuroethology of a virtual rodent
Title | Deep neuroethology of a virtual rodent |
Authors | Anonymous |
Abstract | Parallel developments in neuroscience and deep learning have led to mutually productive exchanges, pushing our understanding of real and artificial neural networks in sensory and cognitive systems. However, this interaction between fields is less developed in the study of motor control. Existing experimental research and neural network models have been focused on the production of individual behaviors, yielding little insight into how intelligent systems can produce a rich and varied set of motor behaviors. In this work we develop a virtual rodent that learns to flexibly apply a broad motor repertoire, including righting, running, leaping and rearing, to solve multiple tasks in a simulated world. We analyze the artificial neural mechanisms underlying the virtual rodent’s motor capabilities using a neuroethological approach, where we characterize neural activity patterns relative to the rodent’s behavior and goals. We show that the rodent solves tasks by using a shared set of force patterns that are orchestrated into task-specific behaviors over longer timescales. Through methods familiar to neuroscientists, including representational similarity analysis, dimensionality reduction techniques, and targeted perturbations, we show that the networks produce these behaviors using at least two classes of behavioral representations, one that explicitly encodes behavioral kinematics in a task-invariant manner, and a second that encodes task-specific behavioral strategies. Overall, the virtual rat promises to facilitate grounded collaborations between deep reinforcement learning and motor neuroscience. |
Tasks | Dimensionality Reduction |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SyxrxR4KPS |
https://openreview.net/pdf?id=SyxrxR4KPS | |
PWC | https://paperswithcode.com/paper/deep-neuroethology-of-a-virtual-rodent |
Repo | |
Framework | |
Domain-Agnostic Few-Shot Classification by Learning Disparate Modulators
Title | Domain-Agnostic Few-Shot Classification by Learning Disparate Modulators |
Authors | Yongseok Choi, Junyoung Park, Subin Yi, Dong-Yeon Cho |
Abstract | Although few-shot learning research has advanced rapidly with the help of meta-learning, its practical usefulness is still limited because most of the researches assumed that all meta-training and meta-testing examples came from a single domain. We propose a simple but effective way for few-shot classification in which a task distribution spans multiple domains including previously unseen ones during meta-training. The key idea is to build a pool of embedding models which have their own metric spaces and to learn to select the best one for a particular task through multi-domain meta-learning. This simplifies task-specific adaptation over a complex task distribution as a simple selection problem rather than modifying the model with a number of parameters at meta-testing time. Inspired by common multi-task learning techniques, we let all models in the pool share a base network and add a separate modulator to each model to refine the base network in its own way. This architecture allows the pool to maintain representational diversity and each model to have domain-invariant representation as well. Experiments show that our selection scheme outperforms other few-shot classification algorithms when target tasks could come from many different domains. They also reveal that aggregating outputs from all constituent models is effective for tasks from unseen domains showing the effectiveness of our framework. |
Tasks | Few-Shot Learning, Meta-Learning, Multi-Task Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1xjJpNYvB |
https://openreview.net/pdf?id=S1xjJpNYvB | |
PWC | https://paperswithcode.com/paper/domain-agnostic-few-shot-classification-by |
Repo | |
Framework | |
Alleviating Privacy Attacks via Causal Learning
Title | Alleviating Privacy Attacks via Causal Learning |
Authors | Anonymous |
Abstract | Machine learning models, especially deep neural networks have been shown to reveal membership information of inputs in the training data. Such membership inference attacks are a serious privacy concern, for example, patients providing medical records to build a model that detects HIV would not want their identity to be leaked. Further, we show that the attack accuracy amplifies when the model is used to predict samples that come from a different distribution than the training set, which is often the case in real world applications. Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. Causal models are known to be invariant to the training distribution and hence generalize well to shifts between samples from the same distribution and across different distributions. First, we prove that models learned using causal structure provide stronger differential privacy guarantees than associational models under reasonable assumptions. Next, we show that causal models trained on sufficiently large samples are robust to membership inference attacks across different distributions of datasets and those trained on smaller sample sizes always have lower attack accuracy than corresponding associational models. Finally, we confirm our theoretical claims with experimental evaluation on 4 datasets with moderately complex Bayesian networks. We observe that neural network-based associational models exhibit upto 80% attack accuracy under different test distributions and sample sizes whereas causal models exhibit attack accuracy close to a random guess. Our results confirm the value of the generalizability of causal models in reducing susceptibility to privacy attacks. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hyxfs1SYwH |
https://openreview.net/pdf?id=Hyxfs1SYwH | |
PWC | https://paperswithcode.com/paper/alleviating-privacy-attacks-via-causal |
Repo | |
Framework | |
Attention over Phrases
Title | Attention over Phrases |
Authors | Anonymous |
Abstract | How to represent the sentence That's the last straw for her''? The answer of the self-attention is a weighted sum of each individual words, i.e. $$semantics=\alpha_1Emb(\text{That})+\alpha_2Emb(\text{'s})+\cdots+\alpha_nEmb(\text{her})$$. But the weighted sum of That’s’', the'', last’', ``straw’’ can hardly represent the semantics of the phrase. We argue that the phrases play an important role in attention. If we combine some words into phrases, a more reasonable representation with compositions is $$semantics=\alpha_1Emb(\text{That’s})+Emb_2(\text{the last straw})+\alpha_3Emb(\text{for})+\alpha_4Emb(\text{her})$$. While recent studies prefer to use the attention mechanism to represent the natural language, few noticed the word compositions. In this paper, we study the problem of representing such compositional attentions in phrases. In this paper, we proposed a new attention architecture called HyperTransformer. Besides representing the words of the sentence, we introduce hypernodes to represent the candidate phrases in attention. HyperTransformer has two phases. The first phase is used to attend over all word/phrase pairs, which is similar to the standard Transformer. The second phase is used to represent the inductive bias within each phrase. Specially, we incorporate the non-linear attention in the second phase. The non-linearity represents the the semantic mutations in phrases. The experimental performance has been greatly improved. In WMT16 English-German translation task, the BLEU increases from 20.90 (by Transformer) to 34.61 (by HyperTransformer). | |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJeYalBKvr |
https://openreview.net/pdf?id=HJeYalBKvr | |
PWC | https://paperswithcode.com/paper/attention-over-phrases |
Repo | |
Framework | |