April 1, 2020

3246 words 16 mins read

Paper Group NANR 115

Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning. Provable Convergence and Global Optimality of Generative Adversarial Network. Representation Learning with Multisets. Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks. Inferring Dynamical Systems with Long-Range Dependencies through Line …

Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning


Title	Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning
Authors	Anonymous
Abstract	Solving tasks with sparse rewards is one of the most important challenges in reinforcement learning. In the single-agent setting, this challenge has been addressed by introducing intrinsic rewards that motivate agents to explore unseen regions of their state spaces. Applying these techniques naively to the multi-agent setting results in agents exploring independently, without any coordination among themselves. We argue that learning in cooperative multi-agent settings can be accelerated and improved if agents coordinate with respect to what they have explored. In this paper we propose an approach for learning how to dynamically select between different types of intrinsic rewards which consider not just what an individual agent has explored, but all agents, such that the agents can coordinate their exploration and maximize extrinsic returns. Concretely, we formulate the approach as a hierarchical policy where a high-level controller selects among sets of policies trained on different types of intrinsic rewards and the low-level controllers learn the action policies of all agents under these specific rewards. We demonstrate the effectiveness of the proposed approach in a multi-agent gridworld domain with sparse rewards, and then show that our method scales up to more complex settings by evaluating on the VizDoom platform.
Tasks	Multi-agent Reinforcement Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=rkltE0VKwH
PDF	https://openreview.net/pdf?id=rkltE0VKwH
PWC	https://paperswithcode.com/paper/coordinated-exploration-via-intrinsic-rewards-1
Repo
Framework

Provable Convergence and Global Optimality of Generative Adversarial Network


Title	Provable Convergence and Global Optimality of Generative Adversarial Network
Authors	Anonymous
Abstract	Generative adversarial networks (GANs) train implicit generative models through solving minimax problems. Such minimax problems are known as nonconvex- nonconcave, for which the dynamics of first-order methods are not well understood. In this paper, we consider GANs in the type of the integral probability metrics (IPMs) with the generator represented by an overparametrized neural network. When the discriminator is solved to approximate optimality in each iteration, we prove that stochastic gradient descent on a regularized IPM objective converges globally to a stationary point with a sublinear rate. Moreover, we prove that when the width of the generator network is sufficiently large and the discriminator function class has enough discriminative ability, the obtained stationary point corresponds to a generator that yields a distribution that is close to the distribution of the observed data in terms of the total variation. To the best of our knowledge, we seem to first establish both the global convergence and global optimality of training GANs when the generator is parametrized by a neural network.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1lnZlHYDS
PDF	https://openreview.net/pdf?id=H1lnZlHYDS
PWC	https://paperswithcode.com/paper/provable-convergence-and-global-optimality-of
Repo
Framework

Representation Learning with Multisets


Title	Representation Learning with Multisets
Authors	Anonymous
Abstract	We study the problem of learning permutation invariant representations that can capture containment relations. We propose training a model on a novel task: predicting the size of the symmetric difference between pairs of multisets, sets which may contain multiple copies of the same object. With motivation from fuzzy set theory, we formulate both multiset representations and how to predict symmetric difference sizes given these representations. We model multiset elements as vectors on the standard simplex and multisets as the summations of such vectors, and we predict symmetric difference as the l1-distance between multiset representations. We demonstrate that our representations more effectively predict the sizes of symmetric differences than DeepSets-based approaches with unconstrained object representations. Furthermore, we demonstrate that the model learns meaningful representations, mapping objects of different classes to different standard basis vectors.
Tasks	Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=H1eUz1rKPr
PDF	https://openreview.net/pdf?id=H1eUz1rKPr
PWC	https://paperswithcode.com/paper/representation-learning-with-multisets
Repo
Framework

Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks


Title	Deep Gradient Boosting – Layer-wise Input Normalization of Neural Networks
Authors	Anonymous
Abstract	Stochastic gradient descent (SGD) has been the dominant optimization method for training deep neural networks due to its many desirable properties. One of the more remarkable and least understood quality of SGD is that it generalizes relatively well on unseen data even when the neural network has millions of parameters. We hypothesize that in certain cases it is desirable to relax its intrinsic generalization properties and introduce an extension of SGD called deep gradient boosting (DGB). The key idea of DGB is that back-propagated gradients inferred using the chain rule can be viewed as pseudo-residual targets of a gradient boosting problem. Thus at each layer of a neural network the weight update is calculated by solving the corresponding boosting problem using a linear base learner. The resulting weight update formula can also be viewed as a normalization procedure of the data that arrives at each layer during the forward pass. When implemented as a separate input normalization layer (INN) the new architecture shows improved performance on image recognition tasks when compared to the same architecture without normalization layers. As opposed to batch normalization (BN), INN has no learnable parameters however it matches its performance on CIFAR10 and ImageNet classification tasks.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BkxzsT4Yvr
PDF	https://openreview.net/pdf?id=BkxzsT4Yvr
PWC	https://paperswithcode.com/paper/deep-gradient-boosting-layer-wise-input
Repo
Framework

Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization


Title	Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization
Authors	Anonymous
Abstract	Vanilla RNN with ReLU activation have a simple structure that is amenable to systematic dynamical systems analysis and interpretation, but they suffer from the exploding vs. vanishing gradients problem. Recent attempts to retain this simplicity while alleviating the gradient problem are based on proper initialization schemes or orthogonality/unitary constraints on the RNN’s recurrency matrix, which, however, comes with limitations to its expressive power with regards to dynamical systems phenomena like chaos or multi-stability. Here, we instead suggest a regularization scheme that pushes part of the RNN’s latent subspace toward a line attractor configuration that enables long short-term memory and arbitrarily slow time scales. We show that our approach excels on a number of benchmarks like the sequential MNIST or multiplication problems, and enables reconstruction of dynamical systems which harbor widely different time scales.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rylZKTNYPr
PDF	https://openreview.net/pdf?id=rylZKTNYPr
PWC	https://paperswithcode.com/paper/inferring-dynamical-systems-with-long-range-1
Repo
Framework

SoftAdam: Unifying SGD and Adam for better stochastic gradient descent


Title	SoftAdam: Unifying SGD and Adam for better stochastic gradient descent
Authors	Anonymous
Abstract	Abstract Stochastic gradient descent (SGD) and Adam are commonly used to optimize deep neural networks, but choosing one usually means making tradeoffs between speed, accuracy and stability. Here we present an intuition for why the tradeoffs exist as well as a method for unifying the two in a continuous way. This makes it possible to control the way models are trained in much greater detail. We show that for default parameters, the new algorithm equals or outperforms SGD and Adam across a range of models for image classification tasks and outperforms SGD for language modeling tasks.
Tasks	Image Classification, Language Modelling
Published	2020-01-01
URL	https://openreview.net/forum?id=Skgfr1rYDH
PDF	https://openreview.net/pdf?id=Skgfr1rYDH
PWC	https://paperswithcode.com/paper/softadam-unifying-sgd-and-adam-for-better
Repo
Framework

BERT-AL: BERT for Arbitrarily Long Document Understanding


Title	BERT-AL: BERT for Arbitrarily Long Document Understanding
Authors	Ruixuan Zhang, Zhuoyu Wei, Yu Shi, Yining Chen
Abstract	Pretrained language models attract lots of attentions, and they take advantage of the two-stages training process: pretraining on huge corpus and finetuning on specific tasks. Thereinto, BERT (Devlin et al., 2019) is a Transformer (Vaswani et al., 2017) based model and has been the state-of-the-art for many kinds of Nature Language Processing (NLP) tasks. However, BERT cannot take text longer than the maximum length as input since the maximum length is predefined during pretraining. When we apply BERT to long text tasks, e.g., document-level text summarization: 1) Truncating inputs by the maximum sequence length will decrease performance, since the model cannot capture long dependency and global information ranging the whole document. 2) Extending the maximum length requires re-pretraining which will cost a mass of time and computing resources. What’s even worse is that the computational complexity will increase quadratically with the length, which will result in an unacceptable training time. To resolve these problems, we propose to apply Transformer to only model local dependency and recurrently capture long dependency by inserting multi-channel LSTM into each layer of BERT. The proposed model is named as BERT-AL (BERT for Arbitrarily Long Document Understanding) and it can accept arbitrarily long input without re-pretraining from scratch. We demonstrate BERT-AL’s effectiveness on text summarization by conducting experiments on the CNN/Daily Mail dataset. Furthermore, our method can be adapted to other Transformer based models, e.g., XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), for various NLP tasks with long text.
Tasks	Text Summarization
Published	2020-01-01
URL	https://openreview.net/forum?id=SklnVAEFDB
PDF	https://openreview.net/pdf?id=SklnVAEFDB
PWC	https://paperswithcode.com/paper/bert-al-bert-for-arbitrarily-long-document
Repo
Framework

Learning Good Policies By Learning Good Perceptual Models


Title	Learning Good Policies By Learning Good Perceptual Models
Authors	Anonymous
Abstract	Reinforcement learning (RL) has led to increasingly complex looking behavior in recent years. However, such complexity can be misleading and hides over-fitting. We find that visual representations may be a useful metric of complexity, and both correlates well objective optimization and causally effects reward optimization. We then propose curious representation learning (CRL) which allows us to use better visual representation learning algorithms to correspondingly increase visual representation in policy through an intrinsic objective on both simulated environments and transfer to real images. Finally, we show better visual representations induced by CRL allows us to obtain better performance on Atari without any reward than other curiosity objectives.
Tasks	Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=HkgYEyrFDr
PDF	https://openreview.net/pdf?id=HkgYEyrFDr
PWC	https://paperswithcode.com/paper/learning-good-policies-by-learning-good
Repo
Framework

The Implicit Bias of Depth: How Incremental Learning Drives Generalization


Title	The Implicit Bias of Depth: How Incremental Learning Drives Generalization
Authors	Anonymous
Abstract	A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1lj0nNFwB
PDF	https://openreview.net/pdf?id=H1lj0nNFwB
PWC	https://paperswithcode.com/paper/the-implicit-bias-of-depth-how-incremental-1
Repo
Framework

TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES


Title	TPO: TREE SEARCH POLICY OPTIMIZATION FOR CONTINUOUS ACTION SPACES
Authors	Amir Yazdanbakhsh, Ebrahim Songhori, Robert Ormandi, Anna Goldie, Azalia Mirhoseini
Abstract	Monte Carlo Tree Search (MCTS) has achieved impressive results on a range of discrete environments, such as Go, Mario and Arcade games, but it has not yet fulfilled its true potential in continuous domains.In this work, we introduceTPO, a tree search based policy optimization method for continuous environments. TPO takes a hybrid approach to policy optimization. Building the MCTS tree in a continuous action space and updating the policy gradient using off-policy MCTS trajectories are non-trivial. To overcome these challenges, we propose limiting tree search branching factor by drawing only few action samples from the policy distribution and define a new loss function based on the trajectories’ mean and standard deviations. Our approach led to some non-intuitive findings. MCTS training generally requires a large number of samples and simulations. However, we observed that bootstrappingtree search with a pre-trained policy allows us to achieve high quality results with a low MCTS branching factor and few number of simulations. Without the proposed policy bootstrapping, continuous MCTS would require a much larger branching factor and simulation count, rendering it computationally and prohibitively expensive. In our experiments, we use PPO as our baseline policy optimization algorithm. TPO significantly improves the policy on nearly all of our benchmarks. For example, in complex environments such as Humanoid, we achieve a 2.5×improvement over the baseline algorithm.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJew70NYvH
PDF	https://openreview.net/pdf?id=HJew70NYvH
PWC	https://paperswithcode.com/paper/tpo-tree-search-policy-optimization-for
Repo
Framework

GraphNVP: an Invertible Flow-based Model for Generating Molecular Graphs


Title	GraphNVP: an Invertible Flow-based Model for Generating Molecular Graphs
Authors	Anonymous
Abstract	We propose GraphNVP, an invertible flow-based molecular graph generation model. Existing flow-based models only handle node attributes of a graph with invertible maps. In contrast, our model is the first invertible model for the whole graph components: both of dequantized node attributes and adjacency tensor are converted into latent vectors through two novel invertible flows. This decomposition yields the exact likelihood maximization on graph-structured data. We decompose the generation of a graph into two steps: generation of (i) an adjacency tensor and(ii) node attributes. We empirically demonstrate that our model and the two-step generation efficiently generates valid molecular graphs with almost no duplicated molecules, although there are no domain-specific heuristics ingrained in the model. We also confirm that the sampling (generation) of graphs is faster in magnitude than other models in our implementation. In addition, we observe that the learned latent space can be used to generate molecules with desired chemical properties
Tasks	Graph Generation
Published	2020-01-01
URL	https://openreview.net/forum?id=ryxQ6T4YwB
PDF	https://openreview.net/pdf?id=ryxQ6T4YwB
PWC	https://paperswithcode.com/paper/graphnvp-an-invertible-flow-based-model-for
Repo
Framework

Deep neuroethology of a virtual rodent


Title	Deep neuroethology of a virtual rodent
Authors	Anonymous
Abstract	Parallel developments in neuroscience and deep learning have led to mutually productive exchanges, pushing our understanding of real and artificial neural networks in sensory and cognitive systems. However, this interaction between fields is less developed in the study of motor control. Existing experimental research and neural network models have been focused on the production of individual behaviors, yielding little insight into how intelligent systems can produce a rich and varied set of motor behaviors. In this work we develop a virtual rodent that learns to flexibly apply a broad motor repertoire, including righting, running, leaping and rearing, to solve multiple tasks in a simulated world. We analyze the artificial neural mechanisms underlying the virtual rodent’s motor capabilities using a neuroethological approach, where we characterize neural activity patterns relative to the rodent’s behavior and goals. We show that the rodent solves tasks by using a shared set of force patterns that are orchestrated into task-specific behaviors over longer timescales. Through methods familiar to neuroscientists, including representational similarity analysis, dimensionality reduction techniques, and targeted perturbations, we show that the networks produce these behaviors using at least two classes of behavioral representations, one that explicitly encodes behavioral kinematics in a task-invariant manner, and a second that encodes task-specific behavioral strategies. Overall, the virtual rat promises to facilitate grounded collaborations between deep reinforcement learning and motor neuroscience.
Tasks	Dimensionality Reduction
Published	2020-01-01
URL	https://openreview.net/forum?id=SyxrxR4KPS
PDF	https://openreview.net/pdf?id=SyxrxR4KPS
PWC	https://paperswithcode.com/paper/deep-neuroethology-of-a-virtual-rodent
Repo
Framework

Domain-Agnostic Few-Shot Classification by Learning Disparate Modulators


Title	Domain-Agnostic Few-Shot Classification by Learning Disparate Modulators
Authors	Yongseok Choi, Junyoung Park, Subin Yi, Dong-Yeon Cho
Abstract	Although few-shot learning research has advanced rapidly with the help of meta-learning, its practical usefulness is still limited because most of the researches assumed that all meta-training and meta-testing examples came from a single domain. We propose a simple but effective way for few-shot classification in which a task distribution spans multiple domains including previously unseen ones during meta-training. The key idea is to build a pool of embedding models which have their own metric spaces and to learn to select the best one for a particular task through multi-domain meta-learning. This simplifies task-specific adaptation over a complex task distribution as a simple selection problem rather than modifying the model with a number of parameters at meta-testing time. Inspired by common multi-task learning techniques, we let all models in the pool share a base network and add a separate modulator to each model to refine the base network in its own way. This architecture allows the pool to maintain representational diversity and each model to have domain-invariant representation as well. Experiments show that our selection scheme outperforms other few-shot classification algorithms when target tasks could come from many different domains. They also reveal that aggregating outputs from all constituent models is effective for tasks from unseen domains showing the effectiveness of our framework.
Tasks	Few-Shot Learning, Meta-Learning, Multi-Task Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=S1xjJpNYvB
PDF	https://openreview.net/pdf?id=S1xjJpNYvB
PWC	https://paperswithcode.com/paper/domain-agnostic-few-shot-classification-by
Repo
Framework

Alleviating Privacy Attacks via Causal Learning


Title	Alleviating Privacy Attacks via Causal Learning
Authors	Anonymous
Abstract	Machine learning models, especially deep neural networks have been shown to reveal membership information of inputs in the training data. Such membership inference attacks are a serious privacy concern, for example, patients providing medical records to build a model that detects HIV would not want their identity to be leaked. Further, we show that the attack accuracy amplifies when the model is used to predict samples that come from a different distribution than the training set, which is often the case in real world applications. Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. Causal models are known to be invariant to the training distribution and hence generalize well to shifts between samples from the same distribution and across different distributions. First, we prove that models learned using causal structure provide stronger differential privacy guarantees than associational models under reasonable assumptions. Next, we show that causal models trained on sufficiently large samples are robust to membership inference attacks across different distributions of datasets and those trained on smaller sample sizes always have lower attack accuracy than corresponding associational models. Finally, we confirm our theoretical claims with experimental evaluation on 4 datasets with moderately complex Bayesian networks. We observe that neural network-based associational models exhibit upto 80% attack accuracy under different test distributions and sample sizes whereas causal models exhibit attack accuracy close to a random guess. Our results confirm the value of the generalizability of causal models in reducing susceptibility to privacy attacks.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Hyxfs1SYwH
PDF	https://openreview.net/pdf?id=Hyxfs1SYwH
PWC	https://paperswithcode.com/paper/alleviating-privacy-attacks-via-causal
Repo
Framework

Attention over Phrases


Title	Attention over Phrases
Authors	Anonymous
Abstract	How to represent the sentence `That's the last straw for her''? The answer of the self-attention is a weighted sum of each individual words, i.e. $$semantics=\alpha_1Emb(\text{That})+\alpha_2Emb(\text{'s})+\cdots+\alpha_nEmb(\text{her})$$. But the weighted sum of` That’s’', `the'',` last’', ``straw’’ can hardly represent the semantics of the phrase. We argue that the phrases play an important role in attention. If we combine some words into phrases, a more reasonable representation with compositions is $$semantics=\alpha_1Emb(\text{That’s})+Emb_2(\text{the last straw})+\alpha_3Emb(\text{for})+\alpha_4Emb(\text{her})$$. While recent studies prefer to use the attention mechanism to represent the natural language, few noticed the word compositions. In this paper, we study the problem of representing such compositional attentions in phrases. In this paper, we proposed a new attention architecture called HyperTransformer. Besides representing the words of the sentence, we introduce hypernodes to represent the candidate phrases in attention. HyperTransformer has two phases. The first phase is used to attend over all word/phrase pairs, which is similar to the standard Transformer. The second phase is used to represent the inductive bias within each phrase. Specially, we incorporate the non-linear attention in the second phase. The non-linearity represents the the semantic mutations in phrases. The experimental performance has been greatly improved. In WMT16 English-German translation task, the BLEU increases from 20.90 (by Transformer) to 34.61 (by HyperTransformer). \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJeYalBKvr
PDF	https://openreview.net/pdf?id=HJeYalBKvr
PWC	https://paperswithcode.com/paper/attention-over-phrases
Repo
Framework