April 1, 2020

2714 words 13 mins read

Paper Group NANR 100

Paper Group NANR 100

Relational State-Space Model for Stochastic Multi-Object Systems. Variance Reduction With Sparse Gradients. The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. Novelty Detection Via Blurring. Federated Learning with Matched Averagi …

Relational State-Space Model for Stochastic Multi-Object Systems

Title Relational State-Space Model for Stochastic Multi-Object Systems
Authors Anonymous
Abstract Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets.
Tasks Time Series
Published 2020-01-01
URL https://openreview.net/forum?id=B1lGU64tDr
PDF https://openreview.net/pdf?id=B1lGU64tDr
PWC https://paperswithcode.com/paper/relational-state-space-model-for-stochastic
Repo
Framework

Variance Reduction With Sparse Gradients

Title Variance Reduction With Sparse Gradients
Authors Anonymous
Abstract Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (Johnson & Zhang, 2013) and SpiderBoost (Wang et al., 2018), require significantly more computational resources per update than SGD (Robbins & Monro, 1951). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator blending the top-K operator (Stich et al., 2018; Aji & Heafield, 2017) and the randomized coordinate descent operator. While the computational cost of computing the derivative of a model parameter is constant, we make the observation that the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we show that a sparse gradient based on the magnitude of past gradients reduces the computational cost of model updates without a significant loss in variance reduction. Theoretically, our algorithm is at least as good as the best available algorithm (e.g. SpiderBoost) under appropriate settings of parameters and can be much more efficient if our algorithm succeeds in capturing the sparsity of the gradients. Empirically, our algorithm consistently outperforms SpiderBoost using various models to solve various image classification tasks. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.
Tasks Image Classification
Published 2020-01-01
URL https://openreview.net/forum?id=Syx1DkSYwB
PDF https://openreview.net/pdf?id=Syx1DkSYwB
PWC https://paperswithcode.com/paper/variance-reduction-with-sparse-gradients
Repo
Framework

The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget

Title The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget
Authors Anirudh Goyal, Yoshua Bengio, Matthew Botvinick, Sergey Levine
Abstract In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged’’ input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information. |
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=Hye1kTVFDS
PDF https://openreview.net/pdf?id=Hye1kTVFDS
PWC https://paperswithcode.com/paper/the-variational-bandwidth-bottleneck
Repo
Framework

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Title Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
Authors Anonymous
Abstract In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as “mixout”, motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.
Tasks Language Modelling
Published 2020-01-01
URL https://openreview.net/forum?id=HkgaETNtDB
PDF https://openreview.net/pdf?id=HkgaETNtDB
PWC https://paperswithcode.com/paper/mixout-effective-regularization-to-finetune-1
Repo
Framework

Novelty Detection Via Blurring

Title Novelty Detection Via Blurring
Authors Anonymous
Abstract Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) are known to assign lower uncertainty to the OOD data than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient in test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns a better target distribution representation than the baselines. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy in CelebA domain.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=ByeNra4FDB
PDF https://openreview.net/pdf?id=ByeNra4FDB
PWC https://paperswithcode.com/paper/novelty-detection-via-blurring
Repo
Framework

Federated Learning with Matched Averaging

Title Federated Learning with Matched Averaging
Authors Anonymous
Abstract Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud. We propose Federated matched averaging (FedMA) algorithm designed for federated learning of modern neural network architectures e.g. convolutional neural networks (CNNs) and LSTMs. FedMA constructs the shared global model in a layer-wise manner by matching and averaging hidden elements (i.e. channels for convolution layers; hidden states for LSTM; neurons for fully connected layers) with similar feature extraction signatures. Our experiments indicate that FedMA outperforms popular state-of-the-art federated learning algorithms on deep CNN and LSTM architectures trained on real world datasets, while improving the communication efficiency.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BkluqlSFDS
PDF https://openreview.net/pdf?id=BkluqlSFDS
PWC https://paperswithcode.com/paper/federated-learning-with-matched-averaging
Repo
Framework

Deep Nonlinear Stochastic Optimal Control for Systems with Multiplicative Uncertainties

Title Deep Nonlinear Stochastic Optimal Control for Systems with Multiplicative Uncertainties
Authors Anonymous
Abstract We present a deep recurrent neural network architecture to solve a class of stochastic optimal control problems described by fully nonlinear Hamilton Jacobi Bellman partial differential equations. Such PDEs arise when one considers stochastic dynamics characterized by uncertainties that are additive and control multiplicative. Stochastic models with the aforementioned characteristics have been used in computational neuroscience, biology, finance and aerospace systems and provide a more accurate representation of actuation than models with additive uncertainty. Previous literature has established the inadequacy of the linear HJB theory and instead rely on a non-linear Feynman-Kac lemma resulting in a second order forward-backward stochastic differential equations representation. However, the proposed solutions that use this representation suffer from compounding errors and computational complexity leading to lack of scalability. In this paper, we propose a deep learning based algorithm that leverages the second order Forward-Backward SDE representation and LSTM based recurrent neural networks to not only solve such Stochastic Optimal Control problems but also overcome the problems faced by previous approaches and scales well to high dimensional systems. The resulting control algorithm is tested on non-linear systems in robotics and biomechanics to demonstrate feasibility and out-performance against previous methods.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=H1gXzxHKvH
PDF https://openreview.net/pdf?id=H1gXzxHKvH
PWC https://paperswithcode.com/paper/deep-nonlinear-stochastic-optimal-control-for
Repo
Framework

Solving single-objective tasks by preference multi-objective reinforcement learning

Title Solving single-objective tasks by preference multi-objective reinforcement learning
Authors Anonymous
Abstract There ubiquitously exist many single-objective tasks in the real world that are inevitably related to some other objectives and influenced by them. We call such task as the objective-constrained task, which is inherently a multi-objective problem. Due to the conflict among different objectives, a trade-off is needed. A common compromise is to design a scalar reward function through clarifying the relationship among these objectives using the prior knowledge of experts. However, reward engineering is extremely cumbersome. This will result in behaviors that optimize our reward function without actually satisfying our preferences. In this paper, we explicitly cast the objective-constrained task as preference multi-objective reinforcement learning, with the overall goal of finding a Pareto optimal policy. Combined with Trajectory Preference Domination we propose, a weight vector that reflects the agent’s preference for each objective can be learned. We analyzed the feasibility of our algorithm in theory, and further proved in experiments its better performance compared to those that design the reward function by experts.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=HJxV5yHYwB
PDF https://openreview.net/pdf?id=HJxV5yHYwB
PWC https://paperswithcode.com/paper/solving-single-objective-tasks-by-preference
Repo
Framework

Robust training with ensemble consensus

Title Robust training with ensemble consensus
Authors Anonymous
Abstract Since deep neural networks are over-parametrized, they may memorize noisy examples. We address such memorizing issue under the existence of annotation noise. From the fact that deep neural networks cannot generalize neighborhoods of the features acquired via memorization, we find that noisy examples do not consistently incur small losses on the network in the presence of perturbation. Based on this, we propose a novel training method called Learning with Ensemble Consensus (LEC) whose goal is to prevent overfitting noisy examples by eliminating them identified via consensus of an ensemble of perturbed networks. One of the proposed LECs, LTEC outperforms the current state-of-the-art methods on MNIST, CIFAR-10, and CIFAR-100 despite its efficient memory usage.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=ryxOUTVYDH
PDF https://openreview.net/pdf?id=ryxOUTVYDH
PWC https://paperswithcode.com/paper/robust-training-with-ensemble-consensus-1
Repo
Framework

Residual Energy-Based Models for Text Generation

Title Residual Energy-Based Models for Text Generation
Authors Anonymous
Abstract Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and RoBERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation.
Tasks Language Modelling, Machine Translation, Text Generation
Published 2020-01-01
URL https://openreview.net/forum?id=B1l4SgHKDH
PDF https://openreview.net/pdf?id=B1l4SgHKDH
PWC https://paperswithcode.com/paper/residual-energy-based-models-for-text
Repo
Framework

Asymptotics of Wide Networks from Feynman Diagrams

Title Asymptotics of Wide Networks from Feynman Diagrams
Authors Anonymous
Abstract Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=S1gFvANKDS
PDF https://openreview.net/pdf?id=S1gFvANKDS
PWC https://paperswithcode.com/paper/asymptotics-of-wide-networks-from-feynman-1
Repo
Framework

Compositional Visual Generation with Energy Based Models

Title Compositional Visual Generation with Energy Based Models
Authors Anonymous
Abstract Humans are able to both learn quickly and rapidly adapt their knowledge. One major component is the ability to incrementally combine many simple concepts to accelerates the learning process. We show that energy based models are a promising class of models towards exhibiting these properties by directly combining probability distributions. This allows us to combine an arbitrary number of different distributions in a globally coherent manner. We show this compositionality property allows us to define three basic operators, logical conjunction, disjunction, and negation, on different concepts to generate plausible naturalistic images. Furthermore, by applying these abilities, we show that we are able to extrapolate concept combinations, continually combine previously learned concepts, and infer concept properties in a compositional manner.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BygZARVFDH
PDF https://openreview.net/pdf?id=BygZARVFDH
PWC https://paperswithcode.com/paper/compositional-visual-generation-with-energy
Repo
Framework

Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models

Title Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models
Authors Anonymous
Abstract The contrastive divergence algorithm is a popular approach to training energy-based latent variable models, which has been widely used in many machine learning models such as the restricted Boltzmann machines and deep belief nets. Despite its empirical success, the contrastive divergence algorithm is also known to have biases that severely affect its convergence. In this article we propose an unbiased version of the contrastive divergence algorithm that completely removes its bias in stochastic gradient methods, based on recent advances on unbiased Markov chain Monte Carlo methods. Rigorous theoretical analysis is developed to justify the proposed algorithm, and numerical experiments show that it significantly improves the existing method. Our findings suggest that the unbiased contrastive divergence algorithm is a promising approach to training general energy-based latent variable models.
Tasks Latent Variable Models
Published 2020-01-01
URL https://openreview.net/forum?id=r1eyceSYPr
PDF https://openreview.net/pdf?id=r1eyceSYPr
PWC https://paperswithcode.com/paper/unbiased-contrastive-divergence-algorithm-for
Repo
Framework

Energy-based models for atomic-resolution protein conformations

Title Energy-based models for atomic-resolution protein conformations
Authors Anonymous
Abstract We propose an energy-based model (EBM) of protein conformations that operates at atomic scale. The model is trained solely on crystallized protein data. By contrast, existing approaches for scoring conformations use energy functions that incorporate knowledge of physical principles and features that are the complex product of several decades of research and tuning. To evaluate our model, we benchmark on the rotamer recovery task, a restricted problem setting used to evaluate energy functions for protein design. Our model achieves comparable performance to the Rosetta energy function, a state-of-the-art method widely used in protein structure prediction and design. An investigation of the model’s outputs and hidden representations find that it captures physicochemical properties relevant to protein energy.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=S1e_9xrFvS
PDF https://openreview.net/pdf?id=S1e_9xrFvS
PWC https://paperswithcode.com/paper/energy-based-models-for-atomic-resolution
Repo
Framework

Causal Induction from Visual Observations for Goal Directed Tasks

Title Causal Induction from Visual Observations for Goal Directed Tasks
Authors Anonymous
Abstract Causal reasoning has been an indispensable capability for humans and other intelligent animals to interact with the physical world. In this work, we propose to endow an artificial agent with the capability of causal reasoning for completing goal-directed tasks. We develop learning-based approaches to inducing causal knowledge in the form of directed acyclic graphs, which can be used to contextualize a learned goal-conditional policy to perform tasks in novel environments with latent causal structures. We leverage attention mechanisms in our causal induction model and goal-conditional policy, enabling us to incrementally generate the causal graph from the agent’s visual observations and to selectively use the induced graph for determining actions. Our experiments show that our method effectively generalizes towards completing new tasks in novel environments with previously unseen causal structures.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BJl4g0NYvB
PDF https://openreview.net/pdf?id=BJl4g0NYvB
PWC https://paperswithcode.com/paper/causal-induction-from-visual-observations-for-1
Repo
Framework
comments powered by Disqus