April 1, 2020

2714 words 13 mins read

Paper Group NANR 100

Relational State-Space Model for Stochastic Multi-Object Systems. Variance Reduction With Sparse Gradients. The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. Novelty Detection Via Blurring. Federated Learning with Matched Averagi …

Relational State-Space Model for Stochastic Multi-Object Systems


Title	Relational State-Space Model for Stochastic Multi-Object Systems
Authors	Anonymous
Abstract	Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets.
Tasks	Time Series
Published	2020-01-01
URL	https://openreview.net/forum?id=B1lGU64tDr
PDF	https://openreview.net/pdf?id=B1lGU64tDr
PWC	https://paperswithcode.com/paper/relational-state-space-model-for-stochastic
Repo
Framework

Variance Reduction With Sparse Gradients


Title	Variance Reduction With Sparse Gradients
Authors	Anonymous
Abstract	Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (Johnson & Zhang, 2013) and SpiderBoost (Wang et al., 2018), require significantly more computational resources per update than SGD (Robbins & Monro, 1951). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator blending the top-K operator (Stich et al., 2018; Aji & Heafield, 2017) and the randomized coordinate descent operator. While the computational cost of computing the derivative of a model parameter is constant, we make the observation that the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we show that a sparse gradient based on the magnitude of past gradients reduces the computational cost of model updates without a significant loss in variance reduction. Theoretically, our algorithm is at least as good as the best available algorithm (e.g. SpiderBoost) under appropriate settings of parameters and can be much more efficient if our algorithm succeeds in capturing the sparsity of the gradients. Empirically, our algorithm consistently outperforms SpiderBoost using various models to solve various image classification tasks. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.
Tasks	Image Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=Syx1DkSYwB
PDF	https://openreview.net/pdf?id=Syx1DkSYwB
PWC	https://paperswithcode.com/paper/variance-reduction-with-sparse-gradients
Repo
Framework

The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget


Title	The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget
Authors	Anirudh Goyal, Yoshua Bengio, Matthew Botvinick, Sergey Levine
Abstract	In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged’’ input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Hye1kTVFDS
PDF	https://openreview.net/pdf?id=Hye1kTVFDS
PWC	https://paperswithcode.com/paper/the-variational-bandwidth-bottleneck
Repo
Framework

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models


Title	Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
Authors	Anonymous
Abstract	In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as “mixout”, motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.
Tasks	Language Modelling
Published	2020-01-01
URL	https://openreview.net/forum?id=HkgaETNtDB
PDF	https://openreview.net/pdf?id=HkgaETNtDB
PWC	https://paperswithcode.com/paper/mixout-effective-regularization-to-finetune-1
Repo
Framework

Novelty Detection Via Blurring


Title	Novelty Detection Via Blurring
Authors	Anonymous
Abstract	Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) are known to assign lower uncertainty to the OOD data than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient in test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns a better target distribution representation than the baselines. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy in CelebA domain.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=ByeNra4FDB
PDF	https://openreview.net/pdf?id=ByeNra4FDB
PWC	https://paperswithcode.com/paper/novelty-detection-via-blurring
Repo
Framework

Federated Learning with Matched Averaging


Title	Federated Learning with Matched Averaging
Authors	Anonymous
Abstract	Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud. We propose Federated matched averaging (FedMA) algorithm designed for federated learning of modern neural network architectures e.g. convolutional neural networks (CNNs) and LSTMs. FedMA constructs the shared global model in a layer-wise manner by matching and averaging hidden elements (i.e. channels for convolution layers; hidden states for LSTM; neurons for fully connected layers) with similar feature extraction signatures. Our experiments indicate that FedMA outperforms popular state-of-the-art federated learning algorithms on deep CNN and LSTM architectures trained on real world datasets, while improving the communication efficiency.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BkluqlSFDS
PDF	https://openreview.net/pdf?id=BkluqlSFDS
PWC	https://paperswithcode.com/paper/federated-learning-with-matched-averaging
Repo
Framework

Deep Nonlinear Stochastic Optimal Control for Systems with Multiplicative Uncertainties


Title	Deep Nonlinear Stochastic Optimal Control for Systems with Multiplicative Uncertainties
Authors	Anonymous
Abstract	We present a deep recurrent neural network architecture to solve a class of stochastic optimal control problems described by fully nonlinear Hamilton Jacobi Bellman partial differential equations. Such PDEs arise when one considers stochastic dynamics characterized by uncertainties that are additive and control multiplicative. Stochastic models with the aforementioned characteristics have been used in computational neuroscience, biology, finance and aerospace systems and provide a more accurate representation of actuation than models with additive uncertainty. Previous literature has established the inadequacy of the linear HJB theory and instead rely on a non-linear Feynman-Kac lemma resulting in a second order forward-backward stochastic differential equations representation. However, the proposed solutions that use this representation suffer from compounding errors and computational complexity leading to lack of scalability. In this paper, we propose a deep learning based algorithm that leverages the second order Forward-Backward SDE representation and LSTM based recurrent neural networks to not only solve such Stochastic Optimal Control problems but also overcome the problems faced by previous approaches and scales well to high dimensional systems. The resulting control algorithm is tested on non-linear systems in robotics and biomechanics to demonstrate feasibility and out-performance against previous methods.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1gXzxHKvH
PDF	https://openreview.net/pdf?id=H1gXzxHKvH
PWC	https://paperswithcode.com/paper/deep-nonlinear-stochastic-optimal-control-for
Repo
Framework

Solving single-objective tasks by preference multi-objective reinforcement learning


Title	Solving single-objective tasks by preference multi-objective reinforcement learning
Authors	Anonymous
Abstract	There ubiquitously exist many single-objective tasks in the real world that are inevitably related to some other objectives and influenced by them. We call such task as the objective-constrained task, which is inherently a multi-objective problem. Due to the conflict among different objectives, a trade-off is needed. A common compromise is to design a scalar reward function through clarifying the relationship among these objectives using the prior knowledge of experts. However, reward engineering is extremely cumbersome. This will result in behaviors that optimize our reward function without actually satisfying our preferences. In this paper, we explicitly cast the objective-constrained task as preference multi-objective reinforcement learning, with the overall goal of finding a Pareto optimal policy. Combined with Trajectory Preference Domination we propose, a weight vector that reflects the agent’s preference for each objective can be learned. We analyzed the feasibility of our algorithm in theory, and further proved in experiments its better performance compared to those that design the reward function by experts.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJxV5yHYwB
PDF	https://openreview.net/pdf?id=HJxV5yHYwB
PWC	https://paperswithcode.com/paper/solving-single-objective-tasks-by-preference
Repo
Framework

Robust training with ensemble consensus


Title	Robust training with ensemble consensus
Authors	Anonymous
Abstract	Since deep neural networks are over-parametrized, they may memorize noisy examples. We address such memorizing issue under the existence of annotation noise. From the fact that deep neural networks cannot generalize neighborhoods of the features acquired via memorization, we find that noisy examples do not consistently incur small losses on the network in the presence of perturbation. Based on this, we propose a novel training method called Learning with Ensemble Consensus (LEC) whose goal is to prevent overfitting noisy examples by eliminating them identified via consensus of an ensemble of perturbed networks. One of the proposed LECs, LTEC outperforms the current state-of-the-art methods on MNIST, CIFAR-10, and CIFAR-100 despite its efficient memory usage.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=ryxOUTVYDH
PDF	https://openreview.net/pdf?id=ryxOUTVYDH
PWC	https://paperswithcode.com/paper/robust-training-with-ensemble-consensus-1
Repo
Framework

Residual Energy-Based Models for Text Generation


Title	Residual Energy-Based Models for Text Generation
Authors	Anonymous
Abstract	Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and RoBERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation.
Tasks	Language Modelling, Machine Translation, Text Generation
Published	2020-01-01
URL	https://openreview.net/forum?id=B1l4SgHKDH
PDF	https://openreview.net/pdf?id=B1l4SgHKDH
PWC	https://paperswithcode.com/paper/residual-energy-based-models-for-text
Repo
Framework

Asymptotics of Wide Networks from Feynman Diagrams


Title	Asymptotics of Wide Networks from Feynman Diagrams
Authors	Anonymous
Abstract	Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=S1gFvANKDS
PDF	https://openreview.net/pdf?id=S1gFvANKDS
PWC	https://paperswithcode.com/paper/asymptotics-of-wide-networks-from-feynman-1
Repo
Framework

Compositional Visual Generation with Energy Based Models


Title	Compositional Visual Generation with Energy Based Models
Authors	Anonymous
Abstract	Humans are able to both learn quickly and rapidly adapt their knowledge. One major component is the ability to incrementally combine many simple concepts to accelerates the learning process. We show that energy based models are a promising class of models towards exhibiting these properties by directly combining probability distributions. This allows us to combine an arbitrary number of different distributions in a globally coherent manner. We show this compositionality property allows us to define three basic operators, logical conjunction, disjunction, and negation, on different concepts to generate plausible naturalistic images. Furthermore, by applying these abilities, we show that we are able to extrapolate concept combinations, continually combine previously learned concepts, and infer concept properties in a compositional manner.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BygZARVFDH
PDF	https://openreview.net/pdf?id=BygZARVFDH
PWC	https://paperswithcode.com/paper/compositional-visual-generation-with-energy
Repo
Framework

Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models


Title	Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models
Authors	Anonymous
Abstract	The contrastive divergence algorithm is a popular approach to training energy-based latent variable models, which has been widely used in many machine learning models such as the restricted Boltzmann machines and deep belief nets. Despite its empirical success, the contrastive divergence algorithm is also known to have biases that severely affect its convergence. In this article we propose an unbiased version of the contrastive divergence algorithm that completely removes its bias in stochastic gradient methods, based on recent advances on unbiased Markov chain Monte Carlo methods. Rigorous theoretical analysis is developed to justify the proposed algorithm, and numerical experiments show that it significantly improves the existing method. Our findings suggest that the unbiased contrastive divergence algorithm is a promising approach to training general energy-based latent variable models.
Tasks	Latent Variable Models
Published	2020-01-01
URL	https://openreview.net/forum?id=r1eyceSYPr
PDF	https://openreview.net/pdf?id=r1eyceSYPr
PWC	https://paperswithcode.com/paper/unbiased-contrastive-divergence-algorithm-for
Repo
Framework

Energy-based models for atomic-resolution protein conformations


Title	Energy-based models for atomic-resolution protein conformations
Authors	Anonymous
Abstract	We propose an energy-based model (EBM) of protein conformations that operates at atomic scale. The model is trained solely on crystallized protein data. By contrast, existing approaches for scoring conformations use energy functions that incorporate knowledge of physical principles and features that are the complex product of several decades of research and tuning. To evaluate our model, we benchmark on the rotamer recovery task, a restricted problem setting used to evaluate energy functions for protein design. Our model achieves comparable performance to the Rosetta energy function, a state-of-the-art method widely used in protein structure prediction and design. An investigation of the model’s outputs and hidden representations find that it captures physicochemical properties relevant to protein energy.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=S1e_9xrFvS
PDF	https://openreview.net/pdf?id=S1e_9xrFvS
PWC	https://paperswithcode.com/paper/energy-based-models-for-atomic-resolution
Repo
Framework

Causal Induction from Visual Observations for Goal Directed Tasks


Title	Causal Induction from Visual Observations for Goal Directed Tasks
Authors	Anonymous
Abstract	Causal reasoning has been an indispensable capability for humans and other intelligent animals to interact with the physical world. In this work, we propose to endow an artificial agent with the capability of causal reasoning for completing goal-directed tasks. We develop learning-based approaches to inducing causal knowledge in the form of directed acyclic graphs, which can be used to contextualize a learned goal-conditional policy to perform tasks in novel environments with latent causal structures. We leverage attention mechanisms in our causal induction model and goal-conditional policy, enabling us to incrementally generate the causal graph from the agent’s visual observations and to selectively use the induced graph for determining actions. Our experiments show that our method effectively generalizes towards completing new tasks in novel environments with previously unseen causal structures.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJl4g0NYvB
PDF	https://openreview.net/pdf?id=BJl4g0NYvB
PWC	https://paperswithcode.com/paper/causal-induction-from-visual-observations-for-1
Repo
Framework