Paper Group NANR 100
Relational State-Space Model for Stochastic Multi-Object Systems. Variance Reduction With Sparse Gradients. The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. Novelty Detection Via Blurring. Federated Learning with Matched Averagi …
Relational State-Space Model for Stochastic Multi-Object Systems
Title | Relational State-Space Model for Stochastic Multi-Object Systems |
Authors | Anonymous |
Abstract | Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets. |
Tasks | Time Series |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1lGU64tDr |
https://openreview.net/pdf?id=B1lGU64tDr | |
PWC | https://paperswithcode.com/paper/relational-state-space-model-for-stochastic |
Repo | |
Framework | |
Variance Reduction With Sparse Gradients
Title | Variance Reduction With Sparse Gradients |
Authors | Anonymous |
Abstract | Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (Johnson & Zhang, 2013) and SpiderBoost (Wang et al., 2018), require significantly more computational resources per update than SGD (Robbins & Monro, 1951). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator blending the top-K operator (Stich et al., 2018; Aji & Heafield, 2017) and the randomized coordinate descent operator. While the computational cost of computing the derivative of a model parameter is constant, we make the observation that the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we show that a sparse gradient based on the magnitude of past gradients reduces the computational cost of model updates without a significant loss in variance reduction. Theoretically, our algorithm is at least as good as the best available algorithm (e.g. SpiderBoost) under appropriate settings of parameters and can be much more efficient if our algorithm succeeds in capturing the sparsity of the gradients. Empirically, our algorithm consistently outperforms SpiderBoost using various models to solve various image classification tasks. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration. |
Tasks | Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Syx1DkSYwB |
https://openreview.net/pdf?id=Syx1DkSYwB | |
PWC | https://paperswithcode.com/paper/variance-reduction-with-sparse-gradients |
Repo | |
Framework | |
The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget
Title | The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget |
Authors | Anirudh Goyal, Yoshua Bengio, Matthew Botvinick, Sergey Levine |
Abstract | In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged’’ input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information. | |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hye1kTVFDS |
https://openreview.net/pdf?id=Hye1kTVFDS | |
PWC | https://paperswithcode.com/paper/the-variational-bandwidth-bottleneck |
Repo | |
Framework | |
Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
Title | Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models |
Authors | Anonymous |
Abstract | In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as “mixout”, motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE. |
Tasks | Language Modelling |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HkgaETNtDB |
https://openreview.net/pdf?id=HkgaETNtDB | |
PWC | https://paperswithcode.com/paper/mixout-effective-regularization-to-finetune-1 |
Repo | |
Framework | |
Novelty Detection Via Blurring
Title | Novelty Detection Via Blurring |
Authors | Anonymous |
Abstract | Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) are known to assign lower uncertainty to the OOD data than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient in test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns a better target distribution representation than the baselines. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy in CelebA domain. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ByeNra4FDB |
https://openreview.net/pdf?id=ByeNra4FDB | |
PWC | https://paperswithcode.com/paper/novelty-detection-via-blurring |
Repo | |
Framework | |
Federated Learning with Matched Averaging
Title | Federated Learning with Matched Averaging |
Authors | Anonymous |
Abstract | Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud. We propose Federated matched averaging (FedMA) algorithm designed for federated learning of modern neural network architectures e.g. convolutional neural networks (CNNs) and LSTMs. FedMA constructs the shared global model in a layer-wise manner by matching and averaging hidden elements (i.e. channels for convolution layers; hidden states for LSTM; neurons for fully connected layers) with similar feature extraction signatures. Our experiments indicate that FedMA outperforms popular state-of-the-art federated learning algorithms on deep CNN and LSTM architectures trained on real world datasets, while improving the communication efficiency. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkluqlSFDS |
https://openreview.net/pdf?id=BkluqlSFDS | |
PWC | https://paperswithcode.com/paper/federated-learning-with-matched-averaging |
Repo | |
Framework | |
Deep Nonlinear Stochastic Optimal Control for Systems with Multiplicative Uncertainties
Title | Deep Nonlinear Stochastic Optimal Control for Systems with Multiplicative Uncertainties |
Authors | Anonymous |
Abstract | We present a deep recurrent neural network architecture to solve a class of stochastic optimal control problems described by fully nonlinear Hamilton Jacobi Bellman partial differential equations. Such PDEs arise when one considers stochastic dynamics characterized by uncertainties that are additive and control multiplicative. Stochastic models with the aforementioned characteristics have been used in computational neuroscience, biology, finance and aerospace systems and provide a more accurate representation of actuation than models with additive uncertainty. Previous literature has established the inadequacy of the linear HJB theory and instead rely on a non-linear Feynman-Kac lemma resulting in a second order forward-backward stochastic differential equations representation. However, the proposed solutions that use this representation suffer from compounding errors and computational complexity leading to lack of scalability. In this paper, we propose a deep learning based algorithm that leverages the second order Forward-Backward SDE representation and LSTM based recurrent neural networks to not only solve such Stochastic Optimal Control problems but also overcome the problems faced by previous approaches and scales well to high dimensional systems. The resulting control algorithm is tested on non-linear systems in robotics and biomechanics to demonstrate feasibility and out-performance against previous methods. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1gXzxHKvH |
https://openreview.net/pdf?id=H1gXzxHKvH | |
PWC | https://paperswithcode.com/paper/deep-nonlinear-stochastic-optimal-control-for |
Repo | |
Framework | |
Solving single-objective tasks by preference multi-objective reinforcement learning
Title | Solving single-objective tasks by preference multi-objective reinforcement learning |
Authors | Anonymous |
Abstract | There ubiquitously exist many single-objective tasks in the real world that are inevitably related to some other objectives and influenced by them. We call such task as the objective-constrained task, which is inherently a multi-objective problem. Due to the conflict among different objectives, a trade-off is needed. A common compromise is to design a scalar reward function through clarifying the relationship among these objectives using the prior knowledge of experts. However, reward engineering is extremely cumbersome. This will result in behaviors that optimize our reward function without actually satisfying our preferences. In this paper, we explicitly cast the objective-constrained task as preference multi-objective reinforcement learning, with the overall goal of finding a Pareto optimal policy. Combined with Trajectory Preference Domination we propose, a weight vector that reflects the agent’s preference for each objective can be learned. We analyzed the feasibility of our algorithm in theory, and further proved in experiments its better performance compared to those that design the reward function by experts. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJxV5yHYwB |
https://openreview.net/pdf?id=HJxV5yHYwB | |
PWC | https://paperswithcode.com/paper/solving-single-objective-tasks-by-preference |
Repo | |
Framework | |
Robust training with ensemble consensus
Title | Robust training with ensemble consensus |
Authors | Anonymous |
Abstract | Since deep neural networks are over-parametrized, they may memorize noisy examples. We address such memorizing issue under the existence of annotation noise. From the fact that deep neural networks cannot generalize neighborhoods of the features acquired via memorization, we find that noisy examples do not consistently incur small losses on the network in the presence of perturbation. Based on this, we propose a novel training method called Learning with Ensemble Consensus (LEC) whose goal is to prevent overfitting noisy examples by eliminating them identified via consensus of an ensemble of perturbed networks. One of the proposed LECs, LTEC outperforms the current state-of-the-art methods on MNIST, CIFAR-10, and CIFAR-100 despite its efficient memory usage. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxOUTVYDH |
https://openreview.net/pdf?id=ryxOUTVYDH | |
PWC | https://paperswithcode.com/paper/robust-training-with-ensemble-consensus-1 |
Repo | |
Framework | |
Residual Energy-Based Models for Text Generation
Title | Residual Energy-Based Models for Text Generation |
Authors | Anonymous |
Abstract | Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and RoBERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation. |
Tasks | Language Modelling, Machine Translation, Text Generation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1l4SgHKDH |
https://openreview.net/pdf?id=B1l4SgHKDH | |
PWC | https://paperswithcode.com/paper/residual-energy-based-models-for-text |
Repo | |
Framework | |
Asymptotics of Wide Networks from Feynman Diagrams
Title | Asymptotics of Wide Networks from Feynman Diagrams |
Authors | Anonymous |
Abstract | Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1gFvANKDS |
https://openreview.net/pdf?id=S1gFvANKDS | |
PWC | https://paperswithcode.com/paper/asymptotics-of-wide-networks-from-feynman-1 |
Repo | |
Framework | |
Compositional Visual Generation with Energy Based Models
Title | Compositional Visual Generation with Energy Based Models |
Authors | Anonymous |
Abstract | Humans are able to both learn quickly and rapidly adapt their knowledge. One major component is the ability to incrementally combine many simple concepts to accelerates the learning process. We show that energy based models are a promising class of models towards exhibiting these properties by directly combining probability distributions. This allows us to combine an arbitrary number of different distributions in a globally coherent manner. We show this compositionality property allows us to define three basic operators, logical conjunction, disjunction, and negation, on different concepts to generate plausible naturalistic images. Furthermore, by applying these abilities, we show that we are able to extrapolate concept combinations, continually combine previously learned concepts, and infer concept properties in a compositional manner. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BygZARVFDH |
https://openreview.net/pdf?id=BygZARVFDH | |
PWC | https://paperswithcode.com/paper/compositional-visual-generation-with-energy |
Repo | |
Framework | |
Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models
Title | Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models |
Authors | Anonymous |
Abstract | The contrastive divergence algorithm is a popular approach to training energy-based latent variable models, which has been widely used in many machine learning models such as the restricted Boltzmann machines and deep belief nets. Despite its empirical success, the contrastive divergence algorithm is also known to have biases that severely affect its convergence. In this article we propose an unbiased version of the contrastive divergence algorithm that completely removes its bias in stochastic gradient methods, based on recent advances on unbiased Markov chain Monte Carlo methods. Rigorous theoretical analysis is developed to justify the proposed algorithm, and numerical experiments show that it significantly improves the existing method. Our findings suggest that the unbiased contrastive divergence algorithm is a promising approach to training general energy-based latent variable models. |
Tasks | Latent Variable Models |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1eyceSYPr |
https://openreview.net/pdf?id=r1eyceSYPr | |
PWC | https://paperswithcode.com/paper/unbiased-contrastive-divergence-algorithm-for |
Repo | |
Framework | |
Energy-based models for atomic-resolution protein conformations
Title | Energy-based models for atomic-resolution protein conformations |
Authors | Anonymous |
Abstract | We propose an energy-based model (EBM) of protein conformations that operates at atomic scale. The model is trained solely on crystallized protein data. By contrast, existing approaches for scoring conformations use energy functions that incorporate knowledge of physical principles and features that are the complex product of several decades of research and tuning. To evaluate our model, we benchmark on the rotamer recovery task, a restricted problem setting used to evaluate energy functions for protein design. Our model achieves comparable performance to the Rosetta energy function, a state-of-the-art method widely used in protein structure prediction and design. An investigation of the model’s outputs and hidden representations find that it captures physicochemical properties relevant to protein energy. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1e_9xrFvS |
https://openreview.net/pdf?id=S1e_9xrFvS | |
PWC | https://paperswithcode.com/paper/energy-based-models-for-atomic-resolution |
Repo | |
Framework | |
Causal Induction from Visual Observations for Goal Directed Tasks
Title | Causal Induction from Visual Observations for Goal Directed Tasks |
Authors | Anonymous |
Abstract | Causal reasoning has been an indispensable capability for humans and other intelligent animals to interact with the physical world. In this work, we propose to endow an artificial agent with the capability of causal reasoning for completing goal-directed tasks. We develop learning-based approaches to inducing causal knowledge in the form of directed acyclic graphs, which can be used to contextualize a learned goal-conditional policy to perform tasks in novel environments with latent causal structures. We leverage attention mechanisms in our causal induction model and goal-conditional policy, enabling us to incrementally generate the causal graph from the agent’s visual observations and to selectively use the induced graph for determining actions. Our experiments show that our method effectively generalizes towards completing new tasks in novel environments with previously unseen causal structures. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJl4g0NYvB |
https://openreview.net/pdf?id=BJl4g0NYvB | |
PWC | https://paperswithcode.com/paper/causal-induction-from-visual-observations-for-1 |
Repo | |
Framework | |