Paper Group NANR 120
Continual Learning via Principal Components Projection. Semi-supervised semantic segmentation needs strong, high-dimensional perturbations. Learning Function-Specific Word Representations. Geometric Insights into the Convergence of Nonlinear TD Learning. Dynamical System Embedding for Efficient Intrinsically Motivated Artificial Agents. Certifying …
Continual Learning via Principal Components Projection
Title | Continual Learning via Principal Components Projection |
Authors | Anonymous |
Abstract | Continual learning in neural networks (NN) often suffers from catastrophic forgetting. That is, when learning a sequence of tasks on an NN, the learning of a new task will cause weight changes that may destroy the learned knowledge embedded in the weights for previous tasks. Without solving this problem, it is difficult to use an NN to perform continual or lifelong learning. Although researchers have attempted to solve the problem in many ways, it remains to be challenging. In this paper, we propose a new approach, called principal components projection (PCP). The idea is that in learning a new task, if we can ensure that the gradient updates will only occur in the orthogonal directions to the input vectors of the previous tasks, then the weight updates for learning the new task will not affect the previous tasks. We propose to compute the principal components of the input vectors and use them to transform the input and to project the gradient updates for learning each new task. PCP does not need to store any sampled data from previous tasks or to generate pseudo data of previous tasks and use them to help learn a new task. Empirical evaluation shows that the proposed method PCP markedly outperforms the state-of-the-art baseline methods. |
Tasks | Continual Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkxlElBYDS |
https://openreview.net/pdf?id=SkxlElBYDS | |
PWC | https://paperswithcode.com/paper/continual-learning-via-principal-components |
Repo | |
Framework | |
Semi-supervised semantic segmentation needs strong, high-dimensional perturbations
Title | Semi-supervised semantic segmentation needs strong, high-dimensional perturbations |
Authors | Anonymous |
Abstract | Consistency regularization describes a class of approaches that have yielded ground breaking results in semi-supervised classification problems. Prior work has established the cluster assumption,—,under which the data distribution consists of uniform class clusters of samples separated by low density regions,—,as key to its success. We analyze the problem of semantic segmentation and find that the data distribution does not exhibit low density regions separating classes and offer this as an explanation for why semi-supervised segmentation is a challenging problem. We then identify the conditions that allow consistency regularization to work even without such low-density regions. This allows us to generalize the recently proposed CutMix augmentation technique to a powerful masked variant, CowMix, leading to a successful application of consistency regularization in the semi-supervised semantic segmentation setting and reaching state-of-the-art results in several standard datasets. |
Tasks | Semantic Segmentation, Semi-Supervised Semantic Segmentation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1eBoJStwr |
https://openreview.net/pdf?id=B1eBoJStwr | |
PWC | https://paperswithcode.com/paper/semi-supervised-semantic-segmentation-needs |
Repo | |
Framework | |
Learning Function-Specific Word Representations
Title | Learning Function-Specific Word Representations |
Authors | Anonymous |
Abstract | We present a neural framework for learning associations between interrelated groups of words such as the ones found in Subject-Verb-Object (SVO) structures. Our model induces a joint function-specific word vector space, where vectors of e.g. plausible SVO compositions lie close together. The model retains information about word group membership even in the joint space, and can thereby effectively be applied to a number of tasks reasoning over the SVO structure. We show the robustness and versatility of the proposed framework by reporting state-of-the-art results on the tasks of estimating selectional preference (i.e., thematic fit) and event similarity. The results indicate that the combinations of representations learned with our task-independent model outperform task-specific architectures from prior work, while reducing the number of parameters by up to 95%. The proposed framework is versatile and holds promise to support learning function-specific representations beyond the SVO structures. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkgL7kBtDH |
https://openreview.net/pdf?id=BkgL7kBtDH | |
PWC | https://paperswithcode.com/paper/learning-function-specific-word |
Repo | |
Framework | |
Geometric Insights into the Convergence of Nonlinear TD Learning
Title | Geometric Insights into the Convergence of Nonlinear TD Learning |
Authors | Anonymous |
Abstract | While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJezGp4YPr |
https://openreview.net/pdf?id=SJezGp4YPr | |
PWC | https://paperswithcode.com/paper/geometric-insights-into-the-convergence-of |
Repo | |
Framework | |
Dynamical System Embedding for Efficient Intrinsically Motivated Artificial Agents
Title | Dynamical System Embedding for Efficient Intrinsically Motivated Artificial Agents |
Authors | Anonymous |
Abstract | Mutual Information between agent Actions and environment States (MIAS) quantifies the influence of agent on its environment. Recently, it was found that intrinsic motivation in artificial agents emerges from the maximization of MIAS. For example, empowerment is an information-theoretic approach to intrinsic motivation, which has been shown to solve a broad range of standard RL benchmark problems. The estimation of empowerment for arbitrary dynamics is a challenging problem because it relies on the estimation of MIAS. Existing approaches rely on sampling, which have formal limitations, requiring exponentially many samples. In this work, we develop a novel approach for the estimation of empowerment in unknown arbitrary dynamics from visual stimulus only, without sampling for the estimation of MIAS. The core idea is to represent the relation between action sequences and future states by a stochastic dynamical system in latent space, which admits an efficient estimation of MIAS by the ``Water-Filling” algorithm from information theory. We construct this embedding with deep neural networks trained on a novel objective function and demonstrate our approach by numerical simulations of non-linear continuous-time dynamical systems. We show that the designed embedding preserves information-theoretic properties of the original dynamics, and enables us to solve the standard AI benchmark problems. | |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SyxIterYwS |
https://openreview.net/pdf?id=SyxIterYwS | |
PWC | https://paperswithcode.com/paper/dynamical-system-embedding-for-efficient |
Repo | |
Framework | |
Certifying Distributional Robustness using Lipschitz Regularisation
Title | Certifying Distributional Robustness using Lipschitz Regularisation |
Authors | Anonymous |
Abstract | Distributional robust risk (DRR) minimisation has arisen as a flexible and effective framework for machine learning. Approximate solutions based on dualisation have become particularly favorable in addressing the semi-infinite optimisation, and they also provide a certificate of the robustness for the worst-case population loss. However existing methods are restricted to either linear models or very small perturbations, and cannot find the globally optimal solution for restricted nonlinear models such as kernel methods. In this paper we resolved these limitations by upper bounding DRRs with an empirical risk regularised by the Lipschitz constant of the model, including deep neural networks and kernel methods. As an application, we showed that it also provides a certificate for adversarial training, and global solutions can be achieved on product kernel machines in polynomial time. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1eo9h4KPH |
https://openreview.net/pdf?id=H1eo9h4KPH | |
PWC | https://paperswithcode.com/paper/certifying-distributional-robustness-using |
Repo | |
Framework | |
On the Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks
Title | On the Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks |
Authors | Anonymous |
Abstract | Variational Bayesian Inference is a popular methodology for approximating posterior distributions in Bayesian neural networks. Recent work developing this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational distribution to a more compact parameterization. For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibits strong low-rank structure after convergence. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing the models’ performance. What’s more, we find that such factorized parameterizations are easier to train since they improve the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound, resulting in faster convergence. |
Tasks | Bayesian Inference |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkxREeHKPS |
https://openreview.net/pdf?id=BkxREeHKPS | |
PWC | https://paperswithcode.com/paper/on-the-parameterization-of-gaussian-mean |
Repo | |
Framework | |
Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
Title | Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well |
Authors | Anonymous |
Abstract | We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel. The resulting models generalize equally well as those trained with small mini-batches but are produced in a substantially shorter time. We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and ImageNet. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rygFWAEFwS |
https://openreview.net/pdf?id=rygFWAEFwS | |
PWC | https://paperswithcode.com/paper/stochastic-weight-averaging-in-parallel-large |
Repo | |
Framework | |
Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity
Title | Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity |
Authors | Anonymous |
Abstract | We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, gradient clipping and normalized gradient, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJgnXpVYwS |
https://openreview.net/pdf?id=BJgnXpVYwS | |
PWC | https://paperswithcode.com/paper/why-gradient-clipping-accelerates-training-a |
Repo | |
Framework | |
MEMO: A Deep Network for Flexible Combination of Episodic Memories
Title | MEMO: A Deep Network for Flexible Combination of Episodic Memories |
Authors | Anonymous |
Abstract | Recent research developing neural network architectures with external memory have often used the benchmark bAbI question and answering dataset which provides a challenging number of tasks requiring reasoning. Here we employed a classic associative inference task from the human neuroscience literature in order to more carefully probe the reasoning capacity of existing memory-augmented architectures. This task is thought to capture the essence of reasoning – the appreciation of distant relationships among elements distributed across multiple facts or memories. Surprisingly, we found that current architectures struggle to reason over long distance associations. Similar results were obtained on a more complex task involving finding the shortest path between nodes in a path. We therefore developed a novel architecture, MEMO, endowed with the capacity to reason over longer distances. This was accomplished with the addition of two novel components. First, it introduces a separation between memories/facts stored in external memory and the items that comprise these facts in external memory. Second, it makes use of an adaptive retrieval mechanism, allowing a variable number of ‘memory hops’ before the answer is produced. MEMO is capable of solving our novel reasoning tasks, as well as all 20 tasks in bAbI. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJxlc0EtDr |
https://openreview.net/pdf?id=rJxlc0EtDr | |
PWC | https://paperswithcode.com/paper/memo-a-deep-network-for-flexible-combination |
Repo | |
Framework | |
Mem2Mem: Learning to Summarize Long Texts with Memory-to-Memory Transfer
Title | Mem2Mem: Learning to Summarize Long Texts with Memory-to-Memory Transfer |
Authors | Anonymous |
Abstract | We introduce the Mem2Mem mechanism, a conditional memory-to-memory mechanism that can be appended to general sequence-to-sequence frameworks, and demonstrate its effectiveness in improving long text neural abstractive summarization. Mem2Mem seamlessly transfers “memories” via readable/writable external memory modules that augment both the encoder and decoder. By enabling a memory transfer, Mem2Mem uses representations of highly salient input sentences and performs an implicit sentence extraction step. By allowing the decoder to read and write over encoded input memories, the models learn to store information about the input sequence while keeping track of what has been generated by the decoder. We evaluate Mem2Mem on abstractive text summarization and surpass the current state-of-the-art with less model capacity than competing models and with a full end-to-end training setup. To our knowledge, Mem2Mem is the first mechanism that can effectively use and update memory cells filled with different contextual information. |
Tasks | Abstractive Text Summarization, Text Summarization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1xTEgBKvB |
https://openreview.net/pdf?id=H1xTEgBKvB | |
PWC | https://paperswithcode.com/paper/mem2mem-learning-to-summarize-long-texts-with |
Repo | |
Framework | |
Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness
Title | Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness |
Authors | Anonymous |
Abstract | Previous work shows that adversarially robust generalization requires larger sample complexity, and the same dataset, e.g., CIFAR-10, which enables good standard accuracy may not suffice to train robust models. Since collecting new training data could be costly, we focus on better utilizing the given data by inducing the regions with high sample density in the feature space, which could lead to locally sufficient samples for robust learning. We first formally show that the softmax cross-entropy (SCE) loss and its variants convey inappropriate supervisory signals, which encourage the learned feature points to spread over the space sparsely in training. This inspires us to propose the Max-Mahalanobis center (MMC) loss to explicitly induce dense feature regions in order to benefit robustness. Namely, the MMC loss encourages the model to concentrate on learning ordered and compact representations, which gather around the preset optimal centers for different classes. We empirically demonstrate that applying the MMC loss can significantly improve robustness even under strong adaptive attacks, while keeping state-of-the-art accuracy on clean inputs with little extra computation compared to the SCE loss. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Byg9A24tvB |
https://openreview.net/pdf?id=Byg9A24tvB | |
PWC | https://paperswithcode.com/paper/rethinking-softmax-cross-entropy-loss-for-1 |
Repo | |
Framework | |
Value-Driven Hindsight Modelling
Title | Value-Driven Hindsight Modelling |
Authors | Anonymous |
Abstract | Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games. |
Tasks | Atari Games, Representation Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJxBa1HFvS |
https://openreview.net/pdf?id=rJxBa1HFvS | |
PWC | https://paperswithcode.com/paper/value-driven-hindsight-modelling |
Repo | |
Framework | |
Training Deep Neural Networks with Partially Adaptive Momentum
Title | Training Deep Neural Networks with Partially Adaptive Momentum |
Authors | Anonymous |
Abstract | Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes ``over adapted’'. We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks. | |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HklWsREKwr |
https://openreview.net/pdf?id=HklWsREKwr | |
PWC | https://paperswithcode.com/paper/training-deep-neural-networks-with-partially |
Repo | |
Framework | |
Neural Non-additive Utility Aggregation
Title | Neural Non-additive Utility Aggregation |
Authors | Anonymous |
Abstract | Neural architectures for set regression problems aim at learning representations such that good predictions can be made based on the learned representations. This strategy, however, ignores the fact that meaningful intermediate results might be helpful to perform well. We study two new architectures that explicitly model latent intermediate utilities and use non-additive utility aggregation to estimate the set utility based on the latent utilities. We evaluate the new architectures with visual and textual datasets, which have non-additive set utilities due to redundancy and synergy effects. We find that the new architectures perform substantially better in this setup. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SklgTkBKDr |
https://openreview.net/pdf?id=SklgTkBKDr | |
PWC | https://paperswithcode.com/paper/neural-non-additive-utility-aggregation |
Repo | |
Framework | |