April 1, 2020

2943 words 14 mins read

Paper Group NANR 120

Continual Learning via Principal Components Projection. Semi-supervised semantic segmentation needs strong, high-dimensional perturbations. Learning Function-Specific Word Representations. Geometric Insights into the Convergence of Nonlinear TD Learning. Dynamical System Embedding for Efficient Intrinsically Motivated Artificial Agents. Certifying …

Continual Learning via Principal Components Projection


Title	Continual Learning via Principal Components Projection
Authors	Anonymous
Abstract	Continual learning in neural networks (NN) often suffers from catastrophic forgetting. That is, when learning a sequence of tasks on an NN, the learning of a new task will cause weight changes that may destroy the learned knowledge embedded in the weights for previous tasks. Without solving this problem, it is difficult to use an NN to perform continual or lifelong learning. Although researchers have attempted to solve the problem in many ways, it remains to be challenging. In this paper, we propose a new approach, called principal components projection (PCP). The idea is that in learning a new task, if we can ensure that the gradient updates will only occur in the orthogonal directions to the input vectors of the previous tasks, then the weight updates for learning the new task will not affect the previous tasks. We propose to compute the principal components of the input vectors and use them to transform the input and to project the gradient updates for learning each new task. PCP does not need to store any sampled data from previous tasks or to generate pseudo data of previous tasks and use them to help learn a new task. Empirical evaluation shows that the proposed method PCP markedly outperforms the state-of-the-art baseline methods.
Tasks	Continual Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=SkxlElBYDS
PDF	https://openreview.net/pdf?id=SkxlElBYDS
PWC	https://paperswithcode.com/paper/continual-learning-via-principal-components
Repo
Framework

Semi-supervised semantic segmentation needs strong, high-dimensional perturbations


Title	Semi-supervised semantic segmentation needs strong, high-dimensional perturbations
Authors	Anonymous
Abstract	Consistency regularization describes a class of approaches that have yielded ground breaking results in semi-supervised classification problems. Prior work has established the cluster assumption,—,under which the data distribution consists of uniform class clusters of samples separated by low density regions,—,as key to its success. We analyze the problem of semantic segmentation and find that the data distribution does not exhibit low density regions separating classes and offer this as an explanation for why semi-supervised segmentation is a challenging problem. We then identify the conditions that allow consistency regularization to work even without such low-density regions. This allows us to generalize the recently proposed CutMix augmentation technique to a powerful masked variant, CowMix, leading to a successful application of consistency regularization in the semi-supervised semantic segmentation setting and reaching state-of-the-art results in several standard datasets.
Tasks	Semantic Segmentation, Semi-Supervised Semantic Segmentation
Published	2020-01-01
URL	https://openreview.net/forum?id=B1eBoJStwr
PDF	https://openreview.net/pdf?id=B1eBoJStwr
PWC	https://paperswithcode.com/paper/semi-supervised-semantic-segmentation-needs
Repo
Framework

Learning Function-Specific Word Representations


Title	Learning Function-Specific Word Representations
Authors	Anonymous
Abstract	We present a neural framework for learning associations between interrelated groups of words such as the ones found in Subject-Verb-Object (SVO) structures. Our model induces a joint function-specific word vector space, where vectors of e.g. plausible SVO compositions lie close together. The model retains information about word group membership even in the joint space, and can thereby effectively be applied to a number of tasks reasoning over the SVO structure. We show the robustness and versatility of the proposed framework by reporting state-of-the-art results on the tasks of estimating selectional preference (i.e., thematic fit) and event similarity. The results indicate that the combinations of representations learned with our task-independent model outperform task-specific architectures from prior work, while reducing the number of parameters by up to 95%. The proposed framework is versatile and holds promise to support learning function-specific representations beyond the SVO structures.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BkgL7kBtDH
PDF	https://openreview.net/pdf?id=BkgL7kBtDH
PWC	https://paperswithcode.com/paper/learning-function-specific-word
Repo
Framework

Geometric Insights into the Convergence of Nonlinear TD Learning


Title	Geometric Insights into the Convergence of Nonlinear TD Learning
Authors	Anonymous
Abstract	While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SJezGp4YPr
PDF	https://openreview.net/pdf?id=SJezGp4YPr
PWC	https://paperswithcode.com/paper/geometric-insights-into-the-convergence-of
Repo
Framework

Dynamical System Embedding for Efficient Intrinsically Motivated Artificial Agents


Title	Dynamical System Embedding for Efficient Intrinsically Motivated Artificial Agents
Authors	Anonymous
Abstract	Mutual Information between agent Actions and environment States (MIAS) quantifies the influence of agent on its environment. Recently, it was found that intrinsic motivation in artificial agents emerges from the maximization of MIAS. For example, empowerment is an information-theoretic approach to intrinsic motivation, which has been shown to solve a broad range of standard RL benchmark problems. The estimation of empowerment for arbitrary dynamics is a challenging problem because it relies on the estimation of MIAS. Existing approaches rely on sampling, which have formal limitations, requiring exponentially many samples. In this work, we develop a novel approach for the estimation of empowerment in unknown arbitrary dynamics from visual stimulus only, without sampling for the estimation of MIAS. The core idea is to represent the relation between action sequences and future states by a stochastic dynamical system in latent space, which admits an efficient estimation of MIAS by the ``Water-Filling” algorithm from information theory. We construct this embedding with deep neural networks trained on a novel objective function and demonstrate our approach by numerical simulations of non-linear continuous-time dynamical systems. We show that the designed embedding preserves information-theoretic properties of the original dynamics, and enables us to solve the standard AI benchmark problems. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SyxIterYwS
PDF	https://openreview.net/pdf?id=SyxIterYwS
PWC	https://paperswithcode.com/paper/dynamical-system-embedding-for-efficient
Repo
Framework

Certifying Distributional Robustness using Lipschitz Regularisation


Title	Certifying Distributional Robustness using Lipschitz Regularisation
Authors	Anonymous
Abstract	Distributional robust risk (DRR) minimisation has arisen as a flexible and effective framework for machine learning. Approximate solutions based on dualisation have become particularly favorable in addressing the semi-infinite optimisation, and they also provide a certificate of the robustness for the worst-case population loss. However existing methods are restricted to either linear models or very small perturbations, and cannot find the globally optimal solution for restricted nonlinear models such as kernel methods. In this paper we resolved these limitations by upper bounding DRRs with an empirical risk regularised by the Lipschitz constant of the model, including deep neural networks and kernel methods. As an application, we showed that it also provides a certificate for adversarial training, and global solutions can be achieved on product kernel machines in polynomial time.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1eo9h4KPH
PDF	https://openreview.net/pdf?id=H1eo9h4KPH
PWC	https://paperswithcode.com/paper/certifying-distributional-robustness-using
Repo
Framework

On the Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks


Title	On the Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks
Authors	Anonymous
Abstract	Variational Bayesian Inference is a popular methodology for approximating posterior distributions in Bayesian neural networks. Recent work developing this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational distribution to a more compact parameterization. For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibits strong low-rank structure after convergence. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing the models’ performance. What’s more, we find that such factorized parameterizations are easier to train since they improve the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound, resulting in faster convergence.
Tasks	Bayesian Inference
Published	2020-01-01
URL	https://openreview.net/forum?id=BkxREeHKPS
PDF	https://openreview.net/pdf?id=BkxREeHKPS
PWC	https://paperswithcode.com/paper/on-the-parameterization-of-gaussian-mean
Repo
Framework

Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well


Title	Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
Authors	Anonymous
Abstract	We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel. The resulting models generalize equally well as those trained with small mini-batches but are produced in a substantially shorter time. We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and ImageNet.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rygFWAEFwS
PDF	https://openreview.net/pdf?id=rygFWAEFwS
PWC	https://paperswithcode.com/paper/stochastic-weight-averaging-in-parallel-large
Repo
Framework

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity


Title	Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity
Authors	Anonymous
Abstract	We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, gradient clipping and normalized gradient, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJgnXpVYwS
PDF	https://openreview.net/pdf?id=BJgnXpVYwS
PWC	https://paperswithcode.com/paper/why-gradient-clipping-accelerates-training-a
Repo
Framework

MEMO: A Deep Network for Flexible Combination of Episodic Memories


Title	MEMO: A Deep Network for Flexible Combination of Episodic Memories
Authors	Anonymous
Abstract	Recent research developing neural network architectures with external memory have often used the benchmark bAbI question and answering dataset which provides a challenging number of tasks requiring reasoning. Here we employed a classic associative inference task from the human neuroscience literature in order to more carefully probe the reasoning capacity of existing memory-augmented architectures. This task is thought to capture the essence of reasoning – the appreciation of distant relationships among elements distributed across multiple facts or memories. Surprisingly, we found that current architectures struggle to reason over long distance associations. Similar results were obtained on a more complex task involving finding the shortest path between nodes in a path. We therefore developed a novel architecture, MEMO, endowed with the capacity to reason over longer distances. This was accomplished with the addition of two novel components. First, it introduces a separation between memories/facts stored in external memory and the items that comprise these facts in external memory. Second, it makes use of an adaptive retrieval mechanism, allowing a variable number of ‘memory hops’ before the answer is produced. MEMO is capable of solving our novel reasoning tasks, as well as all 20 tasks in bAbI.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rJxlc0EtDr
PDF	https://openreview.net/pdf?id=rJxlc0EtDr
PWC	https://paperswithcode.com/paper/memo-a-deep-network-for-flexible-combination
Repo
Framework

Mem2Mem: Learning to Summarize Long Texts with Memory-to-Memory Transfer


Title	Mem2Mem: Learning to Summarize Long Texts with Memory-to-Memory Transfer
Authors	Anonymous
Abstract	We introduce the Mem2Mem mechanism, a conditional memory-to-memory mechanism that can be appended to general sequence-to-sequence frameworks, and demonstrate its effectiveness in improving long text neural abstractive summarization. Mem2Mem seamlessly transfers “memories” via readable/writable external memory modules that augment both the encoder and decoder. By enabling a memory transfer, Mem2Mem uses representations of highly salient input sentences and performs an implicit sentence extraction step. By allowing the decoder to read and write over encoded input memories, the models learn to store information about the input sequence while keeping track of what has been generated by the decoder. We evaluate Mem2Mem on abstractive text summarization and surpass the current state-of-the-art with less model capacity than competing models and with a full end-to-end training setup. To our knowledge, Mem2Mem is the first mechanism that can effectively use and update memory cells filled with different contextual information.
Tasks	Abstractive Text Summarization, Text Summarization
Published	2020-01-01
URL	https://openreview.net/forum?id=H1xTEgBKvB
PDF	https://openreview.net/pdf?id=H1xTEgBKvB
PWC	https://paperswithcode.com/paper/mem2mem-learning-to-summarize-long-texts-with
Repo
Framework

Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness


Title	Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness
Authors	Anonymous
Abstract	Previous work shows that adversarially robust generalization requires larger sample complexity, and the same dataset, e.g., CIFAR-10, which enables good standard accuracy may not suffice to train robust models. Since collecting new training data could be costly, we focus on better utilizing the given data by inducing the regions with high sample density in the feature space, which could lead to locally sufficient samples for robust learning. We first formally show that the softmax cross-entropy (SCE) loss and its variants convey inappropriate supervisory signals, which encourage the learned feature points to spread over the space sparsely in training. This inspires us to propose the Max-Mahalanobis center (MMC) loss to explicitly induce dense feature regions in order to benefit robustness. Namely, the MMC loss encourages the model to concentrate on learning ordered and compact representations, which gather around the preset optimal centers for different classes. We empirically demonstrate that applying the MMC loss can significantly improve robustness even under strong adaptive attacks, while keeping state-of-the-art accuracy on clean inputs with little extra computation compared to the SCE loss.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Byg9A24tvB
PDF	https://openreview.net/pdf?id=Byg9A24tvB
PWC	https://paperswithcode.com/paper/rethinking-softmax-cross-entropy-loss-for-1
Repo
Framework

Value-Driven Hindsight Modelling


Title	Value-Driven Hindsight Modelling
Authors	Anonymous
Abstract	Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.
Tasks	Atari Games, Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=rJxBa1HFvS
PDF	https://openreview.net/pdf?id=rJxBa1HFvS
PWC	https://paperswithcode.com/paper/value-driven-hindsight-modelling
Repo
Framework

Training Deep Neural Networks with Partially Adaptive Momentum


Title	Training Deep Neural Networks with Partially Adaptive Momentum
Authors	Anonymous
Abstract	Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes ``over adapted’'. We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HklWsREKwr
PDF	https://openreview.net/pdf?id=HklWsREKwr
PWC	https://paperswithcode.com/paper/training-deep-neural-networks-with-partially
Repo
Framework

Neural Non-additive Utility Aggregation


Title	Neural Non-additive Utility Aggregation
Authors	Anonymous
Abstract	Neural architectures for set regression problems aim at learning representations such that good predictions can be made based on the learned representations. This strategy, however, ignores the fact that meaningful intermediate results might be helpful to perform well. We study two new architectures that explicitly model latent intermediate utilities and use non-additive utility aggregation to estimate the set utility based on the latent utilities. We evaluate the new architectures with visual and textual datasets, which have non-additive set utilities due to redundancy and synergy effects. We find that the new architectures perform substantially better in this setup.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SklgTkBKDr
PDF	https://openreview.net/pdf?id=SklgTkBKDr
PWC	https://paperswithcode.com/paper/neural-non-additive-utility-aggregation
Repo
Framework