April 1, 2020

3039 words 15 mins read

Paper Group NANR 137

Unsupervised Meta-Learning for Reinforcement Learning. Discrete InfoMax Codes for Meta-Learning. VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning. Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model. Continual Deep Learning by Functional Regularisation of Memorable Past. Learning to Remember from a …

Unsupervised Meta-Learning for Reinforcement Learning


Title	Unsupervised Meta-Learning for Reinforcement Learning
Authors	Anonymous
Abstract	Meta-learning algorithms learn to acquire new tasks more quickly from past experience. In the context of reinforcement learning, meta-learning algorithms can acquire reinforcement learning procedures to solve new problems more efficiently by utilizing experience from prior tasks. The performance of meta-learning algorithms depends on the tasks available for meta-training: in the same way that supervised learning generalizes best to test points drawn from the same distribution as the training points, meta-learning methods generalize best to tasks from the same distribution as the meta-training tasks. In effect, meta-reinforcement learning offloads the design burden from algorithm design to task design. If we can automate the process of task design as well, we can devise a meta-learning algorithm that is truly automated. In this work, we take a step in this direction, proposing a family of unsupervised meta-learning algorithms for reinforcement learning. We motivate and describe a general recipe for unsupervised meta-reinforcement learning, and present an instantiation of this approach. Our conceptual and theoretical contributions consist of formulating the unsupervised meta-reinforcement learning problem and describing how task proposals based on mutual information can in principle be used to train optimal meta-learners. Our experimental results indicate that unsupervised meta-reinforcement learning effectively acquires accelerated reinforcement learning procedures without the need for manual task design and significantly exceeds the performance of learning from scratch.
Tasks	Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=S1et1lrtwr
PDF	https://openreview.net/pdf?id=S1et1lrtwr
PWC	https://paperswithcode.com/paper/unsupervised-meta-learning-for-reinforcement-2
Repo
Framework

Discrete InfoMax Codes for Meta-Learning


Title	Discrete InfoMax Codes for Meta-Learning
Authors	Anonymous
Abstract	This paper analyzes how generalization works in meta-learning. Our core contribution is an information-theoretic generalization bound for meta-learning, which identifies the expressivity of the task-specific learner as the key factor that makes generalization to new datasets difficult. Taking inspiration from our bound, we present Discrete InfoMax Codes (DIMCO), a novel meta-learning model that trains a stochastic encoder to output discrete codes. Experiments show that DIMCO requires less memory and less time for similar performance to previous metric learning methods and that our method generalizes particularly well in a challenging small-data setting.
Tasks	Meta-Learning, Metric Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=Syx5eT4KDS
PDF	https://openreview.net/pdf?id=Syx5eT4KDS
PWC	https://paperswithcode.com/paper/discrete-infomax-codes-for-meta-learning-1
Repo
Framework

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning


Title	VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Authors	Anonymous
Abstract	Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent’s uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We also evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher return during training than existing methods.
Tasks	Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=Hkl9JlBYvr
PDF	https://openreview.net/pdf?id=Hkl9JlBYvr
PWC	https://paperswithcode.com/paper/varibad-a-very-good-method-for-bayes-adaptive
Repo
Framework

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model


Title	Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model
Authors	Anonymous
Abstract	Recent breakthroughs of large-scale pretrained language models have shown the effectiveness of self-training for natural language processing (NLP). In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale language modeling could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised training objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model also achieves consistent improvements over BERT on four entity-related question answering datasets (average 2.7 F1 improvements on WebQuestions, TriviaQA, SearchQA and Quasar-T) and a standard fine-grained entity typing dataset (i.e., 5.7 accuracy gains on FIGER), establishing several new state-of-the-art.
Tasks	Entity Typing, Language Modelling, Question Answering
Published	2020-01-01
URL	https://openreview.net/forum?id=BJlzm64tDH
PDF	https://openreview.net/pdf?id=BJlzm64tDH
PWC	https://paperswithcode.com/paper/pretrained-encyclopedia-weakly-supervised
Repo
Framework

Continual Deep Learning by Functional Regularisation of Memorable Past


Title	Continual Deep Learning by Functional Regularisation of Memorable Past
Authors	Anonymous
Abstract	Continually learning new skills without forgetting old ones is an important quality for an intelligent system, yet most deep learning methods suffer from catastrophic forgetting of the past. Recent works have addressed this by regularising the network weights, but it is challenging to identify weights crucial to avoid forgetting. A better approach is to directly regularise the network outputs at past inputs, e.g., by using Gaussian processes (GPs), but this is usually computationally challenging. In this paper, we propose a scalable functional-regularisation approach where we regularise only over a few memorable past examples that are crucial to avoid forgetting. Our key idea is to use a GP formulation of deep networks, enabling us to both identify the memorable past and regularise over them. Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation methods are naturally combined with memory-based methods.
Tasks	Gaussian Processes
Published	2020-01-01
URL	https://openreview.net/forum?id=r1xjgxBFPB
PDF	https://openreview.net/pdf?id=r1xjgxBFPB
PWC	https://paperswithcode.com/paper/continual-deep-learning-by-functional
Repo
Framework

Learning to Remember from a Multi-Task Teacher


Title	Learning to Remember from a Multi-Task Teacher
Authors	Anonymous
Abstract	Recent studies on catastrophic forgetting during sequential learning typically focus on fixing the accuracy of the predictions for a previously learned task. In this paper we argue that the outputs of neural networks are subject to rapid changes when learning a new data distribution, and networks that appear to “forget” everything still contain useful representation towards previous tasks. We thus propose to enforce the output accuracy to stay the same, we should aim to reduce the effect of catastrophic forgetting on the representation level, as the output layer can be quickly recovered later with a small number of examples. Towards this goal, we propose an experimental setup that measures the amount of representational forgetting, and develop a novel meta-learning algorithm to overcome this issue. The proposed meta-learner produces weight updates of a sequential learning network, mimicking a multi-task teacher network’s representation. We show that our meta-learner can improve its learned representations on new tasks, while maintaining a good representation for old tasks.
Tasks	Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=rke5R1SFwS
PDF	https://openreview.net/pdf?id=rke5R1SFwS
PWC	https://paperswithcode.com/paper/learning-to-remember-from-a-multi-task
Repo
Framework

Independence-aware Advantage Estimation


Title	Independence-aware Advantage Estimation
Authors	Anonymous
Abstract	Most of existing advantage function estimation methods in reinforcement learning suffer from the problem of high variance, which scales unfavorably with the time horizon. To address this challenge, we propose to identify the independence property between current action and future states in environments, which can be further leveraged to effectively reduce the variance of the advantage estimation. In particular, the recognized independence property can be naturally utilized to construct a novel importance sampling advantage estimator with close-to-zero variance even when the Monte-Carlo return signal yields a large variance. To further remove the risk of the high variance introduced by the new estimator, we combine it with existing Monte-Carlo estimator via a reward decomposition model learned by minimizing the estimation variance. Experiments demonstrate that our method achieves higher sample efficiency compared with existing advantage estimation methods in complex environments.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1eP504YDr
PDF	https://openreview.net/pdf?id=B1eP504YDr
PWC	https://paperswithcode.com/paper/independence-aware-advantage-estimation
Repo
Framework

Federated Adversarial Domain Adaptation


Title	Federated Adversarial Domain Adaptation
Authors	Anonymous
Abstract	Federated learning improves data privacy and efficiency in machine learning performed over networks of distributed devices, such as mobile phones, IoT and wearable devices, etc. Yet models trained with federated learning can still fail to generalize to new devices due to the problem of domain shift. Domain shift occurs when the labeled data collected by source nodes statistically differs from the target node’s unlabeled data. In this work, we present a principled approach to the problem of federated domain adaptation, which aims to align the representations learned among the different nodes with the data distribution of the target node. Our approach extends adversarial adaptation techniques to the constraints of the federated setting. In addition, we devise a dynamic attention mechanism and leverage feature disentanglement to enhance knowledge transfer. Empirically, we perform extensive experiments on several image and text classification tasks and show promising results under unsupervised federated domain adaptation setting.
Tasks	Domain Adaptation, Text Classification, Transfer Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=HJezF3VYPB
PDF	https://openreview.net/pdf?id=HJezF3VYPB
PWC	https://paperswithcode.com/paper/federated-adversarial-domain-adaptation-1
Repo
Framework

Faster and Just As Accurate: A Simple Decomposition for Transformer Models


Title	Faster and Just As Accurate: A Simple Decomposition for Transformer Models
Authors	Anonymous
Abstract	Large pre-trained Transformers such as BERT have been tremendously effective for many NLP tasks. However, inference in these large-capacity models is prohibitively slow and expensive. Transformers are essentially a stack of self-attention layers which encode each input position using the entire input sequence as its context. However, we find that it may not be necessary to apply this expensive sequence-wide self-attention over at all layers. Based on this observation, we propose a decomposition to a pre-trained Transformer that allows the lower layers to process segments of the input independently enabling parallelism and caching. We show that the information loss due to this decomposition can be recovered in the upper layers with auxiliary supervision during fine-tuning. We evaluate de-composition with pre-trained BERT models on five different paired-input tasks in question answering, sentence similarity, and natural language inference. Results show that decomposition enables faster inference (up to 4x), significant memory reduction (up to 70%) while retaining most (up to 99%) of the original performance. We will release the code at.
Tasks	Natural Language Inference, Question Answering
Published	2020-01-01
URL	https://openreview.net/forum?id=B1gKVeBtDH
PDF	https://openreview.net/pdf?id=B1gKVeBtDH
PWC	https://paperswithcode.com/paper/faster-and-just-as-accurate-a-simple
Repo
Framework

Re-Examining Linear Embeddings for High-dimensional Bayesian Optimization


Title	Re-Examining Linear Embeddings for High-dimensional Bayesian Optimization
Authors	Anonymous
Abstract	Bayesian optimization (BO) is a popular approach to optimize resource-intensive black-box functions. A significant challenge in BO is to scale to high-dimensional parameter spaces while retaining sample efficiency. A solution considered in previous literature is to embed the high-dimensional parameter space into a lower-dimensional manifold, often a random linear embedding. In this paper, we identify several crucial issues and misconceptions about the use of linear embeddings for BO. We thoroughly study and analyze the consequences of using linear embeddings and show that some of the design choices in current approaches adversely impact their performance. Based on this new theoretical understanding we propose ALEBO, a new algorithm for high-dimensional BO via linear embeddings that outperforms state-of-the-art methods on a range of problems.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SJgn3lBtwH
PDF	https://openreview.net/pdf?id=SJgn3lBtwH
PWC	https://paperswithcode.com/paper/re-examining-linear-embeddings-for-high
Repo
Framework

Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming


Title	Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming
Authors	Anonymous
Abstract	The ability to detect objects regardless of image distortions or weather conditions is crucial for real-world applications of deep learning like autonomous driving. We here provide an easy-to-use benchmark to assess how object detection models perform when image quality degrades. The three resulting benchmark datasets, termed PASCAL-C, COCO-C and Cityscapes-C, contain a large variety of image corruptions. We show that a range of standard object detection models suffer a severe performance loss on corrupted images (down to 30-60% of the original performance). However, a simple data augmentation trick - stylizing the training images - leads to a substantial increase in robustness across corruption type, severity and dataset. We envision our comprehensive benchmark to track future progress towards building robust object detection models. Benchmark, code and data are available at: (hidden for double blind review)
Tasks	Autonomous Driving, Data Augmentation, Object Detection, Robust Object Detection
Published	2020-01-01
URL	https://openreview.net/forum?id=ryljMpNtwr
PDF	https://openreview.net/pdf?id=ryljMpNtwr
PWC	https://paperswithcode.com/paper/benchmarking-robustness-in-object-detection-1
Repo
Framework

Resizable Neural Networks


Title	Resizable Neural Networks
Authors	Anonymous
Abstract	In this paper, we present a deep convolutional neural network (CNN) which performs arbitrary resize operation on intermediate feature map resolution at stage-level. Motivated by weight sharing mechanism in neural architecture search, where a super-network is trained and sub-networks inherit the weights from the super-network, we present a novel CNN approach. We construct a spatial super-network which consists of multiple sub-networks, where each sub-network is a single scale network that obtain a unique spatial configuration, the convolutional layers are shared across all sub-networks. Such network, named as Resizable Neural Networks, are equivalent to training infinite single scale networks, but has no extra computational cost. Moreover, we present a training algorithm such that all sub-networks achieve better performance than individually trained counterparts. On large-scale ImageNet classification, we demonstrate its effectiveness on various modern network architectures such as MobileNet, ShuffleNet, and ResNet. To go even further, we present three variants of resizable networks: 1) Resizable as Architecture Search (Resizable-NAS). On ImageNet, Resizable-NAS ResNet-50 attain 0.4% higher on accuracy and 44% smaller than the baseline model. 2) Resizable as Data Augmentation (Resizable-Aug). While we use resizable networks as a data augmentation technique, it obtains superior performance on ImageNet classification, outperform AutoAugment by 1.2% with ResNet-50. 3) Adaptive Resizable Network (Resizable-Adapt). We introduce the adaptive resizable networks as dynamic networks, which further improve the performance with less computational cost via data-dependent inference.
Tasks	Data Augmentation, Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=BJe_z1HFPr
PDF	https://openreview.net/pdf?id=BJe_z1HFPr
PWC	https://paperswithcode.com/paper/resizable-neural-networks
Repo
Framework

Role-Wise Data Augmentation for Knowledge Distillation


Title	Role-Wise Data Augmentation for Knowledge Distillation
Authors	Jie Fu, Xue Geng, Bohan Zhuang, Xingdi Yuan, Adam Trischler, Jie Lin, Vijay Chandrasekhar, Chris Pal
Abstract	Knowledge Distillation (KD) is a common method for transferring the ``knowledge’’ learned by one machine learning model (the teacher) into another model (the student), where typically, the teacher has a greater capacity (e.g., more parameters or higher bit-widths). To our knowledge, existing methods overlook the fact that although the student absorbs extra knowledge from the teacher, both models share the same input data – and this data is the only medium by which the teacher’s knowledge can be demonstrated. Due to the difference in model capacities, the student may not benefit fully from the same data points on which the teacher is trained. On the other hand, a human teacher may demonstrate a piece of knowledge with individualized examples adapted to a particular student, for instance, in terms of her cultural background and interests. Inspired by this behavior, we design data augmentation agents with distinct roles to facilitate knowledge distillation. Our data augmentation agents generate distinct training data for the teacher and student, respectively. We focus specifically on KD when the teacher network has greater precision (bit-width) than the student network. We find empirically that specially tailored data points enable the teacher’s knowledge to be demonstrated more effectively to the student. We compare our approach with existing KD methods on training popular neural architectures and demonstrate that role-wise data augmentation improves the effectiveness of KD over strong prior approaches. The code for reproducing our results will be made publicly available. \|
Tasks	Data Augmentation
Published	2020-01-01
URL	https://openreview.net/forum?id=rJeidA4KvS
PDF	https://openreview.net/pdf?id=rJeidA4KvS
PWC	https://paperswithcode.com/paper/role-wise-data-augmentation-for-knowledge
Repo
Framework

Analytical Moment Regularizer for Training Robust Networks


Title	Analytical Moment Regularizer for Training Robust Networks
Authors	Modar Alfadly, Adel Bibi, Muhammed Kocabas, Bernard Ghanem
Abstract	Despite the impressive performance of deep neural networks (DNNs) on numerous learning tasks, they still exhibit uncouth behaviours. One puzzling behaviour is the subtle sensitive reaction of DNNs to various noise attacks. Such a nuisance has strengthened the line of research around developing and training noise-robust networks. In this work, we propose a new training regularizer that aims to minimize the probabilistic expected training loss of a DNN subject to a generic Gaussian input. We provide an efficient and simple approach to approximate such a regularizer for arbitrarily deep networks. This is done by leveraging the analytic expression of the output mean of a shallow neural network, avoiding the need for memory and computation expensive data augmentation. We conduct extensive experiments on LeNet and AlexNet on various datasets including MNIST, CIFAR10, and CIFAR100 to demonstrate the effectiveness of our proposed regularizer. In particular, we show that networks that are trained with the proposed regularizer benefit from a boost in robustness against Gaussian noise to an equivalent amount of performing 3-21 folds of noisy data augmentation. Moreover, we empirically show on several architectures and datasets that improving robustness against Gaussian noise, by using the new regularizer, can improve the overall robustness against 6 other types of attacks by two orders of magnitude.
Tasks	Data Augmentation
Published	2020-01-01
URL	https://openreview.net/forum?id=B1xDq2EFDH
PDF	https://openreview.net/pdf?id=B1xDq2EFDH
PWC	https://paperswithcode.com/paper/analytical-moment-regularizer-for-training
Repo
Framework

Learning to Learn via Gradient Component Corrections


Title	Learning to Learn via Gradient Component Corrections
Authors	Anonymous
Abstract	Gradient-based meta-learning algorithms require several steps of gradient descent to adapt to newly incoming tasks. This process becomes more costly as the number of samples increases. Moreover, the gradient updates suffer from several sources of noise leading to a degraded performance. In this work, we propose a meta-learning algorithm equipped with the GradiEnt Component COrrections, aGECCO cell for short, which generates a multiplicative corrective low-rank matrix which (after vectorization) corrects the estimated gradients. GECCO contains a simple decoder-like network with learnable parameters, an attention module and a so-called context input parameter. The context parameter of GECCO is updated to generate a low-rank corrective term for the network gradients. As a result, meta-learning requires only a few of gradient updates to absorb new task (often, a single update is sufficient in the few-shot scenario). While previous approaches address this problem by altering the learning rates, factorising network parameters or directly learning feature corrections from features and/or gradients, GECCO is an off-the-shelf generator-like unit that performs element-wise gradient corrections without the need to ‘observe’ the features and/or the gradients directly. We show that our GECCO (i) accelerates learning, (ii) performs robust corrections of the gradients corrupted by a noise, and (iii) leads to notable improvements over existing gradient-based meta-learning algorithms.
Tasks	Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=H1g7sxHKPr
PDF	https://openreview.net/pdf?id=H1g7sxHKPr
PWC	https://paperswithcode.com/paper/learning-to-learn-via-gradient-component
Repo
Framework