Paper Group NANR 118
A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning. Continual Density Ratio Estimation (CDRE): A new method for evaluating generative models in continual learning. Contrastive Representation Distillation. Information-Theoretic Local Minima Characterization and Regularization. Efficient Training of Robust and Verifiable Neural …
A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning
Title | A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning |
Authors | Anonymous |
Abstract | Despite the growing interest in continual learning, most of its contemporary works have been studied in a rather restricted setting where tasks are clearly distinguishable and task boundaries are known during training. However, if our goal is to develop an algorithm that learns as humans do, this setting is far from realistic and it is essential to develop a methodology that works in a task-free manner. Meanwhile, among several branches of continual learning, expansion-based methods have the advantage of eliminating catastrophic forgetting by allocating new resource to learn new data. In this work, we propose an expansion-based approach for task-free continual learning for the first time. Our model, named Continual Neural Dirichlet Process Mixture (CN-DPM), consists of a set of neural network experts that are in charge of a subset of the data. CN-DPM expands the number of experts in a principled way under the Bayesian nonparametric framework. With extensive experiments, we show that our model successfully performs task-free continual learning for both discriminative and generative tasks such as image classification and image generation. |
Tasks | Continual Learning, Image Classification, Image Generation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJxSOJStPr |
https://openreview.net/pdf?id=SJxSOJStPr | |
PWC | https://paperswithcode.com/paper/a-neural-dirichlet-process-mixture-model-for |
Repo | |
Framework | |
Continual Density Ratio Estimation (CDRE): A new method for evaluating generative models in continual learning
Title | Continual Density Ratio Estimation (CDRE): A new method for evaluating generative models in continual learning |
Authors | Anonymous |
Abstract | We propose a new method Continual Density Ratio Estimation (CDRE), which can estimate density ratios between a target distribution of real samples and a distribution of samples generated by a model while the model is changing over time and the data of the target distribution is not available after a certain time point. This method perfectly fits the setting of continual learning, in which one model is supposed to learn different tasks sequentially and the most crucial restriction is that model has none or very limited access to the data of all learned tasks. Through CDRE, we can evaluate generative models in continual learning using f-divergences. To the best of our knowledge, there is no existing method that can evaluate generative models under the setting of continual learning without storing real samples from the target distribution. |
Tasks | Continual Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJemQJBKDr |
https://openreview.net/pdf?id=HJemQJBKDr | |
PWC | https://paperswithcode.com/paper/continual-density-ratio-estimation-cdre-a-new |
Repo | |
Framework | |
Contrastive Representation Distillation
Title | Contrastive Representation Distillation |
Authors | Anonymous |
Abstract | Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher’s representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. When combined with knowledge distillation, our method sets a state of the art in many transfer tasks, sometimes even outperforming the teacher network. |
Tasks | Model Compression, Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkgpBJrtvS |
https://openreview.net/pdf?id=SkgpBJrtvS | |
PWC | https://paperswithcode.com/paper/contrastive-representation-distillation |
Repo | |
Framework | |
Information-Theoretic Local Minima Characterization and Regularization
Title | Information-Theoretic Local Minima Characterization and Regularization |
Authors | Anonymous |
Abstract | Recent advances in deep learning theory have evoked the study of generalizability across different local minima of deep neural networks (DNNs). While current work focused on either discovering properties of good local minima or developing regularization techniques to induce good local minima, no approach exists that can tackle both problems. We achieve these two goals successfully in a unified manner. Specifically, based on the Fisher information we propose a metric both strongly indicative of generalizability of local minima and effectively applied as a practical regularizer. We provide theoretical analysis including a generalization bound and empirically demonstrate the success of our approach in both capturing and improving the generalizability of DNNs. Experiments are performed on CIFAR-10 and CIFAR-100 for various network architectures. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJlXgkHYvS |
https://openreview.net/pdf?id=BJlXgkHYvS | |
PWC | https://paperswithcode.com/paper/information-theoretic-local-minima |
Repo | |
Framework | |
Efficient Training of Robust and Verifiable Neural Networks
Title | Efficient Training of Robust and Verifiable Neural Networks |
Authors | Anonymous |
Abstract | Recent works have developed several methods of defending neural networks against adversarial attacks with certified guarantees. We propose that many common certified defenses can be viewed under a unified framework of regularization. This unified framework provides a technique for comparing different certified defenses with respect to robust generalization. In addition, we develop a new regularizer that is both more efficient than existing certified defenses and can be used to train networks with higher certified accuracy. Our regularizer also extends to an L0 threat model and ensemble models. Through experiments on MNIST, CIFAR-10 and GTSRB, we demonstrate improvements in training speed and certified accuracy compared to state-of-the-art certified defenses. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hke_f0EYPH |
https://openreview.net/pdf?id=Hke_f0EYPH | |
PWC | https://paperswithcode.com/paper/efficient-training-of-robust-and-verifiable |
Repo | |
Framework | |
Copy That! Editing Sequences by Copying Spans
Title | Copy That! Editing Sequences by Copying Spans |
Authors | Anonymous |
Abstract | Neural sequence-to-sequence models are finding increasing use in editing of documents, for example in correcting a text document or repairing source code. In this paper, we argue that existing seq2seq models (with a facility to copy single tokens) are not a natural fit for such tasks, as they have to explicitly copy each unchanged token. We present an extension of seq2seq models capable of copying entire spans of the input to the output in one step, greatly reducing the number of decisions required during inference. This extension means that there are now many ways of generating the same output, which we handle by deriving a new objective for training and a variation of beam search for inference that explicitly handle this problem. In our experiments on a range of editing tasks of natural language and source code, we show that our new model consistently outperforms simpler baselines. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SklM1xStPB |
https://openreview.net/pdf?id=SklM1xStPB | |
PWC | https://paperswithcode.com/paper/copy-that-editing-sequences-by-copying-spans |
Repo | |
Framework | |
If MaxEnt RL is the Answer, What is the Question?
Title | If MaxEnt RL is the Answer, What is the Question? |
Authors | Anonymous |
Abstract | Experimentally, it has been observed that humans and animals often make decisions that do not maximize their expected utility, but rather choose outcomes randomly, with probability proportional to expected utility. Probability matching, as this strategy is called, is equivalent to maximum entropy reinforcement learning (MaxEnt RL). However, MaxEnt RL does not optimize expected utility. In this paper, we formally show that MaxEnt RL does optimally solve certain classes of control problems with variability in the reward function. In particular, we show (1) that MaxEnt RL can be used to solve a certain class of POMDPs, and (2) that MaxEnt RL is equivalent to a two-player game where an adversary chooses the reward function. These results suggest a deeper connection between MaxEnt RL, robust control, and POMDPs, and provide insight for the types of problems for which we might expect MaxEnt RL to produce effective solutions. Specifically, our results suggest that domains with uncertainty in the task goal may be especially well-suited for MaxEnt RL methods. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkxcZCNKDS |
https://openreview.net/pdf?id=SkxcZCNKDS | |
PWC | https://paperswithcode.com/paper/if-maxent-rl-is-the-answer-what-is-the-1 |
Repo | |
Framework | |
Learning to Anneal and Prune Proximity Graphs for Similarity Search
Title | Learning to Anneal and Prune Proximity Graphs for Similarity Search |
Authors | Anonymous |
Abstract | This paper studies similarity search, which is a crucial enabler of many feature vector–based applications. The problem of similarity search has been extensively studied in the machine learning community. Recent advances of proximity graphs have achieved outstanding performance through exploiting the navigability of the underlying graph structure. In this work, we introduce the annealable proximity graph (APG) method to learn and reshape proximity graphs for efficiency and effective similarity search. APG makes proximity graph edges annealable, which can be effectively trained with a stochastic optimization algorithm. APG identifies important edges that best preserve graph navigability and prune inferior edges without drastically changing graph properties. Experimental results show that APG achieves state-of-the-art results not only by producing proximity graphs with less number of edges but also speeding up the search time by 20–40% across different datasets with almost no loss of accuracy. |
Tasks | Stochastic Optimization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJlXC3EtwB |
https://openreview.net/pdf?id=HJlXC3EtwB | |
PWC | https://paperswithcode.com/paper/learning-to-anneal-and-prune-proximity-graphs |
Repo | |
Framework | |
On the Invertibility of Invertible Neural Networks
Title | On the Invertibility of Invertible Neural Networks |
Authors | Anonymous |
Abstract | Guarantees in deep learning are hard to achieve due to the interplay of flexible modeling schemes and complex tasks. Invertible neural networks (INNs), however, provide several mathematical guarantees by design, such as the ability to approximate non-linear diffeomorphisms. One less studied advantage of INNs is that they enable the design of bi-Lipschitz functions. This property has been used implicitly by various works to design generative models, memory-saving gradient computation, regularize classifiers, and solve inverse problems. In this work, we study Lipschitz constants of invertible architectures in order to investigate guarantees on stability of their inverse and forward mapping. Our analysis reveals that commonly-used INN building blocks can easily become non-invertible, leading to questionable ``exact’’ log likelihood computations and training difficulties. We introduce a set of numerical analysis tools to diagnose non-invertibility in practice. Finally, based on our theoretical analysis, we show how to guarantee numerical invertibility for one of the most common INN architectures. | |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJlVeyHFwH |
https://openreview.net/pdf?id=BJlVeyHFwH | |
PWC | https://paperswithcode.com/paper/on-the-invertibility-of-invertible-neural |
Repo | |
Framework | |
Hope For The Best But Prepare For The Worst: Cautious Adaptation In RL Agents
Title | Hope For The Best But Prepare For The Worst: Cautious Adaptation In RL Agents |
Authors | Anonymous |
Abstract | We study the problem of safe adaptation: given a model trained on a variety of past experiences for some task, can this model learn to perform that task in a new situation while avoiding catastrophic failure? This problem setting occurs frequently in real-world reinforcement learning scenarios such as a vehicle adapting to drive in a new city, or a robotic drone adapting a policy trained only in simulation. While learning without catastrophic failures is exceptionally difficult, prior experience can allow us to learn models that make this much easier. These models might not directly transfer to new settings, but can enable cautious adaptation that is substantially safer than na"{i}ve adaptation as well as learning from scratch. Building on this intuition, we propose risk-averse domain adaptation (RADA). RADA works in two steps: it first trains probabilistic model-based RL agents in a population of source domains to gain experience and capture epistemic uncertainty about the environment dynamics. Then, when dropped into a new environment, it employs a pessimistic exploration policy, selecting actions that have the best worst-case performance as forecasted by the probabilistic model. We show that this simple maximin policy accelerates domain adaptation in a safety-critical driving environment with varying vehicle sizes. We compare our approach against other approaches for adapting to new environments, including meta-reinforcement learning. |
Tasks | Domain Adaptation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkxA5lBFvH |
https://openreview.net/pdf?id=BkxA5lBFvH | |
PWC | https://paperswithcode.com/paper/hope-for-the-best-but-prepare-for-the-worst |
Repo | |
Framework | |
Corpus Based Amharic Sentiment Lexicon Generation
Title | Corpus Based Amharic Sentiment Lexicon Generation |
Authors | Girma Neshir, Andeas Rauber, and Solomon Atnafu |
Abstract | Sentiment classification is an active research area with several applications including analysis of political opinions, classifying comments, movie reviews, news reviews and product reviews. To employ rule based sentiment classification, we require sentiment lexicons. However, manual construction of sentiment lexicon is time consuming and costly for resource-limited languages. To bypass manual development time and costs, we tried to build Amharic Sentiment Lexicons relying on corpus based approach. The intention of this approach is to handle sentiment terms specific to Amharic language from Amharic Corpus. Small set of seed terms are manually prepared from three parts of speech such as noun, adjective and verb. We developed algorithms for constructing Amharic sentiment lexicons automatically from Amharic news corpus. Corpus based approach is proposed relying on the word co-occurrence distributional embedding including frequency based embedding (i.e. Positive Point-wise Mutual Information PPMI). First we build word-context unigram frequency count matrix and transform it to point-wise mutual Information matrix. Using this matrix, we computed the cosine distance of mean vector of seed lists and each word in the corpus vocabulary. Based on the threshold value, the top closest words to the mean vector of seed list are added to the lexicon. Then the mean vector of the new sentiment seed list is updated and process is repeated until we get sufficient terms in the lexicon. Using PPMI with threshold value of 100 and 200, we got corpus based Amharic Sentiment lexicons of size 1811 and 3794 respectively by expanding 519 seeds. Finally, the lexicon generated in corpus based approach is evaluated. |
Tasks | Sentiment Analysis |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJxVT3EKDH |
https://openreview.net/pdf?id=BJxVT3EKDH | |
PWC | https://paperswithcode.com/paper/corpus-based-amharic-sentiment-lexicon |
Repo | |
Framework | |
ROS-HPL: Robotic Object Search with Hierarchical Policy Learning and Intrinsic-Extrinsic Modeling
Title | ROS-HPL: Robotic Object Search with Hierarchical Policy Learning and Intrinsic-Extrinsic Modeling |
Authors | Anonymous |
Abstract | Despite significant progress in Robotic Object Search (ROS) over the recent years with deep reinforcement learning based approaches, the sparsity issue in reward setting as well as the lack of interpretability of the previous ROS approaches leave much to be desired. We present a novel policy learning approach for ROS, based on a hierarchical and interpretable modeling with intrinsic/extrinsic reward setting, to tackle these two challenges. More specifically, we train the low-level policy by deliberating between an action that achieves an immediate sub-goal and the one that is better suited for achieving the final goal. We also introduce a new evaluation metric, namely the extrinsic reward, as a harmonic measure of the object search success rate and the average steps taken. Experiments conducted with multiple settings on the House3D environment validate and show that the intelligent agent, trained with our model, can achieve a better object search performance (higher success rate with lower average steps, measured by SPL: Success weighted by inverse Path Length). In addition, we conduct studies w.r.t. the parameter that controls the weighted overall reward from intrinsic and extrinsic components. The results suggest it is critical to devise a proper trade-off strategy to perform the object search well. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BklxI0VtDB |
https://openreview.net/pdf?id=BklxI0VtDB | |
PWC | https://paperswithcode.com/paper/ros-hpl-robotic-object-search-with |
Repo | |
Framework | |
Revisiting the Information Plane
Title | Revisiting the Information Plane |
Authors | Anonymous |
Abstract | There has recently been a heated debate (e.g. Schwartz-Ziv & Tishby (2017), Saxe et al. (2018), Noshad et al. (2018), Goldfeld et al. (2018)) about measuring the information flow in Deep Neural Networks using techniques from information theory. It is claimed that Deep Neural Networks in general have good generalization capabilities since they not only learn how to map from an input to an output but also how to compress information about the training data input (Schwartz-Ziv & Tishby, 2017). That is, they abstract the input information and strip down any unnecessary or over-specific information. If so, the message compression method, Information Bottleneck (IB), could be used as a natural comparator for network performance, since this method gives an optimal information compression boundary. This claim was then later denounced as well as reaffirmed (e.g. Saxe et al. (2018), Achille et al. (2017), Noshad et al. (2018)), as the employed method of mutual information measuring is not actually measuring information but clustering of the internal layer representations (Goldfeld et al. (2018)). In this paper, we will present a detailed explanation of the development in the Information Plain (IP), which is a plot-type that compares mutual information to judge compression (Schwartz-Ziv & Tishby (2017)), when noise is retroactively added (using binning estimation). We also explain why different activation functions show different trajectories on the IP. Further, we have looked into the effect of clustering on the network loss through early and perfect stopping using the Information Plane and how clustering can be used to help network pruning. |
Tasks | Network Pruning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hyljn1SFwr |
https://openreview.net/pdf?id=Hyljn1SFwr | |
PWC | https://paperswithcode.com/paper/revisiting-the-information-plane |
Repo | |
Framework | |
One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation
Title | One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation |
Authors | Anonymous |
Abstract | Recent advances in the sparse neural network literature have made it possible to prune many large feed forward and convolutional networks with only a small quantity of data. Yet, these same techniques often falter when applied to the problem of recovering sparse recurrent networks. These failures are quantitative: when pruned with recent techniques, RNNs typically obtain worse performance than they do under a simple random pruning scheme. The failures are also qualitative: the distribution of active weights in a pruned LSTM or GRU network tend to be concentrated in specific neurons and gates, and not well dispersed across the entire architecture. We seek to rectify both the quantitative and qualitative issues with recurrent network pruning by introducing a new recurrent pruning objective derived from the spectrum of the recurrent Jacobian. Our objective is data efficient (requiring only 64 data points to prune the network), easy to implement, and produces 95 % sparse GRUs that significantly improve on existing baselines. We evaluate on sequential MNIST, Billion Words, and Wikitext. |
Tasks | Network Pruning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1e9GCNKvH |
https://openreview.net/pdf?id=r1e9GCNKvH | |
PWC | https://paperswithcode.com/paper/one-shot-pruning-of-recurrent-neural-networks |
Repo | |
Framework | |
Learning Structured Communication for Multi-agent Reinforcement Learning
Title | Learning Structured Communication for Multi-agent Reinforcement Learning |
Authors | Anonymous |
Abstract | Learning to cooperate is crucial for many practical large-scale multi-agent applications. In this work, we consider an important collaborative task, in which agents learn to efficiently communicate with each other under a multi-agent reinforcement learning (MARL) setting. Despite the fact that there has been a number of existing works along this line, achieving global cooperation at scale is still challenging. In particular, most of the existing algorithms suffer from issues such as scalability and high communication complexity, in the sense that when the agent population is large, it can be difficult to extract effective information for high-performance MARL. In contrast, the proposed algorithmic framework, termed Learning Structured Communication (LSC), is not only scalable but also communication high-qualitative (learning efficient). The key idea is to allow the agents to dynamically learn a hierarchical communication structure, while under such a structure the graph neural network (GNN) is used to efficiently extract useful information to be exchanged between the neighboring agents. A number of new techniques are proposed to tightly integrate the communication structure learning, GNN optimization and MARL tasks. Extensive experiments are performed to demonstrate that, the proposed LSC framework enjoys high communication efficiency, scalability and global cooperation capability. |
Tasks | Multi-agent Reinforcement Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BklWt24tvH |
https://openreview.net/pdf?id=BklWt24tvH | |
PWC | https://paperswithcode.com/paper/learning-structured-communication-for-multi |
Repo | |
Framework | |