April 1, 2020

2970 words 14 mins read

Paper Group NANR 133

Paper Group NANR 133

Winning the Lottery with Continuous Sparsification. Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula. Reweighted Proximal Pruning for Large-Scale Language Representation. Distilled embedding: non-linear embedding factorization using knowledge distillation. Soft Token Matching for Interpretable Low-Resource C …

Winning the Lottery with Continuous Sparsification

Title Winning the Lottery with Continuous Sparsification
Authors Anonymous
Abstract The Lottery Ticket Hypothesis from Frankle & Carbin (2019) conjectures that, for typically-sized neural networks, it is possible to find small sub-networks which train faster and yield superior performance than their original counterparts. The proposed algorithm to search for such sub-networks (winning tickets), Iterative Magnitude Pruning (IMP), consistently finds sub-networks with 90-95% less parameters which indeed train faster and better than the overparameterized models they were extracted from, creating potential applications to problems such as transfer learning. In this paper, we propose a new algorithm to search for winning tickets, Continuous Sparsification, which continuously removes parameters from a network during training, and learns the sub-network’s structure with gradient-based methods instead of relying on pruning strategies. We show empirically that our method is capable of finding tickets that outperforms the ones learned by Iterative Magnitude Pruning, and at the same time providing up to 5 times faster search, when measured in number of training epochs.
Tasks Transfer Learning
Published 2020-01-01
URL https://openreview.net/forum?id=BJe4oxHYPB
PDF https://openreview.net/pdf?id=BJe4oxHYPB
PWC https://paperswithcode.com/paper/winning-the-lottery-with-continuous
Repo
Framework

Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula

Title Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula
Authors Anonymous
Abstract Recently, there has been a growing interest in automatically exploring neural network architecture design space with the goal of finding an architecture that improves performance (characterized as improved accuracy, speed of training, or resource requirements). However, our theoretical understanding of how model architecture affects performance or accuracy is limited. In this paper, we study the impact of model architecture on the speed of training in the context of gradient descent optimization. We model gradient descent as a first-order ODE and use ODE’s coefficient matrix H to characterize the convergence rate. We introduce a simple analysis technique that enumerates H in terms of all possible ``paths’’ in the network. We show that changes in model architecture parameters reflect as changes in the number of paths and the properties of each path, which jointly control the speed of convergence. We believe our analysis technique is useful in reasoning about more complex model architecture modifications. |
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=B1xw9n4Kwr
PDF https://openreview.net/pdf?id=B1xw9n4Kwr
PWC https://paperswithcode.com/paper/model-architecture-controls-gradient-descent
Repo
Framework

Reweighted Proximal Pruning for Large-Scale Language Representation

Title Reweighted Proximal Pruning for Large-Scale Language Representation
Authors Anonymous
Abstract Recently, pre-trained language representation flourishes as the mainstay of the natural language understanding community, e.g., BERT. These pre-trained language representations can create state-of-the-art results on a wide range of downstream tasks. Along with continuous significant performance improvement, the size and complexity of these pre-trained neural models continue to increase rapidly. Is it possible to compress these large-scale language representation models? How will the pruned language representation affect the downstream multi-task transfer learning objectives? In this paper, we propose Reweighted Proximal Pruning (RPP), a new pruning method specifically designed for a large-scale language representation model. Through experiments on SQuAD and the GLUE benchmark suite, we show that proximal pruned BERT keeps high accuracy for both the pre-training task and the downstream multiple fine-tuning tasks at high prune ratio. RPP provides a new perspective to help us analyze what large-scale language representation might learn. Additionally, RPP makes it possible to deploy a large state-of-the-art language representation model such as BERT on a series of distinct devices (e.g., online servers, mobile phones, and edge devices).
Tasks Transfer Learning
Published 2020-01-01
URL https://openreview.net/forum?id=r1gBOxSFwr
PDF https://openreview.net/pdf?id=r1gBOxSFwr
PWC https://paperswithcode.com/paper/reweighted-proximal-pruning-for-large-scale
Repo
Framework

Distilled embedding: non-linear embedding factorization using knowledge distillation

Title Distilled embedding: non-linear embedding factorization using knowledge distillation
Authors Anonymous
Abstract Word-embeddings are a vital component of Natural Language Processing (NLP) systems and have been extensively researched. Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations. Compressing embedding matrices without sacrificing model performance is essential for successful commercial edge deployment. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then fine-tune on the downstream task employing knowledge distillation on the factorized embedding. We conduct extensive experimentation with various compression rates on machine translation, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique outperforms conventional low-rank matrix factorization, and other recently proposed word-embedding matrix compression methods.
Tasks Machine Translation, Word Embeddings
Published 2020-01-01
URL https://openreview.net/forum?id=Bkga90VKDB
PDF https://openreview.net/pdf?id=Bkga90VKDB
PWC https://paperswithcode.com/paper/distilled-embedding-non-linear-embedding-1
Repo
Framework

Soft Token Matching for Interpretable Low-Resource Classification

Title Soft Token Matching for Interpretable Low-Resource Classification
Authors Anonymous
Abstract We propose a model to tackle classification tasks in the presence of very little training data. To this aim, we introduce a novel matching mechanism to focus on elements of the input by using vectors that represent semantically meaningful concepts for the task at hand. By leveraging highlighted portions of the training data, a simple, yet effective, error boosting technique guides the learning process. In practice, it increases the error associated to relevant parts of the input by a given factor. Results on text classification tasks confirm the benefits of the proposed approach in both balanced and unbalanced cases, thus being of practical use when labeling new examples is expensive. In addition, the model is interpretable, as it allows for human inspection of the learned weights.
Tasks Text Classification
Published 2020-01-01
URL https://openreview.net/forum?id=SJlNnhVYDr
PDF https://openreview.net/pdf?id=SJlNnhVYDr
PWC https://paperswithcode.com/paper/soft-token-matching-for-interpretable-low
Repo
Framework

Data-Efficient Image Recognition with Contrastive Predictive Coding

Title Data-Efficient Image Recognition with Contrastive Predictive Coding
Authors Anonymous
Abstract Human observers can learn to recognize new categories of objects from a handful of examples, yet doing so with machine perception remains an open challenge. We hypothesize that data-efficient recognition is enabled by representations which make the variability in natural signals more predictable, as suggested by recent perceptual evidence. We therefore revisit and improve Contrastive Predictive Coding, a recently-proposed unsupervised learning framework, and arrive at a representation which enables generalization from small amounts of labeled data. When provided with only 1% of ImageNet labels (i.e. 13 per class), this model retains a strong classification performance, 73% Top-5 accuracy, outperforming supervised networks by 28% (a 65% relative improvement) and state-of-the-art semi-supervised methods by 14%. We also find this representation to serve as a useful substrate for object detection on the PASCAL-VOC 2007 dataset, approaching the performance of representations trained with a fully annotated ImageNet dataset.
Tasks Object Detection, Self-Supervised Image Classification, Semi-Supervised Image Classification
Published 2020-01-01
URL https://openreview.net/forum?id=rJerHlrYwH
PDF https://openreview.net/pdf?id=rJerHlrYwH
PWC https://paperswithcode.com/paper/data-efficient-image-recognition-with-1
Repo
Framework

ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION

Title ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION
Authors Anonymous
Abstract Pretrained language models (LMs) have shown excellent results in achieving human like performance on many language tasks. However, the most powerful LMs have one significant drawback: a fixed-sized input. With this constraint, these LMs are unable to utilize the full input of long documents. In this paper, we introduce a new framework to handle documents of arbitrary lengths. We investigate the addition of a recurrent mechanism to extend the input size and utilizing attention to identify the most discriminating segment of the input. We perform extensive validating experiments on patent and Arxiv datasets, both of which have long text. We demonstrate our method significantly outperforms state-of-the-art results reported in recent literature.
Tasks Document Classification
Published 2020-01-01
URL https://openreview.net/forum?id=ryxW804FPH
PDF https://openreview.net/pdf?id=ryxW804FPH
PWC https://paperswithcode.com/paper/adapting-pretrained-language-models-for-long
Repo
Framework

MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis

Title MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
Authors Anonymous
Abstract With a mortality rate of 5.4 million lives worldwide every year and a healthcare cost of more than 16 billion dollars in the USA alone, sepsis is one of the leading causes of hospital mortality and an increasing concern in the ageing western world. Recently, medical and technological advances have helped re-define the illness criteria of this disease, which is otherwise poorly understood by the medical society. Together with the rise of widely accessible Electronic Health Records, the advances in data mining and complex nonlinear algorithms are a promising avenue for the early detection of sepsis. This work contributes to the research effort in the field of automated sepsis detection with an open-access labelling of the medical MIMIC-III data set. Moreover, we propose MGP-AttTCN: a joint multitask Gaussian Process and attention-based deep learning model to early predict the occurrence of sepsis in an interpretable manner. We show that our model outperforms the current state-of-the-art and present evidence that different labelling heuristics lead to discrepancies in task difficulty.
Tasks Interpretable Machine Learning
Published 2020-01-01
URL https://openreview.net/forum?id=rJgDb1SFwB
PDF https://openreview.net/pdf?id=rJgDb1SFwB
PWC https://paperswithcode.com/paper/mgp-atttcn-an-interpretable-machine-learning-1
Repo
Framework

Few-Shot Few-Shot Learning and the role of Spatial Attention

Title Few-Shot Few-Shot Learning and the role of Spatial Attention
Authors Anonymous
Abstract Few-shot learning is often motivated by the ability of humans to learn new tasks from few examples. However, standard few-shot classification benchmarks assume that the representation is learned on a limited amount of base class data, ignoring the amount of prior knowledge that a human may have accumulated before learning new tasks. At the same time, even if a powerful representation is available, it may happen in some domain that base class data are limited or non-existent. This motivates us to study a problem where the representation is obtained from a classifier pre-trained on a large-scale dataset of a different domain, assuming no access to its training process, while the base class data are limited to few examples per class and their role is to adapt the representation to the domain at hand rather than learn from scratch. We adapt the representation in two stages, namely on the few base class data if available and on the even fewer data of new tasks. In doing so, we obtain from the pre-trained classifier a spatial attention map that allows focusing on objects and suppressing background clutter. This is important in the new problem, because when base class data are few, the network cannot learn where to focus implicitly. We also show that a pre-trained network may be easily adapted to novel classes, without meta-learning.
Tasks Few-Shot Learning, Meta-Learning
Published 2020-01-01
URL https://openreview.net/forum?id=H1l2mxHKvr
PDF https://openreview.net/pdf?id=H1l2mxHKvr
PWC https://paperswithcode.com/paper/few-shot-few-shot-learning-and-the-role-of
Repo
Framework

Meta-Learning with Network Pruning for Overfitting Reduction

Title Meta-Learning with Network Pruning for Overfitting Reduction
Authors Anonymous
Abstract Meta-Learning has achieved great success in few-shot learning. However, the existing meta-learning models have been evidenced to overfit on meta-training tasks when using deeper and wider convolutional neural networks. This means that we cannot improve the meta-generalization performance by merely deepening or widening the networks. To remedy such a deficiency of meta-overfitting, we propose in this paper a sparsity constrained meta-learning approach to learn from meta-training tasks a subnetwork from which first-order optimization methods can quickly converge towards the optimal network in meta-testing tasks. Our theoretical analysis shows the benefit of sparsity for improving the generalization gap of the learned meta-initialization network. We have implemented our approach on top of the widely applied Reptile algorithm assembled with varying network pruning routines including Dense-Sparse-Dense (DSD) and Iterative Hard Thresholding (IHT). Extensive experimental results on benchmark datasets with different over-parameterized deep networks demonstrate that our method can not only effectively ease meta-overfitting but also in many cases improve the meta-generalization performance when applied to few-shot classification tasks.
Tasks Few-Shot Learning, Meta-Learning, Network Pruning
Published 2020-01-01
URL https://openreview.net/forum?id=B1gcblSKwB
PDF https://openreview.net/pdf?id=B1gcblSKwB
PWC https://paperswithcode.com/paper/meta-learning-with-network-pruning-for
Repo
Framework

Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts

Title Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts
Authors Anonymous
Abstract Informed and robust decision making in the face of uncertainty is critical for robots that perform physical tasks alongside people. We formulate this as a Bayesian Reinforcement Learning problem over latent Markov Decision Processes (MDPs). While Bayes-optimality is theoretically the gold standard, existing algorithms do not scale well to continuous state and action spaces. We propose a scalable solution that builds on the following insight: in the absence of uncertainty, each latent MDP is easier to solve. We split the challenge into two simpler components. First, we obtain an ensemble of clairvoyant experts and fuse their advice to compute a baseline policy. Second, we train a Bayesian residual policy to improve upon the ensemble’s recommendation and learn to reduce uncertainty. Our algorithm, Bayesian Residual Policy Optimization (BRPO), imports the scalability of policy gradient methods as well as the initialization from prior models. BRPO significantly improves the ensemble of experts and drastically outperforms existing adaptive RL methods.
Tasks Decision Making, Policy Gradient Methods
Published 2020-01-01
URL https://openreview.net/forum?id=B1grSREtDH
PDF https://openreview.net/pdf?id=B1grSREtDH
PWC https://paperswithcode.com/paper/bayesian-residual-policy-optimization
Repo
Framework

SpectroBank: A filter-bank convolutional layer for CNN-based audio applications

Title SpectroBank: A filter-bank convolutional layer for CNN-based audio applications
Authors Anonymous
Abstract We propose and investigate the design of a new convolutional layer where kernels are parameterized functions. This layer aims at being the input layer of convolutional neural networks for audio applications. The kernels are defined as functions having a band-pass filter shape, with a limited number of trainable parameters. We show that networks having such an input layer can achieve state-of-the-art accuracy on several audio classification tasks. This approach, while reducing the number of weights to be trained along with network training time, enables larger kernel sizes, an advantage for audio applications. Furthermore, the learned filters bring additional interpretability and a better understanding of the data properties exploited by the network.
Tasks Audio Classification
Published 2020-01-01
URL https://openreview.net/forum?id=HyewT1BKvr
PDF https://openreview.net/pdf?id=HyewT1BKvr
PWC https://paperswithcode.com/paper/spectrobank-a-filter-bank-convolutional-layer
Repo
Framework

Variational Autoencoders for Opponent Modeling in Multi-Agent Systems

Title Variational Autoencoders for Opponent Modeling in Multi-Agent Systems
Authors Anonymous
Abstract Multi-agent systems exhibit complex behaviors that emanate from the interactions of multiple agents in a shared environment. In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with the other agents that have fixed policies. Modeling the behavior of other agents (opponents) is essential in understanding the interactions of the agents in the system. By taking advantage of recent advances in unsupervised learning, we propose modeling opponents using variational autoencoders. Additionally, many existing methods in the literature assume that the opponent models have access to opponent’s observations and actions during both training and execution. To eliminate this assumption, we propose a modification that attempts to identify the underlying opponent model, using only local information of our agent, such as its observations, actions, and rewards. The experiments indicate that our opponent modeling methods achieve equal or greater episodic returns in reinforcement learning tasks against another modeling method.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BylNoaVYPS
PDF https://openreview.net/pdf?id=BylNoaVYPS
PWC https://paperswithcode.com/paper/variational-autoencoders-for-opponent
Repo
Framework

Multi-source Multi-view Transfer Learning in Neural Topic Modeling with Pretrained Topic and Word Embeddings

Title Multi-source Multi-view Transfer Learning in Neural Topic Modeling with Pretrained Topic and Word Embeddings
Authors Anonymous
Abstract Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity problem in short text or small collection of documents. However, no prior work has employed (pretrained latent) topics in transfer learning paradigm. In this paper, we propose a framework to perform transfer learning in neural topic modeling using (1) pretrained (latent) topics obtained from a large source corpus, and (2) pretrained word and topic embeddings jointly (i.e., multiview) in order to improve topic quality, better deal with polysemy and data sparsity issues in a target corpus. In doing so, we first accumulate topics and word representations from one or many source corpora to build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). Then, we identify one or multiple relevant source domain(s) and take advantage of corresponding topics and word features via the respective pools to guide meaningful learning in the sparse target domain. We quantify the quality of topic and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. We have demonstrated the state-ofthe- art results on topic modeling with the proposed transfer learning approaches.
Tasks Information Retrieval, Transfer Learning, Word Embeddings
Published 2020-01-01
URL https://openreview.net/forum?id=ByxODxHYwB
PDF https://openreview.net/pdf?id=ByxODxHYwB
PWC https://paperswithcode.com/paper/multi-source-multi-view-transfer-learning-in
Repo
Framework

Knowledge Transfer via Student-Teacher Collaboration

Title Knowledge Transfer via Student-Teacher Collaboration
Authors Anonymous
Abstract Accompanying with the flourish development in various fields, deep neural networks, however, are still facing with the plight of high computational costs and storage. One way to compress these heavy models is knowledge transfer (KT), in which a light student network is trained through absorbing the knowledge from a powerful teacher network. In this paper, we propose a novel knowledge transfer method which employs a Student-Teacher Collaboration (STC) network during the knowledge transfer process. This is done by connecting the front part of the student network to the back part of the teacher network as the STC network. The back part of the teacher network takes the intermediate representation from the front part of the student network as input to make the prediction. The difference between the prediction from the collaboration network and the output tensor from the teacher network is taken into account of the loss during the train process. Through back propagation, the teacher network provides guidance to the student network in a gradient signal manner. In this way, our method takes advantage of the knowledge from the entire teacher network, who instructs the student network in learning process. Through plentiful experiments, it is proved that our STC method outperforms other KT methods with conventional strategy.
Tasks Transfer Learning
Published 2020-01-01
URL https://openreview.net/forum?id=H1lVvgHKDr
PDF https://openreview.net/pdf?id=H1lVvgHKDr
PWC https://paperswithcode.com/paper/knowledge-transfer-via-student-teacher
Repo
Framework
comments powered by Disqus