Paper Group NANR 133
Winning the Lottery with Continuous Sparsification. Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula. Reweighted Proximal Pruning for Large-Scale Language Representation. Distilled embedding: non-linear embedding factorization using knowledge distillation. Soft Token Matching for Interpretable Low-Resource C …
Winning the Lottery with Continuous Sparsification
Title | Winning the Lottery with Continuous Sparsification |
Authors | Anonymous |
Abstract | The Lottery Ticket Hypothesis from Frankle & Carbin (2019) conjectures that, for typically-sized neural networks, it is possible to find small sub-networks which train faster and yield superior performance than their original counterparts. The proposed algorithm to search for such sub-networks (winning tickets), Iterative Magnitude Pruning (IMP), consistently finds sub-networks with 90-95% less parameters which indeed train faster and better than the overparameterized models they were extracted from, creating potential applications to problems such as transfer learning. In this paper, we propose a new algorithm to search for winning tickets, Continuous Sparsification, which continuously removes parameters from a network during training, and learns the sub-network’s structure with gradient-based methods instead of relying on pruning strategies. We show empirically that our method is capable of finding tickets that outperforms the ones learned by Iterative Magnitude Pruning, and at the same time providing up to 5 times faster search, when measured in number of training epochs. |
Tasks | Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJe4oxHYPB |
https://openreview.net/pdf?id=BJe4oxHYPB | |
PWC | https://paperswithcode.com/paper/winning-the-lottery-with-continuous |
Repo | |
Framework | |
Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula
Title | Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula |
Authors | Anonymous |
Abstract | Recently, there has been a growing interest in automatically exploring neural network architecture design space with the goal of finding an architecture that improves performance (characterized as improved accuracy, speed of training, or resource requirements). However, our theoretical understanding of how model architecture affects performance or accuracy is limited. In this paper, we study the impact of model architecture on the speed of training in the context of gradient descent optimization. We model gradient descent as a first-order ODE and use ODE’s coefficient matrix H to characterize the convergence rate. We introduce a simple analysis technique that enumerates H in terms of all possible ``paths’’ in the network. We show that changes in model architecture parameters reflect as changes in the number of paths and the properties of each path, which jointly control the speed of convergence. We believe our analysis technique is useful in reasoning about more complex model architecture modifications. | |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1xw9n4Kwr |
https://openreview.net/pdf?id=B1xw9n4Kwr | |
PWC | https://paperswithcode.com/paper/model-architecture-controls-gradient-descent |
Repo | |
Framework | |
Reweighted Proximal Pruning for Large-Scale Language Representation
Title | Reweighted Proximal Pruning for Large-Scale Language Representation |
Authors | Anonymous |
Abstract | Recently, pre-trained language representation flourishes as the mainstay of the natural language understanding community, e.g., BERT. These pre-trained language representations can create state-of-the-art results on a wide range of downstream tasks. Along with continuous significant performance improvement, the size and complexity of these pre-trained neural models continue to increase rapidly. Is it possible to compress these large-scale language representation models? How will the pruned language representation affect the downstream multi-task transfer learning objectives? In this paper, we propose Reweighted Proximal Pruning (RPP), a new pruning method specifically designed for a large-scale language representation model. Through experiments on SQuAD and the GLUE benchmark suite, we show that proximal pruned BERT keeps high accuracy for both the pre-training task and the downstream multiple fine-tuning tasks at high prune ratio. RPP provides a new perspective to help us analyze what large-scale language representation might learn. Additionally, RPP makes it possible to deploy a large state-of-the-art language representation model such as BERT on a series of distinct devices (e.g., online servers, mobile phones, and edge devices). |
Tasks | Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1gBOxSFwr |
https://openreview.net/pdf?id=r1gBOxSFwr | |
PWC | https://paperswithcode.com/paper/reweighted-proximal-pruning-for-large-scale |
Repo | |
Framework | |
Distilled embedding: non-linear embedding factorization using knowledge distillation
Title | Distilled embedding: non-linear embedding factorization using knowledge distillation |
Authors | Anonymous |
Abstract | Word-embeddings are a vital component of Natural Language Processing (NLP) systems and have been extensively researched. Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations. Compressing embedding matrices without sacrificing model performance is essential for successful commercial edge deployment. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then fine-tune on the downstream task employing knowledge distillation on the factorized embedding. We conduct extensive experimentation with various compression rates on machine translation, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique outperforms conventional low-rank matrix factorization, and other recently proposed word-embedding matrix compression methods. |
Tasks | Machine Translation, Word Embeddings |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bkga90VKDB |
https://openreview.net/pdf?id=Bkga90VKDB | |
PWC | https://paperswithcode.com/paper/distilled-embedding-non-linear-embedding-1 |
Repo | |
Framework | |
Soft Token Matching for Interpretable Low-Resource Classification
Title | Soft Token Matching for Interpretable Low-Resource Classification |
Authors | Anonymous |
Abstract | We propose a model to tackle classification tasks in the presence of very little training data. To this aim, we introduce a novel matching mechanism to focus on elements of the input by using vectors that represent semantically meaningful concepts for the task at hand. By leveraging highlighted portions of the training data, a simple, yet effective, error boosting technique guides the learning process. In practice, it increases the error associated to relevant parts of the input by a given factor. Results on text classification tasks confirm the benefits of the proposed approach in both balanced and unbalanced cases, thus being of practical use when labeling new examples is expensive. In addition, the model is interpretable, as it allows for human inspection of the learned weights. |
Tasks | Text Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJlNnhVYDr |
https://openreview.net/pdf?id=SJlNnhVYDr | |
PWC | https://paperswithcode.com/paper/soft-token-matching-for-interpretable-low |
Repo | |
Framework | |
Data-Efficient Image Recognition with Contrastive Predictive Coding
Title | Data-Efficient Image Recognition with Contrastive Predictive Coding |
Authors | Anonymous |
Abstract | Human observers can learn to recognize new categories of objects from a handful of examples, yet doing so with machine perception remains an open challenge. We hypothesize that data-efficient recognition is enabled by representations which make the variability in natural signals more predictable, as suggested by recent perceptual evidence. We therefore revisit and improve Contrastive Predictive Coding, a recently-proposed unsupervised learning framework, and arrive at a representation which enables generalization from small amounts of labeled data. When provided with only 1% of ImageNet labels (i.e. 13 per class), this model retains a strong classification performance, 73% Top-5 accuracy, outperforming supervised networks by 28% (a 65% relative improvement) and state-of-the-art semi-supervised methods by 14%. We also find this representation to serve as a useful substrate for object detection on the PASCAL-VOC 2007 dataset, approaching the performance of representations trained with a fully annotated ImageNet dataset. |
Tasks | Object Detection, Self-Supervised Image Classification, Semi-Supervised Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJerHlrYwH |
https://openreview.net/pdf?id=rJerHlrYwH | |
PWC | https://paperswithcode.com/paper/data-efficient-image-recognition-with-1 |
Repo | |
Framework | |
ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION
Title | ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION |
Authors | Anonymous |
Abstract | Pretrained language models (LMs) have shown excellent results in achieving human like performance on many language tasks. However, the most powerful LMs have one significant drawback: a fixed-sized input. With this constraint, these LMs are unable to utilize the full input of long documents. In this paper, we introduce a new framework to handle documents of arbitrary lengths. We investigate the addition of a recurrent mechanism to extend the input size and utilizing attention to identify the most discriminating segment of the input. We perform extensive validating experiments on patent and Arxiv datasets, both of which have long text. We demonstrate our method significantly outperforms state-of-the-art results reported in recent literature. |
Tasks | Document Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxW804FPH |
https://openreview.net/pdf?id=ryxW804FPH | |
PWC | https://paperswithcode.com/paper/adapting-pretrained-language-models-for-long |
Repo | |
Framework | |
MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
Title | MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis |
Authors | Anonymous |
Abstract | With a mortality rate of 5.4 million lives worldwide every year and a healthcare cost of more than 16 billion dollars in the USA alone, sepsis is one of the leading causes of hospital mortality and an increasing concern in the ageing western world. Recently, medical and technological advances have helped re-define the illness criteria of this disease, which is otherwise poorly understood by the medical society. Together with the rise of widely accessible Electronic Health Records, the advances in data mining and complex nonlinear algorithms are a promising avenue for the early detection of sepsis. This work contributes to the research effort in the field of automated sepsis detection with an open-access labelling of the medical MIMIC-III data set. Moreover, we propose MGP-AttTCN: a joint multitask Gaussian Process and attention-based deep learning model to early predict the occurrence of sepsis in an interpretable manner. We show that our model outperforms the current state-of-the-art and present evidence that different labelling heuristics lead to discrepancies in task difficulty. |
Tasks | Interpretable Machine Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJgDb1SFwB |
https://openreview.net/pdf?id=rJgDb1SFwB | |
PWC | https://paperswithcode.com/paper/mgp-atttcn-an-interpretable-machine-learning-1 |
Repo | |
Framework | |
Few-Shot Few-Shot Learning and the role of Spatial Attention
Title | Few-Shot Few-Shot Learning and the role of Spatial Attention |
Authors | Anonymous |
Abstract | Few-shot learning is often motivated by the ability of humans to learn new tasks from few examples. However, standard few-shot classification benchmarks assume that the representation is learned on a limited amount of base class data, ignoring the amount of prior knowledge that a human may have accumulated before learning new tasks. At the same time, even if a powerful representation is available, it may happen in some domain that base class data are limited or non-existent. This motivates us to study a problem where the representation is obtained from a classifier pre-trained on a large-scale dataset of a different domain, assuming no access to its training process, while the base class data are limited to few examples per class and their role is to adapt the representation to the domain at hand rather than learn from scratch. We adapt the representation in two stages, namely on the few base class data if available and on the even fewer data of new tasks. In doing so, we obtain from the pre-trained classifier a spatial attention map that allows focusing on objects and suppressing background clutter. This is important in the new problem, because when base class data are few, the network cannot learn where to focus implicitly. We also show that a pre-trained network may be easily adapted to novel classes, without meta-learning. |
Tasks | Few-Shot Learning, Meta-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1l2mxHKvr |
https://openreview.net/pdf?id=H1l2mxHKvr | |
PWC | https://paperswithcode.com/paper/few-shot-few-shot-learning-and-the-role-of |
Repo | |
Framework | |
Meta-Learning with Network Pruning for Overfitting Reduction
Title | Meta-Learning with Network Pruning for Overfitting Reduction |
Authors | Anonymous |
Abstract | Meta-Learning has achieved great success in few-shot learning. However, the existing meta-learning models have been evidenced to overfit on meta-training tasks when using deeper and wider convolutional neural networks. This means that we cannot improve the meta-generalization performance by merely deepening or widening the networks. To remedy such a deficiency of meta-overfitting, we propose in this paper a sparsity constrained meta-learning approach to learn from meta-training tasks a subnetwork from which first-order optimization methods can quickly converge towards the optimal network in meta-testing tasks. Our theoretical analysis shows the benefit of sparsity for improving the generalization gap of the learned meta-initialization network. We have implemented our approach on top of the widely applied Reptile algorithm assembled with varying network pruning routines including Dense-Sparse-Dense (DSD) and Iterative Hard Thresholding (IHT). Extensive experimental results on benchmark datasets with different over-parameterized deep networks demonstrate that our method can not only effectively ease meta-overfitting but also in many cases improve the meta-generalization performance when applied to few-shot classification tasks. |
Tasks | Few-Shot Learning, Meta-Learning, Network Pruning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1gcblSKwB |
https://openreview.net/pdf?id=B1gcblSKwB | |
PWC | https://paperswithcode.com/paper/meta-learning-with-network-pruning-for |
Repo | |
Framework | |
Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts
Title | Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts |
Authors | Anonymous |
Abstract | Informed and robust decision making in the face of uncertainty is critical for robots that perform physical tasks alongside people. We formulate this as a Bayesian Reinforcement Learning problem over latent Markov Decision Processes (MDPs). While Bayes-optimality is theoretically the gold standard, existing algorithms do not scale well to continuous state and action spaces. We propose a scalable solution that builds on the following insight: in the absence of uncertainty, each latent MDP is easier to solve. We split the challenge into two simpler components. First, we obtain an ensemble of clairvoyant experts and fuse their advice to compute a baseline policy. Second, we train a Bayesian residual policy to improve upon the ensemble’s recommendation and learn to reduce uncertainty. Our algorithm, Bayesian Residual Policy Optimization (BRPO), imports the scalability of policy gradient methods as well as the initialization from prior models. BRPO significantly improves the ensemble of experts and drastically outperforms existing adaptive RL methods. |
Tasks | Decision Making, Policy Gradient Methods |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1grSREtDH |
https://openreview.net/pdf?id=B1grSREtDH | |
PWC | https://paperswithcode.com/paper/bayesian-residual-policy-optimization |
Repo | |
Framework | |
SpectroBank: A filter-bank convolutional layer for CNN-based audio applications
Title | SpectroBank: A filter-bank convolutional layer for CNN-based audio applications |
Authors | Anonymous |
Abstract | We propose and investigate the design of a new convolutional layer where kernels are parameterized functions. This layer aims at being the input layer of convolutional neural networks for audio applications. The kernels are defined as functions having a band-pass filter shape, with a limited number of trainable parameters. We show that networks having such an input layer can achieve state-of-the-art accuracy on several audio classification tasks. This approach, while reducing the number of weights to be trained along with network training time, enables larger kernel sizes, an advantage for audio applications. Furthermore, the learned filters bring additional interpretability and a better understanding of the data properties exploited by the network. |
Tasks | Audio Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HyewT1BKvr |
https://openreview.net/pdf?id=HyewT1BKvr | |
PWC | https://paperswithcode.com/paper/spectrobank-a-filter-bank-convolutional-layer |
Repo | |
Framework | |
Variational Autoencoders for Opponent Modeling in Multi-Agent Systems
Title | Variational Autoencoders for Opponent Modeling in Multi-Agent Systems |
Authors | Anonymous |
Abstract | Multi-agent systems exhibit complex behaviors that emanate from the interactions of multiple agents in a shared environment. In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with the other agents that have fixed policies. Modeling the behavior of other agents (opponents) is essential in understanding the interactions of the agents in the system. By taking advantage of recent advances in unsupervised learning, we propose modeling opponents using variational autoencoders. Additionally, many existing methods in the literature assume that the opponent models have access to opponent’s observations and actions during both training and execution. To eliminate this assumption, we propose a modification that attempts to identify the underlying opponent model, using only local information of our agent, such as its observations, actions, and rewards. The experiments indicate that our opponent modeling methods achieve equal or greater episodic returns in reinforcement learning tasks against another modeling method. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BylNoaVYPS |
https://openreview.net/pdf?id=BylNoaVYPS | |
PWC | https://paperswithcode.com/paper/variational-autoencoders-for-opponent |
Repo | |
Framework | |
Multi-source Multi-view Transfer Learning in Neural Topic Modeling with Pretrained Topic and Word Embeddings
Title | Multi-source Multi-view Transfer Learning in Neural Topic Modeling with Pretrained Topic and Word Embeddings |
Authors | Anonymous |
Abstract | Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity problem in short text or small collection of documents. However, no prior work has employed (pretrained latent) topics in transfer learning paradigm. In this paper, we propose a framework to perform transfer learning in neural topic modeling using (1) pretrained (latent) topics obtained from a large source corpus, and (2) pretrained word and topic embeddings jointly (i.e., multiview) in order to improve topic quality, better deal with polysemy and data sparsity issues in a target corpus. In doing so, we first accumulate topics and word representations from one or many source corpora to build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). Then, we identify one or multiple relevant source domain(s) and take advantage of corresponding topics and word features via the respective pools to guide meaningful learning in the sparse target domain. We quantify the quality of topic and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. We have demonstrated the state-ofthe- art results on topic modeling with the proposed transfer learning approaches. |
Tasks | Information Retrieval, Transfer Learning, Word Embeddings |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ByxODxHYwB |
https://openreview.net/pdf?id=ByxODxHYwB | |
PWC | https://paperswithcode.com/paper/multi-source-multi-view-transfer-learning-in |
Repo | |
Framework | |
Knowledge Transfer via Student-Teacher Collaboration
Title | Knowledge Transfer via Student-Teacher Collaboration |
Authors | Anonymous |
Abstract | Accompanying with the flourish development in various fields, deep neural networks, however, are still facing with the plight of high computational costs and storage. One way to compress these heavy models is knowledge transfer (KT), in which a light student network is trained through absorbing the knowledge from a powerful teacher network. In this paper, we propose a novel knowledge transfer method which employs a Student-Teacher Collaboration (STC) network during the knowledge transfer process. This is done by connecting the front part of the student network to the back part of the teacher network as the STC network. The back part of the teacher network takes the intermediate representation from the front part of the student network as input to make the prediction. The difference between the prediction from the collaboration network and the output tensor from the teacher network is taken into account of the loss during the train process. Through back propagation, the teacher network provides guidance to the student network in a gradient signal manner. In this way, our method takes advantage of the knowledge from the entire teacher network, who instructs the student network in learning process. Through plentiful experiments, it is proved that our STC method outperforms other KT methods with conventional strategy. |
Tasks | Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1lVvgHKDr |
https://openreview.net/pdf?id=H1lVvgHKDr | |
PWC | https://paperswithcode.com/paper/knowledge-transfer-via-student-teacher |
Repo | |
Framework | |