April 1, 2020

2970 words 14 mins read

Paper Group NANR 133

Winning the Lottery with Continuous Sparsification. Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula. Reweighted Proximal Pruning for Large-Scale Language Representation. Distilled embedding: non-linear embedding factorization using knowledge distillation. Soft Token Matching for Interpretable Low-Resource C …

Winning the Lottery with Continuous Sparsification


Title	Winning the Lottery with Continuous Sparsification
Authors	Anonymous
Abstract	The Lottery Ticket Hypothesis from Frankle & Carbin (2019) conjectures that, for typically-sized neural networks, it is possible to find small sub-networks which train faster and yield superior performance than their original counterparts. The proposed algorithm to search for such sub-networks (winning tickets), Iterative Magnitude Pruning (IMP), consistently finds sub-networks with 90-95% less parameters which indeed train faster and better than the overparameterized models they were extracted from, creating potential applications to problems such as transfer learning. In this paper, we propose a new algorithm to search for winning tickets, Continuous Sparsification, which continuously removes parameters from a network during training, and learns the sub-network’s structure with gradient-based methods instead of relying on pruning strategies. We show empirically that our method is capable of finding tickets that outperforms the ones learned by Iterative Magnitude Pruning, and at the same time providing up to 5 times faster search, when measured in number of training epochs.
Tasks	Transfer Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=BJe4oxHYPB
PDF	https://openreview.net/pdf?id=BJe4oxHYPB
PWC	https://paperswithcode.com/paper/winning-the-lottery-with-continuous
Repo
Framework

Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula


Title	Model Architecture Controls Gradient Descent Dynamics: A Combinatorial Path-Based Formula
Authors	Anonymous
Abstract	Recently, there has been a growing interest in automatically exploring neural network architecture design space with the goal of finding an architecture that improves performance (characterized as improved accuracy, speed of training, or resource requirements). However, our theoretical understanding of how model architecture affects performance or accuracy is limited. In this paper, we study the impact of model architecture on the speed of training in the context of gradient descent optimization. We model gradient descent as a first-order ODE and use ODE’s coefficient matrix H to characterize the convergence rate. We introduce a simple analysis technique that enumerates H in terms of all possible ``paths’’ in the network. We show that changes in model architecture parameters reflect as changes in the number of paths and the properties of each path, which jointly control the speed of convergence. We believe our analysis technique is useful in reasoning about more complex model architecture modifications. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1xw9n4Kwr
PDF	https://openreview.net/pdf?id=B1xw9n4Kwr
PWC	https://paperswithcode.com/paper/model-architecture-controls-gradient-descent
Repo
Framework

Reweighted Proximal Pruning for Large-Scale Language Representation


Title	Reweighted Proximal Pruning for Large-Scale Language Representation
Authors	Anonymous
Abstract	Recently, pre-trained language representation flourishes as the mainstay of the natural language understanding community, e.g., BERT. These pre-trained language representations can create state-of-the-art results on a wide range of downstream tasks. Along with continuous significant performance improvement, the size and complexity of these pre-trained neural models continue to increase rapidly. Is it possible to compress these large-scale language representation models? How will the pruned language representation affect the downstream multi-task transfer learning objectives? In this paper, we propose Reweighted Proximal Pruning (RPP), a new pruning method specifically designed for a large-scale language representation model. Through experiments on SQuAD and the GLUE benchmark suite, we show that proximal pruned BERT keeps high accuracy for both the pre-training task and the downstream multiple fine-tuning tasks at high prune ratio. RPP provides a new perspective to help us analyze what large-scale language representation might learn. Additionally, RPP makes it possible to deploy a large state-of-the-art language representation model such as BERT on a series of distinct devices (e.g., online servers, mobile phones, and edge devices).
Tasks	Transfer Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=r1gBOxSFwr
PDF	https://openreview.net/pdf?id=r1gBOxSFwr
PWC	https://paperswithcode.com/paper/reweighted-proximal-pruning-for-large-scale
Repo
Framework

Distilled embedding: non-linear embedding factorization using knowledge distillation


Title	Distilled embedding: non-linear embedding factorization using knowledge distillation
Authors	Anonymous
Abstract	Word-embeddings are a vital component of Natural Language Processing (NLP) systems and have been extensively researched. Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations. Compressing embedding matrices without sacrificing model performance is essential for successful commercial edge deployment. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then fine-tune on the downstream task employing knowledge distillation on the factorized embedding. We conduct extensive experimentation with various compression rates on machine translation, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique outperforms conventional low-rank matrix factorization, and other recently proposed word-embedding matrix compression methods.
Tasks	Machine Translation, Word Embeddings
Published	2020-01-01
URL	https://openreview.net/forum?id=Bkga90VKDB
PDF	https://openreview.net/pdf?id=Bkga90VKDB
PWC	https://paperswithcode.com/paper/distilled-embedding-non-linear-embedding-1
Repo
Framework

Soft Token Matching for Interpretable Low-Resource Classification


Title	Soft Token Matching for Interpretable Low-Resource Classification
Authors	Anonymous
Abstract	We propose a model to tackle classification tasks in the presence of very little training data. To this aim, we introduce a novel matching mechanism to focus on elements of the input by using vectors that represent semantically meaningful concepts for the task at hand. By leveraging highlighted portions of the training data, a simple, yet effective, error boosting technique guides the learning process. In practice, it increases the error associated to relevant parts of the input by a given factor. Results on text classification tasks confirm the benefits of the proposed approach in both balanced and unbalanced cases, thus being of practical use when labeling new examples is expensive. In addition, the model is interpretable, as it allows for human inspection of the learned weights.
Tasks	Text Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=SJlNnhVYDr
PDF	https://openreview.net/pdf?id=SJlNnhVYDr
PWC	https://paperswithcode.com/paper/soft-token-matching-for-interpretable-low
Repo
Framework

Data-Efficient Image Recognition with Contrastive Predictive Coding


Title	Data-Efficient Image Recognition with Contrastive Predictive Coding
Authors	Anonymous
Abstract	Human observers can learn to recognize new categories of objects from a handful of examples, yet doing so with machine perception remains an open challenge. We hypothesize that data-efficient recognition is enabled by representations which make the variability in natural signals more predictable, as suggested by recent perceptual evidence. We therefore revisit and improve Contrastive Predictive Coding, a recently-proposed unsupervised learning framework, and arrive at a representation which enables generalization from small amounts of labeled data. When provided with only 1% of ImageNet labels (i.e. 13 per class), this model retains a strong classification performance, 73% Top-5 accuracy, outperforming supervised networks by 28% (a 65% relative improvement) and state-of-the-art semi-supervised methods by 14%. We also find this representation to serve as a useful substrate for object detection on the PASCAL-VOC 2007 dataset, approaching the performance of representations trained with a fully annotated ImageNet dataset.
Tasks	Object Detection, Self-Supervised Image Classification, Semi-Supervised Image Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=rJerHlrYwH
PDF	https://openreview.net/pdf?id=rJerHlrYwH
PWC	https://paperswithcode.com/paper/data-efficient-image-recognition-with-1
Repo
Framework

ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION


Title	ADAPTING PRETRAINED LANGUAGE MODELS FOR LONG DOCUMENT CLASSIFICATION
Authors	Anonymous
Abstract	Pretrained language models (LMs) have shown excellent results in achieving human like performance on many language tasks. However, the most powerful LMs have one significant drawback: a fixed-sized input. With this constraint, these LMs are unable to utilize the full input of long documents. In this paper, we introduce a new framework to handle documents of arbitrary lengths. We investigate the addition of a recurrent mechanism to extend the input size and utilizing attention to identify the most discriminating segment of the input. We perform extensive validating experiments on patent and Arxiv datasets, both of which have long text. We demonstrate our method significantly outperforms state-of-the-art results reported in recent literature.
Tasks	Document Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=ryxW804FPH
PDF	https://openreview.net/pdf?id=ryxW804FPH
PWC	https://paperswithcode.com/paper/adapting-pretrained-language-models-for-long
Repo
Framework

MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis


Title	MGP-AttTCN: An Interpretable Machine Learning Model for the Prediction of Sepsis
Authors	Anonymous
Abstract	With a mortality rate of 5.4 million lives worldwide every year and a healthcare cost of more than 16 billion dollars in the USA alone, sepsis is one of the leading causes of hospital mortality and an increasing concern in the ageing western world. Recently, medical and technological advances have helped re-define the illness criteria of this disease, which is otherwise poorly understood by the medical society. Together with the rise of widely accessible Electronic Health Records, the advances in data mining and complex nonlinear algorithms are a promising avenue for the early detection of sepsis. This work contributes to the research effort in the field of automated sepsis detection with an open-access labelling of the medical MIMIC-III data set. Moreover, we propose MGP-AttTCN: a joint multitask Gaussian Process and attention-based deep learning model to early predict the occurrence of sepsis in an interpretable manner. We show that our model outperforms the current state-of-the-art and present evidence that different labelling heuristics lead to discrepancies in task difficulty.
Tasks	Interpretable Machine Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=rJgDb1SFwB
PDF	https://openreview.net/pdf?id=rJgDb1SFwB
PWC	https://paperswithcode.com/paper/mgp-atttcn-an-interpretable-machine-learning-1
Repo
Framework

Few-Shot Few-Shot Learning and the role of Spatial Attention


Title	Few-Shot Few-Shot Learning and the role of Spatial Attention
Authors	Anonymous
Abstract	Few-shot learning is often motivated by the ability of humans to learn new tasks from few examples. However, standard few-shot classification benchmarks assume that the representation is learned on a limited amount of base class data, ignoring the amount of prior knowledge that a human may have accumulated before learning new tasks. At the same time, even if a powerful representation is available, it may happen in some domain that base class data are limited or non-existent. This motivates us to study a problem where the representation is obtained from a classifier pre-trained on a large-scale dataset of a different domain, assuming no access to its training process, while the base class data are limited to few examples per class and their role is to adapt the representation to the domain at hand rather than learn from scratch. We adapt the representation in two stages, namely on the few base class data if available and on the even fewer data of new tasks. In doing so, we obtain from the pre-trained classifier a spatial attention map that allows focusing on objects and suppressing background clutter. This is important in the new problem, because when base class data are few, the network cannot learn where to focus implicitly. We also show that a pre-trained network may be easily adapted to novel classes, without meta-learning.
Tasks	Few-Shot Learning, Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=H1l2mxHKvr
PDF	https://openreview.net/pdf?id=H1l2mxHKvr
PWC	https://paperswithcode.com/paper/few-shot-few-shot-learning-and-the-role-of
Repo
Framework

Meta-Learning with Network Pruning for Overfitting Reduction


Title	Meta-Learning with Network Pruning for Overfitting Reduction
Authors	Anonymous
Abstract	Meta-Learning has achieved great success in few-shot learning. However, the existing meta-learning models have been evidenced to overfit on meta-training tasks when using deeper and wider convolutional neural networks. This means that we cannot improve the meta-generalization performance by merely deepening or widening the networks. To remedy such a deficiency of meta-overfitting, we propose in this paper a sparsity constrained meta-learning approach to learn from meta-training tasks a subnetwork from which first-order optimization methods can quickly converge towards the optimal network in meta-testing tasks. Our theoretical analysis shows the benefit of sparsity for improving the generalization gap of the learned meta-initialization network. We have implemented our approach on top of the widely applied Reptile algorithm assembled with varying network pruning routines including Dense-Sparse-Dense (DSD) and Iterative Hard Thresholding (IHT). Extensive experimental results on benchmark datasets with different over-parameterized deep networks demonstrate that our method can not only effectively ease meta-overfitting but also in many cases improve the meta-generalization performance when applied to few-shot classification tasks.
Tasks	Few-Shot Learning, Meta-Learning, Network Pruning
Published	2020-01-01
URL	https://openreview.net/forum?id=B1gcblSKwB
PDF	https://openreview.net/pdf?id=B1gcblSKwB
PWC	https://paperswithcode.com/paper/meta-learning-with-network-pruning-for
Repo
Framework

Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts


Title	Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts
Authors	Anonymous
Abstract	Informed and robust decision making in the face of uncertainty is critical for robots that perform physical tasks alongside people. We formulate this as a Bayesian Reinforcement Learning problem over latent Markov Decision Processes (MDPs). While Bayes-optimality is theoretically the gold standard, existing algorithms do not scale well to continuous state and action spaces. We propose a scalable solution that builds on the following insight: in the absence of uncertainty, each latent MDP is easier to solve. We split the challenge into two simpler components. First, we obtain an ensemble of clairvoyant experts and fuse their advice to compute a baseline policy. Second, we train a Bayesian residual policy to improve upon the ensemble’s recommendation and learn to reduce uncertainty. Our algorithm, Bayesian Residual Policy Optimization (BRPO), imports the scalability of policy gradient methods as well as the initialization from prior models. BRPO significantly improves the ensemble of experts and drastically outperforms existing adaptive RL methods.
Tasks	Decision Making, Policy Gradient Methods
Published	2020-01-01
URL	https://openreview.net/forum?id=B1grSREtDH
PDF	https://openreview.net/pdf?id=B1grSREtDH
PWC	https://paperswithcode.com/paper/bayesian-residual-policy-optimization
Repo
Framework

SpectroBank: A filter-bank convolutional layer for CNN-based audio applications


Title	SpectroBank: A filter-bank convolutional layer for CNN-based audio applications
Authors	Anonymous
Abstract	We propose and investigate the design of a new convolutional layer where kernels are parameterized functions. This layer aims at being the input layer of convolutional neural networks for audio applications. The kernels are defined as functions having a band-pass filter shape, with a limited number of trainable parameters. We show that networks having such an input layer can achieve state-of-the-art accuracy on several audio classification tasks. This approach, while reducing the number of weights to be trained along with network training time, enables larger kernel sizes, an advantage for audio applications. Furthermore, the learned filters bring additional interpretability and a better understanding of the data properties exploited by the network.
Tasks	Audio Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=HyewT1BKvr
PDF	https://openreview.net/pdf?id=HyewT1BKvr
PWC	https://paperswithcode.com/paper/spectrobank-a-filter-bank-convolutional-layer
Repo
Framework

Variational Autoencoders for Opponent Modeling in Multi-Agent Systems


Title	Variational Autoencoders for Opponent Modeling in Multi-Agent Systems
Authors	Anonymous
Abstract	Multi-agent systems exhibit complex behaviors that emanate from the interactions of multiple agents in a shared environment. In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with the other agents that have fixed policies. Modeling the behavior of other agents (opponents) is essential in understanding the interactions of the agents in the system. By taking advantage of recent advances in unsupervised learning, we propose modeling opponents using variational autoencoders. Additionally, many existing methods in the literature assume that the opponent models have access to opponent’s observations and actions during both training and execution. To eliminate this assumption, we propose a modification that attempts to identify the underlying opponent model, using only local information of our agent, such as its observations, actions, and rewards. The experiments indicate that our opponent modeling methods achieve equal or greater episodic returns in reinforcement learning tasks against another modeling method.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BylNoaVYPS
PDF	https://openreview.net/pdf?id=BylNoaVYPS
PWC	https://paperswithcode.com/paper/variational-autoencoders-for-opponent
Repo
Framework

Multi-source Multi-view Transfer Learning in Neural Topic Modeling with Pretrained Topic and Word Embeddings


Title	Multi-source Multi-view Transfer Learning in Neural Topic Modeling with Pretrained Topic and Word Embeddings
Authors	Anonymous
Abstract	Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity problem in short text or small collection of documents. However, no prior work has employed (pretrained latent) topics in transfer learning paradigm. In this paper, we propose a framework to perform transfer learning in neural topic modeling using (1) pretrained (latent) topics obtained from a large source corpus, and (2) pretrained word and topic embeddings jointly (i.e., multiview) in order to improve topic quality, better deal with polysemy and data sparsity issues in a target corpus. In doing so, we first accumulate topics and word representations from one or many source corpora to build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). Then, we identify one or multiple relevant source domain(s) and take advantage of corresponding topics and word features via the respective pools to guide meaningful learning in the sparse target domain. We quantify the quality of topic and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. We have demonstrated the state-ofthe- art results on topic modeling with the proposed transfer learning approaches.
Tasks	Information Retrieval, Transfer Learning, Word Embeddings
Published	2020-01-01
URL	https://openreview.net/forum?id=ByxODxHYwB
PDF	https://openreview.net/pdf?id=ByxODxHYwB
PWC	https://paperswithcode.com/paper/multi-source-multi-view-transfer-learning-in
Repo
Framework

Knowledge Transfer via Student-Teacher Collaboration


Title	Knowledge Transfer via Student-Teacher Collaboration
Authors	Anonymous
Abstract	Accompanying with the flourish development in various fields, deep neural networks, however, are still facing with the plight of high computational costs and storage. One way to compress these heavy models is knowledge transfer (KT), in which a light student network is trained through absorbing the knowledge from a powerful teacher network. In this paper, we propose a novel knowledge transfer method which employs a Student-Teacher Collaboration (STC) network during the knowledge transfer process. This is done by connecting the front part of the student network to the back part of the teacher network as the STC network. The back part of the teacher network takes the intermediate representation from the front part of the student network as input to make the prediction. The difference between the prediction from the collaboration network and the output tensor from the teacher network is taken into account of the loss during the train process. Through back propagation, the teacher network provides guidance to the student network in a gradient signal manner. In this way, our method takes advantage of the knowledge from the entire teacher network, who instructs the student network in learning process. Through plentiful experiments, it is proved that our STC method outperforms other KT methods with conventional strategy.
Tasks	Transfer Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=H1lVvgHKDr
PDF	https://openreview.net/pdf?id=H1lVvgHKDr
PWC	https://paperswithcode.com/paper/knowledge-transfer-via-student-teacher
Repo
Framework