Paper Group NANR 55
Semi-supervised Learning by Coaching. Min-Max Optimization without Gradients: Convergence and Applications to Adversarial ML. Good Semi-supervised VAE Requires Tighter Evidence Lower Bound. Learning Latent Dynamics for Partially-Observed Chaotic Systems. Mixed Precision Training With 8-bit Floating Point. Pipelined Training with Stale Weights of De …
Semi-supervised Learning by Coaching
Title | Semi-supervised Learning by Coaching |
Authors | Anonymous |
Abstract | Recent semi-supervised learning (SSL) methods often have a teacher to train a student in order to propagate labels from labeled data to unlabeled data. We argue that a weakness of these methods is that the teacher does not learn from the student’s mistakes during the course of student’s learning. To address this weakness, we introduce Coaching, a framework where a teacher generates pseudo labels for unlabeled data, from which a student will learn and the student’s performance on labeled data will be used as reward to train the teacher using policy gradient. Our experiments show that Coaching significantly improves over state-of-the-art SSL baselines. For instance, on CIFAR-10, with only 4,000 labeled examples, a WideResNet-28-2 trained by Coaching achieves 96.11% accuracy, which is better than 94.9% achieved by the same architecture trained with 45,000 labeled. On ImageNet with 10% labeled examples, Coaching trains a ResNet-50 to 72.94% top-1 accuracy, comfortably outperforming the existing state-of-the-art by more than 4%. Coaching also scales successfully to the high data regime with full ImageNet. Specifically, with additional 9 million unlabeled images from OpenImages, Coaching trains a ResNet-50 to 82.34% top-1 accuracy, setting a new state-of-the-art for the architecture on ImageNet without using extra labeled data. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJe04p4YDB |
https://openreview.net/pdf?id=rJe04p4YDB | |
PWC | https://paperswithcode.com/paper/semi-supervised-learning-by-coaching |
Repo | |
Framework | |
Min-Max Optimization without Gradients: Convergence and Applications to Adversarial ML
Title | Min-Max Optimization without Gradients: Convergence and Applications to Adversarial ML |
Authors | Anonymous |
Abstract | In this paper, we study the problem of constrained robust (min-max) optimization ina black-box setting, where the desired optimizer cannot access the gradients of the objective function but may query its values. We present a principled optimization framework, integrating a zeroth-order (ZO) gradient estimator with an alternating projected stochastic gradient descent-ascent method, where the former only requires a small number of function queries and the later needs just one-step descent/ascent update. We show that the proposed framework, referred to as ZO-Min-Max, has a sub-linear convergence rate under mild conditions and scales gracefully with problem size. From an application side, we explore a promising connection between black-box min-max optimization and black-box evasion and poisoning attacks in adversarial machine learning (ML). Our empirical evaluations on these use cases demonstrate the effectiveness of our approach and its scalability to dimensions that prohibit using recent black-box solvers. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rylkma4twr |
https://openreview.net/pdf?id=rylkma4twr | |
PWC | https://paperswithcode.com/paper/min-max-optimization-without-gradients-1 |
Repo | |
Framework | |
Good Semi-supervised VAE Requires Tighter Evidence Lower Bound
Title | Good Semi-supervised VAE Requires Tighter Evidence Lower Bound |
Authors | Anonymous |
Abstract | Semi-supervised learning approaches based on generative models have now encountered 3 challenges: (1) The two-stage training strategy is not robust. (2) Good semi-supervised learning results and good generative performance can not be obtained at the same time. (3) Even at the expense of sacrificing generative performance, the semi-supervised classification results are still not satisfactory. To address these problems, we propose One-stage Semi-suPervised Optimal Transport VAE (OSPOT-VAE), a one-stage deep generative model that theoretically unifies the generation and classification loss in one ELBO framework and achieves a tighter ELBO by applying the optimal transport scheme to the distribution of latent variables. We show that with tighter ELBO, our OSPOT-VAE surpasses the best semi-supervised generative models by a large margin across many benchmark datasets. For example, we reduce the error rate from 14.41% to 6.11% on Cifar-10 with 4k labels and achieve state-of-the-art performance with 25.30% on Cifar-100 with 10k labels. We also demonstrate that good generative models and semi-supervised results can be achieved simultaneously by OSPOT-VAE. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1ejj64YvS |
https://openreview.net/pdf?id=S1ejj64YvS | |
PWC | https://paperswithcode.com/paper/good-semi-supervised-vae-requires-tighter |
Repo | |
Framework | |
Learning Latent Dynamics for Partially-Observed Chaotic Systems
Title | Learning Latent Dynamics for Partially-Observed Chaotic Systems |
Authors | Anonymous |
Abstract | This paper addresses the data-driven identification of latent representations of partially-observed dynamical systems, i.e. dynamical systems whose some components are never observed, with an emphasis on forecasting applications and long-term asymptotic patterns. Whereas state-of-the-art data-driven approaches rely on delay embeddings and linear decompositions of the underlying operators, we introduce a framework based on the data-driven identification of an augmented state-space model using a neural-network-based representation. For a given training dataset, it amounts to jointly reconstructing the latent states and learning an ODE (Ordinary Differential Equation) representation in this space. Through numerical experiments, we demonstrate the relevance of the proposed framework w.r.t. state-of-the-art approaches in terms of short-term forecasting errors and long-term behaviour. We further discuss how the proposed framework relates to Koopman operator theory and Takens’ embedding theorem. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BygMreSYPB |
https://openreview.net/pdf?id=BygMreSYPB | |
PWC | https://paperswithcode.com/paper/learning-latent-dynamics-for-partially-1 |
Repo | |
Framework | |
Mixed Precision Training With 8-bit Floating Point
Title | Mixed Precision Training With 8-bit Floating Point |
Authors | Anonymous |
Abstract | Reduced precision computation is one of the key areas addressing the widening’compute gap’, driven by an exponential growth in deep learning applications. In recent years, deep neural network training has largely migrated to 16-bit precision,with significant gains in performance and energy efficiency. However, attempts to train DNNs at 8-bit precision have met with significant challenges, because of the higher precision and dynamic range requirements of back-propagation. In this paper, we propose a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients. We demonstrate state-of-the-art accuracy across multiple data sets (imagenet-1K, WMT16)and a broader set of workloads (Resnet-18/34/50, GNMT, and Transformer) than previously reported. We propose an enhanced loss scaling method to augment the reduced subnormal range of 8-bit floating point, to improve error propagation.We also examine the impact of quantization noise on generalization, and propose a stochastic rounding technique to address gradient noise. As a result of applying all these techniques, we report slightly higher validation accuracy compared to full precision baseline. |
Tasks | Quantization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJe88xBKPr |
https://openreview.net/pdf?id=HJe88xBKPr | |
PWC | https://paperswithcode.com/paper/mixed-precision-training-with-8-bit-floating-1 |
Repo | |
Framework | |
Pipelined Training with Stale Weights of Deep Convolutional Neural Networks
Title | Pipelined Training with Stale Weights of Deep Convolutional Neural Networks |
Authors | Anonymous |
Abstract | The growth in the complexity of Convolutional Neural Networks (CNNs) is increasing interest in partitioning a network across multiple accelerators during training and pipelining the backpropagation computations over the accelerators. Existing approaches avoid or limit the use of stale weights through techniques such as micro-batching or weight stashing. These techniques either underutilize of accelerators or increase memory footprint. We explore the impact of stale weights on the statistical efficiency and performance in a pipelined backpropagation scheme that maximizes accelerator utilization and keeps memory overhead modest. We use 4 CNNs (LeNet-5, AlexNet, VGG and ResNet) and show that when pipelining is limited to early layers in a network, training with stale weights converges and results in models with comparable inference accuracies to those resulting from non-pipelined training on MNIST and CIFAR-10 datasets; a drop in accuracy of 0.4%, 4%, 0.83% and 1.45% for the 4 networks, respectively. However, when pipelining is deeper in the network, inference accuracies drop significantly. We propose combining pipelined and non-pipelined training in a hybrid scheme to address this drop. We demonstrate the implementation and performance of our pipelined backpropagation in PyTorch on 2 GPUs using ResNet, achieving speedups of up to 1.8X over a 1-GPU baseline, with a small drop in inference accuracy. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkgTR3VFvH |
https://openreview.net/pdf?id=SkgTR3VFvH | |
PWC | https://paperswithcode.com/paper/pipelined-training-with-stale-weights-of-deep |
Repo | |
Framework | |
Contextual Inverse Reinforcement Learning
Title | Contextual Inverse Reinforcement Learning |
Authors | Anonymous |
Abstract | We consider the Inverse Reinforcement Learning problem in Contextual Markov Decision Processes. In this setting, the reward, which is unknown to the agent, is a function of a static parameter referred to as the context. There is also an “expert” who knows this mapping and acts according to the optimal policy for each context. The goal of the agent is to learn the expert’s mapping by observing demonstrations. We define an optimization problem for finding this mapping and show that when it is linear, the problem is convex. We present and analyze the sample complexity of three algorithms for solving this problem: the mirrored descent algorithm, evolution strategies, and the ellipsoid method. We also extend the first two methods to work with general reward functions, e.g., deep neural networks, but without the theoretical guarantees. Finally, we compare the different techniques empirically in driving simulation and a medical treatment regime. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1gqraNKwB |
https://openreview.net/pdf?id=S1gqraNKwB | |
PWC | https://paperswithcode.com/paper/contextual-inverse-reinforcement-learning |
Repo | |
Framework | |
TabNet: Attentive Interpretable Tabular Learning
Title | TabNet: Attentive Interpretable Tabular Learning |
Authors | Anonymous |
Abstract | We propose a novel high-performance interpretable deep tabular data learning network, TabNet. TabNet utilizes a sequential attention mechanism that softly selects features to reason from at each decision step and then aggregates the processed information to make a final prediction decision. By explicitly selecting sparse features, TabNet learns very efficiently as the model capacity at each decision step is fully utilized for the most relevant features, resulting in a high performance model. This sparsity also enables more interpretable decision making through the visualization of feature selection masks. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of tabular data learning datasets and yields interpretable feature attributions and insights into the global model behavior. |
Tasks | Decision Making, Feature Selection |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BylRkAEKDH |
https://openreview.net/pdf?id=BylRkAEKDH | |
PWC | https://paperswithcode.com/paper/tabnet-attentive-interpretable-tabular-1 |
Repo | |
Framework | |
Unsupervised Generative 3D Shape Learning from Natural Images
Title | Unsupervised Generative 3D Shape Learning from Natural Images |
Authors | Anonymous |
Abstract | In this paper we present, to the best of our knowledge, the first method to learn a generative model of 3D shapes from natural images in a fully unsupervised way. For example, we do not use any ground truth 3D or 2D annotations, stereo video, and ego-motion during the training. Our approach follows the general strategy of Generative Adversarial Networks, where an image generator network learns to create image samples that are realistic enough to fool a discriminator network into believing that they are natural images. In contrast, in our approach the image gen- eration is split into 2 stages. In the first stage a generator network outputs 3D ob- jects. In the second, a differentiable renderer produces an image of the 3D object from a random viewpoint. The key observation is that a realistic 3D object should yield a realistic rendering from any plausible viewpoint. Thus, by randomizing the choice of the viewpoint our proposed training forces the generator network to learn an interpretable 3D representation disentangled from the viewpoint. In this work, a 3D representation consists of a triangle mesh and a texture map that is used to color the triangle surface by using the UV-mapping technique. We provide analysis of our learning approach, expose its ambiguities and show how to over- come them. Experimentally, we demonstrate that our method can learn realistic 3D shapes of faces by using only the natural images of the FFHQ dataset. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJgKYlSKvr |
https://openreview.net/pdf?id=HJgKYlSKvr | |
PWC | https://paperswithcode.com/paper/unsupervised-generative-3d-shape-learning |
Repo | |
Framework | |
Demonstration Actor Critic
Title | Demonstration Actor Critic |
Authors | Anonymous |
Abstract | We study the problem of \textit{Reinforcement learning from demonstrations (RLfD)}, where the learner is provided with both some expert demonstrations and reinforcement signals from the environment. One approach leverages demonstration data in a supervised manner, which is simple and direct, but can only provide supervision signal over those states seen in the demonstrations. Another approach uses demonstration data for reward shaping. By contrast, the latter approach can provide guidance on how to take actions, even for those states are not seen in the demonstrations. But existing algorithms in the latter one adopt shaping reward which is not directly dependent on current policy, limiting the algorithms to treat demonstrated states the same as other states, failing to directly exploit supervision signal in demonstration data. In this paper, we propose a novel objective function with policy-dependent shaping reward, so as to get the best of both worlds. We present a convergence proof for policy iteration of the proposed objective, under the tabular setting. Then we develop a new practical algorithm, termed as Demonstration Actor Critic (DAC). Experiments on a range of popular benchmark sparse-reward tasks shows that our DAC method obtains a significant performance gain over five strong and off-the-shelf baselines. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BklRFpVKPH |
https://openreview.net/pdf?id=BklRFpVKPH | |
PWC | https://paperswithcode.com/paper/demonstration-actor-critic |
Repo | |
Framework | |
Split LBI for Deep Learning: Structural Sparsity via Differential Inclusion Paths
Title | Split LBI for Deep Learning: Structural Sparsity via Differential Inclusion Paths |
Authors | Anonymous |
Abstract | Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling over-parameterized models to compressive ones, we propose a new approach based on \emph{differential inclusions of inverse scale spaces}, that generates a family of models from simple to complex ones by coupling gradient descent and mirror descent to explore model structural sparsity. It has a simple discretization, called the Split Linearized Bregman Iteration (SplitLBI), whose global convergence analysis in deep learning is established that from any initializations, algorithmic iterations converge to a critical point of empirical risks. Experimental evidence shows that\ SplitLBI may achieve state-of-the-art performance in large scale training on ImageNet-2012 dataset etc., while with \emph{early stopping} it unveils effective subnet architecture with comparable test accuracies to dense models after retraining instead of pruning well-trained ones. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkxUrTVKDH |
https://openreview.net/pdf?id=SkxUrTVKDH | |
PWC | https://paperswithcode.com/paper/split-lbi-for-deep-learning-structural |
Repo | |
Framework | |
Localized Meta-Learning: A PAC-Bayes Analysis for Meta-Leanring Beyond Global Prior
Title | Localized Meta-Learning: A PAC-Bayes Analysis for Meta-Leanring Beyond Global Prior |
Authors | Anonymous |
Abstract | Meta-learning methods learn the meta-knowledge among various training tasks and aim to promote the learning of new tasks under the task similarity assumption. However, such meta-knowledge is often represented as a fixed distribution, which is too restrictive to capture various specific task information. In this work, we present a localized meta-learning framework based on PAC-Bayes theory. In particular, we propose a LCC-based prior predictor that allows the meta learner adaptively generate local meta-knowledge for specific task. We further develop a pratical algorithm with deep neural network based on the bound. Empirical results on real-world datasets demonstrate the efficacy of the proposed method. |
Tasks | Meta-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1gIwgSYwr |
https://openreview.net/pdf?id=r1gIwgSYwr | |
PWC | https://paperswithcode.com/paper/localized-meta-learning-a-pac-bayes-analysis |
Repo | |
Framework | |
Robust Few-Shot Learning with Adversarially Queried Meta-Learners
Title | Robust Few-Shot Learning with Adversarially Queried Meta-Learners |
Authors | Anonymous |
Abstract | Previous work on adversarially robust neural networks requires large training sets and computationally expensive training procedures. On the other hand, few-shot learning methods are highly vulnerable to adversarial examples. The goal of our work is to produce networks which both perform well at few-shot tasks and are simultaneously robust to adversarial examples. We adapt adversarial training for meta-learning, we adapt robust architectural features to small networks for meta-learning, we test pre-processing defenses as an alternative to adversarial training for meta-learning, and we investigate the advantages of robust meta-learning over robust transfer-learning for few-shot tasks. This work provides a thorough analysis of adversarially robust methods in the context of meta-learning, and we lay the foundation for future work on defenses for few-shot tasks. |
Tasks | Few-Shot Learning, Meta-Learning, Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SyekweSFPr |
https://openreview.net/pdf?id=SyekweSFPr | |
PWC | https://paperswithcode.com/paper/robust-few-shot-learning-with-adversarially |
Repo | |
Framework | |
A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms
Title | A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms |
Authors | Anonymous |
Abstract | We propose to use a meta-learning objective that maximizes the speed of transfer on a modified distribution to learn how to modularize acquired knowledge. In particular, we focus on how to factor a joint distribution into appropriate conditionals, consistent with the causal directions. We explain when this can work, using the assumption that the changes in distributions are localized (e.g. to one of the marginals, for example due to an intervention on one of the variables). We prove that under this assumption of localized changes in causal mechanisms, the correct causal graph will tend to have only a few of its parameters with non-zero gradient, i.e. that need to be adapted (those of the modified variables). We argue and observe experimentally that this leads to faster adaptation, and use this property to define a meta-learning surrogate score which, in addition to a continuous parametrization of graphs, would favour correct causal graphs. Finally, motivated by the AI agent point of view (e.g. of a robot discovering its environment autonomously), we consider how the same objective can discover the causal variables themselves, as a transformation of observed low-level variables with no causal meaning. Experiments in the two-variable case validate the proposed ideas and theoretical results. |
Tasks | Meta-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxWIgBFPS |
https://openreview.net/pdf?id=ryxWIgBFPS | |
PWC | https://paperswithcode.com/paper/a-meta-transfer-objective-for-learning-to-1 |
Repo | |
Framework | |
Few-Shot Regression via Learning Sparsifying Basis Functions
Title | Few-Shot Regression via Learning Sparsifying Basis Functions |
Authors | Anonymous |
Abstract | Recent few-shot learning algorithms have enabled models to quickly adapt to new tasks based on only a few training samples. Previous few-shot learning works have mainly focused on classification and reinforcement learning. In this paper, we propose a few-shot meta-learning system that focuses exclusively on regression tasks. Our model is based on the idea that the degree of freedom of the unknown function can be significantly reduced if it is represented as a linear combination of a set of sparsifying basis functions. This enables a few labeled samples to approximate the function. We design a Basis Function Learner network to encode basis functions for a task distribution, and a Weights Generator network to generate the weight vector for a novel task. We show that our model outperforms the current state of the art meta-learning methods in various regression tasks. |
Tasks | Few-Shot Learning, few-shot regression, Meta-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJxDNxSFDH |
https://openreview.net/pdf?id=BJxDNxSFDH | |
PWC | https://paperswithcode.com/paper/few-shot-regression-via-learning-sparsifying |
Repo | |
Framework | |