April 1, 2020

2912 words 14 mins read

Paper Group NANR 88

Paper Group NANR 88

Training binary neural networks with real-to-binary convolutions. FALCON: Fast and Lightweight Convolution for Compressing and Accelerating CNN. Ridge Regression: Structure, Cross-Validation, and Sketching. Confidence Scores Make Instance-dependent Label-noise Learning Possible. FreeLB: Enhanced Adversarial Training for Language Understanding. Mode …

Training binary neural networks with real-to-binary convolutions

Title Training binary neural networks with real-to-binary convolutions
Authors Anonymous
Abstract This paper shows how to train binary networks to within a few percent points (~3-5 %) of the full precision counterpart with a negligible increase in the computational cost. In particular, we first show how to build a strong baseline, which already achieves state-of-the-art accuracy, by combining recently proposed advances, and carefully tuning the optimization procedure. Secondly, we show that by attempting to minimize the discrepancy between the output of the binary and the corresponding real-valued convolution additional significant accuracy gains can be obtained. We materialize this idea in two complementary ways: (1) with a loss function, during training, by matching the spatial attention maps computed at the output of the binary and real-valued convolutions, and (2) in data-driven manner, by using the real-valued activations being available during inference prior to the binarization process for re-scaling the activations right after the binary convolution. Finally, we show that, when putting all of our improvements together, the resulting model reduces the gap to its real-valued counterpart to less than 3% and 5% top-1 error on CIFAR-100 and ImageNet, respectively, when using a ResNet-18 architecture.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BJg4NgBKvH
PDF https://openreview.net/pdf?id=BJg4NgBKvH
PWC https://paperswithcode.com/paper/training-binary-neural-networks-with-real-to
Repo
Framework

FALCON: Fast and Lightweight Convolution for Compressing and Accelerating CNN

Title FALCON: Fast and Lightweight Convolution for Compressing and Accelerating CNN
Authors Anonymous
Abstract How can we efficiently compress Convolutional Neural Networks (CNN) while retaining their accuracy on classification tasks? A promising direction is based on depthwise separable convolution which replaces a standard convolution with a depthwise convolution and a pointwise convolution. However, previous works based on depthwise separable convolution are limited since 1) they are mostly heuristic approaches without a precise understanding of their relations to standard convolution, and 2) their accuracies do not match that of the standard convolution. In this paper, we propose FALCON, an accurate and lightweight method for compressing CNN. FALCON is derived by interpreting existing convolution methods based on depthwise separable convolution using EHP, our proposed mathematical formulation to approximate the standard convolution kernel. Such interpretation leads to developing a generalized version rank-k FALCON which further improves the accuracy while sacrificing a bit of compression and computation reduction rates. In addition, we propose FALCON-branch by fitting FALCON into the previous state-of-the-art convolution unit ShuffleUnitV2 which gives even better accuracy. Experiments show that FALCON and FALCON-branch outperform 1) existing methods based on depthwise separable convolution and 2) standard CNN models by up to 8x compression and 8x computation reduction while ensuring similar accuracy. We also demonstrate that rank-k FALCON provides even better accuracy than standard convolution in many cases, while using a smaller number of parameters and floating-point operations.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BylXi3NKvS
PDF https://openreview.net/pdf?id=BylXi3NKvS
PWC https://paperswithcode.com/paper/falcon-fast-and-lightweight-convolution-for-1
Repo
Framework

Ridge Regression: Structure, Cross-Validation, and Sketching

Title Ridge Regression: Structure, Cross-Validation, and Sketching
Authors Anonymous
Abstract We study the following three fundamental problems about ridge regression: (1) what is the structure of the estimator? (2) how to correctly use cross-validation to choose the regularization parameter? and (3) how to accelerate computation without losing too much accuracy? We consider the three problems in a unified large-data linear model. We give a precise representation of ridge regression as a covariance matrix-dependent linear combination of the true parameter and the noise. We study the bias of $K$-fold cross-validation for choosing the regularization parameter, and propose a simple bias-correction. We analyze the accuracy of primal and dual sketching for ridge regression, showing they are surprisingly accurate. Our results are illustrated by simulations and by analyzing empirical data.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=HklRwaEKwB
PDF https://openreview.net/pdf?id=HklRwaEKwB
PWC https://paperswithcode.com/paper/ridge-regression-structure-cross-validation
Repo
Framework

Confidence Scores Make Instance-dependent Label-noise Learning Possible

Title Confidence Scores Make Instance-dependent Label-noise Learning Possible
Authors Anonymous
Abstract Learning with noisy labels has drawn a lot of attention. In this area, most of recent works only consider class-conditional noise, where the label noise is independent of its input features. This noise model may not be faithful to many real-world applications. Instead, few pioneer works have studied instance-dependent noise, but these methods are limited to strong assumptions on noise models. To alleviate this issue, we introduce confidence-scored instance-dependent noise (CSIDN), where each instance-label pair is associated with a confidence score. The confidence scores are sufficient to estimate the noise functions of each instance with minimal assumptions. Moreover, such scores can be easily and cheaply derived during the construction of the dataset through crowdsourcing or automatic annotation. To handle CSIDN, we design a benchmark algorithm termed instance-level forward correction. Empirical results on synthetic and real-world datasets demonstrate the utility of our proposed method.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=SyevDaVYwr
PDF https://openreview.net/pdf?id=SyevDaVYwr
PWC https://paperswithcode.com/paper/confidence-scores-make-instance-dependent
Repo
Framework

FreeLB: Enhanced Adversarial Training for Language Understanding

Title FreeLB: Enhanced Adversarial Training for Language Understanding
Authors Anonymous
Abstract Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training algorithm - FreeLB, that promotes higher robustness and invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, we apply it to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-based model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. In addition, the proposed approach achieves state-of-the-art test accuracies of 85.39% and 67.32% on ARC-Easy and ARC-Challenge. Experiments on CommonsenseQA benchmark further demonstrate that FreeLB can be generalized and boost the performance of RoBERTa-large model on other tasks as well.
Tasks Word Embeddings
Published 2020-01-01
URL https://openreview.net/forum?id=BygzbyHFvB
PDF https://openreview.net/pdf?id=BygzbyHFvB
PWC https://paperswithcode.com/paper/freelb-enhanced-adversarial-training-for-1
Repo
Framework

Modelling the influence of data structure on learning in neural networks

Title Modelling the influence of data structure on learning in neural networks
Authors Anonymous
Abstract The lack of crisp mathematical models that capture the structure of real-world data sets is a major obstacle to the detailed theoretical understanding of deep neural networks. Here, we first demonstrate the effect of structured data sets by experimentally comparing the dynamics and the performance of two-layer networks trained on two different data sets: (i) an unstructured synthetic data set containing random i.i.d. inputs, and (ii) a simple canonical data set such as MNIST images. Our analysis reveals two phenomena related to the dynamics of the networks and their ability to generalise that only appear when training on structured data sets. Second, we introduce a generative model for data sets, where high-dimensional inputs lie on a lower-dimensional manifold and have labels that depend only on their position within this manifold. We call it the hidden manifold model and we experimentally demonstrate that training networks on data sets drawn from this model reproduces both the phenomena seen during training on MNIST.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BJlisySYPS
PDF https://openreview.net/pdf?id=BJlisySYPS
PWC https://paperswithcode.com/paper/modelling-the-influence-of-data-structure-on
Repo
Framework

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Title Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
Authors Anonymous
Abstract Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.
Tasks Hierarchical Reinforcement Learning
Published 2020-01-01
URL https://openreview.net/forum?id=ryxgJTEYDr
PDF https://openreview.net/pdf?id=ryxgJTEYDr
PWC https://paperswithcode.com/paper/reinforcement-learning-with-competitive-1
Repo
Framework

Adversarially Robust Representations with Smooth Encoders

Title Adversarially Robust Representations with Smooth Encoders
Authors Anonymous
Abstract This paper studies the undesired phenomena of over-sensitivity of representations learned by deep networks to semantically-irrelevant changes in data. We identify a cause for this shortcoming in the classical Variational Auto-encoder (VAE) objective, the evidence lower bound (ELBO). We show that the ELBO fails to control the behaviour of the encoder out of the support of the empirical data distribution and this behaviour of the VAE can lead to extreme errors in the learned representation. This is a key hurdle in the effective use of representations for data-efficient learning and transfer. To address this problem, we propose to augment the data with specifications that enforce insensitivity of the representation with respect to families of transformations. To incorporate these specifications, we propose a regularization method that is based on a selection mechanism that creates a fictive data point by explicitly perturbing an observed true data point. For certain choices of parameters, our formulation naturally leads to the minimization of the entropy regularized Wasserstein distance between representations. We illustrate our approach on standard datasets and experimentally show that significant improvements in the downstream adversarial accuracy can be achieved by learning robust representations completely in an unsupervised manner, without a reference to a particular downstream task and without a costly supervised adversarial training procedure.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=H1gfFaEYDS
PDF https://openreview.net/pdf?id=H1gfFaEYDS
PWC https://paperswithcode.com/paper/adversarially-robust-representations-with
Repo
Framework

Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base

Title Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base
Authors Anonymous
Abstract We describe a novel way of representing a symbolic knowledge base (KB) called a sparse-matrix reified KB. This representation enables neural modules that are fully differentiable, faithful to the original semantics of the KB, expressive enough to model multi-hop inferences, and scalable enough to use with realistically large KBs. The sparse-matrix reified KB can be distributed across multiple GPUs, can scale to tens of millions of entities and facts, and is orders of magnitude faster than naive sparse-matrix implementations. The reified KB enables very simple end-to-end architectures to obtain competitive performance on several benchmarks representing two families of tasks: KB completion, and learning semantic parsers from denotations.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BJlguT4YPr
PDF https://openreview.net/pdf?id=BJlguT4YPr
PWC https://paperswithcode.com/paper/scalable-neural-methods-for-reasoning-with-a
Repo
Framework

Curriculum Loss: Robust Learning and Generalization against Label Corruption

Title Curriculum Loss: Robust Learning and Generalization against Label Corruption
Authors Anonymous
Abstract Deep neural networks (DNNs) have great expressive power, which can even memorize samples with wrong labels. It is vitally important to reiterate robustness and generalization in DNNs against label corruption. To this end, this paper studies the 0-1 loss, which has a monotonic relationship between empirical adversary (reweighted) risk (Hu et al. 2018). Although the 0-1 loss is robust to outliers, it is also difficult to optimize. To efficiently optimize the 0-1 loss while keeping its robust properties, we propose a very simple and efficient loss, i.e. curriculum loss (CL). Our CL is a tighter upper bound of the 0-1 loss compared with conventional summation based surrogate losses. Moreover, CL can adaptively select samples for stagewise training. As a result, our loss can be deemed as a novel perspective of curriculum sample selection strategy, which bridges a connection between curriculum learning and robust learning. Experimental results on noisy MNIST, CIFAR10 and CIFAR100 dataset validate the robustness of the proposed loss.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=rkgt0REKwS
PDF https://openreview.net/pdf?id=rkgt0REKwS
PWC https://paperswithcode.com/paper/curriculum-loss-robust-learning-and-1
Repo
Framework

Improving Gradient Estimation in Evolutionary Strategies With Past Descent Directions

Title Improving Gradient Estimation in Evolutionary Strategies With Past Descent Directions
Authors Anonymous
Abstract We propose a novel method to optimally incorporate surrogate gradient information. Our approach, unlike previous work, needs no information about the quality of the surrogate gradients and is always guaranteed to find a descent direction that is better than the surrogate gradient. This allows to iteratively use the previous gradient estimate as surrogate gradient for the current search point. We theoretically prove that this yields fast convergence to the true gradient for linear functions and show under simplifying assumptions that it significantly improves gradient estimates for general functions. Finally, we evaluate our approach empirically on MNIST and reinforcement learning tasks and show that it considerably improves the gradient estimation of ES at no extra computational cost.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=H1lOUeSFvB
PDF https://openreview.net/pdf?id=H1lOUeSFvB
PWC https://paperswithcode.com/paper/improving-gradient-estimation-in-evolutionary
Repo
Framework

On the Global Convergence of Training Deep Linear ResNets

Title On the Global Convergence of Training Deep Linear ResNets
Authors Anonymous
Abstract We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear ResNets. Compared with the global convergence result of GD for training standard deep linear networks \citep{du2019width}, our condition on the neural network width is sharper by a factor of $O(\kappa L)$, where $\kappa$ denotes the condition number of the covariance matrix of the training data. In addition, for the first time we establish the global convergence of SGD for training deep linear ResNets and prove a linear convergence rate when the global minimum is $0$.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=HJxEhREKDH
PDF https://openreview.net/pdf?id=HJxEhREKDH
PWC https://paperswithcode.com/paper/on-the-global-convergence-of-training-deep
Repo
Framework

Identity Crisis: Memorization and Generalization Under Extreme Overparameterization

Title Identity Crisis: Memorization and Generalization Under Extreme Overparameterization
Authors Anonymous
Abstract We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). We formally characterize generalization in single-layer FCNs and CNNs. We show empirically that different architectures exhibit strikingly different inductive biases. For example, CNNs of up to 10 layers are able to generalize from a single example, whereas FCNs cannot learn the identity function reliably from 60k examples. Deeper CNNs often fail, but nonetheless do astonishing work to memorize the training output: because CNN biases are location invariant, the model must progressively grow an output pattern from the image boundaries via the coordination of many layers. Our work helps to quantify and visualize the sensitivity of inductive biases to architectural choices such as depth, kernel width, and number of channels.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=B1l6y0VFPr
PDF https://openreview.net/pdf?id=B1l6y0VFPr
PWC https://paperswithcode.com/paper/identity-crisis-memorization-and-1
Repo
Framework

Conservative Uncertainty Estimation By Fitting Prior Networks

Title Conservative Uncertainty Estimation By Fitting Prior Networks
Authors Anonymous
Abstract Obtaining high-quality uncertainty estimates is essential for many applications of deep neural networks. In this paper, we theoretically justify a scheme for estimating uncertainties, based on sampling from a prior distribution. Crucially, the uncertainty estimates are shown to be conservative in the sense that they never underestimate a posterior uncertainty obtained by a hypothetical Bayesian algorithm. We also show concentration, implying that the uncertainty estimates converge to zero as we get more data. Uncertainty estimates obtained from random priors can be adapted to any deep network architecture and trained using standard supervised learning pipelines. We provide experimental evaluation of random priors on calibration and out-of-distribution detection on typical computer vision tasks, demonstrating that they outperform deep ensembles in practice.
Tasks Calibration, Out-of-Distribution Detection
Published 2020-01-01
URL https://openreview.net/forum?id=BJlahxHYDS
PDF https://openreview.net/pdf?id=BJlahxHYDS
PWC https://paperswithcode.com/paper/conservative-uncertainty-estimation-by
Repo
Framework

Fast Task Inference with Variational Intrinsic Successor Features

Title Fast Task Inference with Variational Intrinsic Successor Features
Authors Anonymous
Abstract It has been established that diverse behaviors spanning the controllable subspace of a Markov decision process can be trained by rewarding a policy for being distinguishable from other policies. However, one limitation of this formulation is the difficulty to generalize beyond the finite set of behaviors being explicitly learned, as may be needed in subsequent tasks. Successor features provide an appealing solution to this generalization problem, but require defining the reward function as linear in some grounded feature space. In this paper, we show that these two techniques can be combined, and that each method solves the other’s primary limitation. To do so we introduce Variational Intrinsic Successor FeatuRes (VISR), a novel algorithm which learns controllable features that can be leveraged to provide enhanced generalization and fast task inference through the successor features framework. We empirically validate VISR on the full Atari suite, in a novel setup wherein the rewards are only exposed briefly after a long unsupervised phase. Achieving human-level performance on 12 games and beating all baselines, we believe VISR represents a step towards agents that rapidly learn from limited feedback.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BJeAHkrYDS
PDF https://openreview.net/pdf?id=BJeAHkrYDS
PWC https://paperswithcode.com/paper/fast-task-inference-with-variational-1
Repo
Framework
comments powered by Disqus