Paper Group NANR 91
Augmenting Self-attention with Persistent Memory. Ensemble Distribution Distillation. Automated curriculum generation through setter-solver interactions. PolyGAN: High-Order Polynomial Generators. Mathematical Reasoning in Latent Space. Neural Execution of Graph Algorithms. Deep Double Descent: Where Bigger Models and More Data Hurt. The Early Phas …
Augmenting Self-attention with Persistent Memory
Title | Augmenting Self-attention with Persistent Memory |
Authors | Anonymous |
Abstract | Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks. |
Tasks | Language Modelling, Machine Translation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HklJdaNYPH |
https://openreview.net/pdf?id=HklJdaNYPH | |
PWC | https://paperswithcode.com/paper/augmenting-self-attention-with-persistent-1 |
Repo | |
Framework | |
Ensemble Distribution Distillation
Title | Ensemble Distribution Distillation |
Authors | Anonymous |
Abstract | Ensembles of models often yield improvements in system performance. These ensemble approaches have also been empirically shown to yield robust measures of uncertainty, and are capable of distinguishing between different forms of uncertainty. However, ensembles come at a computational and memory cost which may be prohibitive for many applications. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve an accuracy comparable to that of an ensemble. However, information about the diversity of the ensemble, which can yield estimates of different forms of uncertainty, is lost. This work considers the novel task of Ensemble Distribution Distillation (EnD^2) - distilling the distribution of the predictions from an ensemble, rather than just the average prediction, into a single model. EnD^2 enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation. A solution for EnD^2 based on Prior Networks, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work. The properties of EnD^2 are investigated on both an artificial dataset, and on the CIFAR-10, CIFAR-100 and TinyImageNet datasets, where it is shown that EnD^2 can approach the classification performance of an ensemble, and outperforms both standard DNNs and Ensemble Distillation on the tasks of misclassification and out-of-distribution input detection. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BygSP6Vtvr |
https://openreview.net/pdf?id=BygSP6Vtvr | |
PWC | https://paperswithcode.com/paper/ensemble-distribution-distillation-1 |
Repo | |
Framework | |
Automated curriculum generation through setter-solver interactions
Title | Automated curriculum generation through setter-solver interactions |
Authors | Anonymous |
Abstract | Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula –the breakdown of tasks into simpler, static challenges with dense rewards– to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich,dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1e0Wp4KvH |
https://openreview.net/pdf?id=H1e0Wp4KvH | |
PWC | https://paperswithcode.com/paper/automated-curriculum-generation-through |
Repo | |
Framework | |
PolyGAN: High-Order Polynomial Generators
Title | PolyGAN: High-Order Polynomial Generators |
Authors | Anonymous |
Abstract | Generative Adversarial Networks (GANs) have become the gold standard when it comes to learning generative models for high-dimensional distributions. Since their advent, numerous variations of GANs have been introduced in the literature, primarily focusing on utilization of novel loss functions, optimization/regularization strategies and network architectures. In this paper, we turn our attention to the generator and investigate the use of high-order polynomials as an alternative class of universal function approximators. Concretely, we propose PolyGAN, where we model the data generator by means of a high-order polynomial whose unknown parameters are naturally represented by high-order tensors. We introduce two tensor decompositions that significantly reduce the number of parameters and show how they can be efficiently implemented by hierarchical neural networks that only employ linear/convolutional blocks. We exhibit for the first time that by using our approach a GAN generator can approximate the data distribution without using any activation functions. Thorough experimental evaluation on both synthetic and real data (images and 3D point clouds) demonstrates the merits of PolyGAN against the state of the art. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bye30kSYDH |
https://openreview.net/pdf?id=Bye30kSYDH | |
PWC | https://paperswithcode.com/paper/polygan-high-order-polynomial-generators-1 |
Repo | |
Framework | |
Mathematical Reasoning in Latent Space
Title | Mathematical Reasoning in Latent Space |
Authors | Anonymous |
Abstract | We design and conduct a simple experiment to study whether neural networks can perform several steps of approximate reasoning in a fixed dimensional latent space. The set of rewrites (i.e. transformations) that can be successfully performed on a statement represents essential semantic features of the statement. We can compress this information by embedding the formula in a vector space, such that the vector associated with a statement can be used to predict whether a statement can be rewritten by other theorems. Predicting the embedding of a formula generated by some rewrite rule is naturally viewed as approximate reasoning in the latent space. In order to measure the effectiveness of this reasoning, we perform approximate deduction sequences in the latent space and use the resulting embedding to inform the semantic features of the corresponding formal statement (which is obtained by performing the corresponding rewrite sequence using real formulas). Our experiments show that graph neural networks can make non-trivial predictions about the rewrite-success of statements, even when they propagate predicted latent representations for several steps. Since our corpus of mathematical formulas includes a wide variety of mathematical disciplines, this experiment is a strong indicator for the feasibility of deduction in latent space in general. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Ske31kBtPr |
https://openreview.net/pdf?id=Ske31kBtPr | |
PWC | https://paperswithcode.com/paper/mathematical-reasoning-in-latent-space-1 |
Repo | |
Framework | |
Neural Execution of Graph Algorithms
Title | Neural Execution of Graph Algorithms |
Authors | Anonymous |
Abstract | Graph Neural Networks (GNNs) are a powerful representational tool for solving problems on graph-structured inputs. In almost all cases so far, however, they have been applied to directly recovering a final solution from raw inputs, without explicit guidance on how to structure their problem-solving. Here, instead, we focus on learning in the space of algorithms: we train several state-of-the-art GNN architectures to imitate individual steps of classical graph algorithms, parallel (breadth-first search, Bellman-Ford) as well as sequential (Prim’s algorithm). As graph algorithms usually rely on making discrete decisions within neighbourhoods, we hypothesise that maximisation-based message passing neural networks are best-suited for such objectives, and validate this claim empirically. We also demonstrate how learning in the space of algorithms can yield new opportunities for positive transfer between tasks—showing how learning a shortest-path algorithm can be substantially improved when simultaneously learning a reachability algorithm. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkgKO0EtvS |
https://openreview.net/pdf?id=SkgKO0EtvS | |
PWC | https://paperswithcode.com/paper/neural-execution-of-graph-algorithms-1 |
Repo | |
Framework | |
Deep Double Descent: Where Bigger Models and More Data Hurt
Title | Deep Double Descent: Where Bigger Models and More Data Hurt |
Authors | Anonymous |
Abstract | We show that a variety of modern deep learning tasks exhibit a “double-descent” phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity, and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1g5sA4twr |
https://openreview.net/pdf?id=B1g5sA4twr | |
PWC | https://paperswithcode.com/paper/deep-double-descent-where-bigger-models-and |
Repo | |
Framework | |
The Early Phase of Neural Network Training
Title | The Early Phase of Neural Network Training |
Authors | Anonymous |
Abstract | Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state and its updates during these early iterations of training, and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this, pre-training with blurred inputs or an auxiliary self-supervised task can approximate the changes in supervised networks, suggesting that these changes are label-agnostic, though labels significantly accelerate this process. Together, these results help to elucidate the network changes occurring during this pivotal initial period of learning. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hkl1iRNFwS |
https://openreview.net/pdf?id=Hkl1iRNFwS | |
PWC | https://paperswithcode.com/paper/the-early-phase-of-neural-network-training |
Repo | |
Framework | |
Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network
Title | Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network |
Authors | Anonymous |
Abstract | One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compression based bound is one of the promising approaches. However, the compression based bound can be applied only to a compressed network, and it is not applicable to the non-compressed original network. In this paper, we give a unified frame-work that can convert compression based bounds to those for non-compressed original networks. The bound gives even better rate than the one for the compressed network by improving the bias term. By establishing the unified frame-work, we can obtain a data dependent generalization error bound which gives a tighter evaluation than the data independent ones. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ByeGzlrKwH |
https://openreview.net/pdf?id=ByeGzlrKwH | |
PWC | https://paperswithcode.com/paper/compression-based-bound-for-non-compressed |
Repo | |
Framework | |
Rethinking the Hyperparameters for Fine-tuning
Title | Rethinking the Hyperparameters for Fine-tuning |
Authors | Anonymous |
Abstract | Fine-tuning from pre-trained ImageNet models has become the de-facto standard for various computer vision tasks. Current practices for fine-tuning typically involve selecting an ad-hoc choice of hyper-parameters and keeping them fixed to values normally used for training from scratch. This paper re-examines several common practices of setting hyper-parameters for fine-tuning. Our findings are based on extensive empirical evaluation for fine-tuning on various transfer learning benchmarks. (1) While prior works have thoroughly investigated learning rate and batch size, momentum for fine-tuning is a relatively unexplored parameter. We find that picking the right value for momentum is critical for fine-tuning performance and connect it with previous theoretical findings. (2) Optimal hyper-parameters for fine-tuning in particular the effective learning rate are not only dataset dependent but also sensitive to the similarity between the source domain and target domain. This is in contrast to hyper-parameters for training from scratch. (3) Reference-based regularization that keeps models close to the initial model does not necessarily apply for “dissimilar” datasets. Our findings challenge common practices of fine- tuning and encourages deep learning practitioners to rethink the hyper-parameters for fine-tuning. |
Tasks | Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1g8VkHFPH |
https://openreview.net/pdf?id=B1g8VkHFPH | |
PWC | https://paperswithcode.com/paper/rethinking-the-hyperparameters-for-fine |
Repo | |
Framework | |
Generating Semantic Adversarial Examples with Differentiable Rendering
Title | Generating Semantic Adversarial Examples with Differentiable Rendering |
Authors | Anonymous |
Abstract | Machine learning (ML) algorithms, especially deep neural networks, have demonstrated success in several domains. However, several types of attacks have raised concerns about deploying ML in safety-critical domains, such as autonomous driving and security. An attacker perturbs a data point slightly in the pixel space and causes the ML algorithm to misclassify (e.g. a perturbed stop sign is classified as a yield sign). These perturbed data points are called adversarial examples, and there are numerous algorithms in the literature for constructing adversarial examples and defending against them. In this paper we explore semantic adversarial examples (SAEs) where an attacker creates perturbations in the semantic space. For example, an attacker can change the background of the image to be cloudier to cause misclassification. We present an algorithm for constructing SAEs that uses recent advances in differential rendering and inverse graphics. |
Tasks | Autonomous Driving |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJlRF04YwB |
https://openreview.net/pdf?id=SJlRF04YwB | |
PWC | https://paperswithcode.com/paper/generating-semantic-adversarial-examples-with-1 |
Repo | |
Framework | |
Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference
Title | Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference |
Authors | Anonymous |
Abstract | Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) (Tsipras et al., 2019). Such a dilemma is shown to be rooted in the inherently higher sample complexity (Schmidt et al., 2018) and/or model capacity (Nakkiran, 2019), for learning a high-accuracy and robust classifier. In view of that, give a classification task, growing the model capacity appears to help draw a win-win between accuracy and robustness, yet at the expense of model size and latency, therefore posing challenges for resource-constrained applications. Is it possible to co-design model accuracy, robustness and efficiency to achieve their triple wins? This paper studies multi-exit networks associated with input-adaptive efficient inference, showing their strong promise in achieving a “sweet point” in co-optimizing model accuracy, robustness, and efficiency. Our proposed solution, dubbed Robust Dynamic Inference Networks (RDI-Nets), allows for each input (either clean or adversarial) to adaptively choose one of the multiple output layers (early branches or the final one) to output its prediction. That multi-loss adaptivity adds new variations and flexibility to adversarial attacks and defenses, on which we present a systematical investigation. We show experimentally that by equipping existing backbones with such robust adaptive inference, the resulting RDI-Nets can achieve better accuracy and robustness, yet with over 30% computational savings, compared to the defended original models. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJgzzJHtDB |
https://openreview.net/pdf?id=rJgzzJHtDB | |
PWC | https://paperswithcode.com/paper/triple-wins-boosting-accuracy-robustness-and |
Repo | |
Framework | |
Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks
Title | Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks |
Authors | Anonymous |
Abstract | We study the generalization properties of deep convolutional neural networks for image denoising in the presence of varying noise levels. We provide extensive empirical evidence that current state-of-the-art architectures systematically overfit to the noise levels in the training set, performing very poorly at new noise levels. We show that strong generalization can be achieved through a simple architectural modification: removing all additive constants. The resulting “bias-free” networks attain state-of-the-art performance over a broad range of noise levels, even when trained over a limited range. They are also locally linear, which enables direct analysis with linear-algebraic tools. We show that the denoising map can be visualized locally as a filter that adapts to both image structure and noise level. In addition, our analysis reveals that deep networks implicitly perform a projection onto an adaptively-selected low-dimensional subspace, with dimensionality inversely proportional to noise level, that captures features of natural images. |
Tasks | Denoising, Image Denoising |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJlSmC4FPS |
https://openreview.net/pdf?id=HJlSmC4FPS | |
PWC | https://paperswithcode.com/paper/robust-and-interpretable-blind-image-1 |
Repo | |
Framework | |
Permutation Equivariant Models for Compositional Generalization in Language
Title | Permutation Equivariant Models for Compositional Generalization in Language |
Authors | Anonymous |
Abstract | Humans understand novel sentences by composing meanings and roles of core language components. In contrast, neural network models for natural language modeling fail when such compositional generalization is required. The main contribution of this paper is to hypothesize that language compositionality is a form of group-equivariance. Based on this hypothesis, we propose a set of tools for constructing equivariant sequence-to-sequence models. Throughout a variety of experiments on the SCAN tasks, we analyze the behavior of existing models under the lens of equivariance, and demonstrate that our equivariant architecture is able to achieve the type compositional generalization required in human language understanding. |
Tasks | Language Modelling |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SylVNerFvr |
https://openreview.net/pdf?id=SylVNerFvr | |
PWC | https://paperswithcode.com/paper/permutation-equivariant-models-for |
Repo | |
Framework | |
Learning to Learn by Zeroth-Order Oracle
Title | Learning to Learn by Zeroth-Order Oracle |
Authors | Anonymous |
Abstract | In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules. In this paper, we extend the L2L framework to zeroth-order (ZO) optimization setting, where no explicit gradient information is available. Our learned optimizer, modeled as recurrent neural network (RNN), first approximates gradient by ZO gradient estimator and then produces parameter update utilizing the knowledge of previous iterations. To reduce high variance effect due to ZO gradient estimator, we further introduce another RNN to learn the Gaussian sampling rule and dynamically guide the query direction sampling. Our learned optimizer outperforms hand-designed algorithms in terms of convergence rate and final solution on both synthetic and practical ZO optimization tasks (in particular, the black-box adversarial attack task, which is one of the most widely used tasks of ZO optimization). We finally conduct extensive analytical experiments to demonstrate the effectiveness of our proposed optimizer. |
Tasks | Adversarial Attack |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxz8CVYDH |
https://openreview.net/pdf?id=ryxz8CVYDH | |
PWC | https://paperswithcode.com/paper/learning-to-learn-by-zeroth-order-oracle-1 |
Repo | |
Framework | |