April 1, 2020

3001 words 15 mins read

Paper Group NANR 91

Augmenting Self-attention with Persistent Memory. Ensemble Distribution Distillation. Automated curriculum generation through setter-solver interactions. PolyGAN: High-Order Polynomial Generators. Mathematical Reasoning in Latent Space. Neural Execution of Graph Algorithms. Deep Double Descent: Where Bigger Models and More Data Hurt. The Early Phas …

Augmenting Self-attention with Persistent Memory


Title	Augmenting Self-attention with Persistent Memory
Authors	Anonymous
Abstract	Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.
Tasks	Language Modelling, Machine Translation
Published	2020-01-01
URL	https://openreview.net/forum?id=HklJdaNYPH
PDF	https://openreview.net/pdf?id=HklJdaNYPH
PWC	https://paperswithcode.com/paper/augmenting-self-attention-with-persistent-1
Repo
Framework

Ensemble Distribution Distillation


Title	Ensemble Distribution Distillation
Authors	Anonymous
Abstract	Ensembles of models often yield improvements in system performance. These ensemble approaches have also been empirically shown to yield robust measures of uncertainty, and are capable of distinguishing between different forms of uncertainty. However, ensembles come at a computational and memory cost which may be prohibitive for many applications. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve an accuracy comparable to that of an ensemble. However, information about the diversity of the ensemble, which can yield estimates of different forms of uncertainty, is lost. This work considers the novel task of Ensemble Distribution Distillation (EnD^2) - distilling the distribution of the predictions from an ensemble, rather than just the average prediction, into a single model. EnD^2 enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation. A solution for EnD^2 based on Prior Networks, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work. The properties of EnD^2 are investigated on both an artificial dataset, and on the CIFAR-10, CIFAR-100 and TinyImageNet datasets, where it is shown that EnD^2 can approach the classification performance of an ensemble, and outperforms both standard DNNs and Ensemble Distillation on the tasks of misclassification and out-of-distribution input detection.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BygSP6Vtvr
PDF	https://openreview.net/pdf?id=BygSP6Vtvr
PWC	https://paperswithcode.com/paper/ensemble-distribution-distillation-1
Repo
Framework

Automated curriculum generation through setter-solver interactions


Title	Automated curriculum generation through setter-solver interactions
Authors	Anonymous
Abstract	Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula –the breakdown of tasks into simpler, static challenges with dense rewards– to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich,dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1e0Wp4KvH
PDF	https://openreview.net/pdf?id=H1e0Wp4KvH
PWC	https://paperswithcode.com/paper/automated-curriculum-generation-through
Repo
Framework

PolyGAN: High-Order Polynomial Generators


Title	PolyGAN: High-Order Polynomial Generators
Authors	Anonymous
Abstract	Generative Adversarial Networks (GANs) have become the gold standard when it comes to learning generative models for high-dimensional distributions. Since their advent, numerous variations of GANs have been introduced in the literature, primarily focusing on utilization of novel loss functions, optimization/regularization strategies and network architectures. In this paper, we turn our attention to the generator and investigate the use of high-order polynomials as an alternative class of universal function approximators. Concretely, we propose PolyGAN, where we model the data generator by means of a high-order polynomial whose unknown parameters are naturally represented by high-order tensors. We introduce two tensor decompositions that significantly reduce the number of parameters and show how they can be efficiently implemented by hierarchical neural networks that only employ linear/convolutional blocks. We exhibit for the first time that by using our approach a GAN generator can approximate the data distribution without using any activation functions. Thorough experimental evaluation on both synthetic and real data (images and 3D point clouds) demonstrates the merits of PolyGAN against the state of the art.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Bye30kSYDH
PDF	https://openreview.net/pdf?id=Bye30kSYDH
PWC	https://paperswithcode.com/paper/polygan-high-order-polynomial-generators-1
Repo
Framework

Mathematical Reasoning in Latent Space


Title	Mathematical Reasoning in Latent Space
Authors	Anonymous
Abstract	We design and conduct a simple experiment to study whether neural networks can perform several steps of approximate reasoning in a fixed dimensional latent space. The set of rewrites (i.e. transformations) that can be successfully performed on a statement represents essential semantic features of the statement. We can compress this information by embedding the formula in a vector space, such that the vector associated with a statement can be used to predict whether a statement can be rewritten by other theorems. Predicting the embedding of a formula generated by some rewrite rule is naturally viewed as approximate reasoning in the latent space. In order to measure the effectiveness of this reasoning, we perform approximate deduction sequences in the latent space and use the resulting embedding to inform the semantic features of the corresponding formal statement (which is obtained by performing the corresponding rewrite sequence using real formulas). Our experiments show that graph neural networks can make non-trivial predictions about the rewrite-success of statements, even when they propagate predicted latent representations for several steps. Since our corpus of mathematical formulas includes a wide variety of mathematical disciplines, this experiment is a strong indicator for the feasibility of deduction in latent space in general.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Ske31kBtPr
PDF	https://openreview.net/pdf?id=Ske31kBtPr
PWC	https://paperswithcode.com/paper/mathematical-reasoning-in-latent-space-1
Repo
Framework

Neural Execution of Graph Algorithms


Title	Neural Execution of Graph Algorithms
Authors	Anonymous
Abstract	Graph Neural Networks (GNNs) are a powerful representational tool for solving problems on graph-structured inputs. In almost all cases so far, however, they have been applied to directly recovering a final solution from raw inputs, without explicit guidance on how to structure their problem-solving. Here, instead, we focus on learning in the space of algorithms: we train several state-of-the-art GNN architectures to imitate individual steps of classical graph algorithms, parallel (breadth-first search, Bellman-Ford) as well as sequential (Prim’s algorithm). As graph algorithms usually rely on making discrete decisions within neighbourhoods, we hypothesise that maximisation-based message passing neural networks are best-suited for such objectives, and validate this claim empirically. We also demonstrate how learning in the space of algorithms can yield new opportunities for positive transfer between tasks—showing how learning a shortest-path algorithm can be substantially improved when simultaneously learning a reachability algorithm.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SkgKO0EtvS
PDF	https://openreview.net/pdf?id=SkgKO0EtvS
PWC	https://paperswithcode.com/paper/neural-execution-of-graph-algorithms-1
Repo
Framework

Deep Double Descent: Where Bigger Models and More Data Hurt


Title	Deep Double Descent: Where Bigger Models and More Data Hurt
Authors	Anonymous
Abstract	We show that a variety of modern deep learning tasks exhibit a “double-descent” phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity, and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1g5sA4twr
PDF	https://openreview.net/pdf?id=B1g5sA4twr
PWC	https://paperswithcode.com/paper/deep-double-descent-where-bigger-models-and
Repo
Framework

The Early Phase of Neural Network Training


Title	The Early Phase of Neural Network Training
Authors	Anonymous
Abstract	Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state and its updates during these early iterations of training, and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this, pre-training with blurred inputs or an auxiliary self-supervised task can approximate the changes in supervised networks, suggesting that these changes are label-agnostic, though labels significantly accelerate this process. Together, these results help to elucidate the network changes occurring during this pivotal initial period of learning.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Hkl1iRNFwS
PDF	https://openreview.net/pdf?id=Hkl1iRNFwS
PWC	https://paperswithcode.com/paper/the-early-phase-of-neural-network-training
Repo
Framework

Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network


Title	Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network
Authors	Anonymous
Abstract	One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compression based bound is one of the promising approaches. However, the compression based bound can be applied only to a compressed network, and it is not applicable to the non-compressed original network. In this paper, we give a unified frame-work that can convert compression based bounds to those for non-compressed original networks. The bound gives even better rate than the one for the compressed network by improving the bias term. By establishing the unified frame-work, we can obtain a data dependent generalization error bound which gives a tighter evaluation than the data independent ones.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=ByeGzlrKwH
PDF	https://openreview.net/pdf?id=ByeGzlrKwH
PWC	https://paperswithcode.com/paper/compression-based-bound-for-non-compressed
Repo
Framework

Rethinking the Hyperparameters for Fine-tuning


Title	Rethinking the Hyperparameters for Fine-tuning
Authors	Anonymous
Abstract	Fine-tuning from pre-trained ImageNet models has become the de-facto standard for various computer vision tasks. Current practices for fine-tuning typically involve selecting an ad-hoc choice of hyper-parameters and keeping them fixed to values normally used for training from scratch. This paper re-examines several common practices of setting hyper-parameters for fine-tuning. Our findings are based on extensive empirical evaluation for fine-tuning on various transfer learning benchmarks. (1) While prior works have thoroughly investigated learning rate and batch size, momentum for fine-tuning is a relatively unexplored parameter. We find that picking the right value for momentum is critical for fine-tuning performance and connect it with previous theoretical findings. (2) Optimal hyper-parameters for fine-tuning in particular the effective learning rate are not only dataset dependent but also sensitive to the similarity between the source domain and target domain. This is in contrast to hyper-parameters for training from scratch. (3) Reference-based regularization that keeps models close to the initial model does not necessarily apply for “dissimilar” datasets. Our findings challenge common practices of fine- tuning and encourages deep learning practitioners to rethink the hyper-parameters for fine-tuning.
Tasks	Transfer Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=B1g8VkHFPH
PDF	https://openreview.net/pdf?id=B1g8VkHFPH
PWC	https://paperswithcode.com/paper/rethinking-the-hyperparameters-for-fine
Repo
Framework

Generating Semantic Adversarial Examples with Differentiable Rendering


Title	Generating Semantic Adversarial Examples with Differentiable Rendering
Authors	Anonymous
Abstract	Machine learning (ML) algorithms, especially deep neural networks, have demonstrated success in several domains. However, several types of attacks have raised concerns about deploying ML in safety-critical domains, such as autonomous driving and security. An attacker perturbs a data point slightly in the pixel space and causes the ML algorithm to misclassify (e.g. a perturbed stop sign is classified as a yield sign). These perturbed data points are called adversarial examples, and there are numerous algorithms in the literature for constructing adversarial examples and defending against them. In this paper we explore semantic adversarial examples (SAEs) where an attacker creates perturbations in the semantic space. For example, an attacker can change the background of the image to be cloudier to cause misclassification. We present an algorithm for constructing SAEs that uses recent advances in differential rendering and inverse graphics.
Tasks	Autonomous Driving
Published	2020-01-01
URL	https://openreview.net/forum?id=SJlRF04YwB
PDF	https://openreview.net/pdf?id=SJlRF04YwB
PWC	https://paperswithcode.com/paper/generating-semantic-adversarial-examples-with-1
Repo
Framework

Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference


Title	Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference
Authors	Anonymous
Abstract	Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) (Tsipras et al., 2019). Such a dilemma is shown to be rooted in the inherently higher sample complexity (Schmidt et al., 2018) and/or model capacity (Nakkiran, 2019), for learning a high-accuracy and robust classifier. In view of that, give a classification task, growing the model capacity appears to help draw a win-win between accuracy and robustness, yet at the expense of model size and latency, therefore posing challenges for resource-constrained applications. Is it possible to co-design model accuracy, robustness and efficiency to achieve their triple wins? This paper studies multi-exit networks associated with input-adaptive efficient inference, showing their strong promise in achieving a “sweet point” in co-optimizing model accuracy, robustness, and efficiency. Our proposed solution, dubbed Robust Dynamic Inference Networks (RDI-Nets), allows for each input (either clean or adversarial) to adaptively choose one of the multiple output layers (early branches or the final one) to output its prediction. That multi-loss adaptivity adds new variations and flexibility to adversarial attacks and defenses, on which we present a systematical investigation. We show experimentally that by equipping existing backbones with such robust adaptive inference, the resulting RDI-Nets can achieve better accuracy and robustness, yet with over 30% computational savings, compared to the defended original models.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rJgzzJHtDB
PDF	https://openreview.net/pdf?id=rJgzzJHtDB
PWC	https://paperswithcode.com/paper/triple-wins-boosting-accuracy-robustness-and
Repo
Framework


Title	Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks
Authors	Anonymous
Abstract	We study the generalization properties of deep convolutional neural networks for image denoising in the presence of varying noise levels. We provide extensive empirical evidence that current state-of-the-art architectures systematically overfit to the noise levels in the training set, performing very poorly at new noise levels. We show that strong generalization can be achieved through a simple architectural modification: removing all additive constants. The resulting “bias-free” networks attain state-of-the-art performance over a broad range of noise levels, even when trained over a limited range. They are also locally linear, which enables direct analysis with linear-algebraic tools. We show that the denoising map can be visualized locally as a filter that adapts to both image structure and noise level. In addition, our analysis reveals that deep networks implicitly perform a projection onto an adaptively-selected low-dimensional subspace, with dimensionality inversely proportional to noise level, that captures features of natural images.
Tasks	Denoising, Image Denoising
Published	2020-01-01
URL	https://openreview.net/forum?id=HJlSmC4FPS
PDF	https://openreview.net/pdf?id=HJlSmC4FPS
PWC	https://paperswithcode.com/paper/robust-and-interpretable-blind-image-1
Repo
Framework

Permutation Equivariant Models for Compositional Generalization in Language


Title	Permutation Equivariant Models for Compositional Generalization in Language
Authors	Anonymous
Abstract	Humans understand novel sentences by composing meanings and roles of core language components. In contrast, neural network models for natural language modeling fail when such compositional generalization is required. The main contribution of this paper is to hypothesize that language compositionality is a form of group-equivariance. Based on this hypothesis, we propose a set of tools for constructing equivariant sequence-to-sequence models. Throughout a variety of experiments on the SCAN tasks, we analyze the behavior of existing models under the lens of equivariance, and demonstrate that our equivariant architecture is able to achieve the type compositional generalization required in human language understanding.
Tasks	Language Modelling
Published	2020-01-01
URL	https://openreview.net/forum?id=SylVNerFvr
PDF	https://openreview.net/pdf?id=SylVNerFvr
PWC	https://paperswithcode.com/paper/permutation-equivariant-models-for
Repo
Framework

Learning to Learn by Zeroth-Order Oracle


Title	Learning to Learn by Zeroth-Order Oracle
Authors	Anonymous
Abstract	In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules. In this paper, we extend the L2L framework to zeroth-order (ZO) optimization setting, where no explicit gradient information is available. Our learned optimizer, modeled as recurrent neural network (RNN), first approximates gradient by ZO gradient estimator and then produces parameter update utilizing the knowledge of previous iterations. To reduce high variance effect due to ZO gradient estimator, we further introduce another RNN to learn the Gaussian sampling rule and dynamically guide the query direction sampling. Our learned optimizer outperforms hand-designed algorithms in terms of convergence rate and final solution on both synthetic and practical ZO optimization tasks (in particular, the black-box adversarial attack task, which is one of the most widely used tasks of ZO optimization). We finally conduct extensive analytical experiments to demonstrate the effectiveness of our proposed optimizer.
Tasks	Adversarial Attack
Published	2020-01-01
URL	https://openreview.net/forum?id=ryxz8CVYDH
PDF	https://openreview.net/pdf?id=ryxz8CVYDH
PWC	https://paperswithcode.com/paper/learning-to-learn-by-zeroth-order-oracle-1
Repo
Framework