April 1, 2020

3136 words 15 mins read

Paper Group NANR 95

GPU Memory Management for Deep Neural Networks Using Deep Q-Network. On the “steerability” of generative adversarial networks. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. Adversarial Robustness Against the Union of Multiple Perturbation Models. Quantifying Point-Prediction Uncertainty in Neural Networks via Residua …

GPU Memory Management for Deep Neural Networks Using Deep Q-Network


Title	GPU Memory Management for Deep Neural Networks Using Deep Q-Network
Authors	Anonymous
Abstract	Deep neural networks use deeper and broader structures to achieve better performance and consequently, use increasingly more GPU memory as well. However, limited GPU memory restricts many potential designs of neural networks. In this paper, we propose a reinforcement learning based variable swapping and recomputation algorithm to reduce the memory cost, without sacrificing the accuracy of models. Variable swapping can transfer variables between CPU and GPU memory to reduce variables stored in GPU memory. Recomputation can trade time for space by removing some feature maps during forward propagation. Forward functions are executed once again to get the feature maps before reuse. However, how to automatically decide which variables to be swapped or recomputed remains a challenging problem. To address this issue, we propose to use a deep Q-network(DQN) to make plans. By combining variable swapping and recomputation, our results outperform several well-known benchmarks.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJxg7eHYvB
PDF	https://openreview.net/pdf?id=BJxg7eHYvB
PWC	https://paperswithcode.com/paper/gpu-memory-management-for-deep-neural
Repo
Framework

On the “steerability” of generative adversarial networks


Title	On the “steerability” of generative adversarial networks
Authors	Anonymous
Abstract	An open secret in contemporary machine learning is that many models work beautifully on standard benchmarks but fail to generalize outside the lab. This has been attributed to biased training data, which provide poor coverage over real world events. Generative models are no exception, but recent advances in generative adversarial networks (GANs) suggest otherwise – these models can now synthesize strikingly realistic and diverse images. Is generative modeling of photos a solved problem? We show that although current GANs can fit standard datasets very well, they still fall short of being comprehensive models of the visual manifold. In particular, we study their ability to fit simple transformations such as camera movements and color changes. We find that the models reflect the biases of the datasets on which they are trained (e.g., centered objects), but that they also exhibit some capacity for generalization: by “steering” in latent space, we can shift the distribution while still creating realistic images. We hypothesize that the degree of distributional shift is related to the breadth of the training data distribution. Thus, we conduct experiments to quantify the limits of GAN transformations and introduce techniques to mitigate the problem.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HylsTT4FvB
PDF	https://openreview.net/pdf?id=HylsTT4FvB
PWC	https://paperswithcode.com/paper/on-the-steerability-of-generative-adversarial-1
Repo
Framework

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators


Title	ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Authors	Anonymous
Abstract	While masked language modeling (MLM) pre-training methods such as BERT produce excellent results on downstream NLP tasks, they require large amounts of compute to be effective. These approaches corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some input tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the model learns from all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where we match the performance of RoBERTa, the current state-of-the-art pre-trained transformer, while using less than 1/4 of the compute.
Tasks	Language Modelling, Linguistic Acceptability, Natural Language Inference, Question Answering, Semantic Textual Similarity, Sentiment Analysis
Published	2020-01-01
URL	https://openreview.net/forum?id=r1xMH1BtvB
PDF	https://openreview.net/pdf?id=r1xMH1BtvB
PWC	https://paperswithcode.com/paper/electra-pre-training-text-encoders-as
Repo
Framework

Adversarial Robustness Against the Union of Multiple Perturbation Models


Title	Adversarial Robustness Against the Union of Multiple Perturbation Models
Authors	Anonymous
Abstract	Owing to the susceptibility of deep learning systems to adversarial attacks, there has been a great deal of work in developing (both empirically and certifiably) robust classifiers, but the vast majority has defended against single types of attacks. Recent work has looked at defending against multiple attacks, specifically on the MNIST dataset, yet this approach used a relatively complex architecture, claiming that standard adversarial training can not apply because it “overfits” to a particular norm. In this work, we show that it is indeed possible to adversarially train a robust model against a union of norm-bounded attacks, by using a natural generalization of the standard PGD-based procedure for adversarial training to multiple threat models. With this approach, we are able to train standard architectures which are robust against l_inf, l_2, and l_1 attacks, outperforming past approaches on the MNIST dataset and providing the first CIFAR10 network trained to be simultaneously robust against (l_inf, l_2, l_1) threat models, which achieves adversarial accuracy rates of (47.6%, 64.3%, 53.4%) for (l_inf, l_2, l_1) perturbations with epsilon radius = (0.03,0.5,12).
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rklMnyBtPB
PDF	https://openreview.net/pdf?id=rklMnyBtPB
PWC	https://paperswithcode.com/paper/adversarial-robustness-against-the-union-of-1
Repo
Framework

Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel


Title	Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel
Authors	Anonymous
Abstract	Neural Networks (NNs) have been extensively used for a wide spectrum of real-world regression tasks, where the goal is to predict a numerical outcome such as revenue, effectiveness, or a quantitative result. In many such tasks, the point prediction is not enough: the uncertainty (i.e. risk or confidence) of that prediction must also be estimated. Standard NNs, which are most often used in such tasks, do not provide uncertainty information. Existing approaches address this issue by combining Bayesian models with NNs, but these models are hard to implement, more expensive to train, and usually do not predict as accurately as standard NNs. In this paper, a new framework (RIO) is developed that makes it possible to estimate uncertainty in any pretrained standard NN. The behavior of the NN is captured by modeling its prediction residuals with a Gaussian Process, whose kernel includes both the NN’s input and its output. The framework is justified theoretically and evaluated in twelve real-world datasets, where it is found to (1) provide reliable estimates of uncertainty, (2) reduce the error of the point predictions, and (3) scale well to large datasets. Given that RIO can be applied to any standard NN without modifications to model architecture or training pipeline, it provides an important ingredient for building real-world NN applications.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rkxNh1Stvr
PDF	https://openreview.net/pdf?id=rkxNh1Stvr
PWC	https://paperswithcode.com/paper/quantifying-point-prediction-uncertainty-in
Repo
Framework

Span Recovery for Deep Neural Networks with Applications to Input Obfuscation


Title	Span Recovery for Deep Neural Networks with Applications to Input Obfuscation
Authors	Anonymous
Abstract	The tremendous success of deep neural networks has motivated the need to better understand the fundamental properties of these networks, but many of the theoretical results proposed have only been for shallow networks. In this paper, we study an important primitive for understanding the meaningful input space of a deep network: \textit{span recovery}. For $k<n$, let $\AA \in \R^{k \times n}$ be the innermost weight matrix of an arbitrary feed forward neural network $M:\R^n \to \R$, so $M(x)$ can be written as $M(x) = \sigma(\AA x)$, for some network $\sigma:\R^k \to \R$. The goal is then to recover the row span of $\AA$ given only oracle access to the value of $M(x)$. We show that if $M$ is a multi-layered network with ReLU activation functions, then partial recovery is possible: namely, we can provably recover $k/2$ linearly independent vectors in the row span of $\AA$ using $\poly(n)$ non-adaptive queries to $M(x)$. Furthermore, if $M$ has differentiable activation functions, we demonstrate that \textit{full} span recovery is possible even when the output is first passed through a sign or $0/1$ thresholding function; in this case our algorithm is adaptive. Empirically, we confirm that full span recovery is not always possible, but only for unrealistically thin layers. For reasonably wide networks, we obtain full span recovery on both random networks and networks trained on MNIST data. Furthermore, we demonstrate the utility of span recovery as an attack by inducing neural networks to misclassify data obfuscated by controlled random noise as sensical inputs.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1guLAVFDB
PDF	https://openreview.net/pdf?id=B1guLAVFDB
PWC	https://paperswithcode.com/paper/span-recovery-for-deep-neural-networks-with
Repo
Framework

Curvature Graph Network


Title	Curvature Graph Network
Authors	Anonymous
Abstract	Graph-structured data is prevalent in many domains. Despite the widely celebrated success of deep neural networks, their power in graph-structured data is yet to be fully explored. We propose a novel network architecture that incorporates advanced graph structural features. In particular, we leverage discrete graph curvature, which measures how the neighborhoods of a pair of nodes are structurally related. The curvature of an edge (x, y) defines the distance taken to travel from neighbors of x to neighbors of y, compared with the length of edge (x, y). It is a much more descriptive feature compared to previously used features that only focus on node specific attributes or limited topological information such as degree. Our curvature graph convolution network outperforms state-of-the-art on various synthetic and real-world graphs, especially the larger and denser ones.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BylEqnVFDB
PDF	https://openreview.net/pdf?id=BylEqnVFDB
PWC	https://paperswithcode.com/paper/curvature-graph-network
Repo
Framework

The Break-Even Point on the Optimization Trajectories of Deep Neural Networks


Title	The Break-Even Point on the Optimization Trajectories of Deep Neural Networks
Authors	Anonymous
Abstract	Understanding the optimization trajectory is critical to understand training of deep neural networks. We show how the hyperparameters of stochastic gradient descent influence the covariance of the gradients (K) and the Hessian of the training loss (H) along this trajectory. Based on a theoretical model, we predict that using a high learning rate or a small batch size in the early phase of training leads SGD to regions of the parameter space with (1) reduced spectral norm of K, and (2) improved conditioning of K and H. We show that the point on the trajectory after which these effects hold, which we refer to as the break-even point, is reached early during training. We demonstrate these effects empirically for a range of deep neural networks applied to multiple different tasks. Finally, we apply our analysis to networks with batch normalization (BN) layers and find that it is necessary to use a high learning rate to achieve loss smoothing effects attributed previously to BN alone.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=r1g87C4KwB
PDF	https://openreview.net/pdf?id=r1g87C4KwB
PWC	https://paperswithcode.com/paper/the-break-even-point-on-the-optimization
Repo
Framework

Gradients as Features for Deep Representation Learning


Title	Gradients as Features for Deep Representation Learning
Authors	Anonymous
Abstract	We address the challenging problem of deep representation learning–the efficient adaption of a pre-trained deep network to different tasks. Specifically, we propose to explore gradient-based features. These features are gradients of the model parameters with respect to a task-specific loss given an input sample. Our key innovation is the design of a linear model that incorporates both gradient features and the activation of the network. We show that our model provides a local linear approximation to a underlying deep model, and discuss important theoretical insight. Moreover, we present an efficient algorithm for the training and inference of our model without computing the actual gradients. Our method is evaluated across a number of representation learning tasks on several datasets and using different network architectures. We demonstrate strong results in all settings. And our results are well-aligned with our theoretical insight.
Tasks	Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=BkeoaeHKDS
PDF	https://openreview.net/pdf?id=BkeoaeHKDS
PWC	https://paperswithcode.com/paper/gradients-as-features-for-deep-representation
Repo
Framework

Putting Machine Translation in Context with the Noisy Channel Model


Title	Putting Machine Translation in Context with the Noisy Channel Model
Authors	Anonymous
Abstract	We show that Bayes’ rule provides a compelling mechanism for controlling unconditional document language models, using the long-standing challenge of effectively leveraging document context in machine translation. In our formulation, we estimate the probability of a candidate translation as the product of the unconditional probability of the candidate output document and the `reverse translation probability'' of translating the candidate output back into the input source language document---the so-called` noisy channel’’ decomposition. A particular advantage of our model is that it requires only parallel sentences to train, rather than parallel documents, which are not always available. Using a new beam search reranking approximation to solve the decoding problem, we find that document language models outperform language models that assume independence between sentences, and that using either a document or sentence language model outperform comparable models that directly estimate the translation probability. We obtain the best-published results on the NIST Chinese–English translation task, a standard task for evaluating document translation. Our model also outperforms the benchmark Transformer model by approximately 2.5 BLEU on the WMT19 Chinese–English translation task.
Tasks	Language Modelling, Machine Translation
Published	2020-01-01
URL	https://openreview.net/forum?id=B1x1MerYPB
PDF	https://openreview.net/pdf?id=B1x1MerYPB
PWC	https://paperswithcode.com/paper/putting-machine-translation-in-context-with
Repo
Framework

Behaviour Suite for Reinforcement Learning


Title	Behaviour Suite for Reinforcement Learning
Authors	Anonymous
Abstract	This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source this http URL, which automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. Our code is Python, and easy to use within existing projects. We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a periodic review of bsuite from a committee of prominent researchers.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rygf-kSYwH
PDF	https://openreview.net/pdf?id=rygf-kSYwH
PWC	https://paperswithcode.com/paper/behaviour-suite-for-reinforcement-learning-1
Repo
Framework

Semantic Hierarchy Emerges in the Deep Generative Representations for Scene Synthesis


Title	Semantic Hierarchy Emerges in the Deep Generative Representations for Scene Synthesis
Authors	Anonymous
Abstract	Despite the success of Generative Adversarial Networks (GANs) in image synthesis, there lacks enough understanding on what networks have learned inside the deep generative representations and how photo-realistic images are able to be composed from random noises. In this work, we show that highly-structured semantic hierarchy emerges from the generative representations as the variation factors for synthesizing scenes. By probing the layer-wise representations with a broad set of visual concepts at different abstraction levels, we are able to quantify the causality between the activations and the semantics occurring in the output image. Such a quantification identifies the human-understandable variation factors learned by GANs to compose scenes. The qualitative and quantitative results suggest that the generative representations learned by GAN are specialized to synthesize different hierarchical semantics: the early layers tend to determine the spatial layout and configuration, the middle layers control the categorical objects, and the later layers finally render the scene attributes as well as color scheme. Identifying such a set of manipulatable latent semantics facilitates semantic scene manipulation.
Tasks	Image Generation
Published	2020-01-01
URL	https://openreview.net/forum?id=Syxp-1HtvB
PDF	https://openreview.net/pdf?id=Syxp-1HtvB
PWC	https://paperswithcode.com/paper/semantic-hierarchy-emerges-in-the-deep
Repo
Framework

Can gradient clipping mitigate label noise?


Title	Can gradient clipping mitigate label noise?
Authors	Anonymous
Abstract	Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum. This intuition has been made precise in a line of recent works, which show that suitable clipping can yield significantly faster convergence than vanilla gradient descent. In this paper, we propose a new lens for studying gradient clipping, namely, robustness: informally, one expects clipping to provide robustness to noise, since one does not overly trust any single sample. Surprisingly, we prove that for the common problem of label noise in classification, standard gradient clipping does not in general provide robustness. On the other hand, we show that a simple variant of gradient clipping is provably robust, and corresponds to suitably modifying the underlying loss function. This yields a simple, noise-robust alternative to the standard cross-entropy loss which performs well empirically.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rklB76EKPr
PDF	https://openreview.net/pdf?id=rklB76EKPr
PWC	https://paperswithcode.com/paper/can-gradient-clipping-mitigate-label-noise
Repo
Framework

Comparing Fine-tuning and Rewinding in Neural Network Pruning


Title	Comparing Fine-tuning and Rewinding in Neural Network Pruning
Authors	Anonymous
Abstract	Neural network pruning is a popular technique for reducing inference costs by removing connections, neurons, or other structure from the network. In the literature, pruning typically follows a standard procedure: train the network, remove unwanted structure (pruning), and train the resulting network further to recover accuracy (fine-tuning). In this paper, we explore an alternative to fine-tuning: rewinding. Rather than continuing to train the resultant pruned network (fine-tuning), rewind the remaining weights to their values from earlier in training, and re-train the resultant network for the remainder of the original training process. We find that this procedure, which repurposes the strategy for finding lottery tickets presented by Frankle et al. (2019), makes it possible to prune networks further than is possible with fine-tuning for a given target accuracy, provided that the weights are rewound to a suitable point in training. We also find that there are wide ranges of suitable rewind points that achieve higher accuracy than fine-tuning across all tested networks. Based on these results, we argue that practitioners should explore rewinding as an alternative to fine-tuning for neural network pruning.
Tasks	Network Pruning
Published	2020-01-01
URL	https://openreview.net/forum?id=S1gSj0NKvB
PDF	https://openreview.net/pdf?id=S1gSj0NKvB
PWC	https://paperswithcode.com/paper/comparing-fine-tuning-and-rewinding-in-neural
Repo
Framework

Lookahead: A Far-sighted Alternative of Magnitude-based Pruning


Title	Lookahead: A Far-sighted Alternative of Magnitude-based Pruning
Authors	Anonymous
Abstract	Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants have shown state-of-the-art performances for pruning modern architectures. Based on the observation that the magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization. Our experimental results demonstrate that the proposed method consistently outperforms the magnitude pruning on various networks including VGG and ResNet, particularly in the high-sparsity regime.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=ryl3ygHYDB
PDF	https://openreview.net/pdf?id=ryl3ygHYDB
PWC	https://paperswithcode.com/paper/lookahead-a-far-sighted-alternative-of
Repo
Framework