April 1, 2020

3046 words 15 mins read

Paper Group NANR 7

BETANAS: Balanced Training and selective drop for Neural Architecture Search. INTERPRETING CNN PREDICTION THROUGH LAYER - WISE SELECTED DISCERNIBLE NEURONS. Adversarial Interpolation Training: A Simple Approach for Improving Model Robustness. Robustness Verification for Transformers. Diversely Stale Parameters for Efficient Training of Deep Convolu …

BETANAS: Balanced Training and selective drop for Neural Architecture Search


Title	BETANAS: Balanced Training and selective drop for Neural Architecture Search
Authors	Anonymous
Abstract	Automatic neural architecture search techniques are becoming increasingly important in machine learning area recently. Especially, weight sharing methods have shown remarkable potentials on searching good network architectures with few computational resources. However, existing weight sharing methods mainly suffer limitations on searching strategies: these methods either uniformly train all network paths to convergence which introduces conflicts between branches and wastes a large amount of computation on unpromising candidates, or selectively train branches with different frequency which leads to unfair evaluation and comparison among paths. To address these issues, we propose a novel neural architecture search method with balanced training strategy to ensure fair comparisons and a selective drop mechanism to reduces conflicts among candidate paths. The experimental results show that our proposed method can achieve a leading performance of 79.0% on ImageNet under mobile settings, which outperforms other state-of-the-art methods in both accuracy and efficiency.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=HyeEIyBtvr
PDF	https://openreview.net/pdf?id=HyeEIyBtvr
PWC	https://paperswithcode.com/paper/betanas-balanced-training-and-selective-drop
Repo
Framework

INTERPRETING CNN PREDICTION THROUGH LAYER - WISE SELECTED DISCERNIBLE NEURONS


Title	INTERPRETING CNN PREDICTION THROUGH LAYER - WISE SELECTED DISCERNIBLE NEURONS
Authors	Anonymous
Abstract	In recent years, researchers have seen working on interpreting the insights of deep networks in the pursuit of overcoming their opaqueness and so-called ‘black-box’ tag from them. In this work, we present a new visual interpretation technique that finds out discriminative image locations contributing highly towards networks’ prediction. We select the most contributing set of neurons per layer and engineer the forward pass operation to gradually reach to the important locations of the in-put image. We explore the connectivity structure of the neuron and obtain support from succeeding and preceding layer along with its evidence from current layer to advocate for a neuron’s importance. While conducting this operation, we also add priorities to the supports from neighboring layers, which, in practice, provides a reliable way of selecting the discriminative set of neurons for the target layer.We conduct both the objective and subjective evaluations to examine the performance of our method in terms of model’s faithfulness and human-trust, where we visualize its efficacy over other existing methods.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HylYBlBYvB
PDF	https://openreview.net/pdf?id=HylYBlBYvB
PWC	https://paperswithcode.com/paper/interpreting-cnn-prediction-through-layer
Repo
Framework

Adversarial Interpolation Training: A Simple Approach for Improving Model Robustness


Title	Adversarial Interpolation Training: A Simple Approach for Improving Model Robustness
Authors	Anonymous
Abstract	We propose a simple approach for adversarial training. The proposed approach utilizes an adversarial interpolation scheme for generating adversarial images and accompanying adversarial labels, which are then used in place of the original data for model training. The proposed approach is intuitive to understand, simple to implement and achieves state-of-the-art performance. We evaluate the proposed approach on a number of datasets including CIFAR10, CIFAR100 and SVHN. Extensive empirical results compared with several state-of-the-art methods against different attacks verify the effectiveness of the proposed approach.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Syejj0NYvr
PDF	https://openreview.net/pdf?id=Syejj0NYvr
PWC	https://paperswithcode.com/paper/adversarial-interpolation-training-a-simple
Repo
Framework

Robustness Verification for Transformers


Title	Robustness Verification for Transformers
Authors	Anonymous
Abstract	Robustness verification that aims to formally certify the prediction behavior of neural networks has become an important tool for understanding the behavior of a given model and for obtaining safety guarantees. However, previous methods are usually limited to relatively simple neural networks. In this paper, we consider the robustness verification problem for Transformers. Transformers have very complicated self-attention layers that create many challenges for verification, including cross-nonlinearity and cross-position dependency that have not been solved in previous work. We resolve these key challenges and develop the first verification algorithm for Transformers. The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound Propagation, and they also consistently reflect the importance of different words in sentiment analysis and thus are meaningful in practice.
Tasks	Sentiment Analysis
Published	2020-01-01
URL	https://openreview.net/forum?id=BJxwPJHFwS
PDF	https://openreview.net/pdf?id=BJxwPJHFwS
PWC	https://paperswithcode.com/paper/robustness-verification-for-transformers
Repo
Framework

Diversely Stale Parameters for Efficient Training of Deep Convolutional Networks


Title	Diversely Stale Parameters for Efficient Training of Deep Convolutional Networks
Authors	Anonymous
Abstract	The backpropagation algorithm is the most popular algorithm training neural networks nowadays. However, it suffers from the forward locking, backward locking and update locking problems, especially when a neural network is so large that its layers are distributed across multiple devices. Existing solutions either can only handle one locking problem or lead to severe accuracy loss or memory inefficiency. Moreover, none of them consider the straggler problem among devices. In this paper, we propose \textbf{Layer-wise Staleness} and a novel efficient training algorithm, \textbf{Diversely Stale Parameters} (DSP), which can address all these challenges without loss of accuracy nor memory issue. We also analyze the convergence of DSP with two popular gradient-based methods and prove that both of them are guaranteed to converge to critical points for non-convex problems. Finally, extensive experimental results on training deep convolutional neural networks demonstrate that our proposed DSP algorithm can achieve significant training speedup with stronger robustness and better generalization than compared methods.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJgLlgBKvH
PDF	https://openreview.net/pdf?id=HJgLlgBKvH
PWC	https://paperswithcode.com/paper/diversely-stale-parameters-for-efficient-1
Repo
Framework

Towards Interpreting Deep Neural Networks via Understanding Layer Behaviors


Title	Towards Interpreting Deep Neural Networks via Understanding Layer Behaviors
Authors	Anonymous
Abstract	Deep neural networks (DNNs) have achieved unprecedented practical success in many applications. However, how to interpret DNNs is still an open problem. In particular, what do hidden layers behave is not clearly understood. In this paper, relying on a teacher-student paradigm, we seek to understand the layer behaviors of DNNs by `monitoring" both across-layer and single-layer distribution evolution to some target distribution in the training. Here, the` across-layer” and ``single-layer” considers the layer behavior \emph{along the depth} and a specific layer \emph{along training epochs}, respectively. Relying on optimal transport theory, we employ the Wasserstein distance ($W$-distance) to measure the divergence between the layer distribution and the target distribution. Theoretically, we prove that i) the $W$-distance of across layers to the target distribution tends to decrease along the depth. ii) the $W$-distance of a specific layer to the target distribution tends to decrease along training iterations. iii) However, a deep layer is not always better than a shallow layer for some samples. Moreover, our results helps to analyze the stability of layer distributions and explains why auxiliary losses helps the training of DNNs. Extensive experiments on real-world datasets justify our theoretical findings. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rkxMKerYwr
PDF	https://openreview.net/pdf?id=rkxMKerYwr
PWC	https://paperswithcode.com/paper/towards-interpreting-deep-neural-networks-via
Repo
Framework

Plan2Vec: Unsupervised Representation Learning by Latent Plans


Title	Plan2Vec: Unsupervised Representation Learning by Latent Plans
Authors	Anonymous
Abstract	Creating a useful representation of the world takes more than just rote memorization of individual data samples. This is because fundamentally, we use our internal representation to plan, to solve problems, and to navigate the world. For a representation to be amenable to planning, it is critical for it to embody some notion of optimality. A representation learning objective that explicitly considers some form of planning should generate representations which are more computationally valuable than those that memorize samples. In this paper, we introduce \textbf{Plan2Vec}, an unsupervised representation learning objective inspired by value-based reinforcement learning methods. By abstracting away low-level control with a learned local metric, we show that it is possible to learn plannable representations that inform long-range structures, entirely passively from high-dimensional sequential datasets without supervision. A latent space is learned by playing an ``Imagined Planning Game” on the graph formed by the data points, using a local metric function trained contrastively from context. We show that the global metric on this learned embedding can be used to plan with O(1) complexity by linear interpolation. This exponential speed-up is critical for planning with a learned representation on any problem containing non-trivial global topology. We demonstrate the effectiveness of Plan2Vec on simulated toy tasks from both proprioceptive and image states, as well as two real-world image datasets, showing that Plan2Vec can effectively plan using learned representations. Additional results and videos can be found at \url{https://sites.google.com/view/plan2vec}. \|
Tasks	Representation Learning, Unsupervised Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=Bye6weHFvB
PDF	https://openreview.net/pdf?id=Bye6weHFvB
PWC	https://paperswithcode.com/paper/plan2vec-unsupervised-representation-learning
Repo
Framework

Potential Flow Generator with $L_2$ Optimal Transport Regularity for Generative Models


Title	Potential Flow Generator with $L_2$ Optimal Transport Regularity for Generative Models
Authors	Anonymous
Abstract	We propose a potential flow generator with $L_2$ optimal transport regularity, which can be easily integrated into a wide range of generative models including different versions of GANs and flow-based models. With up to a slight augmentation of the original generator loss functions, our generator is not only a transport map from the input distribution to the target one, but also the one with minimum $L_2$ transport cost. We show the correctness and robustness of the potential flow generator in several 2D problems, and illustrate the concept of ``proximity’’ due to the $L_2$ optimal transport regularity. Subsequently, we demonstrate the effectiveness of the potential flow generator in image translation tasks with unpaired training data from the MNIST dataset and the CelebA dataset. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SkexNpNFwS
PDF	https://openreview.net/pdf?id=SkexNpNFwS
PWC	https://paperswithcode.com/paper/potential-flow-generator-with-l_2-optimal-1
Repo
Framework

Simple but effective techniques to reduce dataset biases


Title	Simple but effective techniques to reduce dataset biases
Authors	Anonymous
Abstract	There have been several studies recently showing that strong natural language understanding (NLU) models are prone to relying on unwanted dataset biases without learning the underlying task, resulting in models which fail to generalize to out-of-domain datasets, and are likely to perform poorly in real-world scenarios. We propose several learning strategies to train neural models which are more robust to such biases and transfer better to out-of-domain datasets. We introduce an additional lightweight bias-only model which learns dataset biases and uses its prediction to adjust the loss of the base model to reduce the biases. In other words, our methods down-weight the importance of the biased examples, and focus training on hard examples, i.e. examples that cannot be correctly classified by only relying on biases. Our approaches are model agnostic and simple to implement. We experiment on large-scale natural language inference and fact verification datasets and their out-of-domain datasets and show that our debiased models significantly improve the robustness in all settings, including gaining 9.76 points on the FEVER symmetric evaluation dataset, 5.45 on the HANS dataset and 4.78 points on the SNLI hard set. These datasets are specifically designed to assess the robustness of models in the out-of-domain setting where typical biases in the training data do not exist in the evaluation set.
Tasks	Natural Language Inference
Published	2020-01-01
URL	https://openreview.net/forum?id=SJlCK1rYwB
PDF	https://openreview.net/pdf?id=SJlCK1rYwB
PWC	https://paperswithcode.com/paper/simple-but-effective-techniques-to-reduce-1
Repo
Framework

Distributionally Robust Neural Networks


Title	Distributionally Robust Neural Networks
Authors	Anonymous
Abstract	Overparameterized neural networks trained to minimize average loss can be highly accurate on average on an i.i.d. test set, yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that do not hold at test time). Distributionally robust optimization (DRO) provides an approach for learning models that instead minimize worst-case training loss over a set of pre-defined groups. We find, however, that naively applying DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss will also already have vanishing worst-case training loss. Instead, the poor worst-case performance of these models arises from poor generalization on some groups. As a solution, we show that increased regularization—e.g., stronger-than-typical weight decay or early stopping—allows DRO models to achieve substantially higher worst-group accuracies, with 10% to 40% improvements over standard models on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is critical for worst-group performance in the overparameterized regime, even if it is not needed for average performance. Finally, we introduce and provide convergence guarantees for a stochastic optimizer for this group DRO setting, underpinning the empirical study above.
Tasks	Natural Language Inference
Published	2020-01-01
URL	https://openreview.net/forum?id=ryxGuJrFvS
PDF	https://openreview.net/pdf?id=ryxGuJrFvS
PWC	https://paperswithcode.com/paper/distributionally-robust-neural-networks
Repo
Framework

Adapting to Label Shift with Bias-Corrected Calibration


Title	Adapting to Label Shift with Bias-Corrected Calibration
Authors	Anonymous
Abstract	Label shift refers to the phenomenon where the marginal probability p(y) of observing a particular class changes between the training and test distributions, while the conditional probability p(xy) stays fixed. This is relevant in settings such as medical diagnosis, where a classifier trained to predict disease based on observed symptoms may need to be adapted to a different distribution where the baseline frequency of the disease is higher. Given estimates of p(yx) from a predictive model, one can apply domain adaptation procedures including Expectation Maximization (EM) and Black-Box Shift Estimation (BBSE) to efficiently correct for the difference in class proportions between the training and test distributions. Unfortunately, modern neural networks typically fail to produce well-calibrated estimates of p(yx), reducing the effectiveness of these approaches. In recent years, Temperature Scaling has emerged as an efficient approach to combat miscalibration. However, the effectiveness of Temperature Scaling in the context of adaptation to label shift has not been explored. In this work, we study the impact of various calibration approaches on shift estimates produced by EM or BBSE. In experiments with image classification and diabetic retinopathy detection, we find that calibration consistently tends to improve shift estimation. In particular, calibration approaches that include class-specific bias parameters are significantly better than approaches that lack class-specific bias parameters, suggesting that reducing systematic bias in the calibrated probabilities is especially important for domain adaptation.
Tasks	Calibration, Diabetic Retinopathy Detection, Domain Adaptation, Image Classification, Medical Diagnosis
Published	2020-01-01
URL	https://openreview.net/forum?id=rkx-wA4YPS
PDF	https://openreview.net/pdf?id=rkx-wA4YPS
PWC	https://paperswithcode.com/paper/adapting-to-label-shift-with-bias-corrected
Repo
Framework

ExpandNets: Linear Over-parameterization to Train Compact Convolutional Networks


Title	ExpandNets: Linear Over-parameterization to Train Compact Convolutional Networks
Authors	Anonymous
Abstract	In this paper, we introduce a novel approach to training a given compact network. To this end, we build upon over-parameterization, which typically improves both optimization and generalization in neural network training, while being unnecessary at inference time. We propose to expand each linear layer of the compact network into multiple linear layers, without adding any nonlinearity. As such, the resulting expanded network can benefit from over-parameterization during training but can be compressed back to the compact one algebraically at inference. As evidenced by our experiments, this consistently outperforms training the compact network from scratch and knowledge distillation using a teacher. In this context, we introduce several expansion strategies, together with an initialization scheme, and demonstrate the benefits of our ExpandNets on several tasks, including image classification, object detection, and semantic segmentation.
Tasks	Image Classification, Object Detection, Semantic Segmentation
Published	2020-01-01
URL	https://openreview.net/forum?id=B1x3EgHtwB
PDF	https://openreview.net/pdf?id=B1x3EgHtwB
PWC	https://paperswithcode.com/paper/expandnets-linear-over-parameterization-to
Repo
Framework

Sensible adversarial learning


Title	Sensible adversarial learning
Authors	Anonymous
Abstract	The trade-off between robustness and standard accuracy has been consistently reported in the machine learning literature. Although the problem has been widely studied to understand and explain this trade-off, no studies have shown the possibility of a no trade-off solution. In this paper, motivated by the fact that the high dimensional distribution is poorly represented by limited data samples, we introduce sensible adversarial learning and demonstrate the synergistic effect between pursuits of natural accuracy and robustness. Specifically, we define a sensible adversary which is useful for learning a defense model and keeping a high natural accuracy simultaneously. We theoretically establish that the Bayes rule is the most robust multi-class classifier with the 0-1 loss under sensible adversarial learning. We propose a novel and efficient algorithm that trains a robust model with sensible adversarial examples, without a significant drop in natural accuracy. Our model on CIFAR10 yields state-of-the-art results against various attacks with perturbations restricted to l∞ with ε = 8/255, e.g., the robust accuracy 65.17% against PGD attacks as well as the natural accuracy 91.51%.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rJlf_RVKwr
PDF	https://openreview.net/pdf?id=rJlf_RVKwr
PWC	https://paperswithcode.com/paper/sensible-adversarial-learning
Repo
Framework

SPROUT: Self-Progressing Robust Training


Title	SPROUT: Self-Progressing Robust Training
Authors	Anonymous
Abstract	Enhancing model robustness under new and even adversarial environments is a crucial milestone toward building trustworthy and reliable machine learning systems. Current robust training methods such as adversarial training explicitly specify an ``attack’’ (e.g., $\ell_{\infty}$-norm bounded perturbation) to generate adversarial examples during model training in order to improve adversarial robustness. In this paper, we take a different perspective and propose a new framework SPROUT, self-progressing robust training. During model training, SPROUT progressively adjusts training label distribution via our proposed parametrized label smoothing technique, making training free of attack generation and more scalable. We also motivate SPROUT using a general formulation based on vicinity risk minimization, which includes many robust training methods as special cases. Compared with state-of-the-art adversarial training methods (PGD-$\ell_\infty$ and TRADES) under $\ell_{\infty}$-norm bounded attacks and various invariance tests, SPROUT consistently attains superior performance and is more scalable to large neural networks. Our results shed new light on scalable, effective and attack-independent robust training methods. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SyxGoJrtPr
PDF	https://openreview.net/pdf?id=SyxGoJrtPr
PWC	https://paperswithcode.com/paper/sprout-self-progressing-robust-training
Repo
Framework

Bias in word embeddings


Title	Bias in word embeddings
Authors	Orestis Papakyriakopoulos, Simon Hegelich, Juan Carlos Medina Serrano, Fabienne Marco
Abstract	Word embeddings are a widely used set of natural language processing techniques that map words to vectors of real numbers. These vectors are used to improve the quality of generative and predictive models. Recent studies demonstrate that word embeddings contain and amplify biases present in data, such as stereotypes and prejudice. In this study, we provide a complete overview of bias in word embeddings. We develop a new technique for bias detection for gendered languages and use it to compare bias in embeddings trained on Wikipedia and on political social media data. We investigate bias diffusion and prove that existing biases are transferred to further machine learning models. We test two techniques for bias mitigation and show that the generally proposed methodology for debiasing models at the embeddings level is insufficient. Finally, we employ biased word embeddings and illustrate that they can be used for the detection of similar biases in new data. Given that word embeddings are widely used by commercial companies, we discuss the challenges and required actions towards fair algorithmic implementations and applications.
Tasks	Word Embeddings
Published	2020-01-27
URL	https://dl.acm.org/doi/abs/10.1145/3351095.3372843
PDF	https://dl.acm.org/doi/pdf/10.1145/3351095.3372843
PWC	https://paperswithcode.com/paper/bias-in-word-embeddings
Repo
Framework