Paper Group NANR 106
Rethinking the Security of Skip Connections in ResNet-like Neural Networks. Fully Polynomial-Time Randomized Approximation Schemes for Global Optimization of High-Dimensional Folded Concave Penalized Generalized Linear Models. Batch-shaping for learning conditional channel gated networks. Instant Quantization of Neural Networks using Monte Carlo Me …
Rethinking the Security of Skip Connections in ResNet-like Neural Networks
Title | Rethinking the Security of Skip Connections in ResNet-like Neural Networks |
Authors | Anonymous |
Abstract | Skip connections are an essential component of current state-of-the-art deep neural networks (DNNs) such as ResNet, WideResNet, DenseNet, and ResNeXt. Despite their huge success in building deeper and more powerful DNNs, we identify a surprising \emph{security weakness} of skip connections in this paper. Use of skip connections \textit{allows easier generation of highly transferable adversarial examples}. Specifically, in ResNet-like (with skip connections) neural networks, gradients can backpropagate through either skip connections or residual connections. We find that using more gradients from the skip connections rather than the residual connections according to a decay factor, allows one to craft adversarial examples with high transferability. Our method is termed \emph{Skip Gradient Method} (SGM). We conduct comprehensive transfer attacks against 10 state-of-the-art DNNs including ResNets, DenseNets, Inceptions, Inception-ResNet, Squeeze-and-Excitation Network (SENet) and robustly trained DNNs. We show that employing SGM on the gradient flow can greatly improve the transferability of crafted attacks in almost all cases. Furthermore, SGM can be easily combined with existing black-box attack techniques, and obtain high improvements over state-of-the-art transferability methods. Our findings not only motivate new research into the architectural vulnerability of DNNs, but also open up further challenges for the design of secure DNN architectures. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJlRs34Fvr |
https://openreview.net/pdf?id=BJlRs34Fvr | |
PWC | https://paperswithcode.com/paper/rethinking-the-security-of-skip-connections |
Repo | |
Framework | |
Fully Polynomial-Time Randomized Approximation Schemes for Global Optimization of High-Dimensional Folded Concave Penalized Generalized Linear Models
Title | Fully Polynomial-Time Randomized Approximation Schemes for Global Optimization of High-Dimensional Folded Concave Penalized Generalized Linear Models |
Authors | Anonymous |
Abstract | Global solutions to high-dimensional sparse estimation problems with a folded concave penalty (FCP) have been shown to be statistically desirable but are strongly NP-hard to compute, which implies the non-existence of a pseudo-polynomial time global optimization schemes in the worst case. This paper shows that, with high probability, a global solution to the formulation for a FCP-based high-dimensional generalized linear model coincides with a stationary point characterized by the significant subspace second order necessary conditions (S$^3$ONC). Since the desired S$^3$ONC solution admits a fully polynomial-time approximation schemes (FPTAS), we thus have shown the existence of fully polynomial-time randomized approximation scheme (FPRAS) for a strongly NP-hard problem. We further demonstrate two versions of the FPRAS for generating the desired S$^3$ONC solutions. One follows the paradigm of an interior point trust region algorithm and the other is the well-studied local linear approximation (LLA). Our analysis thus provides new techniques for global optimization of certain NP-Hard problems and new insights on the effectiveness of LLA. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Byx0iAEYPH |
https://openreview.net/pdf?id=Byx0iAEYPH | |
PWC | https://paperswithcode.com/paper/fully-polynomial-time-randomized |
Repo | |
Framework | |
Batch-shaping for learning conditional channel gated networks
Title | Batch-shaping for learning conditional channel gated networks |
Authors | Anonymous |
Abstract | We present a method that trains large capacity neural networks with significantly improved accuracy and lower dynamic computational cost. This is achieved by gating the deep-learning architecture on a fine-grained-level. Individual convolutional maps are turned on/off conditionally on features in the network. To achieve this, we introduce a new residual block architecture that gates convolutional channels in a fine-grained manner. We also introduce a generally applicable tool batch-shaping that matches the marginal aggregate posteriors of features in a neural network to a pre-specified prior distribution. We use this novel technique to force gates to be more conditional on the data. We present results on CIFAR-10 and ImageNet datasets for image classification, and Cityscapes for semantic segmentation. Our results show that our method can slim down large architectures conditionally, such that the average computational cost on the data is on par with a smaller architecture, but with higher accuracy. In particular, on ImageNet, our ResNet50 and ResNet34 gated networks obtain 74.60% and 72.55% top-1 accuracy compared to the 69.76% accuracy of the baseline ResNet18 model, for similar complexity. We also show that the resulting networks automatically learn to use more features for difficult examples and fewer features for simple examples. |
Tasks | Image Classification, Semantic Segmentation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bke89JBtvB |
https://openreview.net/pdf?id=Bke89JBtvB | |
PWC | https://paperswithcode.com/paper/batch-shaping-for-learning-conditional |
Repo | |
Framework | |
Instant Quantization of Neural Networks using Monte Carlo Methods
Title | Instant Quantization of Neural Networks using Monte Carlo Methods |
Authors | Anonymous |
Abstract | Low bit-width integer weights and activations are very important for efficient inference, especially with respect to lower power consumption. We propose to apply Monte Carlo methods and importance sampling to sparsify and quantize pre-trained neural networks without any retraining. We obtain sparse, low bit-width integer representations that approximate the full precision weights and activations. The precision, sparsity, and complexity are easily configurable by the amount of sampling performed. Our approach, called Monte Carlo Quantization (MCQ), is linear in both time and space, while the resulting quantized sparse networks show minimal accuracy loss compared to the original full-precision networks. Our method either outperforms or achieves results competitive with methods that do require additional training on a variety of challenging tasks. |
Tasks | Quantization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1e5NySKwH |
https://openreview.net/pdf?id=B1e5NySKwH | |
PWC | https://paperswithcode.com/paper/instant-quantization-of-neural-networks-using-1 |
Repo | |
Framework | |
The Local Elasticity of Neural Networks
Title | The Local Elasticity of Neural Networks |
Authors | Anonymous |
Abstract | This paper presents a phenomenon in neural networks that we refer to as local elasticity. Roughly speaking, a classifier is said to be locally elastic if its prediction at a feature vector x’ is not significantly perturbed, after the classifier is updated via stochastic gradient descent at a (labeled) feature vector x that is dissimilar to x’ in a certain sense. This phenomenon is shown to persist for neural networks with nonlinear activation functions through extensive simulations on synthetic datasets, whereas this is not the case for linear classifiers. In addition, we offer a geometric interpretation of local elasticity using the neural tangent kernel (Jacot et al., 2018). Building on top of local elasticity, we obtain pairwise similarity measures between feature vectors, which can be used for clustering in conjunction with K-means. The effectiveness of the clustering algorithm on the MNIST and CIFAR-10 datasets in turn confirms the hypothesis of local elasticity of neural networks on real-life data. Finally, we discuss implications of local elasticity to shed light on several intriguing aspects of deep neural networks. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJxMYANtPH |
https://openreview.net/pdf?id=HJxMYANtPH | |
PWC | https://paperswithcode.com/paper/the-local-elasticity-of-neural-networks-1 |
Repo | |
Framework | |
Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Title | Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning |
Authors | Anonymous |
Abstract | In recent years we have seen fast progress on a number of benchmark problems in AI, with modern methods achieving near or super human performance in Go, Poker and Dota. One common aspect of all of these challenges is that they are by design adversarial or, technically speaking, zero-sum. In contrast to these settings, success in the real world commonly requires humans to collaborate and communicate with others, in settings that are, at least partially, cooperative. In the last year, the card game Hanabi has been established as a new benchmark environment for AI to fill this gap. In particular, Hanabi is interesting to humans since it is entirely focused on theory of mind, i.e. the ability to effectively reason over the intentions, beliefs and point of view of other agents when observing their actions. Learning to be informative when observed by others is an interesting challenge for Reinforcement Learning (RL): Fundamentally, RL requires agents to explore in order to discover good policies. However, when done naively, this randomness will inherently make their actions less informative to others during training. We present a new deep multi-agent RL method, the Simplified Action Decoder (SAD), which resolves this contradiction exploiting the centralized training phase. During training SAD allows other agents to not only observe the (exploratory) action chosen, but agents instead also observe the greedy action of their team mates. By combining this simple intuition with an auxiliary task for state prediction and best practices for multi-agent learning, SAD establishes a new state of the art for 2-5 players on the self-play part of the Hanabi challenge. |
Tasks | Multi-agent Reinforcement Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1xm3RVtwB |
https://openreview.net/pdf?id=B1xm3RVtwB | |
PWC | https://paperswithcode.com/paper/simplified-action-decoder-for-deep-multi |
Repo | |
Framework | |
Learning Nearly Decomposable Value Functions Via Communication Minimization
Title | Learning Nearly Decomposable Value Functions Via Communication Minimization |
Authors | Anonymous |
Abstract | Reinforcement learning encounters major challenges in multi-agent settings, such as scalability and non-stationarity. Recently, value function factorization learning emerges as a promising way to address these challenges in collaborative multi-agent systems. However, existing methods have been focusing on learning fully decentralized value function, which are not efficient for tasks requiring communication. To address this limitation, this paper presents a novel framework for learning nearly decomposable value functions with communication, with which agents act on their own most of the time but occasionally send messages to other agents in order for effective coordination. This framework hybridizes value function factorization learning and communication learning by introducing two information-theoretic regularizers. These regularizers are maximizing mutual information between decentralized Q functions and communication messages while minimizing the entropy of messages between agents. We show how to optimize these regularizers in a way that is easily integrated with existing value function factorization methods such as QMIX. Finally, we demonstrate that, on the StarCraft unit micromanagement benchmark, our framework significantly outperforms baseline methods and allows to cut off more than 80% communication without sacrificing the performance. |
Tasks | Starcraft |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJx-3grYDB |
https://openreview.net/pdf?id=HJx-3grYDB | |
PWC | https://paperswithcode.com/paper/learning-nearly-decomposable-value-functions |
Repo | |
Framework | |
Growing Up Together: Structured Exploration for Large Action Spaces
Title | Growing Up Together: Structured Exploration for Large Action Spaces |
Authors | Anonymous |
Abstract | Training good policies for large combinatorial action spaces is onerous and usually tackled with imitation learning, curriculum learning, or reward shaping. Each of these methods has requirements that can hinder their general application. Here, we study how growing the action space of the policy during training can structure the exploration and lead to convergence without any external data (imitation), with less control over the environment (curriculum), and with minimal reward shaping. We evaluate this approach on a challenging end-to-end full games army control task in StarCraft: Brood War by training policies through self-play from scratch. We grow the spatial resolution and frequency of actions and achieve superior results compared to operating purely at finer resolutions. |
Tasks | Imitation Learning, Starcraft |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HylZ5grKvB |
https://openreview.net/pdf?id=HylZ5grKvB | |
PWC | https://paperswithcode.com/paper/growing-up-together-structured-exploration |
Repo | |
Framework | |
Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models
Title | Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models |
Authors | Anonymous |
Abstract | Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to the excessive influence that input complexity has in generative models’ likelihoods. We report a set of experiments supporting this hypothesis, and use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio test akin to Bayesian model comparison. We find such score to perform comparably to, or even better than, existing OOD detection approaches under a wide range of data sets, models, and complexity estimates. |
Tasks | Out-of-Distribution Detection |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SyxIWpVYvr |
https://openreview.net/pdf?id=SyxIWpVYvr | |
PWC | https://paperswithcode.com/paper/input-complexity-and-out-of-distribution-1 |
Repo | |
Framework | |
Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation
Title | Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation |
Authors | Anonymous |
Abstract | Infinite horizon off-policy policy evaluation is a highly challenging task due to the excessively large variance of typical importance sampling (IS) estimators. Recently, Liu et al. (2018) proposed an approach that significantly reduces the variance of infinite-horizon off-policy evaluation by estimating the stationary density ratio, but at the cost of introducing potentially high risks due to the error in density ratio estimation. In this paper, we develop a bias-reduced augmentation of their method, which can take advantage of a learned value function to obtain higher accuracy. Our method is doubly robust in that the bias vanishes when either the density ratio or value function estimation is perfect. In general, when either of them is accurate, the bias can also be reduced. Both theoretical and empirical results show that our method yields significant advantages over previous methods. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1glGANtDr |
https://openreview.net/pdf?id=S1glGANtDr | |
PWC | https://paperswithcode.com/paper/doubly-robust-bias-reduction-in-infinite-1 |
Repo | |
Framework | |
Unsupervised Progressive Learning and the STAM Architecture
Title | Unsupervised Progressive Learning and the STAM Architecture |
Authors | Anonymous |
Abstract | We first pose the Unsupervised Progressive Learning (UPL) problem: learning salient representations from a non-stationary stream of unlabeled data in which the number of object classes increases with time. If some limited labeled data is also available, those representations can be associated with specific classes, thus enabling classification tasks. To solve the UPL problem, we propose an architecture that involves an online clustering module, called Self-Taught Associative Memory (STAM). Layered hierarchies of STAM modules learn based on a combination of online clustering, novelty detection, forgetting outliers, and storing only prototypical representations rather than specific examples. The goal of this paper is to introduce the UPL problem, describe the STAM architecture, and evaluate the latter in the UPL context. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Skxw-REFwS |
https://openreview.net/pdf?id=Skxw-REFwS | |
PWC | https://paperswithcode.com/paper/unsupervised-progressive-learning-and-the |
Repo | |
Framework | |
Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
Title | Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments |
Authors | Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen |
Abstract | We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam/AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam. |
Tasks | Image Classification, Language Modelling, Machine Translation, Speech Recognition |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJepq2VtDB |
https://openreview.net/pdf?id=BJepq2VtDB | |
PWC | https://paperswithcode.com/paper/training-deep-networks-with-stochastic |
Repo | |
Framework | |
The divergences minimized by non-saturating GAN training
Title | The divergences minimized by non-saturating GAN training |
Authors | Anonymous |
Abstract | Interpreting generative adversarial network (GAN) training as approximate divergence minimization has been theoretically insightful, has spurred discussion, and has lead to theoretically and practically interesting extensions such as f-GANs and Wasserstein GANs. For both classic GANs and f-GANs, there is an original variant of training and a “non-saturating” variant which uses an alternative form of generator gradient. The original variant is theoretically easier to study, but for GANs the alternative variant performs better in practice. The non-saturating scheme is often regarded as a simple modification to deal with optimization issues, but we show that in fact the non-saturating scheme for GANs is effectively optimizing a reverse KL-like f-divergence. We also develop a number of theoretical tools to help compare and classify f-divergences. We hope these results may help to clarify some of the theoretical discussion surrounding the divergence minimization view of GAN training. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BygY4grYDr |
https://openreview.net/pdf?id=BygY4grYDr | |
PWC | https://paperswithcode.com/paper/the-divergences-minimized-by-non-saturating |
Repo | |
Framework | |
Why do These Match? Explaining the Behavior of Image Similarity Models
Title | Why do These Match? Explaining the Behavior of Image Similarity Models |
Authors | Anonymous |
Abstract | Explaining a deep learning model can help users understand its behavior and allow researchers to discern its shortcomings. Recent work has primarily focused on explaining models for tasks like image classification or visual question answering. In this paper, we introduce an explanation approach for image similarity models, where a model’s output is a score measuring the similarity of two inputs rather than a classification. In this task, an explanation depends on both of the input images, so standard methods do not apply. We propose an explanation method that pairs a saliency map identifying important image regions with an attribute that best explains the match. We find that our explanations provide additional information not typically captured by saliency maps alone, and can also improve performance on the classic task of attribute recognition. Our approach’s ability to generalize is demonstrated on two datasets from diverse domains, Polyvore Outfits and Animals with Attributes 2. |
Tasks | Image Classification, Question Answering, Visual Question Answering |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1l_ZlrFvS |
https://openreview.net/pdf?id=S1l_ZlrFvS | |
PWC | https://paperswithcode.com/paper/why-do-these-match-explaining-the-behavior-of-1 |
Repo | |
Framework | |
Generating Multi-Sentence Abstractive Summaries of Interleaved Texts
Title | Generating Multi-Sentence Abstractive Summaries of Interleaved Texts |
Authors | Anonymous |
Abstract | In multi-participant postings, as in online chat conversations, several conversations or topic threads may take place concurrently. This leads to difficulties for readers reviewing the postings in not only following discussions but also in quickly identifying their essence. A two-step process, disentanglement of interleaved posts followed by summarization of each thread, addresses the issue, but disentanglement errors are propagated to the summarization step, degrading the overall performance. To address this, we propose an end-to-end trainable encoder-decoder network for summarizing interleaved posts. The interleaved posts are encoded hierarchically, i.e., word-to-word (words in a post) followed by post-to-post (posts in a channel). The decoder also generates summaries hierarchically, thread-to-thread (generate thread representations) followed by word-to-word (i.e., generate summary words). Additionally, we propose a hierarchical attention mechanism for interleaved text. Overall, our end-to-end trainable hierarchical framework enhances performance over a sequence to sequence framework by 8-10% on multiple synthetic interleaved texts datasets. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkgCv1HYvB |
https://openreview.net/pdf?id=BkgCv1HYvB | |
PWC | https://paperswithcode.com/paper/generating-multi-sentence-abstractive-1 |
Repo | |
Framework | |