Paper Group NANR 79
Three-Head Neural Network Architecture for AlphaZero Learning. Invariance vs Robustness of Neural Networks. Data Augmentation in Training CNNs: Injecting Noise to Images. Decoupling Representation and Classifier for Long-Tailed Recognition. Relevant-features based Auxiliary Cells for Robust and Energy Efficient Deep Learning. Representing Model Unc …
Three-Head Neural Network Architecture for AlphaZero Learning
Title | Three-Head Neural Network Architecture for AlphaZero Learning |
Authors | Anonymous |
Abstract | The search-based reinforcement learning algorithm AlphaZero has been used as a general method for mastering two-player games Go, chess and Shogi. One crucial ingredient in AlphaZero (and its predecessor AlphaGo Zero) is the two-head network architecture that outputs two estimates — policy and value — for one input game state. The merit of such an architecture is that letting policy and value learning share the same representation substantially improved generalization of the neural net. A three-head network architecture has been recently proposed that can learn a third action-value head on a fixed dataset the same as for two-head net. Also, using the action-value head in Monte Carlo tree search (MCTS) improved the search efficiency. However, effectiveness of the three-head network has not been investigated in an AlphaZero style learning paradigm. In this paper, using the game of Hex as a test domain, we conduct an empirical study of the three-head network architecture in AlpahZero learning. We show that the architecture is also advantageous at the zero-style iterative learning. Specifically, we find that three-head network can induce the following benefits: (1) learning can become faster as search takes advantage of the additional action-value head; (2) better prediction results than two-head architecture can be achieved when using additional action-value learning as an auxiliary task. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJxvH1BtDS |
https://openreview.net/pdf?id=BJxvH1BtDS | |
PWC | https://paperswithcode.com/paper/three-head-neural-network-architecture-for |
Repo | |
Framework | |
Invariance vs Robustness of Neural Networks
Title | Invariance vs Robustness of Neural Networks |
Authors | Anonymous |
Abstract | Neural networks achieve human-level accuracy on many standard datasets used in image classification. The next step is to achieve better generalization to natural (or non-adversarial) perturbations as well as known pixel-wise adversarial perturbations of inputs. Previous work has studied generalization to natural geometric transformations (e.g., rotations) as invariance, and generalization to adversarial perturbations as robustness. In this paper, we examine the interplay between invariance and robustness. We empirically study the following two cases:(a) change in adversarial robustness as we improve only the invariance using equivariant models and training augmentation, (b) change in invariance as we improve only the adversarial robustness using adversarial training. We observe that the rotation invariance of equivariant models (StdCNNs and GCNNs) improves by training augmentation with progressively larger rotations but while doing so, their adversarial robustness does not improve, or worse, it can even drop significantly on datasets such as MNIST. As a plausible explanation for this phenomenon we observe that the average perturbation distance of the test points to the decision boundary decreases as the model learns larger and larger rotations. On the other hand, we take adversarially trained LeNet and ResNet models which have good \ell_\infty adversarial robustness on MNIST and CIFAR-10, and observe that adversarially training them with progressively larger norms keeps their rotation invariance essentially unchanged. In fact, the difference between test accuracy on unrotated test data and on randomly rotated test data upto \theta , for all \theta in [0, 180], remains essentially unchanged after adversarial training . As a plausible explanation for the observed phenomenon we show empirically that the principal components of adversarial perturbations and perturbations given by small rotations are nearly orthogonal |
Tasks | Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJxp9kBFDS |
https://openreview.net/pdf?id=HJxp9kBFDS | |
PWC | https://paperswithcode.com/paper/invariance-vs-robustness-of-neural-networks |
Repo | |
Framework | |
Data Augmentation in Training CNNs: Injecting Noise to Images
Title | Data Augmentation in Training CNNs: Injecting Noise to Images |
Authors | Anonymous |
Abstract | Noise injection is a fundamental tool for data augmentation, and yet there is no widely accepted procedure to incorporate it with learning frameworks. This study analyzes the effects of adding or applying different noise models of varying magnitudes to Convolutional Neural Network (CNN) architectures. Noise models that are distributed with different density functions are given common magnitude levels via Structural Similarity (SSIM) metric in order to create an appropriate ground for comparison. The basic results are conforming with the most of the common notions in machine learning, and also introduces some novel heuristics and recommendations on noise injection. The new approaches will provide better understanding on optimal learning procedures for image classification. |
Tasks | Data Augmentation, Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkeKtyHYPS |
https://openreview.net/pdf?id=SkeKtyHYPS | |
PWC | https://paperswithcode.com/paper/data-augmentation-in-training-cnns-injecting |
Repo | |
Framework | |
Decoupling Representation and Classifier for Long-Tailed Recognition
Title | Decoupling Representation and Classifier for Long-Tailed Recognition |
Authors | Anonymous |
Abstract | The long-tail distribution of the visual world poses great challenges for deep learning based classification models on how to handle the class imbalance problem. Existing solutions usually involve class-balancing strategies, e.g., by loss re-weighting, data re-sampling, or transfer learning from head- to tail-classes, but all of them adhere to the scheme of jointly learning representations and classifiers. In this work, we decouple the learning procedure into representation learning and classification, and systematically explore how different balancing strategies affect them for long-tailed recognition. The findings are surprising: (1) data imbalance might not be an issue in learning high-quality representations; (2) with representations learned with the simplest instance-balanced (natural) sampling, it is also possible to achieve strong long-tailed recognition ability at little to no cost by adjusting only the classifier. We conduct extensive experiments and set new state-of-the-art performance on common long-tailed benchmarks like ImageNet-LT, Places-LT and iNaturalist, showing that it is possible to outperform carefully designed losses, sampling strategies, even complex modules with memory, by using a straightforward approach that decouples representation and classification. |
Tasks | Representation Learning, Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1gRTCVFvB |
https://openreview.net/pdf?id=r1gRTCVFvB | |
PWC | https://paperswithcode.com/paper/decoupling-representation-and-classifier-for-1 |
Repo | |
Framework | |
Relevant-features based Auxiliary Cells for Robust and Energy Efficient Deep Learning
Title | Relevant-features based Auxiliary Cells for Robust and Energy Efficient Deep Learning |
Authors | Anonymous |
Abstract | Deep neural networks are complex non-linear models used as predictive analytics tool and have demonstrated state-of-the-art performance on many classification tasks. However, they have no inherent capability to recognize when their predictions might go wrong. There have been several efforts in the recent past to detect natural errors i.e. misclassified inputs but these mechanisms pose additional energy requirements. To address this issue, we present a novel post-hoc framework to detect natural errors in an energy efficient way. We achieve this by appending relevant features based linear classifiers per class referred as Relevant features based Auxiliary Cells (RACs). The proposed technique makes use of the consensus between RACs appended at few selected hidden layers to distinguish the correctly classified inputs from misclassified inputs. The combined confidence of RACs is utilized to determine if classification should terminate at an early stage. We demonstrate the effectiveness of our technique on various image classification datasets such as CIFAR10, CIFAR100 and Tiny-ImageNet. Our results show that for CIFAR100 dataset trained on VGG16 network, RACs can detect 46% of the misclassified examples along with 12% reduction in energy compared to the baseline network while 69% of the examples are correctly classified. |
Tasks | Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJgedkStDS |
https://openreview.net/pdf?id=BJgedkStDS | |
PWC | https://paperswithcode.com/paper/relevant-features-based-auxiliary-cells-for |
Repo | |
Framework | |
Representing Model Uncertainty of Neural Networks in Sparse Information Form
Title | Representing Model Uncertainty of Neural Networks in Sparse Information Form |
Authors | Anonymous |
Abstract | This paper addresses the problem of representing a system’s belief using multi-variate normal distributions (MND) where the underlying model is based on a deep neural network (DNN). The major challenge with DNNs is the computational complexity that is needed to obtain model uncertainty using MNDs. To achieve a scalable method, we propose a novel approach that expresses the parameter posterior in sparse information form. Our inference algorithm is based on a novel Laplace Approximation scheme, which involves a diagonal correction of the Kronecker-factored eigenbasis. As this makes the inversion of the information matrix intractable - an operation that is required for full Bayesian analysis, we devise a low-rank approximation of this eigenbasis and a memory-efficient sampling scheme. We provide both a theoretical analysis and an empirical evaluation on various benchmark data sets, showing the superiority of our approach over existing methods. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bkxd9JBYPH |
https://openreview.net/pdf?id=Bkxd9JBYPH | |
PWC | https://paperswithcode.com/paper/representing-model-uncertainty-of-neural |
Repo | |
Framework | |
Learning to Optimize via Dual space Preconditioning
Title | Learning to Optimize via Dual space Preconditioning |
Authors | Anonymous |
Abstract | Preconditioning an minimization algorithm improve its convergence and can lead to a minimizer in one iteration in some extreme cases. There is currently no analytical way for finding a suitable preconditioner. We present a general methodology for learning the preconditioner and show that it can lead to dramatic speed-ups over standard optimization techniques. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rklx-gSYPS |
https://openreview.net/pdf?id=rklx-gSYPS | |
PWC | https://paperswithcode.com/paper/learning-to-optimize-via-dual-space |
Repo | |
Framework | |
Encoding Musical Style with Transformer Autoencoders
Title | Encoding Musical Style with Transformer Autoencoders |
Authors | Anonymous |
Abstract | We consider the problem of learning high-level controls over the global structure of sequence generation, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to combine this global embedding with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and and melody. Empirically, we demonstrate the effectiveness of our method on a variety of music generation tasks on the MAESTRO dataset and an internal, 10,000+ hour dataset of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to relevant baselines. |
Tasks | Music Generation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hkg9HgBYwH |
https://openreview.net/pdf?id=Hkg9HgBYwH | |
PWC | https://paperswithcode.com/paper/encoding-musical-style-with-transformer |
Repo | |
Framework | |
Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks
Title | Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks |
Authors | Anonymous |
Abstract | Deep learning models are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on benign inputs. However, under the black-box setting, most existing adversaries often have a low transferability to attack other defense models. In this work, from the perspective of regarding the adversarial example generation as an optimization process, we propose two new methods to improve the transferability of adversarial examples, namely Nesterov Iterative Fast Gradient Sign Method (NI-FGSM) and Scale-Invariant attack Method (SIM). NI-FGSM aims to adapt Nesterov accelerated gradient into the iterative attacks so as to effectively look ahead and avoid the “missing” of the global maximum. While SIM is based on our discovery on the scale-invariant property of deep learning models, for which we leverage to optimize the adversarial perturbations over the scale copies of the input images so as to avoid “overfitting” on the white-box model being attacked and generate more transferable adversarial examples. NI-FGSM and SIM can be naturally integrated to build a robust gradient-based attack to generate more transferable adversarial examples against the defense models. Empirical results on ImageNet dataset and NIPS 2017 adversarial competition demonstrate that our attack methods exhibit higher transferability and achieve higher attack success rates than state-of-the-art gradient-based attacks. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJlHwkBYDH |
https://openreview.net/pdf?id=SJlHwkBYDH | |
PWC | https://paperswithcode.com/paper/nesterov-accelerated-gradient-and-scale-1 |
Repo | |
Framework | |
Adversarial Example Detection and Classification with Asymmetrical Adversarial Training
Title | Adversarial Example Detection and Classification with Asymmetrical Adversarial Training |
Authors | Anonymous |
Abstract | The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper, we consider the adversarial detection problem under the robust optimization framework. We partition the input space into subspaces and train adversarial robust subspace detectors using asymmetrical adversarial training (AAT). The integration of the classifier and detectors presents a detection mechanism that provides a performance guarantee to the adversary it considered. We demonstrate that AAT promotes the learning of class-conditional distributions, which further gives rise to generative detection/classification approaches that are both robust and more interpretable. We provide comprehensive evaluations of the above methods, and demonstrate their competitive performances and compelling properties on adversarial detection and robust classification problems. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJeQEp4YDH |
https://openreview.net/pdf?id=SJeQEp4YDH | |
PWC | https://paperswithcode.com/paper/adversarial-example-detection-and |
Repo | |
Framework | |
Fast Neural Network Adaptation via Parameters Remapping
Title | Fast Neural Network Adaptation via Parameters Remapping |
Authors | Anonymous |
Abstract | Deep neural networks achieve remarkable performance in many computer vision tasks. However, for most semantic segmentation (seg) and object detection (det) tasks, the backbone of the network directly reuses the network manually designed for classification tasks. Utilizing a network pre-trained on ImageNet as the backbone has been a popular practice for seg/det challenges. However, because of the gap between different tasks, adapting the network directly to the target task could bring performance promotion. Some recent neural architecture search (NAS) methods search for the backbone of seg/det networks. ImageNet pre-training of the search space representation or the searched network bears huge computational cost. In this paper, we propose a fast neural network adaptation method FNA, which can adapt the manually designed network on ImageNet to the new seg/det tasks efficiently. We adopt differentiable NAS to adapt the architecture of the network. We first expand the manually designed network to a super network which is the representation of the search space. Then we successively conduct the adaptation on the architecture-level and parameter-level. Our designed parameters-remapping paradigm accelerates the adaptation process. Our experiments include both seg and det tasks. We conduct adaptation on the MobileNetV2 network. FNA demonstrates performance promotion compared with both manually and NAS designed networks. The total computational cost of FNA is much less than many SOTA seg/det NAS methods, 1737x less than DPC, 6.8x less than Auto-DeepLab and 7.4x less than DetNAS. |
Tasks | Neural Architecture Search, Object Detection, Semantic Segmentation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rklTmyBKPH |
https://openreview.net/pdf?id=rklTmyBKPH | |
PWC | https://paperswithcode.com/paper/fast-neural-network-adaptation-via-parameters |
Repo | |
Framework | |
Understanding Generalization in Recurrent Neural Networks
Title | Understanding Generalization in Recurrent Neural Networks |
Authors | Anonymous |
Abstract | In this work, we develop the theory for analyzing the generalization performance of recurrent neural networks. We first present a new generalization bound for recurrent neural networks based on matrix-1 norm and Fisher-Rao norm. The definition of Fisher-Rao norm relies on a structural lemma about the gradient of RNNs. This new generalization bound assumes that the covariance matrix of the input data is positive definite, which might limit its use in practice. To address this issue, we propose to add random noise to the input data and prove a generalization bound for training with random noise, which is an extension of the former one. Compared with existing results, our generalization bounds have no explicit dependency on the size of networks. We also discover that Fisher-Rao norm for RNNs can be interpreted as a measure of gradient, and incorporating this gradient measure not only can tighten the bound, but allows us to build a relationship between generalization and trainability. Based on the bound, we analyze the effect of covariance of features on generalization of RNNs theoretically and discuss how weight decay and gradient clipping in the training can help improve generalization. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkgg6xBYDH |
https://openreview.net/pdf?id=rkgg6xBYDH | |
PWC | https://paperswithcode.com/paper/understanding-generalization-in-recurrent |
Repo | |
Framework | |
Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
Title | Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring |
Authors | Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, Jason Weston |
Abstract | The use of deep pre-trained transformers has led to remarkable progress in a number of applications (Devlin et al., 2018). For tasks that make pairwise comparisons between sequences, matching a given input with a corresponding label, two approaches are common: Cross-encoders performing full self-attention over the pair and Bi-encoders encoding the pair separately. The former often performs better, but is too slow for practical use. In this work, we develop a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features. We perform a detailed comparison of all three approaches, including what pre-training and fine-tuning strategies work best. We show our models achieve state-of-the-art results on four tasks; that Poly-encoders are faster than Cross-encoders and more accurate than Bi-encoders; and that the best results are obtained by pre-training on large datasets similar to the downstream tasks. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkxgnnNFvH |
https://openreview.net/pdf?id=SkxgnnNFvH | |
PWC | https://paperswithcode.com/paper/poly-encoders-architectures-and-pre-training |
Repo | |
Framework | |
Minimizing FLOPs to Learn Efficient Sparse Representations
Title | Minimizing FLOPs to Learn Efficient Sparse Representations |
Authors | Anonymous |
Abstract | Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets. |
Tasks | Quantization, Representation Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SygpC6Ntvr |
https://openreview.net/pdf?id=SygpC6Ntvr | |
PWC | https://paperswithcode.com/paper/minimizing-flops-to-learn-efficient-sparse |
Repo | |
Framework | |
A Kolmogorov Complexity Approach to Generalization in Deep Learning
Title | A Kolmogorov Complexity Approach to Generalization in Deep Learning |
Authors | Anonymous |
Abstract | Deep artificial neural networks can achieve an extremely small difference between training and test accuracies on identically distributed training and test sets, which is a standard measure of generalization. However, the training and test sets may not be sufficiently representative of the empirical sample set, which consists of real-world input samples. When samples are drawn from an underrepresented or unrepresented subset during inference, the gap between the training and inference accuracies can be significant. To address this problem, we first reformulate a classification algorithm as a procedure for searching for a source code that maps input features to classes. We then derive a necessary and sufficient condition for generalization using a universal cognitive similarity metric, namely information distance, based on Kolmogorov complexity. Using this condition, we formulate an optimization problem to learn a more general classification function. To achieve this end, we extend the input features by concatenating encodings of them, and then train the classifier on the extended features. As an illustration of this idea, we focus on image classification, where we use channel codes on the input features as a systematic way to improve the degree to which the training and test sets are representative of the empirical sample set. To showcase our theoretical findings, considering that corrupted or perturbed input features belong to the empirical sample set, but typically not to the training and test sets, we demonstrate through extensive systematic experiments that, as a result of learning a more general classification function, a model trained on encoded input features is significantly more robust to common corruptions, e.g., Gaussian and shot noise, as well as adversarial perturbations, e.g., those found via projected gradient descent, than the model trained on uncoded input features. |
Tasks | Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bke7MANKvS |
https://openreview.net/pdf?id=Bke7MANKvS | |
PWC | https://paperswithcode.com/paper/a-kolmogorov-complexity-approach-to |
Repo | |
Framework | |