Paper Group NANR 5
Incorporating Perceptual Prior to Improve Model’s Adversarial Robustness. Neural Clustering Processes. Bayesian Inference for Large Scale Image Classification. Refining the variational posterior through iterative optimization. Scale-Equivariant Neural Networks with Decomposed Convolutional Filters. Nonlinearities in activations substantially shape …
Incorporating Perceptual Prior to Improve Model’s Adversarial Robustness
Title | Incorporating Perceptual Prior to Improve Model’s Adversarial Robustness |
Authors | Anonymous |
Abstract | Deep Neural Networks trained using human-annotated data are able to achieve human-like accuracy on many computer vision tasks such as classification, object recognition and segmentation. However, they are still far from being as robust as the human visual system. In this paper, we demonstrate that even models that are trained to be robust to random perturbations do not necessarily learn robust representations. We propose to address this by imposing a perception based prior on the learned representations to ensure that perceptually similar images have similar representations. We demonstrate that, although this training method does not use adversarial samples during training, it significantly improves the network’s robustness to single-step and multi-step adversarial attacks, thus validating our hypothesis that the network indeed learns more robust representations. Our proposed method provides a means of achieving adversarial robustness at no additional computational cost when compared to normal training. |
Tasks | Object Recognition |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1grayHYDH |
https://openreview.net/pdf?id=B1grayHYDH | |
PWC | https://paperswithcode.com/paper/incorporating-perceptual-prior-to-improve |
Repo | |
Framework | |
Neural Clustering Processes
Title | Neural Clustering Processes |
Authors | Anonymous |
Abstract | Mixture models, a basic building block in countless statistical models, involve latent random variables over discrete spaces, and existing posterior inference methods can be inaccurate and/or very slow. In this work we introduce a novel deep learning architecture for efficient amortized Bayesian inference over mixture models. While previous approaches to amortized clustering assumed a fixed or maximum number of mixture components and only amortized over the continuous parameters of each mixture component, our method amortizes over the local discrete labels of all the data points, and performs inference over an unbounded number of mixture components. The latter property makes our method natural for the challenging case of nonparametric Bayesian models, where the number of mixture components grows with the dataset. Our approach exploits the exchangeability of the generative models and is based on mapping distributed, permutation-invariant representations of discrete arrangements into varying-size multinomial conditional probabilities. The resulting algorithm parallelizes easily, yields iid samples from the approximate posteriors along with a normalized probability estimate of each sample (a quantity generally unavailable using Markov Chain Monte Carlo) and can easily be applied to both conjugate and non-conjugate models, as training only requires samples from the generative model. We also present an extension of the method to models of random communities (such as infinite relational or stochastic block models). As a scientific application, we present a novel approach to neural spike sorting for high-density multielectrode arrays. |
Tasks | Bayesian Inference |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxF80NYwS |
https://openreview.net/pdf?id=ryxF80NYwS | |
PWC | https://paperswithcode.com/paper/neural-clustering-processes |
Repo | |
Framework | |
Bayesian Inference for Large Scale Image Classification
Title | Bayesian Inference for Large Scale Image Classification |
Authors | Anonymous |
Abstract | Bayesian inference promises to ground and improve the performance of deep neural networks. It promises to be robust to overfitting, to simplify the training procedure and the space of hyperparameters, and to provide a calibrated measure of uncertainty that can enhance decision making, agent exploration and prediction fairness. Markov Chain Monte Carlo (MCMC) methods enable Bayesian inference by generating samples from the posterior distribution over model parameters. Despite the theoretical advantages of Bayesian inference and the similarity between MCMC and optimization methods, the performance of sampling methods has so far lagged behind optimization methods for large scale deep learning tasks. We aim to fill this gap and introduce ATMC, an adaptive noise MCMC algorithm that estimates and is able to sample from the posterior of a neural network. ATMC dynamically adjusts the amount of momentum and noise applied to each parameter update in order to compensate for the use of stochastic gradients. We use a ResNet architecture without batch normalization to test ATMC on the Cifar10 benchmark and the large scale ImageNet benchmark and show that, despite the absence of batch normalization, ATMC outperforms a strong optimization baseline in terms of both classification accuracy and test log-likelihood. We show that ATMC is intrinsically robust to overfitting on the training data and that ATMC provides a better calibrated measure of uncertainty compared to the optimization baseline. |
Tasks | Bayesian Inference, Decision Making, Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rklFh34Kwr |
https://openreview.net/pdf?id=rklFh34Kwr | |
PWC | https://paperswithcode.com/paper/bayesian-inference-for-large-scale-image-1 |
Repo | |
Framework | |
Refining the variational posterior through iterative optimization
Title | Refining the variational posterior through iterative optimization |
Authors | Anonymous |
Abstract | Variational inference (VI) is a popular approach for approximate Bayesian inference that is particularly promising for highly parameterized models such as deep neural networks. A key challenge of variational inference is to approximate the posterior over model parameters with a distribution that is simpler and tractable yet sufficiently expressive. In this work, we propose a method for training highly flexible variational distributions by starting with a coarse approximation and iteratively refining it. Each refinement step makes cheap, local adjustments and only requires optimization of simple variational families. We demonstrate theoretically that our method always improves a bound on the approximation (the Evidence Lower BOund) and observe this empirically across a variety of benchmark tasks. In experiments, our method consistently outperforms recent variational inference methods for deep learning in terms of log-likelihood and the ELBO. We see that the gains are further amplified on larger scale models, significantly outperforming standard VI and deep ensembles on residual networks on CIFAR10. |
Tasks | Bayesian Inference |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkglZyHtvH |
https://openreview.net/pdf?id=rkglZyHtvH | |
PWC | https://paperswithcode.com/paper/refining-the-variational-posterior-through |
Repo | |
Framework | |
Scale-Equivariant Neural Networks with Decomposed Convolutional Filters
Title | Scale-Equivariant Neural Networks with Decomposed Convolutional Filters |
Authors | Anonymous |
Abstract | Encoding the input scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many vision tasks especially when dealing with multiscale input signals. We study, in this paper, a scale-equivariant CNN architecture with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve scale-equivariant representations. To reduce the model complexity and computational burden, we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation. Numerical experiments demonstrate that the proposed scale-equivariant neural network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size. |
Tasks | Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkgCJ64tDB |
https://openreview.net/pdf?id=rkgCJ64tDB | |
PWC | https://paperswithcode.com/paper/scale-equivariant-neural-networks-with-1 |
Repo | |
Framework | |
Nonlinearities in activations substantially shape the loss surfaces of neural networks
Title | Nonlinearities in activations substantially shape the loss surfaces of neural networks |
Authors | Anonymous |
Abstract | Understanding the loss surfaces of neural networks is fundamentally important to understanding deep learning. This paper presents how the nonlinearities in activations substantially shape the loss surfaces of neural networks. We first prove that the loss surface of every neural network has infinite spurious local minima, which are defined as the local minima with higher empirical risks than the global minima. Our result holds for any neural network with arbitrary depth and arbitrary piecewise linear activation functions (excluding linear functions) under most loss functions in practice. This result demonstrates that nonlinear networks possess substantial differences to the well-studied linear neural networks. Essentially, the underlying assumptions for the above result are consistent with most practical circumstances where the output layer is narrower than any hidden layer. We further prove a theorem that draws a big picture for the loss surfaces of nonlinear neural networks from the following respects. (1) Smooth and multilinear partition: the loss surface is partitioned into multiple smooth and multilinear open cells. (2) Local analogous convexity: within every cell, local minima are equally good, and equivalently, they are all global minima in the cell. (3) Local minima valley: some local minima are concentrated into a valley in some cell, sharing the same empirical risk. (4) Linear collapse: when all activations are linear, the partitioned loss surface collapses to one single cell, which includes linear neural networks as a simplified case. The second result holds for one-hidden-layer networks for regression under convex loss, while all others apply to networks of arbitrary depth. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1x6BTEKwr |
https://openreview.net/pdf?id=B1x6BTEKwr | |
PWC | https://paperswithcode.com/paper/nonlinearities-in-activations-substantially |
Repo | |
Framework | |
Deep Coordination Graphs
Title | Deep Coordination Graphs |
Authors | Anonymous |
Abstract | This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factorizing the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allows training of the value function end-to-end with Q-learning. Payoff functions are approximated with deep neural networks and parameter sharing improves generalization over the state-action space. We show that DCG can solve challenging predator-prey tasks that are vulnerable to the relative overgeneralization pathology and in which all other known value factorization approaches fail. |
Tasks | Multi-agent Reinforcement Learning, Q-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HklRKpEKDr |
https://openreview.net/pdf?id=HklRKpEKDr | |
PWC | https://paperswithcode.com/paper/deep-coordination-graphs-1 |
Repo | |
Framework | |
Learning Surrogate Losses
Title | Learning Surrogate Losses |
Authors | Anonymous |
Abstract | The minimization of loss functions is the heart and soul of Machine Learning. In this paper, we propose an off-the-shelf optimization approach that can seamlessly minimize virtually any non-differentiable and non-decomposable loss function (e.g. Miss-classification Rate, AUC, F1, Jaccard Index, Mathew Correlation Coefficient, etc.). Our strategy learns smooth relaxation versions of the true losses by approximating them through a surrogate neural network. The proposed loss networks are set-wise models which are invariant to the order of mini-batch instances. Ultimately, the surrogate losses are learned jointly with the prediction model via bilevel optimization. Empirical results on multiple datasets with diverse real-life loss functions compared with state-of-the-art baselines demonstrate the efficiency of learning surrogate losses. |
Tasks | bilevel optimization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkePHaVKwS |
https://openreview.net/pdf?id=BkePHaVKwS | |
PWC | https://paperswithcode.com/paper/learning-surrogate-losses-1 |
Repo | |
Framework | |
Gaussian Conditional Random Fields for Classification
Title | Gaussian Conditional Random Fields for Classification |
Authors | Anonymous |
Abstract | In this paper, a Gaussian conditional random field model for structured binary classification (GCRFBC) is proposed. The model is applicable to classification problems with undirected graphs, intractable for standard classification CRFs. The model representation of GCRFBC is extended by latent variables which yield some appealing properties. Thanks to the GCRF latent structure, the model becomes tractable, efficient, and open to improvements previously applied to GCRF regression. Two different forms of the algorithm are presented: GCRFBCb (GCRGBC - Bayesian) and GCRFBCnb (GCRFBC - non-Bayesian). The extended method of local variational approximation of sigmoid function is used for solving empirical Bayes in GCRFBCb variant, whereas MAP value of latent variables is the basis for learning and inference in the GCRFBCnb variant. The inference in GCRFBCb is solved by Newton-Cotes formulas for one-dimensional integration. Both models are evaluated on synthetic data and real-world data. It was shown that both models achieve better prediction performance than relevant baselines. Advantages and disadvantages of the proposed models are discussed. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxC-kBYDS |
https://openreview.net/pdf?id=ryxC-kBYDS | |
PWC | https://paperswithcode.com/paper/gaussian-conditional-random-fields-for-1 |
Repo | |
Framework | |
Open-Set Domain Adaptation with Category-Agnostic Clusters
Title | Open-Set Domain Adaptation with Category-Agnostic Clusters |
Authors | Anonymous |
Abstract | Unsupervised domain adaptation has received significant attention in recent years. Most of existing works tackle the closed-set scenario, assuming that the source and target domains share the exactly same categories. In practice, nevertheless, a target domain often contains samples of classes unseen in source domain (i.e., unknown class). The extension of domain adaptation from closed-set to such open-set situation is not trivial since the target samples in unknown class are not expected to align with the source. In this paper, we address this problem by augmenting the state-of-the-art domain adaptation technique, Self-Ensembling, with category-agnostic clusters in target domain. Specifically, we present Self-Ensembling with Category-agnostic Clusters (SE-CC) — a novel architecture that steers domain adaptation with the additional guidance of category-agnostic clusters that are specific to target domain. These clustering information provides domain-specific visual cues, facilitating the generalization of Self-Ensembling for both closed-set and open-set scenarios. Technically, clustering is firstly performed over all the unlabeled target samples to obtain the category-agnostic clusters, which reveal the underlying data space structure peculiar to target domain. A clustering branch is capitalized on to ensure that the learnt representation preserves such underlying structure by matching the estimated assignment distribution over clusters to the inherent cluster distribution for each target sample. Furthermore, SE-CC enhances the learnt representation with mutual information maximization. Extensive experiments are conducted on Office and VisDA datasets for both open-set and closed-set domain adaptation, and superior results are reported when comparing to the state-of-the-art approaches. |
Tasks | Domain Adaptation, Unsupervised Domain Adaptation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bkgv71rtwr |
https://openreview.net/pdf?id=Bkgv71rtwr | |
PWC | https://paperswithcode.com/paper/open-set-domain-adaptation-with-category |
Repo | |
Framework | |
Distribution Matching Prototypical Network for Unsupervised Domain Adaptation
Title | Distribution Matching Prototypical Network for Unsupervised Domain Adaptation |
Authors | Anonymous |
Abstract | State-of-the-art Unsupervised Domain Adaptation (UDA) methods learn transferable features by minimizing the feature distribution discrepancy between the source and target domains. Different from these methods which do not model the feature distributions explicitly, in this paper, we explore explicit feature distribution modeling for UDA. In particular, we propose Distribution Matching Prototypical Network (DMPN) to model the deep features from each domain as Gaussian mixture distributions. With explicit feature distribution modeling, we can easily measure the discrepancy between the two domains. In DMPN, we propose two new domain discrepancy losses with probabilistic interpretations. The first one minimizes the distances between the corresponding Gaussian component means of the source and target data. The second one minimizes the pseudo negative log likelihood of generating the target features from source feature distribution. To learn both discriminative and domain invariant features, DMPN is trained by minimizing the classification loss on the labeled source data and the domain discrepancy losses together. Extensive experiments are conducted over two UDA tasks. Our approach yields a large margin in the Digits Image transfer task over state-of-the-art approaches. More remarkably, DMPN obtains a mean accuracy of 81.4% on VisDA 2017 dataset. The hyper-parameter sensitivity analysis shows that our approach is robust w.r.t hyper-parameter changes. |
Tasks | Domain Adaptation, Unsupervised Domain Adaptation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1eX1yrKwB |
https://openreview.net/pdf?id=r1eX1yrKwB | |
PWC | https://paperswithcode.com/paper/distribution-matching-prototypical-network |
Repo | |
Framework | |
Enhanced Convolutional Neural Tangent Kernels
Title | Enhanced Convolutional Neural Tangent Kernels |
Authors | Anonymous |
Abstract | Recent research shows that for training with l2 loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity, correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional Neural Tangent Kernel (CNTK) if all layers are trained. An exact algorithm to compute CNTK (Arora et al., 2019) yielded the finding that classification accuracy of CNTK on CIFAR-10 is within 6-7% of that of the corresponding CNN architecture (best figure being around 78%) which is interesting performance for a fixed kernel. Here we show how to significantly enhance the performance of these kernels using two ideas. (1) Modifying the kernel using a new operation called Local Average Pooling (LAP) which preserves efficient computability of the kernel and inherits the spirit of standard data augmentation using pixel shifts. Earlier papers were unable to incorporate naive data augmentation because of the quadratic training cost of kernel regression. This idea is inspired by Global Average Pooling (GAP), which we show for CNN-GP and CNTK, GAP is equivalent to full translation data augmentation. (2) Representing the input image using a pre-processing technique proposed by Coates et al. (2011), which uses a single convolutional layer composed of random image patches. On CIFAR-10 the resulting kernel, CNN-GP with LAP and horizontal flip data augmentation achieves 89% accuracy, matching the performance of AlexNet (Krizhevsky et al., 2012). Note that this is the best such result we know of for a classifier that is not a trained neural network. Similar improvements are obtained for Fashion-MNIST. |
Tasks | Data Augmentation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkgNqkHFPr |
https://openreview.net/pdf?id=BkgNqkHFPr | |
PWC | https://paperswithcode.com/paper/enhanced-convolutional-neural-tangent-kernels |
Repo | |
Framework | |
Wildly Unsupervised Domain Adaptation and Its Powerful and Efficient Solution
Title | Wildly Unsupervised Domain Adaptation and Its Powerful and Efficient Solution |
Authors | Anonymous |
Abstract | In unsupervised domain adaptation (UDA), classifiers for the target domain (TD) are trained with clean labeled data from the source domain (SD) and unlabeled data from TD. However, in the wild, it is hard to acquire a large amount of perfectly clean labeled data in SD given limited budget. Hence, we consider a new, more realistic and more challenging problem setting, where classifiers have to be trained with noisy labeled data from SD and unlabeled data from TD—we name it wildly UDA (WUDA). We show that WUDA ruins all UDA methods if taking no care of label noise in SD, and to this end, we propose a Butterfly framework, a powerful and efficient solution to WUDA. Butterfly maintains four models (e.g., deep networks) simultaneously, where two take care of all adaptations (i.e., noisy-to-clean, labeled-to-unlabeled, and SD-to-TD-distributional) and then the other two can focus on classification in TD. As a consequence, Butterfly possesses all the conceptually necessary components for solving WUDA. Experiments demonstrate that under WUDA, Butterfly significantly outperforms existing baseline methods. |
Tasks | Domain Adaptation, Unsupervised Domain Adaptation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkl2s34twS |
https://openreview.net/pdf?id=rkl2s34twS | |
PWC | https://paperswithcode.com/paper/wildly-unsupervised-domain-adaptation-and-its |
Repo | |
Framework | |
Learning vector representation of local content and matrix representation of local motion, with implications for V1
Title | Learning vector representation of local content and matrix representation of local motion, with implications for V1 |
Authors | Anonymous |
Abstract | This paper proposes a representational model for image pair such as consecutive video frames that are related by local pixel displacements, in the hope that the model may shed light on motion perception in primary visual cortex (V1). The model couples the following two components. (1) The vector representations of local contents of images. (2) The matrix representations of local pixel displacements caused by the relative motions between the agent and the objects in the 3D scene. When the image frame undergoes changes due to local pixel displacements, the vectors are multiplied by the matrices that represent the local displacements. Our experiments show that our model can learn to infer local motions. Moreover, the model can learn Gabor-like filter pairs of quadrature phases. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SyeLGlHtPS |
https://openreview.net/pdf?id=SyeLGlHtPS | |
PWC | https://paperswithcode.com/paper/learning-vector-representation-of-local |
Repo | |
Framework | |
Differentiable Bayesian Neural Network Inference for Data Streams
Title | Differentiable Bayesian Neural Network Inference for Data Streams |
Authors | Anonymous |
Abstract | While deep neural networks (NNs) do not provide the confidence of its prediction, Bayesian neural network (BNN) can estimate the uncertainty of the prediction. However, BNNs have not been widely used in practice due to the computational cost of predictive inference. This prohibitive computational cost is a hindrance especially when processing stream data with low-latency. To address this problem, we propose a novel model which approximate BNNs for data streams. Instead of generating separate prediction for each data sample independently, this model estimates the increments of prediction for a new data sample from the previous predictions. The computational cost of this model is almost the same as that of non-Bayesian deep NNs. Experiments including semantic segmentation on real-world data show that this model performs significantly faster than BNNs, estimating uncertainty comparable to the results of BNNs. |
Tasks | Semantic Segmentation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJx7wlSYvB |
https://openreview.net/pdf?id=rJx7wlSYvB | |
PWC | https://paperswithcode.com/paper/differentiable-bayesian-neural-network-1 |
Repo | |
Framework | |