April 1, 2020

3094 words 15 mins read

Paper Group NANR 24

Statistical Adaptive Stochastic Optimization. Learning to Control PDEs with Differentiable Physics. NADS: Neural Architecture Distribution Search for Uncertainty Awareness. Neural Architecture Search by Learning Action Space for Monte Carlo Tree Search. Computation Reallocation for Object Detection. BANANAS: Bayesian Optimization with Neural Networ …

Statistical Adaptive Stochastic Optimization


Title	Statistical Adaptive Stochastic Optimization
Authors	Anonymous
Abstract	We investigate statistical methods for automatically scheduling the learning rate (step size) in stochastic optimization. First, we consider a broad family of stochastic optimization methods with constant hyperparameters (including the learning rate and various forms of momentum) and derive a general necessary condition for the resulting dynamics to be stationary. Based on this condition, we develop a simple online statistical test to detect (non-)stationarity and use it to automatically drop the learning rate by a constant factor whenever stationarity is detected. Unlike in prior work, our stationarity condition and our statistical test applies to different algorithms without modification. Finally, we propose a smoothed stochastic line-search method that can be used to warm up the optimization process before the statistical test can be applied effectively. This removes the expensive trial and error for setting a good initial learning rate. The combined method is highly autonomous and it attains state-of-the-art training and testing performance in our experiments on several deep learning tasks.
Tasks	Stochastic Optimization
Published	2020-01-01
URL	https://openreview.net/forum?id=B1gkpR4FDB
PDF	https://openreview.net/pdf?id=B1gkpR4FDB
PWC	https://paperswithcode.com/paper/statistical-adaptive-stochastic-optimization
Repo
Framework

Learning to Control PDEs with Differentiable Physics


Title	Learning to Control PDEs with Differentiable Physics
Authors	Anonymous
Abstract	Predicting outcomes and planning interactions with the physical world are long-standing goals for machine learning. A variety of such tasks involves continuous physical systems, which can be described by partial differential equations (PDEs) with many degrees of freedom. Existing methods that aim to control the dynamics of such systems are typically limited to relatively short time frames or a small number of interaction parameters. We show that by using a differentiable PDE solver in conjunction with a novel predictor-corrector scheme, we can train neural networks to understand and control complex nonlinear physical systems over long time frames. We demonstrate that our method successfully develops an understanding of complex physical systems and learns to control them for tasks involving multiple PDEs, including the incompressible Navier-Stokes equations.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HyeSin4FPB
PDF	https://openreview.net/pdf?id=HyeSin4FPB
PWC	https://paperswithcode.com/paper/learning-to-control-pdes-with-differentiable
Repo
Framework

NADS: Neural Architecture Distribution Search for Uncertainty Awareness


Title	NADS: Neural Architecture Distribution Search for Uncertainty Awareness
Authors	Anonymous
Abstract	Machine learning systems often encounter Out-of-Distribution (OoD) errors when dealing with testing data coming from a different distribution from the one used for training. With their growing use in critical applications, it becomes important to develop systems that are able to accurately quantify its predictive uncertainty and screen out these anomalous inputs. However, unlike standard learning tasks, there is currently no well established guiding principle for designing architectures that can accurately quantify uncertainty. Moreover, commonly used OoD detection approaches are prone to errors and even sometimes assign higher likelihoods to OoD samples. To address these problems, we first seek to identify guiding principles for designing uncertainty-aware architectures, by proposing Neural Architecture Distribution Search (NADS). Unlike standard neural architecture search methods which seek for a single best performing architecture, NADS searches for a distribution of architectures that perform well on a given task, allowing us to identify building blocks common among all uncertainty aware architectures. With this formulation, we are able to optimize a stochastic outlier detection objective and construct an ensemble of models to perform OoD detection. We perform multiple OoD detection experiments and observe that our NADS performs favorably compared to state-of-the-art OoD detection methods.
Tasks	Neural Architecture Search, Outlier Detection
Published	2020-01-01
URL	https://openreview.net/forum?id=rJeXDANKwr
PDF	https://openreview.net/pdf?id=rJeXDANKwr
PWC	https://paperswithcode.com/paper/nads-neural-architecture-distribution-search
Repo
Framework

Neural Architecture Search by Learning Action Space for Monte Carlo Tree Search


Title	Neural Architecture Search by Learning Action Space for Monte Carlo Tree Search
Authors	Anonymous
Abstract	Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy). As a result, using manually designed action space to perform NAS often leads to sample-inefficient explorations of architectures and thus can be sub-optimal. In order to improve sample efficiency, this paper proposes Latent Action Neural Architecture Search (LaNAS) that learns actions to recursively partition the search space into good or bad regions that contain networks with concentrated performance metrics, i.e., low variance. During the search phase, as different architecture search action sequences lead to regions of different performance, the search efficiency can be significantly improved by biasing towards the good regions. On the largest NAS dataset NASBench-101, our experimental results demonstrated that LaNAS is 22x, 14.6x, 12.4x, 6.8x, 16.5x more sample-efficient than Random Search, Regularized Evolution, Monte Carlo Tree Search, Neural Architecture Optimization, and Bayesian Optimization, respectively. When applied to the open domain, LaNAS achieves 98.0% accuracy on CIFAR-10 and 75.0% top1 accuracy on ImageNet in only 803 samples, outperforming SOTA AmoebaNet with 33x fewer samples.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=SklR6aEtwH
PDF	https://openreview.net/pdf?id=SklR6aEtwH
PWC	https://paperswithcode.com/paper/neural-architecture-search-by-learning-action
Repo
Framework

Computation Reallocation for Object Detection


Title	Computation Reallocation for Object Detection
Authors	Anonymous
Abstract	The allocation of computation resources across different feature resolutions in the backbone is a crucial issue in object detection. However, classification allocation pattern is usually adopted directly to object detection, which is proved to be sub-optimal. In order to reallocate the engaged computation resources in a more efficient way, we present CR-NAS (Computation Reallocation Neural Architecture Search) that can learn computation reallocation strategies on the target detection dataset. A two-level reallocation space is proposed for both stage and spatial reallocation. A novel hierarchical search procedure is adopted to cope with the complex search space. We apply CR-NAS to multiple backbones and achieve consistent improvements. Our CR-ResNet50 and CR-MobileNetV2 outperforms the baseline by 1.9% and 1.7% COCO AP respectively without any additional computation budget. The models discovered by CR-NAS can be easily transfered to other dataset, e.g. PASCAL VOC, and other vision tasks, e.g. instance segmentation. Our CR-NAS can be used as a plugin to improve the performance of various networks, which is demanding.
Tasks	Instance Segmentation, Neural Architecture Search, Object Detection, Semantic Segmentation
Published	2020-01-01
URL	https://openreview.net/forum?id=SkxLFaNKwB
PDF	https://openreview.net/pdf?id=SkxLFaNKwB
PWC	https://paperswithcode.com/paper/computation-reallocation-for-object-detection
Repo
Framework

BANANAS: Bayesian Optimization with Neural Networks for Neural Architecture Search


Title	BANANAS: Bayesian Optimization with Neural Networks for Neural Architecture Search
Authors	Anonymous
Abstract	Neural Architecture Search (NAS) has seen an explosion of research in the past few years. A variety of methods have been proposed to perform NAS, including reinforcement learning, Bayesian optimization with a Gaussian process model, evolutionary search, and gradient descent. In this work, we design a NAS algorithm that performs Bayesian optimization using a neural network model. We develop a path-based encoding scheme to featurize the neural architectures that are used to train the neural network model. This strategy is particularly effective for encoding architectures in cell-based search spaces. After training on just 200 random neural architectures, we are able to predict the validation accuracy of a new architecture to within one percent of its true accuracy on average. This may be of independent interest beyond Bayesian neural architecture search. We test our algorithm on the NASBench dataset (Ying et al. 2019), and show that our algorithm significantly outperforms other NAS methods including evolutionary search, reinforcement learning, and AlphaX (Wang et al. 2019). Our algorithm is over 100x more efficient than random search, and 3.8x more efficient than the next-best algorithm. We also test our algorithm on the search space used in DARTS (Liu et al. 2018), and show that our algorithm is competitive with state-of-the-art NAS algorithms on this search space.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=B1lxV6NFPH
PDF	https://openreview.net/pdf?id=B1lxV6NFPH
PWC	https://paperswithcode.com/paper/bananas-bayesian-optimization-with-neural-1
Repo
Framework

Noisy Collaboration in Knowledge Distillation


Title	Noisy Collaboration in Knowledge Distillation
Authors	Anonymous
Abstract	Knowledge distillation is an effective model compression technique in which a smaller model is trained to mimic a larger pretrained model. However in order to make these compact models suitable for real world deployment, not only do we need to reduce the performance gap but also we need to make them more robust to commonly occurring and adversarial perturbations. Noise permeates every level of the nervous system, from the perception of sensory signals to the generation of motor responses. We therefore believe that noise could be a crucial element in improving neural networks training and addressing the apparently contradictory goals of improving both the generalization and robustness of the model. Inspired by trial-to-trial variability in the brain that can result from multiple noise sources, we introduce variability through noise at either the input level or the supervision signals. Our results show that noise can improve both the generalization and robustness of the model. ”Fickle Teacher” which uses dropout in teacher model as a source of response variation leads to significant generalization improvement. ”Soft Randomization”, which matches the output distribution of the student model on the image with Gaussian noise to the output of the teacher on original image, improves the adversarial robustness manifolds compared to the student model trained with Gaussian noise. We further show the surprising effect of random label corruption on a model’s adversarial robustness. The study highlights the benefits of adding constructive noise in the knowledge distillation framework and hopes to inspire further work in the area.
Tasks	Model Compression
Published	2020-01-01
URL	https://openreview.net/forum?id=HkeJjeBFDB
PDF	https://openreview.net/pdf?id=HkeJjeBFDB
PWC	https://paperswithcode.com/paper/noisy-collaboration-in-knowledge-distillation
Repo
Framework

Superseding Model Scaling by Penalizing Dead Units and Points with Separation Constraints


Title	Superseding Model Scaling by Penalizing Dead Units and Points with Separation Constraints
Authors	Anonymous
Abstract	In this article, we study a proposal that enables to train extremely thin (4 or 8 neurons per layer) and relatively deep (more than 100 layers) feedforward networks without resorting to any architectural modification such as Residual or Dense connections, data normalization or model scaling. We accomplish that by alleviating two problems. One of them are neurons whose output is zero for all the dataset, which renders them useless. This problem is known to the academic community as \emph{dead neurons}. The other is a less studied problem, dead points. Dead points refers to data points that are mapped to zero during the forward pass of the network. As such, the gradient generated by those points is not propagated back past the layer where they die, thus having no effect in the training process. In this work, we characterize both problems and propose a constraint formulation that added to the standard loss function solves them both. As an additional benefit, the proposed method allows to initialize the network weights with constant or even zero values and still allowing the network to converge to reasonable results. We show very promising results on a toy, MNIST, and CIFAR-10 datasets.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1eX_a4twH
PDF	https://openreview.net/pdf?id=B1eX_a4twH
PWC	https://paperswithcode.com/paper/superseding-model-scaling-by-penalizing-dead
Repo
Framework

GQ-Net: Training Quantization-Friendly Deep Networks


Title	GQ-Net: Training Quantization-Friendly Deep Networks
Authors	Anonymous
Abstract	Network quantization is a model compression and acceleration technique that has become essential to neural network deployment. Most quantization methods per- form fine-tuning on a pretrained network, but this sometimes results in a large loss in accuracy compared to the original network. We introduce a new technique to train quantization-friendly networks, which can be directly converted to an accurate quantized network without the need for additional fine-tuning. Our technique allows quantizing the weights and activations of all network layers down to 4 bits, achieving high efficiency and facilitating deployment in practical settings. Com- pared to other fully quantized networks operating at 4 bits, we show substantial improvements in accuracy, for example 66.68% top-1 accuracy on ImageNet using ResNet-18, compared to the previous state-of-the-art accuracy of 61.52% Louizos et al. (2019) and a full precision reference accuracy of 69.76%. We performed a thorough set of experiments to test the efficacy of our method and also conducted ablation studies on different aspects of the method and techniques to improve training stability and accuracy. Our codebase and trained models are available on GitHub.
Tasks	Model Compression, Quantization
Published	2020-01-01
URL	https://openreview.net/forum?id=Hkx3ElHYwS
PDF	https://openreview.net/pdf?id=Hkx3ElHYwS
PWC	https://paperswithcode.com/paper/gq-net-training-quantization-friendly-deep
Repo
Framework

Self-Supervised GAN Compression


Title	Self-Supervised GAN Compression
Authors	Anonymous
Abstract	Deep learning’s success has led to larger and larger models to handle more and more complex tasks; trained models can contain millions of parameters. These large models are compute- and memory-intensive, which makes it a challenge to deploy them with minimized latency, throughput, and storage requirements. Some model compression methods have been successfully applied on image classification and detection or language models, but there has been very little work compressing generative adversarial networks (GANs) performing complex tasks. In this paper, we show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods. We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator. We show that this framework has a compelling performance to high degrees of sparsity, generalizes well to new tasks and models, and enables meaningful comparisons between different pruning granularities.
Tasks	Image Classification, Model Compression
Published	2020-01-01
URL	https://openreview.net/forum?id=Skl8EkSFDr
PDF	https://openreview.net/pdf?id=Skl8EkSFDr
PWC	https://paperswithcode.com/paper/self-supervised-gan-compression
Repo
Framework

Blockwise Self-Attention for Long Document Understanding


Title	Blockwise Self-Attention for Long Document Understanding
Authors	Anonymous
Abstract	We present BlockBERT, a lightweight and efficient BERT model that is designed to better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training time, which also enables attention heads to capture either short- or long-range contextual information. We conduct experiments on several benchmark question answering datasets with various paragraph lengths. Results show that BlockBERT uses 18.7-36.1% less memory and reduces the training time by 12.0-25.1%, while having comparable and sometimes better prediction accuracy, compared to an advanced BERT-based model, RoBERTa.
Tasks	Question Answering
Published	2020-01-01
URL	https://openreview.net/forum?id=H1gpET4YDB
PDF	https://openreview.net/pdf?id=H1gpET4YDB
PWC	https://paperswithcode.com/paper/blockwise-self-attention-for-long-document-1
Repo
Framework

Decoupling Weight Regularization from Batch Size for Model Compression


Title	Decoupling Weight Regularization from Batch Size for Model Compression
Authors	Anonymous
Abstract	Conventionally, compression-aware training performs weight compression for every mini-batch to compute the impact of compression on the loss function. In this paper, in order to study when would be the right time to compress weights during optimization steps, we propose a new hyper-parameter called Non-Regularization period or NR period during which weights are not updated for regularization. We first investigate the influence of NR period on regularization using weight decay and weight random noise insertion. Throughout various experiments, we show that stronger weight regularization demands longer NR period (regardless of batch size) to best utilize regularization effects. From our empirical evidence, we argue that weight regularization for every mini-batch allows small weight updates only and limited regularization effects such that there is a need to search for right NR period and weight regularization strength to enhance model accuracy. Consequently, NR period becomes especially crucial for model compression where large weight updates are necessary to increase compression ratio. Using various models, we show that simple weight updates to comply with compression formats along with long NR period is enough to achieve high compression ratio and model accuracy.
Tasks	Model Compression
Published	2020-01-01
URL	https://openreview.net/forum?id=BJlaG0VFDH
PDF	https://openreview.net/pdf?id=BJlaG0VFDH
PWC	https://paperswithcode.com/paper/decoupling-weight-regularization-from-batch
Repo
Framework

Test-Time Training for Out-of-Distribution Generalization


Title	Test-Time Training for Out-of-Distribution Generalization
Authors	Anonymous
Abstract	We introduce a general approach, called test-time training, for improving the performance of predictive models when test and training data come from different distributions. Test-time training turns a single unlabeled test instance into a self-supervised learning problem, on which we update the model parameters before making a prediction on the test sample. We show that this simple idea leads to surprising improvements on diverse image classification benchmarks aimed at evaluating robustness to distribution shifts. Theoretical investigations on a convex model reveal helpful intuitions for when we can expect our approach to help.
Tasks	Image Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=HyezmlBKwr
PDF	https://openreview.net/pdf?id=HyezmlBKwr
PWC	https://paperswithcode.com/paper/test-time-training-for-out-of-distribution
Repo
Framework

Cancer homogeneity in single cell revealed by Bi-state model and Binary matrix factorization


Title	Cancer homogeneity in single cell revealed by Bi-state model and Binary matrix factorization
Authors	Anonymous
Abstract	Single cell RNA sequencing (scRNAseq) technology enables quantifying gene expression profiles by individual cells within cancer. Dimension reduction methods have been commonly used for cell clustering analysis and visualization of the data. Current dimension reduction methods tend overly eliminate the expression variations correspond to less dominating characteristics, such we fail to find the homogenious properties of cancer development. In this paper, we proposed a new and clustering analysis method for scRNAseq data, namely BBSC, via implementing a binarization of the gene expression profile into on/off frequency changes with a Boolean matrix factorization. The low rank representation of expression matrix recovered by BBSC increase the resolution in identifying distinct cell types or functions. Application of BBSC on two cancer scRNAseq data successfully discovered both homogeneous and heterogeneous cancer cell clusters. Further finding showed potential in preventing cancer progression.
Tasks	Dimensionality Reduction
Published	2020-01-01
URL	https://openreview.net/forum?id=rygGnertwH
PDF	https://openreview.net/pdf?id=rygGnertwH
PWC	https://paperswithcode.com/paper/cancer-homogeneity-in-single-cell-revealed-by
Repo
Framework

GResNet: Graph Residual Network for Reviving Deep GNNs from Suspended Animation


Title	GResNet: Graph Residual Network for Reviving Deep GNNs from Suspended Animation
Authors	Anonymous
Abstract	The existing graph neural networks (GNNs) based on the spectral graph convolutional operator have been criticized for its performance degradation, which is especially common for the models with deep architectures. In this paper, we further identify the suspended animation problem with the existing GNNs. Such a problem happens when the model depth reaches the suspended animation limit, and the model will not respond to the training data any more and become not learnable. Analysis about the causes of the suspended animation problem with existing GNNs will be provided in this paper, whereas several other peripheral factors that will impact the problem will be reported as well. To resolve the problem, we introduce the GRESNET (Graph Residual Network) framework in this paper, which creates extensively connected highways to involve nodes’ raw features or intermediate representations throughout the graph for all the model layers. Different from the other learning settings, the extensive connections in the graph data will render the existing simple residual learning methods fail to work. We prove the effectiveness of the introduced new graph residual terms from the norm preservation perspective, which will help avoid dramatic changes to the node’s representations between sequential layers. Detailed studies about the GRESNET framework for many existing GNNs, including GCN, GAT and LOOPYNET, will be reported in the paper with extensive empirical experiments on real-world benchmark datasets.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rygHq6EFwr
PDF	https://openreview.net/pdf?id=rygHq6EFwr
PWC	https://paperswithcode.com/paper/gresnet-graph-residual-network-for-reviving
Repo
Framework