Paper Group NANR 52
Smoothness and Stability in GANs. Atomic Compression Networks. Lipschitz constant estimation for Neural Networks via sparse polynomial optimization. Evo-NAS: Evolutionary-Neural Hybrid Agent for Architecture Search. Data-dependent Gaussian Prior Objective for Language Generation. Dataset Distillation. Towards Understanding the Spectral Bias of Deep …
Smoothness and Stability in GANs
Title | Smoothness and Stability in GANs |
Authors | Anonymous |
Abstract | Generative adversarial networks, or GANs, commonly display unstable behavior during training. In this work, we develop a principled theoretical framework for understanding the stability of various types of GANs. In particular, we derive conditions that guarantee eventual stationarity of the generator when it is trained with gradient descent, conditions that must be satisfied by the divergence that is minimized by the GAN and the generator’s architecture. We find that existing GAN variants satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously. In the process, we explain and clarify the need for various existing GAN stabilization techniques, including Lipschitz constraints, gradient penalties, and smooth activation functions. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJeOekHKwr |
https://openreview.net/pdf?id=HJeOekHKwr | |
PWC | https://paperswithcode.com/paper/smoothness-and-stability-in-gans |
Repo | |
Framework | |
Atomic Compression Networks
Title | Atomic Compression Networks |
Authors | Anonymous |
Abstract | Compressed forms of deep neural networks are essential in deploying large-scale computational models on resource-constrained devices. Contrary to analogous domains where large-scale systems are build as a hierarchical repetition of small- scale units, the current practice in Machine Learning largely relies on models with non-repetitive components. In the spirit of molecular composition with repeating atoms, we advance the state-of-the-art in model compression by proposing Atomic Compression Networks (ACNs), a novel architecture that is constructed by recursive repetition of a small set of neurons. In other words, the same neurons with the same weights are stochastically re-positioned in subsequent layers of the network. Empirical evidence suggests that ACNs achieve compression rates of up to three orders of magnitudes compared to fine-tuned fully-connected neural networks (88× to 1116× reduction) with only a fractional deterioration of classification accuracy (0.15% to 5.33%). Moreover our method can yield sub-linear model complexities and permits learning deep ACNs with less parameters than a logistic regression with no decline in classification accuracy. |
Tasks | Model Compression |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1xO4xHFvB |
https://openreview.net/pdf?id=S1xO4xHFvB | |
PWC | https://paperswithcode.com/paper/atomic-compression-networks |
Repo | |
Framework | |
Lipschitz constant estimation for Neural Networks via sparse polynomial optimization
Title | Lipschitz constant estimation for Neural Networks via sparse polynomial optimization |
Authors | Anonymous |
Abstract | We introduce LiPopt, a polynomial optimization framework for computing increasingly tighter upper bound on the Lipschitz constant of neural networks. The underlying optimization problems boil down to either linear (LP) or semidefinite (SDP) programming. We show how to use structural properties of the network, such as sparsity, to significantly reduce the complexity of computation. This is specially useful for convolutional as well as pruned neural networks. We conduct experiments on networks with random weights as well as networks trained on MNIST, showing that in the particular case of the $\ell_\infty$-Lipschitz constant, our approach yields superior estimates as compared to other baselines available in the literature. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJe4_xSFDB |
https://openreview.net/pdf?id=rJe4_xSFDB | |
PWC | https://paperswithcode.com/paper/lipschitz-constant-estimation-for-neural |
Repo | |
Framework | |
Evo-NAS: Evolutionary-Neural Hybrid Agent for Architecture Search
Title | Evo-NAS: Evolutionary-Neural Hybrid Agent for Architecture Search |
Authors | Anonymous |
Abstract | Neural Architecture Search has shown potential to automate the design of neural networks. Deep Reinforcement Learning based agents can learn complex architectural patterns, as well as explore a vast and compositional search space. On the other hand, evolutionary algorithms offer higher sample efficiency, which is critical for such a resource intensive application. In order to capture the best of both worlds, we propose a class of Evolutionary-Neural hybrid agents (Evo-NAS). We show that the Evo-NAS agent outperforms both neural and evolutionary agents when applied to architecture search for a suite of text and image classification benchmarks. On a high-complexity architecture search space for image classification, the Evo-NAS agent surpasses the accuracy achieved by commonly used agents with only 1/3 of the search cost. |
Tasks | Image Classification, Neural Architecture Search |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkeYdyHYPS |
https://openreview.net/pdf?id=BkeYdyHYPS | |
PWC | https://paperswithcode.com/paper/evo-nas-evolutionary-neural-hybrid-agent-for |
Repo | |
Framework | |
Data-dependent Gaussian Prior Objective for Language Generation
Title | Data-dependent Gaussian Prior Objective for Language Generation |
Authors | Anonymous |
Abstract | For typical sequence prediction problems like language generation, maximum likelihood estimation (MLE) has been commonly adopted as it encourages the predicted sequence most consistent with the ground-truth sequence to have the highest probability of occurring. However, MLE focuses on a once-for-all matching between the predicted sequence and gold-standard consequently, treating all incorrect predictions as being equally incorrect. We call such a drawback {\it negative diversity ignorance} in this paper. Treating all incorrect predictions as equal unfairly downplays the nuance of these sequences’ detailed token-wise structure. To counteract this, we augment the MLE loss by introducing an extra KL divergence term which is derived from comparing a data-dependent Gaussian prior and the detailed training prediction. The proposed data-dependent Gaussian prior objective (D2GPo) is defined over a prior topological order of tokens, poles apart from the data-independent Gaussian prior (L2 regularization) commonly adopted for smoothing the training of MLE. Experimental results show that the proposed method can effectively make use of more detailed prior in the data and significantly improve the performance of typical language generation tasks, including supervised and unsupervised machine translation, text summarization, storytelling, and image caption. |
Tasks | L2 Regularization, Machine Translation, Text Generation, Text Summarization, Unsupervised Machine Translation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1efxTVYDr |
https://openreview.net/pdf?id=S1efxTVYDr | |
PWC | https://paperswithcode.com/paper/data-dependent-gaussian-prior-objective-for |
Repo | |
Framework | |
Dataset Distillation
Title | Dataset Distillation |
Authors | Anonymous |
Abstract | Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called dataset distillation: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress 60,000 MNIST training images into just 10 synthetic distilled images (one per class) and achieve close to the original performance, given a fixed network initialization. We evaluate our method in various initialization settings. Experiments on multiple datasets, MNIST, CIFAR10, PASCAL-VOC, and CUB-200, demonstrate the ad-vantage of our approach compared to alternative methods. Finally, we include a real-world application of dataset distillation to the continual learning setting: we show that storing distilled images as episodic memory of previous tasks can alleviate forgetting more effectively than real images. |
Tasks | Continual Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryxO3gBtPB |
https://openreview.net/pdf?id=ryxO3gBtPB | |
PWC | https://paperswithcode.com/paper/dataset-distillation-1 |
Repo | |
Framework | |
Towards Understanding the Spectral Bias of Deep Learning
Title | Towards Understanding the Spectral Bias of Deep Learning |
Authors | Anonymous |
Abstract | An intriguing phenomenon observed during training neural networks is the spectral bias, where neural networks are biased towards learning less complex functions. The priority of learning functions with low complexity might be at the core of explaining generalization ability of neural network, and certain efforts have been made to provide theoretical explanation for spectral bias. However, there is still no satisfying theoretical results justifying the existence of spectral bias. In this work, we give a comprehensive and rigorous explanation for spectral bias and relate it with the neural tangent kernel function proposed in recent work. We prove that the training process of neural networks can be decomposed along different directions defined by the eigenfunctions of the neural tangent kernel, where each direction has its own convergence rate and the rate is determined by the corresponding eigenvalue. We then provide a case study when the input data is uniformly distributed over the unit shpere, and show that lower degree spherical harmonics are easier to be learned by over-parameterized neural networks. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJg15lrKvS |
https://openreview.net/pdf?id=BJg15lrKvS | |
PWC | https://paperswithcode.com/paper/towards-understanding-the-spectral-bias-of |
Repo | |
Framework | |
Understanding and Stabilizing GANs’ Training Dynamics with Control Theory
Title | Understanding and Stabilizing GANs’ Training Dynamics with Control Theory |
Authors | Anonymous |
Abstract | Generative adversarial networks~(GANs) have made significant progress on realistic image generation but often suffer from instability during the training process. Most previous analyses mainly focus on the equilibrium that GANs achieve, whereas a gap exists between such theoretical analyses and practical implementations, where it is the training dynamics that plays a vital role in the convergence and stability of GANs. In this paper, we directly model the dynamics of GANs and adopt the control theory to understand and stabilize it. Specifically, we interpret the training process of various GANs as certain types of dynamics in a unified perspective of control theory which enables us to model the stability and convergence easily. Borrowed from control theory, we adopt the widely-used negative feedback control to stabilize the training dynamics, which can be considered as an $L2$ regularization on the output of the discriminator. We empirically verify our method on both synthetic data and natural image datasets. The results demonstrate that our method can stabilize the training dynamics as well as converge better than baselines. |
Tasks | Image Generation, L2 Regularization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJe7h34YDS |
https://openreview.net/pdf?id=BJe7h34YDS | |
PWC | https://paperswithcode.com/paper/understanding-and-stabilizing-gans-training-1 |
Repo | |
Framework | |
Intensity-Free Learning of Temporal Point Processes
Title | Intensity-Free Learning of Temporal Point Processes |
Authors | Anonymous |
Abstract | Temporal point processes are the dominant paradigm for modeling sequences of events happening at irregular intervals. The standard way of learning in such models is by estimating the conditional intensity function. However, parameterizing the intensity function usually incurs several trade-offs. We show how to overcome the limitations of intensity-based approaches by directly modeling the conditional distribution of inter-event times. We draw on the literature on normalizing flows to design models that are flexible and efficient. We additionally propose a simple mixture model that matches the flexibility of flow-based models, but also permits sampling and computing moments in closed form. The proposed models achieve state-of-the-art performance in standard prediction tasks and are suitable for novel applications, such as learning sequence embeddings and imputing missing data. |
Tasks | Point Processes |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HygOjhEYDH |
https://openreview.net/pdf?id=HygOjhEYDH | |
PWC | https://paperswithcode.com/paper/intensity-free-learning-of-temporal-point-1 |
Repo | |
Framework | |
Long History Short-Term Memory for Long-Term Video Prediction
Title | Long History Short-Term Memory for Long-Term Video Prediction |
Authors | Anonymous |
Abstract | While video prediction approaches have advanced considerably in recent years, learning to predict long-term future is challenging — ambiguous future or error propagation over time yield blurry predictions. To address this challenge, existing algorithms rely on extra supervision (e.g., action or object pose), motion flow learning, or adversarial training. In this paper, we propose a new recurrent unit, Long History Short-Term Memory (LH-STM). LH-STM incorporates long history states into a recurrent unit to learn longer range dependencies. To capture spatio-temporal dynamics in videos, we combined LH-STM with the Context-aware Video Prediction model (ContextVP). Our experiments on the KTH human actions and BAIR robot pushing datasets demonstrate that our approach produces not only sharper near-future predictions, but also farther into the future compared to the state-of-the-art methods. |
Tasks | Video Prediction |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HklmoRVYvr |
https://openreview.net/pdf?id=HklmoRVYvr | |
PWC | https://paperswithcode.com/paper/long-history-short-term-memory-for-long-term |
Repo | |
Framework | |
GDP: Generalized Device Placement for Dataflow Graphs
Title | GDP: Generalized Device Placement for Dataflow Graphs |
Authors | Anonymous |
Abstract | Runtime and scalability of large neural networks can be significantly affected by the placement of operations in their dataflow graphs on suitable devices. With increasingly complex neural network architectures and heterogeneous device characteristics, finding a reasonable placement is extremely challenging even for domain experts. Most existing automated device placement approaches are impractical due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an efficient end-to-end method based on a scalable sequential attention mechanism over a graph neural network that is transferable to new graphs. On a diverse set of representative deep learning models, including Inception-v3, AmoebaNet, Transformer-XL, and WaveNet, our method on average achieves 16% improvement over human experts and 9.2% improvement over the prior art with 15 times faster convergence. To further reduce the computation cost, we pre-train the policy network on a set of dataflow graphs and use a superposition network to fine-tune it on each individual graph, achieving state-of-the-art performance on large hold-out graphs with over 50k nodes, such as an 8-layer GNMT. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SkxW23NtPH |
https://openreview.net/pdf?id=SkxW23NtPH | |
PWC | https://paperswithcode.com/paper/gdp-generalized-device-placement-for-dataflow-1 |
Repo | |
Framework | |
Bayesian Variational Autoencoders for Unsupervised Out-of-Distribution Detection
Title | Bayesian Variational Autoencoders for Unsupervised Out-of-Distribution Detection |
Authors | Anonymous |
Abstract | Despite their successes, deep neural networks still make unreliable predictions when faced with test data drawn from a distribution different to that of the training data, constituting a major problem for AI safety. While this motivated a recent surge in interest in developing methods to detect such out-of-distribution (OoD) inputs, a robust solution is still lacking. We propose a new probabilistic, unsupervised approach to this problem based on a Bayesian variational autoencoder model, which estimates a full posterior distribution over the decoder parameters using stochastic gradient Markov chain Monte Carlo, instead of fitting a point estimate. We describe how information-theoretic measures based on this posterior can then be used to detect OoD data both in input space as well as in the model’s latent space. The effectiveness of our approach is empirically demonstrated. |
Tasks | Out-of-Distribution Detection |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ByggpyrFPS |
https://openreview.net/pdf?id=ByggpyrFPS | |
PWC | https://paperswithcode.com/paper/bayesian-variational-autoencoders-for |
Repo | |
Framework | |
Boosting Network: Learn by Growing Filters and Layers via SplitLBI
Title | Boosting Network: Learn by Growing Filters and Layers via SplitLBI |
Authors | Anonymous |
Abstract | Network structures are important to learning good representations of many tasks in computer vision and machine learning communities. These structures are either manually designed, or searched by Neural Architecture Search (NAS) in previous works, which however requires either expert-level efforts, or prohibitive computational cost. In practice, it is desirable to efficiently and simultaneously learn both the structures and parameters of a network from arbitrary classes with budgeted computational cost. We identify it as a new learning paradigm – Boosting Network, where one starts from simple models, delving into complex trained models progressively. In this paper, by virtue of an iterative sparse regularization path – Split Linearized Bregman Iteration (SplitLBI), we propose a simple yet effective boosting network method that can simultaneously grow and train a network by progressively adding both convolutional filters and layers. Extensive experiments with VGG and ResNets validate the effectiveness of our proposed algorithms. |
Tasks | Neural Architecture Search |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SylwBpNKDr |
https://openreview.net/pdf?id=SylwBpNKDr | |
PWC | https://paperswithcode.com/paper/boosting-network-learn-by-growing-filters-and |
Repo | |
Framework | |
Decaying momentum helps neural network training
Title | Decaying momentum helps neural network training |
Authors | Anonymous |
Abstract | Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD rivals momentum SGD with learning rate decay, and in many cases leads to improved performance. Demon is trivial to implement and incurs limited extra computational overhead, compared to the vanilla counterparts. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJeA_aVtPB |
https://openreview.net/pdf?id=rJeA_aVtPB | |
PWC | https://paperswithcode.com/paper/decaying-momentum-helps-neural-network-1 |
Repo | |
Framework | |
Music Source Separation in the Waveform Domain
Title | Music Source Separation in the Waveform Domain |
Authors | Anonymous |
Abstract | Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song.Such components include voice, bass, drums and any other accompaniments. While end-to-end models that directly generate the waveform are state-of-the-art in many audio synthesis problems, the best multi-instrument source separation models generate masks on the magnitude spectrum and achieve performances far above current end-to-end, waveform-to-waveform models. We present an in-depth analysis of a new architecture, which we will refer to as Demucs, based on a (transposed) convolutional autoencoder, with a bidirectional LSTM at the bottleneck layer and skip-connections as in U-Networks (Ronneberger et al., 2015). Compared to the state-of-the-art waveform-to-waveform model, Wave-U-Net (Stoller et al., 2018), the main features of our approach in addition of the bi-LSTM are the use of trans-posed convolution layers instead of upsampling-convolution blocks, the use of gated linear units, exponentially growing the number of channels with depth and a new careful initialization of the weights. Results on the MusDB dataset show that our architecture achieves a signal-to-distortion ratio (SDR) nearly 2.2 points higher than the best waveform-to-waveform competitor (from 3.2 to 5.4 SDR). This makes our model match the state-of-the-art performances on this dataset, bridging the performance gap between models that operate on the spectrogram and end-to-end approaches. |
Tasks | Music Source Separation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJx7uJStPH |
https://openreview.net/pdf?id=HJx7uJStPH | |
PWC | https://paperswithcode.com/paper/music-source-separation-in-the-waveform |
Repo | |
Framework | |