Paper Group NANR 76
Hidden incentives for self-induced distributional shift. Measuring Numerical Common Sense: Is A Word Embedding Approach Effective?. Variational Autoencoders with Normalizing Flow Decoders. A Non-asymptotic comparison of SVRG and SGD: tradeoffs between compute and speed. Differentiable Architecture Compression. Black Box Recursive Translations for M …
Hidden incentives for self-induced distributional shift
Title | Hidden incentives for self-induced distributional shift |
Authors | Anonymous |
Abstract | Decisions made by machine learning systems have increasing influence on the world. Yet it is common for machine learning algorithms to assume that no such influence exists. An example is the use of the i.i.d. assumption in online learning for applications such as content recommendation, where the (choice of) content displayed can change users’ perceptions and preferences, or even drive them away, causing a shift in the distribution of users. Generally speaking, it is possible for an algorithm to change the distribution of its own inputs. We introduce the term self-induced distributional shift (SIDS) to describe this phenomenon. A large body of work in reinforcement learning and causal machine learning aims to deal with distributional shift caused by deploying learning systems previously trained offline. Our goal is similar, but distinct: we point out that changes to the learning algorithm, such as the introduction of meta-learning, can reveal hidden incentives for distributional shift (HIDS), and aim to diagnose and prevent problems associated with hidden incentives. We design a simple environment as a “unit test” for HIDS, as well as a content recommendation environment which allows us to disentangle different types of SIDS. We demonstrate the potential for HIDS to cause unexpected or undesirable behavior in these environments, and propose and test a mitigation strategy. |
Tasks | Meta-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJeFNlHtPS |
https://openreview.net/pdf?id=SJeFNlHtPS | |
PWC | https://paperswithcode.com/paper/hidden-incentives-for-self-induced |
Repo | |
Framework | |
Measuring Numerical Common Sense: Is A Word Embedding Approach Effective?
Title | Measuring Numerical Common Sense: Is A Word Embedding Approach Effective? |
Authors | Anonymous |
Abstract | Numerical common sense (e.g., ``a person with a height of 2m is very tall’') is essential when deploying artificial intelligence (AI) systems in society. To predict ranges of small and large values for a given target noun and unit, previous studies have implemented a rule-based method that processed numeric values appearing in a natural language by using template matching. To obtain numerical knowledge, crawled textual data from web pages are frequently used as the input in the above method. Although this is an important task, few studies have addressed the availability of numerical common sense extracted from corresponding textual information. To this end, we first used a crowdsourcing service to obtain sufficient data for a subjective agreement on numerical common sense. Second, to examine whether common sense is attributed to current word embedding, we examined the performance of a regressor trained on the obtained data. In comparison with humans, the performance of an automatic relevance determination regression model was good, particularly when the unit was yen (a maximum correlation coefficient of 0.57). Although all the regression approach with word embedding does not predict values with high correlation coefficients, this word-embedding method could potentially contribute to construct numerical common sense for AI deployment. | |
Tasks | Common Sense Reasoning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1xbTlBKwB |
https://openreview.net/pdf?id=B1xbTlBKwB | |
PWC | https://paperswithcode.com/paper/measuring-numerical-common-sense-is-a-word |
Repo | |
Framework | |
Variational Autoencoders with Normalizing Flow Decoders
Title | Variational Autoencoders with Normalizing Flow Decoders |
Authors | Anonymous |
Abstract | Recently proposed normalizing flow models such as Glow (Kingma & Dhariwal, 2018) have been shown to be able to generate high quality, high dimensional images with relatively fast sampling speed. Due to the inherently restrictive design of architecture , however, it is necessary that their model are excessively deep in order to achieve effective training. In this paper we propose to combine Glow model with an underlying variational autoencoder in order to counteract this issue. We demonstrate that such our proposed model is competitive with Glow in terms of image quality while requiring far less time for training. Additionally, our model achieves state-of-the-art FID score on CIFAR-10 for a likelihood-based model. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1eh30NFwB |
https://openreview.net/pdf?id=r1eh30NFwB | |
PWC | https://paperswithcode.com/paper/variational-autoencoders-with-normalizing |
Repo | |
Framework | |
A Non-asymptotic comparison of SVRG and SGD: tradeoffs between compute and speed
Title | A Non-asymptotic comparison of SVRG and SGD: tradeoffs between compute and speed |
Authors | Anonymous |
Abstract | Stochastic gradient descent (SGD), which trades off noisy gradient updates for computational efficiency, is the de-facto optimization algorithm to solve large-scale machine learning problems. SGD can make rapid learning progress by performing updates using subsampled training data, but the noisy updates also lead to slow asymptotic convergence. Several variance reduction algorithms, such as SVRG, introduce control variates to obtain a lower variance gradient estimate and faster convergence. Despite their appealing asymptotic guarantees, SVRG-like algorithms have not been widely adopted in deep learning. The traditional asymptotic analysis in stochastic optimization provides limited insight into training deep learning models under a fixed number of epochs. In this paper, we present a non-asymptotic analysis of SVRG under a noisy least squares regression problem. Our primary focus is to compare the exact loss of SVRG to that of SGD at each iteration t. We show that the learning dynamics of our regression model closely matches with that of neural networks on MNIST and CIFAR-10 for both the underparameterized and the overparameterized models. Our analysis and experimental results suggest there is a trade-off between the computational cost and the convergence speed in underparametrized neural networks. SVRG outperforms SGD after a few epochs in this regime. However, SGD is shown to always outperform SVRG in the overparameterized regime. |
Tasks | Stochastic Optimization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HyleclHKvS |
https://openreview.net/pdf?id=HyleclHKvS | |
PWC | https://paperswithcode.com/paper/a-non-asymptotic-comparison-of-svrg-and-sgd |
Repo | |
Framework | |
Differentiable Architecture Compression
Title | Differentiable Architecture Compression |
Authors | Shashank Singh, Ashish Khetan, Zohar Karnin |
Abstract | In many learning situations, resources at inference time are significantly more constrained than resources at training time. This paper studies a general paradigm, called Differentiable ARchitecture Compression (DARC), that combines model compression and architecture search to learn models that are resource-efficient at inference time. Given a resource-intensive base architecture, DARC utilizes the training data to learn which sub-components can be replaced by cheaper alternatives. The high-level technique can be applied to any neural architecture, and we report experiments on state-of-the-art convolutional neural networks for image classification. For a WideResNet with 97.2% accuracy on CIFAR-10, we improve single-sample inference speed by 2.28X and memory footprint by 5.64X, with no accuracy loss. For a ResNet with 79.15% Top-1 accuracy on ImageNet, we improve batch inference speed by 1.29X and memory footprint by 3.57X with 1% accuracy loss. We also give theoretical Rademacher complexity bounds in simplified cases, showing how DARC avoids over-fitting despite over-parameterization. |
Tasks | Image Classification, Model Compression |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJgkj0NFwr |
https://openreview.net/pdf?id=HJgkj0NFwr | |
PWC | https://paperswithcode.com/paper/differentiable-architecture-compression |
Repo | |
Framework | |
Black Box Recursive Translations for Molecular Optimization
Title | Black Box Recursive Translations for Molecular Optimization |
Authors | Anonymous |
Abstract | Machine learning algorithms for generating molecular structures offer a promising new approach to drug discovery. We cast molecular optimization as a translation problem, where the goal is to map an input compound to a target compound with improved biochemical properties. Remarkably, we observe that when generated molecules are iteratively fed back into the translator, molecular compound attributes improve with each step. We show that this finding is invariant to the choice of translation model, making this a “black box” algorithm. We call this method Black Box Recursive Translation (BBRT), a new inference method for molecular property optimization. This simple, powerful technique operates strictly on the inputs and outputs of any translation model. We obtain new state-of-the-art results for molecular property optimization tasks using our simple drop-in replacement with well-known sequence and graph-based models. Our method provides a significant boost in performance relative to its non-recursive peers with just a simple “``for” loop. Further, BBRT is highly interpretable, allowing users to map the evolution of newly discovered compounds from known starting points. | |
Tasks | Drug Discovery |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJxok1BYPr |
https://openreview.net/pdf?id=rJxok1BYPr | |
PWC | https://paperswithcode.com/paper/black-box-recursive-translations-for |
Repo | |
Framework | |
The Generalization-Stability Tradeoff in Neural Network Pruning
Title | The Generalization-Stability Tradeoff in Neural Network Pruning |
Authors | Anonymous |
Abstract | Pruning neural network parameters is often viewed as a means to compress models, but pruning has also been motivated by the desire to prevent overfitting. This motivation is particularly relevant given the perhaps surprising observation that a wide variety of pruning approaches increase test accuracy despite sometimes massive reductions in parameter counts. To better understand this phenomenon, we analyze the behavior of pruning over the course of training, finding that pruning’s effect on generalization relies more on the instability it generates (defined as the drops in test accuracy immediately following pruning) than on the final size of the pruned model. We demonstrate that even the pruning of unimportant parameters can lead to such instability, and show similarities between pruning and regularizing by injecting noise, suggesting a mechanism for pruning-based generalization improvements that is compatible with the strong generalization recently observed in over-parameterized networks. |
Tasks | Network Pruning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1eCk1StPH |
https://openreview.net/pdf?id=B1eCk1StPH | |
PWC | https://paperswithcode.com/paper/the-generalization-stability-tradeoff-in-1 |
Repo | |
Framework | |
Robust Reinforcement Learning with Wasserstein Constraint
Title | Robust Reinforcement Learning with Wasserstein Constraint |
Authors | Anonymous |
Abstract | Robust Reinforcement Learning aims to find the optimal policy with some degree of robustness to environmental dynamics. Existing learning algorithms usually enable the robustness though disturbing the current state or simulated environmental parameters in a heuristic way, which lack quantified robustness to the system dynamics (i.e. transition probability). To overcome this issue, we leverage Wasserstein distance to measure the disturbance to the reference transition probability. With Wasserstein distance, we are able to connect transition probability disturbance to the state disturbance, and reduces an infinite-dimensional optimization problem to a finite-dimensional risk-aware problem. Through the derived risk-aware optimal Bellman equation, we first show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm—WassersteinRobustAdvantageActor-Critic algorithm (WRA2C). The effectiveness of the proposed algorithm is verified in theCart-Pole environment. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HkeeITEYDr |
https://openreview.net/pdf?id=HkeeITEYDr | |
PWC | https://paperswithcode.com/paper/robust-reinforcement-learning-with |
Repo | |
Framework | |
MODiR: Multi-Objective Dimensionality Reduction for Joint Data Visualisation
Title | MODiR: Multi-Objective Dimensionality Reduction for Joint Data Visualisation |
Authors | Tim Repke, Ralf Krestel |
Abstract | Many large text collections exhibit graph structures, either inherent to the content itself or encoded in the metadata of the individual documents. Example graphs extracted from document collections are co-author networks, citation networks, or named-entity-cooccurrence networks. Furthermore, social networks can be extracted from email corpora, tweets, or social media. When it comes to visualising these large corpora, either the textual content or the network graph are used. In this paper, we propose to incorporate both, text and graph, to not only visualise the semantic information encoded in the documents’ content but also the relationships expressed by the inherent network structure. To this end, we introduce a novel algorithm based on multi-objective optimisation to jointly position embedded documents and graph nodes in a two-dimensional landscape. We illustrate the effectiveness of our approach with real-world datasets and show that we can capture the semantics of large document collections better than other visualisations based on either the content or the network information. |
Tasks | Dimensionality Reduction |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJlMkTNYvH |
https://openreview.net/pdf?id=HJlMkTNYvH | |
PWC | https://paperswithcode.com/paper/modir-multi-objective-dimensionality |
Repo | |
Framework | |
BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning
Title | BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning |
Authors | Anonymous |
Abstract | The field of Deep Reinforcement Learning (DRL) has recently seen a surge in research in batch reinforcement learning, which aims for sample-efficient learning from a given data set without additional interactions with the environment. In the batch DRL setting, commonly employed off-policy DRL algorithms can perform poorly and sometimes even fail to learn altogether. In this paper we propose anew algorithm, Best-Action Imitation Learning (BAIL), which unlike many off-policy DRL algorithms does not involve maximizing Q functions over the action space. Striving for simplicity as well as performance, BAIL first selects from the batch the actions it believes to be high-performing actions for their corresponding states; it then uses those state-action pairs to train a policy network using imitation learning. Although BAIL is simple, we demonstrate that BAIL achieves state of the art performance on the Mujoco benchmark, typically outperforming BatchConstrained deep Q-Learning (BCQ) by a wide margin. |
Tasks | Imitation Learning, Q-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJlnmgrFvS |
https://openreview.net/pdf?id=BJlnmgrFvS | |
PWC | https://paperswithcode.com/paper/bail-best-action-imitation-learning-for-batch |
Repo | |
Framework | |
Short and Sparse Deconvolution — A Geometric Approach
Title | Short and Sparse Deconvolution — A Geometric Approach |
Authors | Anonymous |
Abstract | Short-and-sparse deconvolution (SaSD) is the problem of extracting localized, recurring motifs in signals with spatial or temporal structure. Variants of this problem arise in applications such as image deblurring, microscopy, neural spike sorting, and more. The problem is challenging in both theory and practice, as natural optimization formulations are nonconvex. Moreover, practical deconvolution problems involve smooth motifs (kernels) whose spectra decay rapidly, resulting in poor conditioning and numerical challenges. This paper is motivated by recent theoretical advances \citep{zhang2017global,kuo2019geometry}, which characterize the optimization landscape of a particular nonconvex formulation of SaSD. This is used to derive a {\em provable} algorithm which exactly solves certain non-practical instances of the SaSD problem. We leverage the key ideas from this theory (sphere constraints, data-driven initialization) to develop a {\em practical} algorithm, which performs well on data arising from a range of application areas. We highlight key additional challenges posed by the ill-conditioning of real SaSD problems, and suggest heuristics (acceleration, continuation, reweighting) to mitigate them. Experiments demonstrate the performance and generality of the proposed method. |
Tasks | Deblurring |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Byg5ZANtvH |
https://openreview.net/pdf?id=Byg5ZANtvH | |
PWC | https://paperswithcode.com/paper/short-and-sparse-deconvolution-a-geometric-1 |
Repo | |
Framework | |
Rethinking deep active learning: Using unlabeled data at model training
Title | Rethinking deep active learning: Using unlabeled data at model training |
Authors | Anonymous |
Abstract | Active learning typically focuses on training a model on few labeled examples alone, while unlabeled ones are only used for acquisition. In this work we depart from this setting by using both labeled and unlabeled data during model training across active learning cycles. We do so by using unsupervised feature learning at the beginning of the active learning pipeline and semi-supervised learning at every active learning cycle, on all available data. The former has not been investigated before in active learning, while the study of latter in the context of deep learning is scarce and recent findings are not conclusive with respect to its benefit. Our idea is orthogonal to acquisition strategies by using more data, much like ensemble methods use more models. By systematically evaluating on a number of popular acquisition strategies and datasets, we find that the use of unlabeled data during model training brings a spectacular accuracy improvement in image classification, compared to the differences between acquisition strategies. We thus explore smaller label budgets, even one label per class. |
Tasks | Active Learning, Image Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJehllrtDS |
https://openreview.net/pdf?id=rJehllrtDS | |
PWC | https://paperswithcode.com/paper/rethinking-deep-active-learning-using |
Repo | |
Framework | |
Amortized Nesterov’s Momentum: Robust and Lightweight Momentum for Deep Learning
Title | Amortized Nesterov’s Momentum: Robust and Lightweight Momentum for Deep Learning |
Authors | Anonymous |
Abstract | Stochastic Gradient Descent (SGD) with Nesterov’s momentum is a widely used optimizer in deep learning, which is observed to have excellent generalization performance. However, due to the large stochasticity, SGD with Nesterov’s momentum is not robust, i.e., its performance may deviate significantly from the expectation. In this work, we propose Amortized Nesterov’s Momentum, a special variant of Nesterov’s momentum which has more robust iterates, faster convergence in the early stage and higher efficiency. Our experimental results show that this new momentum achieves similar (sometimes better) generalization performance with little-to-no tuning. In the convex case, we provide optimal convergence rates for our new methods and discuss how the theorems explain the empirical results. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1xJFREKvB |
https://openreview.net/pdf?id=S1xJFREKvB | |
PWC | https://paperswithcode.com/paper/amortized-nesterovs-momentum-robust-and |
Repo | |
Framework | |
Realism Index: Interpolation in Generative Models With Arbitrary Prior
Title | Realism Index: Interpolation in Generative Models With Arbitrary Prior |
Authors | Anonymous |
Abstract | In order to perform plausible interpolations in the latent space of a generative model, we need a measure that credibly reflects if a point in an interpolation is close to the data manifold being modelled, i.e. if it is convincing. In this paper, we introduce a realism index of a point, which can be constructed from an arbitrary prior density, or based on FID score approach in case a prior is not available. We propose a numerically efficient algorithm that directly maximises the realism index of an interpolation which, as we theoretically prove, leads to a search of a geodesic with respect to the corresponding Riemann structure. We show that we obtain better interpolations then the classical linear ones, in particular when either the prior density is not convex shaped, or when the soap bubble effect appears. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SklVqa4YwH |
https://openreview.net/pdf?id=SklVqa4YwH | |
PWC | https://paperswithcode.com/paper/realism-index-interpolation-in-generative |
Repo | |
Framework | |
Observational Overfitting in Reinforcement Learning
Title | Observational Overfitting in Reinforcement Learning |
Authors | Anonymous |
Abstract | A major component of overfitting in model-free reinforcement learning (RL) involves the case where the agent may mistakenly correlate reward with certain spurious features from the observations generated by the Markov Decision Process (MDP). We provide a general framework for analyzing this scenario, which we use to design multiple synthetic benchmarks from only modifying the observation space of an MDP. When an agent overfits to different observation spaces even if the underlying MDP dynamics is fixed, we term this observational overfitting. Our experiments expose intriguing properties especially with regards to implicit regularization, and also corroborate results from previous works in RL generalization and supervised learning (SL). |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJli2hNKDH |
https://openreview.net/pdf?id=HJli2hNKDH | |
PWC | https://paperswithcode.com/paper/observational-overfitting-in-reinforcement |
Repo | |
Framework | |