April 1, 2020

3131 words 15 mins read

Paper Group NANR 85

Disentangling Factors of Variations Using Few Labels. Critical initialisation in continuous approximations of binary neural networks. Probabilistic Connection Importance Inference and Lossless Compression of Deep Neural Networks. Stochastic Gradient Descent with Biased but Consistent Gradient Estimators. The Sooner The Better: Investigating Structu …

Disentangling Factors of Variations Using Few Labels


Title	Disentangling Factors of Variations Using Few Labels
Authors	Anonymous
Abstract	Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al. (2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a limited amount of supervision, for example through manual labeling of (some) factors of variation in a few training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 52000 models under well-defined and reproducible experimental conditions. We observe that a small number of labeled examples (0.01–0.5% of the data set), with potentially imprecise and incomplete labels, is sufficient to perform model selection on state-of-the-art unsupervised models. Further, we investigate the benefit of incorporating supervision into the training process. Overall, we empirically validate that with little and imprecise supervision it is possible to reliably learn disentangled representations.
Tasks	Model Selection, Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=SygagpEKwB
PDF	https://openreview.net/pdf?id=SygagpEKwB
PWC	https://paperswithcode.com/paper/disentangling-factors-of-variations-using-few
Repo
Framework

Critical initialisation in continuous approximations of binary neural networks


Title	Critical initialisation in continuous approximations of binary neural networks
Authors	Anonymous
Abstract	The training of stochastic neural network models with binary ($\pm1$) weights and activations via continuous surrogate networks is investigated. We derive, using mean field theory, a set of scalar equations describing how input signals propagate through surrogate networks. The equations reveal that depending on the choice of surrogate model, the networks may or may not exhibit an order to chaos transition, and the presence of depth scales that limit the maximum trainable depth. Specifically, in solving the equations for edge of chaos conditions, we show that surrogates derived using the Gaussian local reparameterisation trick have no critical initialisation, whereas a deterministic surrogates based on analytic Gaussian integration do. The theory is applied to a range of binary neuron and weight design choices, such as different neuron noise models, allowing the categorisation of algorithms in terms of their behaviour at initialisation. Moreover, we predict theoretically and confirm numerically, that common weight initialization schemes used in standard continuous networks, when applied to the mean values of the stochastic binary weights, yield poor training performance. This study shows that, contrary to common intuition, the means of the stochastic binary weights should be initialised close to close to $\pm 1$ for deeper networks to be trainable.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rylmoxrFDH
PDF	https://openreview.net/pdf?id=rylmoxrFDH
PWC	https://paperswithcode.com/paper/critical-initialisation-in-continuous
Repo
Framework

Probabilistic Connection Importance Inference and Lossless Compression of Deep Neural Networks


Title	Probabilistic Connection Importance Inference and Lossless Compression of Deep Neural Networks
Authors	Anonymous
Abstract	Deep neural networks (DNNs) can be huge in size, requiring a considerable a mount of energy and computational resources to operate, which limits their applications in numerous scenarios. It is thus of interest to compress DNNs while maintaining their performance levels. We here propose a probabilistic importance inference approach for pruning DNNs. Specifically, we test the significance of the relevance of a connection in a DNN to the DNN’s outputs using a nonparemtric scoring testand keep only those significant ones. Experimental results show that the proposed approach achieves better lossless compression rates than existing techniques
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJgCF0VFwr
PDF	https://openreview.net/pdf?id=HJgCF0VFwr
PWC	https://paperswithcode.com/paper/probabilistic-connection-importance-inference
Repo
Framework

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators


Title	Stochastic Gradient Descent with Biased but Consistent Gradient Estimators
Authors	Anonymous
Abstract	Stochastic gradient descent (SGD), which dates back to the 1950s, is one of the most popular and effective approaches for performing stochastic optimization. Research on SGD resurged recently in machine learning for optimizing convex loss functions and training nonconvex deep neural networks. The theory assumes that one can easily compute an unbiased gradient estimator, which is usually the case due to the sample average nature of empirical risk minimization. There exist, however, many scenarios (e.g., graphs) where an unbiased estimator may be as expensive to compute as the full gradient because training examples are interconnected. Recently, Chen et al. (2018) proposed using a consistent gradient estimator as an economic alternative. Encouraged by empirical success, we show, in a general setting, that consistent estimators result in the same convergence behavior as do unbiased ones. Our analysis covers strongly convex, convex, and nonconvex objectives. We verify the results with illustrative experiments on synthetic and real-world data. This work opens several new research directions, including the development of more efficient SGD updates with consistent estimators and the design of efficient training algorithms for large-scale graphs.
Tasks	Stochastic Optimization
Published	2020-01-01
URL	https://openreview.net/forum?id=rygMWT4twS
PDF	https://openreview.net/pdf?id=rygMWT4twS
PWC	https://paperswithcode.com/paper/stochastic-gradient-descent-with-biased-but-1
Repo
Framework

The Sooner The Better: Investigating Structure of Early Winning Lottery Tickets


Title	The Sooner The Better: Investigating Structure of Early Winning Lottery Tickets
Authors	Anonymous
Abstract	The recent success of the lottery ticket hypothesis by Frankle & Carbin (2018) suggests that small, sparsified neural networks can be trained as long as the network is initialized properly. Several follow-up discussions on the initialization of the sparsified model have discovered interesting characteristics such as the necessity of rewinding (Frankle et al. (2019)), importance of sign of the initial weights (Zhou et al. (2019)), and the transferability of the winning lottery tickets (S. Morcos et al. (2019)). In contrast, another essential aspect of the winning ticket, the structure of the sparsified model, has been little discussed. To find the lottery ticket, unfortunately, all the prior work still relies on computationally expensive iterative pruning. In this work, we conduct an in-depth investigation of the structure of winning lottery tickets. Interestingly, we discover that there exist many lottery tickets that can achieve equally good accuracy much before the regular training schedule even finishes. We provide insights into the structure of these early winning tickets with supporting evidence. 1) Under stochastic gradient descent optimization, lottery ticket emerges when weight magnitude of a model saturates; 2) Pruning before the saturation of a model causes the loss of capability in learning complex patterns, resulting in the accuracy degradation. We employ the memorization capacity analysis to quantitatively confirm it, and further explain why gradual pruning can achieve better accuracy over the one-shot pruning. Based on these insights, we discover the early winning tickets for various ResNet architectures on both CIFAR10 and ImageNet, achieving state-of-the-art accuracy at a high pruning rate without expensive iterative pruning. In the case of ResNet50 on ImageNet, this comes to the winning ticket of 75:02% Top-1 accuracy at 80% pruning rate in only 22% of the total epochs for iterative pruning.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJlNs0VYPB
PDF	https://openreview.net/pdf?id=BJlNs0VYPB
PWC	https://paperswithcode.com/paper/the-sooner-the-better-investigating-structure
Repo
Framework

Stable Rank Normalization for Improved Generalization in Neural Networks and GANs


Title	Stable Rank Normalization for Improved Generalization in Neural Networks and GANs
Authors	Anonymous
Abstract	Exciting new work on generalization bounds for neural networks (NN) given by Bartlett et al. (2017); Neyshabur et al. (2018) closely depend on two parameter- dependant quantities: the Lipschitz constant upper bound and the stable rank (a softer version of rank). Even though these bounds typically have minimal practical utility, they facilitate questions on whether controlling such quantities together could improve the generalization behaviour of NNs in practice. To this end, we propose stable rank normalization (SRN), a novel, provably optimal, and computationally efficient weight-normalization scheme which minimizes the stable rank of a linear operator. Surprisingly we find that SRN, despite being non-convex, can be shown to have a unique optimal solution. We provide extensive analyses across a wide variety of NNs (DenseNet, WideResNet, ResNet, Alexnet, VGG), where applying SRN to their linear layers leads to improved classification accuracy, while simultaneously showing improvements in genealization, evaluated empirically using—(a) shattering experiments (Zhang et al., 2016); and (b) three measures of sample complexity by Bartlett et al. (2017), Neyshabur et al. (2018), & Wei & Ma. Additionally, we show that, when applied to the discriminator of GANs, it improves Inception, FID, and Neural divergence scores, while learning mappings with low empirical Lipschitz constant.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1enKkrFDB
PDF	https://openreview.net/pdf?id=H1enKkrFDB
PWC	https://paperswithcode.com/paper/stable-rank-normalization-for-improved-1
Repo
Framework

Conditional Learning of Fair Representations


Title	Conditional Learning of Fair Representations
Authors	Anonymous
Abstract	We propose a novel algorithm for learning fair representations that can simultaneously mitigate two notions of disparity among different demographic subgroups. Two key components underpinning the design of our algorithm are balanced error rate and conditional alignment of representations. We show how these two components contribute to ensuring accuracy parity and equalized false-positive and false-negative rates across groups without impacting demographic parity. Furthermore, we also demonstrate both in theory and on two real-world experiments that the proposed algorithm leads to a better utility-fairness trade-off on balanced datasets compared with existing algorithms on learning fair representations.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Hkekl0NFPr
PDF	https://openreview.net/pdf?id=Hkekl0NFPr
PWC	https://paperswithcode.com/paper/conditional-learning-of-fair-representations-1
Repo
Framework

I love your chain mail! Making knights smile in a fantasy game world


Title	I love your chain mail! Making knights smile in a fantasy game world
Authors	Anonymous
Abstract	Dialogue research tends to distinguish between chit-chat and goal-oriented tasks. While the former is arguably more naturalistic and has a wider use of language, the latter has clearer metrics and a more straightforward learning signal. Humans effortlessly combine the two, and engage in chit-chat for example with the goal of exchanging information or eliciting a specific response. Here, we bridge the divide between these two domains in the setting of a rich multi-player text-based fantasy environment where agents and humans engage in both actions and dialogue. Specifically, we train a goal-oriented model with reinforcement learning via self-play against an imitation-learned chit-chat model with two new approaches: the policy either learns to pick a topic or learns to pick an utterance given the top-k utterances. We show that both models outperform a strong inverse model baseline and can converse naturally with their dialogue partner in order to achieve goals.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJxRrlBFwB
PDF	https://openreview.net/pdf?id=BJxRrlBFwB
PWC	https://paperswithcode.com/paper/i-love-your-chain-mail-making-knights-smile
Repo
Framework


Title	Cross-Lingual Vision-Language Navigation
Authors	Anonymous
Abstract	Vision-Language Navigation (VLN) is the task where an agent is commanded to navigate in photo-realistic unknown environments with natural language instructions. Previous research on VLN is primarily conducted on the Room-to-Room (R2R) dataset with only English instructions. The ultimate goal of VLN, however, is to serve people speaking arbitrary languages. Towards multilingual VLN with numerous languages, we collect a cross-lingual R2R dataset, which extends the original benchmark with corresponding Chinese instructions. But it is time-consuming and expensive to collect large-scale human instructions for every existing language. Based on the newly introduced dataset, we propose a general cross-lingual VLN framework to enable instruction-following navigation for different languages. We first explore the possibility of building a cross-lingual agent when no training data of the target language is available. The cross-lingual agent is equipped with a meta-learner to aggregate cross-lingual representations and a visually grounded cross-lingual alignment module to align textual representations of different languages. Under the zero-shot learning scenario, our model shows competitive results even compared to a model trained with all target language instructions. In addition, we introduce an adversarial domain adaption loss to improve the transferring ability of our model when given a certain amount of target language data. Our methods and dataset demonstrate the potentials of building a cross-lingual agent to serve speakers with different languages.
Tasks	Domain Adaptation, Vision-Language Navigation, Zero-Shot Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=rkeZO1BFDB
PDF	https://openreview.net/pdf?id=rkeZO1BFDB
PWC	https://paperswithcode.com/paper/cross-lingual-vision-language-navigation
Repo
Framework

Unifying Graph Convolutional Neural Networks and Label Propagation


Title	Unifying Graph Convolutional Neural Networks and Label Propagation
Authors	Anonymous
Abstract	Label Propagation (LPA) and Graph Convolutional Neural Networks (GCN) are both message passing algorithms on graphs. Both solve the task of node classification but LPA propagates node label information across the edges of the graph, while GCN propagates and transforms node feature information. However, while conceptually similar, theoretical relation between LPA and GCN has not yet been investigated. Here we study the relationship between LPA and GCN in terms of two aspects: (1) feature/label smoothing where we analyze how the feature/label of one node are spread over its neighbors; And, (2) feature/label influence of how much the initial feature/label of one node influences the final feature/label of another node. Based on our theoretical analysis, we propose an end-to-end model that unifies GCN and LPA for node classification. In our unified model, edge weights are learnable, and the LPA serves as regularization to assist the GCN in learning proper edge weights that lead to improved classification performance. Our model can also be seen as learning attention weights based on node labels, which is more task-oriented than existing feature-based attention models. In a number of experiments on real-world graphs, our model shows superiority over state-of-the-art GCN-based methods in terms of node classification accuracy.
Tasks	Node Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=rkgdYhVtvH
PDF	https://openreview.net/pdf?id=rkgdYhVtvH
PWC	https://paperswithcode.com/paper/unifying-graph-convolutional-neural-networks
Repo
Framework

Guiding Program Synthesis by Learning to Generate Examples


Title	Guiding Program Synthesis by Learning to Generate Examples
Authors	Anonymous
Abstract	A key challenge of existing program synthesizers is ensuring that the synthesized program generalizes well. This can be difficult to achieve as the specification provided by the end user is often limited, containing as few as one or two input-output examples. In this paper we address this challenge via an iterative approach that finds ambiguities in the provided specification and learns to resolve these by generating additional input-output examples. The main insight is to reduce the problem of selecting which program generalizes well to the simpler task of deciding which output is correct. As a result, to train our probabilistic models, we can take advantage of the large amounts of data in the form of program outputs, which are often much easier to obtain than the corresponding ground-truth programs.
Tasks	Program Synthesis
Published	2020-01-01
URL	https://openreview.net/forum?id=BJl07ySKvS
PDF	https://openreview.net/pdf?id=BJl07ySKvS
PWC	https://paperswithcode.com/paper/guiding-program-synthesis-by-learning-to
Repo
Framework

Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data


Title	Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data
Authors	Anonymous
Abstract	This paper investigates the intriguing question of whether we can create learning algorithms that automatically generate training data, learning environments, and curricula in order to help AI agents rapidly learn. We show that such algorithms are possible via Generative Teaching Networks (GTNs), a general approach that is applicable to supervised, unsupervised, and reinforcement learning. GTNs are deep neural networks that generate data and/or training environments that a learner (e.g.\ a freshly initialized neural network) trains on before being tested on a target task. We then differentiate \emph{through the entire learning process} via meta-gradients to update the GTN parameters to improve performance on the target task. GTNs have the beneficial property that they can theoretically generate any type of data or training environment, making their potential impact large. This paper introduces GTNs, discusses their potential, and showcases that they can substantially accelerate learning. We also demonstrate a practical and exciting application of GTNs: accelerating the evaluation of candidate architectures for neural architecture search (NAS), which is rate-limited by such evaluations, enabling massive speed-ups in NAS. GTN-NAS improves the NAS state of the art, finding higher performing architectures when controlling for the search proposal mechanism. GTN-NAS also is competitive with the overall state of the art approaches, which achieve top performance while using orders of magnitude less computation than typical NAS methods. Overall, GTNs represent a first step toward the ambitious goal of algorithms that generate their own training data and, in doing so, open a variety of interesting new research questions and directions.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=HJg_ECEKDr
PDF	https://openreview.net/pdf?id=HJg_ECEKDr
PWC	https://paperswithcode.com/paper/generative-teaching-networks-accelerating
Repo
Framework

Don’t Use Large Mini-batches, Use Local SGD


Title	Don’t Use Large Mini-batches, Use Local SGD
Authors	Anonymous
Abstract	Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1eyO1BFPr
PDF	https://openreview.net/pdf?id=B1eyO1BFPr
PWC	https://paperswithcode.com/paper/dont-use-large-mini-batches-use-local-sgd-1
Repo
Framework

Kernelized Wasserstein Natural Gradient


Title	Kernelized Wasserstein Natural Gradient
Authors	Anonymous
Abstract	Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on \texttt{Cifar10} and \texttt{Cifar100} empirically.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Hklz71rYvS
PDF	https://openreview.net/pdf?id=Hklz71rYvS
PWC	https://paperswithcode.com/paper/kernelized-wasserstein-natural-gradient
Repo
Framework

Learning Compositional Koopman Operators for Model-Based Control


Title	Learning Compositional Koopman Operators for Model-Based Control
Authors	Anonymous
Abstract	Finding an embedding space for a linear approximation of a nonlinear dynamical system enables efficient system identification and control synthesis. The Koopman operator theory lays the foundation for identifying the nonlinear-to-linear coordinate transformations with data-driven methods. Recently, researchers have proposed to use deep neural networks as a more expressive class of basis functions for calculating the Koopman operators. These approaches, however, assume a fixed dimensional state space; they are therefore not applicable to scenarios with a variable number of objects. In this paper, we propose to learn compositional Koopman operators, using graph neural networks to encode the state into object-centric embeddings and using a block-wise linear transition matrix to regularize the shared structure across objects. The learned dynamics can quickly adapt to new environments of unknown physical parameters and produce control signals to achieve a specified goal. Our experiments on manipulating ropes and controlling soft robots show that the proposed method has better efficiency and generalization ability than existing baselines.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1ldzA4tPr
PDF	https://openreview.net/pdf?id=H1ldzA4tPr
PWC	https://paperswithcode.com/paper/learning-compositional-koopman-operators-for-1
Repo
Framework