Paper Group NANR 85
Disentangling Factors of Variations Using Few Labels. Critical initialisation in continuous approximations of binary neural networks. Probabilistic Connection Importance Inference and Lossless Compression of Deep Neural Networks. Stochastic Gradient Descent with Biased but Consistent Gradient Estimators. The Sooner The Better: Investigating Structu …
Disentangling Factors of Variations Using Few Labels
Title | Disentangling Factors of Variations Using Few Labels |
Authors | Anonymous |
Abstract | Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al. (2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a limited amount of supervision, for example through manual labeling of (some) factors of variation in a few training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 52000 models under well-defined and reproducible experimental conditions. We observe that a small number of labeled examples (0.01–0.5% of the data set), with potentially imprecise and incomplete labels, is sufficient to perform model selection on state-of-the-art unsupervised models. Further, we investigate the benefit of incorporating supervision into the training process. Overall, we empirically validate that with little and imprecise supervision it is possible to reliably learn disentangled representations. |
Tasks | Model Selection, Representation Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SygagpEKwB |
https://openreview.net/pdf?id=SygagpEKwB | |
PWC | https://paperswithcode.com/paper/disentangling-factors-of-variations-using-few |
Repo | |
Framework | |
Critical initialisation in continuous approximations of binary neural networks
Title | Critical initialisation in continuous approximations of binary neural networks |
Authors | Anonymous |
Abstract | The training of stochastic neural network models with binary ($\pm1$) weights and activations via continuous surrogate networks is investigated. We derive, using mean field theory, a set of scalar equations describing how input signals propagate through surrogate networks. The equations reveal that depending on the choice of surrogate model, the networks may or may not exhibit an order to chaos transition, and the presence of depth scales that limit the maximum trainable depth. Specifically, in solving the equations for edge of chaos conditions, we show that surrogates derived using the Gaussian local reparameterisation trick have no critical initialisation, whereas a deterministic surrogates based on analytic Gaussian integration do. The theory is applied to a range of binary neuron and weight design choices, such as different neuron noise models, allowing the categorisation of algorithms in terms of their behaviour at initialisation. Moreover, we predict theoretically and confirm numerically, that common weight initialization schemes used in standard continuous networks, when applied to the mean values of the stochastic binary weights, yield poor training performance. This study shows that, contrary to common intuition, the means of the stochastic binary weights should be initialised close to close to $\pm 1$ for deeper networks to be trainable. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rylmoxrFDH |
https://openreview.net/pdf?id=rylmoxrFDH | |
PWC | https://paperswithcode.com/paper/critical-initialisation-in-continuous |
Repo | |
Framework | |
Probabilistic Connection Importance Inference and Lossless Compression of Deep Neural Networks
Title | Probabilistic Connection Importance Inference and Lossless Compression of Deep Neural Networks |
Authors | Anonymous |
Abstract | Deep neural networks (DNNs) can be huge in size, requiring a considerable a mount of energy and computational resources to operate, which limits their applications in numerous scenarios. It is thus of interest to compress DNNs while maintaining their performance levels. We here propose a probabilistic importance inference approach for pruning DNNs. Specifically, we test the significance of the relevance of a connection in a DNN to the DNN’s outputs using a nonparemtric scoring testand keep only those significant ones. Experimental results show that the proposed approach achieves better lossless compression rates than existing techniques |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJgCF0VFwr |
https://openreview.net/pdf?id=HJgCF0VFwr | |
PWC | https://paperswithcode.com/paper/probabilistic-connection-importance-inference |
Repo | |
Framework | |
Stochastic Gradient Descent with Biased but Consistent Gradient Estimators
Title | Stochastic Gradient Descent with Biased but Consistent Gradient Estimators |
Authors | Anonymous |
Abstract | Stochastic gradient descent (SGD), which dates back to the 1950s, is one of the most popular and effective approaches for performing stochastic optimization. Research on SGD resurged recently in machine learning for optimizing convex loss functions and training nonconvex deep neural networks. The theory assumes that one can easily compute an unbiased gradient estimator, which is usually the case due to the sample average nature of empirical risk minimization. There exist, however, many scenarios (e.g., graphs) where an unbiased estimator may be as expensive to compute as the full gradient because training examples are interconnected. Recently, Chen et al. (2018) proposed using a consistent gradient estimator as an economic alternative. Encouraged by empirical success, we show, in a general setting, that consistent estimators result in the same convergence behavior as do unbiased ones. Our analysis covers strongly convex, convex, and nonconvex objectives. We verify the results with illustrative experiments on synthetic and real-world data. This work opens several new research directions, including the development of more efficient SGD updates with consistent estimators and the design of efficient training algorithms for large-scale graphs. |
Tasks | Stochastic Optimization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rygMWT4twS |
https://openreview.net/pdf?id=rygMWT4twS | |
PWC | https://paperswithcode.com/paper/stochastic-gradient-descent-with-biased-but-1 |
Repo | |
Framework | |
The Sooner The Better: Investigating Structure of Early Winning Lottery Tickets
Title | The Sooner The Better: Investigating Structure of Early Winning Lottery Tickets |
Authors | Anonymous |
Abstract | The recent success of the lottery ticket hypothesis by Frankle & Carbin (2018) suggests that small, sparsified neural networks can be trained as long as the network is initialized properly. Several follow-up discussions on the initialization of the sparsified model have discovered interesting characteristics such as the necessity of rewinding (Frankle et al. (2019)), importance of sign of the initial weights (Zhou et al. (2019)), and the transferability of the winning lottery tickets (S. Morcos et al. (2019)). In contrast, another essential aspect of the winning ticket, the structure of the sparsified model, has been little discussed. To find the lottery ticket, unfortunately, all the prior work still relies on computationally expensive iterative pruning. In this work, we conduct an in-depth investigation of the structure of winning lottery tickets. Interestingly, we discover that there exist many lottery tickets that can achieve equally good accuracy much before the regular training schedule even finishes. We provide insights into the structure of these early winning tickets with supporting evidence. 1) Under stochastic gradient descent optimization, lottery ticket emerges when weight magnitude of a model saturates; 2) Pruning before the saturation of a model causes the loss of capability in learning complex patterns, resulting in the accuracy degradation. We employ the memorization capacity analysis to quantitatively confirm it, and further explain why gradual pruning can achieve better accuracy over the one-shot pruning. Based on these insights, we discover the early winning tickets for various ResNet architectures on both CIFAR10 and ImageNet, achieving state-of-the-art accuracy at a high pruning rate without expensive iterative pruning. In the case of ResNet50 on ImageNet, this comes to the winning ticket of 75:02% Top-1 accuracy at 80% pruning rate in only 22% of the total epochs for iterative pruning. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJlNs0VYPB |
https://openreview.net/pdf?id=BJlNs0VYPB | |
PWC | https://paperswithcode.com/paper/the-sooner-the-better-investigating-structure |
Repo | |
Framework | |
Stable Rank Normalization for Improved Generalization in Neural Networks and GANs
Title | Stable Rank Normalization for Improved Generalization in Neural Networks and GANs |
Authors | Anonymous |
Abstract | Exciting new work on generalization bounds for neural networks (NN) given by Bartlett et al. (2017); Neyshabur et al. (2018) closely depend on two parameter- dependant quantities: the Lipschitz constant upper bound and the stable rank (a softer version of rank). Even though these bounds typically have minimal practical utility, they facilitate questions on whether controlling such quantities together could improve the generalization behaviour of NNs in practice. To this end, we propose stable rank normalization (SRN), a novel, provably optimal, and computationally efficient weight-normalization scheme which minimizes the stable rank of a linear operator. Surprisingly we find that SRN, despite being non-convex, can be shown to have a unique optimal solution. We provide extensive analyses across a wide variety of NNs (DenseNet, WideResNet, ResNet, Alexnet, VGG), where applying SRN to their linear layers leads to improved classification accuracy, while simultaneously showing improvements in genealization, evaluated empirically using—(a) shattering experiments (Zhang et al., 2016); and (b) three measures of sample complexity by Bartlett et al. (2017), Neyshabur et al. (2018), & Wei & Ma. Additionally, we show that, when applied to the discriminator of GANs, it improves Inception, FID, and Neural divergence scores, while learning mappings with low empirical Lipschitz constant. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1enKkrFDB |
https://openreview.net/pdf?id=H1enKkrFDB | |
PWC | https://paperswithcode.com/paper/stable-rank-normalization-for-improved-1 |
Repo | |
Framework | |
Conditional Learning of Fair Representations
Title | Conditional Learning of Fair Representations |
Authors | Anonymous |
Abstract | We propose a novel algorithm for learning fair representations that can simultaneously mitigate two notions of disparity among different demographic subgroups. Two key components underpinning the design of our algorithm are balanced error rate and conditional alignment of representations. We show how these two components contribute to ensuring accuracy parity and equalized false-positive and false-negative rates across groups without impacting demographic parity. Furthermore, we also demonstrate both in theory and on two real-world experiments that the proposed algorithm leads to a better utility-fairness trade-off on balanced datasets compared with existing algorithms on learning fair representations. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hkekl0NFPr |
https://openreview.net/pdf?id=Hkekl0NFPr | |
PWC | https://paperswithcode.com/paper/conditional-learning-of-fair-representations-1 |
Repo | |
Framework | |
I love your chain mail! Making knights smile in a fantasy game world
Title | I love your chain mail! Making knights smile in a fantasy game world |
Authors | Anonymous |
Abstract | Dialogue research tends to distinguish between chit-chat and goal-oriented tasks. While the former is arguably more naturalistic and has a wider use of language, the latter has clearer metrics and a more straightforward learning signal. Humans effortlessly combine the two, and engage in chit-chat for example with the goal of exchanging information or eliciting a specific response. Here, we bridge the divide between these two domains in the setting of a rich multi-player text-based fantasy environment where agents and humans engage in both actions and dialogue. Specifically, we train a goal-oriented model with reinforcement learning via self-play against an imitation-learned chit-chat model with two new approaches: the policy either learns to pick a topic or learns to pick an utterance given the top-k utterances. We show that both models outperform a strong inverse model baseline and can converse naturally with their dialogue partner in order to achieve goals. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJxRrlBFwB |
https://openreview.net/pdf?id=BJxRrlBFwB | |
PWC | https://paperswithcode.com/paper/i-love-your-chain-mail-making-knights-smile |
Repo | |
Framework | |
Cross-Lingual Vision-Language Navigation
Title | Cross-Lingual Vision-Language Navigation |
Authors | Anonymous |
Abstract | Vision-Language Navigation (VLN) is the task where an agent is commanded to navigate in photo-realistic unknown environments with natural language instructions. Previous research on VLN is primarily conducted on the Room-to-Room (R2R) dataset with only English instructions. The ultimate goal of VLN, however, is to serve people speaking arbitrary languages. Towards multilingual VLN with numerous languages, we collect a cross-lingual R2R dataset, which extends the original benchmark with corresponding Chinese instructions. But it is time-consuming and expensive to collect large-scale human instructions for every existing language. Based on the newly introduced dataset, we propose a general cross-lingual VLN framework to enable instruction-following navigation for different languages. We first explore the possibility of building a cross-lingual agent when no training data of the target language is available. The cross-lingual agent is equipped with a meta-learner to aggregate cross-lingual representations and a visually grounded cross-lingual alignment module to align textual representations of different languages. Under the zero-shot learning scenario, our model shows competitive results even compared to a model trained with all target language instructions. In addition, we introduce an adversarial domain adaption loss to improve the transferring ability of our model when given a certain amount of target language data. Our methods and dataset demonstrate the potentials of building a cross-lingual agent to serve speakers with different languages. |
Tasks | Domain Adaptation, Vision-Language Navigation, Zero-Shot Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkeZO1BFDB |
https://openreview.net/pdf?id=rkeZO1BFDB | |
PWC | https://paperswithcode.com/paper/cross-lingual-vision-language-navigation |
Repo | |
Framework | |
Unifying Graph Convolutional Neural Networks and Label Propagation
Title | Unifying Graph Convolutional Neural Networks and Label Propagation |
Authors | Anonymous |
Abstract | Label Propagation (LPA) and Graph Convolutional Neural Networks (GCN) are both message passing algorithms on graphs. Both solve the task of node classification but LPA propagates node label information across the edges of the graph, while GCN propagates and transforms node feature information. However, while conceptually similar, theoretical relation between LPA and GCN has not yet been investigated. Here we study the relationship between LPA and GCN in terms of two aspects: (1) feature/label smoothing where we analyze how the feature/label of one node are spread over its neighbors; And, (2) feature/label influence of how much the initial feature/label of one node influences the final feature/label of another node. Based on our theoretical analysis, we propose an end-to-end model that unifies GCN and LPA for node classification. In our unified model, edge weights are learnable, and the LPA serves as regularization to assist the GCN in learning proper edge weights that lead to improved classification performance. Our model can also be seen as learning attention weights based on node labels, which is more task-oriented than existing feature-based attention models. In a number of experiments on real-world graphs, our model shows superiority over state-of-the-art GCN-based methods in terms of node classification accuracy. |
Tasks | Node Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkgdYhVtvH |
https://openreview.net/pdf?id=rkgdYhVtvH | |
PWC | https://paperswithcode.com/paper/unifying-graph-convolutional-neural-networks |
Repo | |
Framework | |
Guiding Program Synthesis by Learning to Generate Examples
Title | Guiding Program Synthesis by Learning to Generate Examples |
Authors | Anonymous |
Abstract | A key challenge of existing program synthesizers is ensuring that the synthesized program generalizes well. This can be difficult to achieve as the specification provided by the end user is often limited, containing as few as one or two input-output examples. In this paper we address this challenge via an iterative approach that finds ambiguities in the provided specification and learns to resolve these by generating additional input-output examples. The main insight is to reduce the problem of selecting which program generalizes well to the simpler task of deciding which output is correct. As a result, to train our probabilistic models, we can take advantage of the large amounts of data in the form of program outputs, which are often much easier to obtain than the corresponding ground-truth programs. |
Tasks | Program Synthesis |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJl07ySKvS |
https://openreview.net/pdf?id=BJl07ySKvS | |
PWC | https://paperswithcode.com/paper/guiding-program-synthesis-by-learning-to |
Repo | |
Framework | |
Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data
Title | Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data |
Authors | Anonymous |
Abstract | This paper investigates the intriguing question of whether we can create learning algorithms that automatically generate training data, learning environments, and curricula in order to help AI agents rapidly learn. We show that such algorithms are possible via Generative Teaching Networks (GTNs), a general approach that is applicable to supervised, unsupervised, and reinforcement learning. GTNs are deep neural networks that generate data and/or training environments that a learner (e.g.\ a freshly initialized neural network) trains on before being tested on a target task. We then differentiate \emph{through the entire learning process} via meta-gradients to update the GTN parameters to improve performance on the target task. GTNs have the beneficial property that they can theoretically generate any type of data or training environment, making their potential impact large. This paper introduces GTNs, discusses their potential, and showcases that they can substantially accelerate learning. We also demonstrate a practical and exciting application of GTNs: accelerating the evaluation of candidate architectures for neural architecture search (NAS), which is rate-limited by such evaluations, enabling massive speed-ups in NAS. GTN-NAS improves the NAS state of the art, finding higher performing architectures when controlling for the search proposal mechanism. GTN-NAS also is competitive with the overall state of the art approaches, which achieve top performance while using orders of magnitude less computation than typical NAS methods. Overall, GTNs represent a first step toward the ambitious goal of algorithms that generate their own training data and, in doing so, open a variety of interesting new research questions and directions. |
Tasks | Neural Architecture Search |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJg_ECEKDr |
https://openreview.net/pdf?id=HJg_ECEKDr | |
PWC | https://paperswithcode.com/paper/generative-teaching-networks-accelerating |
Repo | |
Framework | |
Don’t Use Large Mini-batches, Use Local SGD
Title | Don’t Use Large Mini-batches, Use Local SGD |
Authors | Anonymous |
Abstract | Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1eyO1BFPr |
https://openreview.net/pdf?id=B1eyO1BFPr | |
PWC | https://paperswithcode.com/paper/dont-use-large-mini-batches-use-local-sgd-1 |
Repo | |
Framework | |
Kernelized Wasserstein Natural Gradient
Title | Kernelized Wasserstein Natural Gradient |
Authors | Anonymous |
Abstract | Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on \texttt{Cifar10} and \texttt{Cifar100} empirically. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Hklz71rYvS |
https://openreview.net/pdf?id=Hklz71rYvS | |
PWC | https://paperswithcode.com/paper/kernelized-wasserstein-natural-gradient |
Repo | |
Framework | |
Learning Compositional Koopman Operators for Model-Based Control
Title | Learning Compositional Koopman Operators for Model-Based Control |
Authors | Anonymous |
Abstract | Finding an embedding space for a linear approximation of a nonlinear dynamical system enables efficient system identification and control synthesis. The Koopman operator theory lays the foundation for identifying the nonlinear-to-linear coordinate transformations with data-driven methods. Recently, researchers have proposed to use deep neural networks as a more expressive class of basis functions for calculating the Koopman operators. These approaches, however, assume a fixed dimensional state space; they are therefore not applicable to scenarios with a variable number of objects. In this paper, we propose to learn compositional Koopman operators, using graph neural networks to encode the state into object-centric embeddings and using a block-wise linear transition matrix to regularize the shared structure across objects. The learned dynamics can quickly adapt to new environments of unknown physical parameters and produce control signals to achieve a specified goal. Our experiments on manipulating ropes and controlling soft robots show that the proposed method has better efficiency and generalization ability than existing baselines. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1ldzA4tPr |
https://openreview.net/pdf?id=H1ldzA4tPr | |
PWC | https://paperswithcode.com/paper/learning-compositional-koopman-operators-for-1 |
Repo | |
Framework | |