Paper Group NANR 105
Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. Adversarial Video Generation on Complex Datasets. Information Geometry of Orthogonal Initializations and Training. Towards Hierarchical Importance Attribution: Explaining Compositional Semantics …
Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
Title | Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks |
Authors | Anonymous |
Abstract | The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkgqN1SYvr |
https://openreview.net/pdf?id=rkgqN1SYvr | |
PWC | https://paperswithcode.com/paper/provable-benefit-of-orthogonal-initialization |
Repo | |
Framework | |
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Title | vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations |
Authors | Anonymous |
Abstract | We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition. |
Tasks | Speech Recognition |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rylwJxrYDS |
https://openreview.net/pdf?id=rylwJxrYDS | |
PWC | https://paperswithcode.com/paper/vq-wav2vec-self-supervised-learning-of |
Repo | |
Framework | |
Adversarial Video Generation on Complex Datasets
Title | Adversarial Video Generation on Complex Datasets |
Authors | Anonymous |
Abstract | Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state-of-the-art Fréchet Inception Distance for prediction for Kinetics-600, as well as state-of-the-art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a strong baseline for synthesis on Kinetics-600. |
Tasks | Video Generation, Video Prediction |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Byx91R4twB |
https://openreview.net/pdf?id=Byx91R4twB | |
PWC | https://paperswithcode.com/paper/adversarial-video-generation-on-complex |
Repo | |
Framework | |
Information Geometry of Orthogonal Initializations and Training
Title | Information Geometry of Orthogonal Initializations and Training |
Authors | Anonymous |
Abstract | Recently mean field theory has been successfully used to analyze properties of wide, random neural networks. It gave rise to a prescriptive theory for initializing feed-forward neural networks with orthogonal weights, which ensures that both the forward propagated activations and the backpropagated gradients are near `2 isometries and as a consequence training is orders of magnitude faster. Despite strong empirical performance, the mechanisms by which critical initializations confer an advantage in the optimization of deep neural networks are poorly understood. Here we show a novel connection between the maximum curvature of the optimization landscape (gradient smoothness) as measured by the Fisher information matrix (FIM) and the spectral radius of the input-output Jacobian, which partially explains why more isometric networks can train much faster. Furthermore, given that orthogonal weights are necessary to ensure that gradient norms are approximately preserved at initialization, we experimentally investigate the benefits of maintaining orthogonality throughout training, and we conclude that manifold optimization of weights performs well regardless of the smoothness of the gradients. Moreover, we observe a surprising yet robust behavior of highly isometric initializations— even though such networks have a lower FIM condition number at initialization, and therefore by analogy to convex functions should be easier to optimize, experimentally they prove to be much harder to train with stochastic gradient descent. We propose an explanation for this phenomenon by exploting connections between Fisher geometry and the recently introduced Neural Tangent Kernel. | |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkg1ngrFPr |
https://openreview.net/pdf?id=rkg1ngrFPr | |
PWC | https://paperswithcode.com/paper/information-geometry-of-orthogonal-1 |
Repo | |
Framework | |
Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models
Title | Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models |
Authors | Anonymous |
Abstract | Deep neural networks have achieved impressive performance in handling complicated semantics in natural language, while mostly treated as black boxes. To explain how the model handles compositional semantics of words and phrases, we study the hierarchical explanation problem. We highlight the key challenge is to compute non-additive and context-independent importance for individual words and phrases. We show some prior efforts on hierarchical explanations, e.g. contextual decomposition, do not satisfy the desired properties mathematically, leading to inconsistent explanation quality in different models. In this paper, we propose a formal way to quantify the importance of each word or phrase to generate hierarchical explanations. We modify contextual decomposition algorithms according to our formulation, and propose a model-agnostic explanation algorithm with competitive performance. Human evaluation and automatic metrics evaluation on both LSTM models and fine-tuned BERT Transformer models on multiple datasets show that our algorithms robustly outperform prior works on hierarchical explanations. We show our algorithms help explain compositionality of semantics, extract classification rules, and improve human trust of models. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkxRRkSKwr |
https://openreview.net/pdf?id=BkxRRkSKwr | |
PWC | https://paperswithcode.com/paper/towards-hierarchical-importance-attribution |
Repo | |
Framework | |
GRAPH ANALYSIS AND GRAPH POOLING IN THE SPATIAL DOMAIN
Title | GRAPH ANALYSIS AND GRAPH POOLING IN THE SPATIAL DOMAIN |
Authors | Anonymous |
Abstract | The spatial convolution layer which is widely used in the Graph Neural Networks (GNNs) aggregates the feature vector of each node with the feature vectors of its neighboring nodes. The GNN is not aware of the locations of the nodes in the global structure of the graph and when the local structures corresponding to different nodes are similar to each other, the convolution layer maps all those nodes to similar or same feature vectors in the continuous feature space. Therefore, the GNN cannot distinguish two graphs if their difference is not in their local structures. In addition, when the nodes are not labeled/attributed the convolution layers can fail to distinguish even different local structures. In this paper, we propose an effective solution to address this problem of the GNNs. The proposed approach leverages a spatial representation of the graph which makes the neural network aware of the differences between the nodes and also their locations in the graph. The spatial representation which is equivalent to a point-cloud representation of the graph is obtained by a graph embedding method. Using the proposed approach, the local feature extractor of the GNN distinguishes similar local structures in different locations of the graph and the GNN infers the topological structure of the graph from the spatial distribution of the locally extracted feature vectors. Moreover, the spatial representation is utilized to simplify the graph down-sampling problem. A new graph pooling method is proposed and it is shown that the proposed pooling method achieves competitive or better results in comparison with the state-of-the-art methods. |
Tasks | Graph Embedding |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1eQeCEYwB |
https://openreview.net/pdf?id=r1eQeCEYwB | |
PWC | https://paperswithcode.com/paper/graph-analysis-and-graph-pooling-in-the-1 |
Repo | |
Framework | |
Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
Title | Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization |
Authors | Anonymous |
Abstract | An open question in the Deep Learning community is why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data. We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent that we call Coherent Gradients: Gradients from similar examples are similar and so the overall gradient is stronger in certain directions where these reinforce each other. Thus changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples when such similarity exists. We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ryeFY0EFwS |
https://openreview.net/pdf?id=ryeFY0EFwS | |
PWC | https://paperswithcode.com/paper/coherent-gradients-an-approach-to |
Repo | |
Framework | |
Understanding Why Neural Networks Generalize Well Through GSNR of Parameters
Title | Understanding Why Neural Networks Generalize Well Through GSNR of Parameters |
Authors | Anonymous |
Abstract | As deep neural networks (DNNs) achieve tremendous success across many application domains, researchers tried to explore in many aspects on why they generalize well. In this paper, we provide a novel perspective on these issues using the gradient signal to noise ratio (GSNR) of parameters during training process of DNNs. The GSNR of a parameter is simply defined as the ratio between its gradient’s squared mean and variance, over the data distribution. Based on several approximations, we establish a quantitative relationship between model parameters’ GSNR and the generalization gap. This relationship indicates that larger GSNR during training process leads to better generalization performance. Futher, we show that, different from that of shallow models (e.g. logistic regression, support vector machines), the gradient descent optimization dynamics of DNNs naturally produces large GSNR during training, which is probably the key to DNNs’ remarkable generalization ability. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HyevIJStwH |
https://openreview.net/pdf?id=HyevIJStwH | |
PWC | https://paperswithcode.com/paper/understanding-why-neural-networks-generalize |
Repo | |
Framework | |
Intrinsic Motivation for Encouraging Synergistic Behavior
Title | Intrinsic Motivation for Encouraging Synergistic Behavior |
Authors | Anonymous |
Abstract | We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJleNCNtDH |
https://openreview.net/pdf?id=SJleNCNtDH | |
PWC | https://paperswithcode.com/paper/intrinsic-motivation-for-encouraging |
Repo | |
Framework | |
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Title | ALBERT: A Lite BERT for Self-supervised Learning of Language Representations |
Authors | Anonymous |
Abstract | Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1eA7AEtvS |
https://openreview.net/pdf?id=H1eA7AEtvS | |
PWC | https://paperswithcode.com/paper/albert-a-lite-bert-for-self-supervised-1 |
Repo | |
Framework | |
SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference
Title | SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference |
Authors | Anonymous |
Abstract | We present a modern scalable reinforcement learning agent called SEED (Scalable, Efficient Deep-RL). By effectively utilizing modern accelerators, we show that it is not only possible to train on millions of frames per second but also to lower the cost. of experiments compared to current methods. We achieve this with a simple architecture that features centralized inference and an optimized communication layer. SEED adopts two state-of-the-art distributed algorithms, IMPALA/V-trace (policy gradients) and R2D2 (Q-learning), and is evaluated on Atari-57, DeepMind Lab and Google Research Football. We improve the state of the art on Football and are able to reach state of the art on Atari-57 twice as fast in wall-time. For the scenarios we consider, a 40% to 80% cost reduction for running experiments is achieved. The implementation along with experiments is open-sourced so results can be reproduced and novel ideas tried out. |
Tasks | Q-Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkgvXlrKwH |
https://openreview.net/pdf?id=rkgvXlrKwH | |
PWC | https://paperswithcode.com/paper/seed-rl-scalable-and-efficient-deep-rl-with |
Repo | |
Framework | |
Adversarial Attacks on Copyright Detection Systems
Title | Adversarial Attacks on Copyright Detection Systems |
Authors | Anonymous |
Abstract | It is well-known that many machine learning models are susceptible to adversarial attacks, in which an attacker evades a classifier by making small perturbations to inputs. This paper discusses how industrial copyright detection tools, which serve a central role on the web, are susceptible to adversarial attacks. We discuss a range of copyright detection systems, and why they are particularly vulnerable to attacks. These vulnerabilities are especially apparent for neural network based systems. As proof of concept, we describe a well-known music identification method and implement this system in the form of a neural net. We then attack this system using simple gradient methods. Adversarial music created this way successfully fools industrial systems, including the AudioTag copyright detector and YouTube’s Content ID system. Our goal is to raise awareness of the threats posed by adversarial examples in this space and to highlight the importance of hardening copyright detection systems to attacks. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJlRWC4FDB |
https://openreview.net/pdf?id=SJlRWC4FDB | |
PWC | https://paperswithcode.com/paper/adversarial-attacks-on-copyright-detection-1 |
Repo | |
Framework | |
Goal-Conditioned Video Prediction
Title | Goal-Conditioned Video Prediction |
Authors | Anonymous |
Abstract | Many processes can be concisely represented as a sequence of events leading from a starting state to an end state. Given raw ingredients, and a finished cake, an experienced chef can surmise the recipe. Building upon this intuition, we propose a new class of visual generative models: goal-conditioned predictors (GCP). Prior work on video generation largely focuses on prediction models that only observe frames from the beginning of the video. GCP instead treats videos as start-goal transformations, making video generation easier by conditioning on the more informative context provided by the first and final frames. Not only do existing forward prediction approaches synthesize better and longer videos when modified to become goal-conditioned, but GCP models can also utilize structures that are not linear in time, to accomplish hierarchical prediction. To this end, we study both auto-regressive GCP models and novel tree-structured GCP models that generate frames recursively, splitting the video iteratively into finer and finer segments delineated by subgoals. In experiments across simulated and real datasets, our GCP methods generate high-quality sequences over long horizons. Tree-structured GCPs are also substantially easier to parallelize than auto-regressive GCPs, making training and inference very efficient, and allowing the model to train on sequences that are thousands of frames in length.Finally, we demonstrate the utility of GCP approaches for imitation learning in the setting without access to expert actions. Videos are on the supplementary website: https://sites.google.com/view/video-gcp |
Tasks | Imitation Learning, Video Generation, Video Prediction |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1g79grKPr |
https://openreview.net/pdf?id=B1g79grKPr | |
PWC | https://paperswithcode.com/paper/goal-conditioned-video-prediction |
Repo | |
Framework | |
Toward Amortized Ranking-Critical Training For Collaborative Filtering
Title | Toward Amortized Ranking-Critical Training For Collaborative Filtering |
Authors | Anonymous |
Abstract | We investigate new methods for training collaborative filtering models based on actor-critic reinforcement learning, to more directly maximize ranking-based objective functions. Specifically, we train a critic network to approximate ranking-based metrics, and then update the actor network to directly optimize against the learned metrics. In contrast to traditional learning-to-rank methods that require re-running the optimization procedure for new lists, our critic-based method amortizes the scoring process with a neural network, and can directly provide the (approximate) ranking scores for new lists. We demonstrate the actor-critic’s ability to significantly improve the performance of a variety of prediction models, and achieve better or comparable performance to the state-of-the-art on three large-scale datasets. |
Tasks | Learning-To-Rank |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJxR7R4FvS |
https://openreview.net/pdf?id=HJxR7R4FvS | |
PWC | https://paperswithcode.com/paper/toward-amortized-ranking-critical-training |
Repo | |
Framework | |
Improving SAT Solver Heuristics with Graph Networks and Reinforcement Learning
Title | Improving SAT Solver Heuristics with Graph Networks and Reinforcement Learning |
Authors | Anonymous |
Abstract | We present GQSAT, a branching heuristic in a Boolean SAT solver trained with value-based reinforcement learning (RL) using Graph Neural Networks for function approximation. Solvers using GQSAT are complete SAT solvers that either provide a satisfying assignment or a proof of unsatisfiability, which is required for many SAT applications. The branching heuristic commonly used in SAT solvers today suffers from bad decisions during their warm-up period, whereas GQSAT has been trained to examine the structure of the particular problem instance to make better decisions at the beginning of the search. Training GQSAT is data efficient and does not require elaborate dataset preparation or feature engineering to train. We train GQSAT on small SAT problems using RL interfacing with an existing SAT solver. We show that GQSAT is able to reduce the number of iterations required to solve SAT problems by 2-3X, and it generalizes to unsatisfiable SAT instances, as well as to problems with 5X more variables than it was trained on. We also show that, to a lesser extent, it generalizes to SAT problems from different domains by evaluating it on graph coloring. Our experiments show that augmenting SAT solvers with agents trained with RL and graph neural networks can improve performance on the SAT search problem. |
Tasks | Feature Engineering |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1lCn64tvS |
https://openreview.net/pdf?id=B1lCn64tvS | |
PWC | https://paperswithcode.com/paper/improving-sat-solver-heuristics-with-graph-1 |
Repo | |
Framework | |