April 1, 2020

3002 words 15 mins read

Paper Group NANR 105

Paper Group NANR 105

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. Adversarial Video Generation on Complex Datasets. Information Geometry of Orthogonal Initializations and Training. Towards Hierarchical Importance Attribution: Explaining Compositional Semantics …

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Title Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
Authors Anonymous
Abstract The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=rkgqN1SYvr
PDF https://openreview.net/pdf?id=rkgqN1SYvr
PWC https://paperswithcode.com/paper/provable-benefit-of-orthogonal-initialization
Repo
Framework

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Title vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Authors Anonymous
Abstract We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.
Tasks Speech Recognition
Published 2020-01-01
URL https://openreview.net/forum?id=rylwJxrYDS
PDF https://openreview.net/pdf?id=rylwJxrYDS
PWC https://paperswithcode.com/paper/vq-wav2vec-self-supervised-learning-of
Repo
Framework

Adversarial Video Generation on Complex Datasets

Title Adversarial Video Generation on Complex Datasets
Authors Anonymous
Abstract Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state-of-the-art Fréchet Inception Distance for prediction for Kinetics-600, as well as state-of-the-art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a strong baseline for synthesis on Kinetics-600.
Tasks Video Generation, Video Prediction
Published 2020-01-01
URL https://openreview.net/forum?id=Byx91R4twB
PDF https://openreview.net/pdf?id=Byx91R4twB
PWC https://paperswithcode.com/paper/adversarial-video-generation-on-complex
Repo
Framework

Information Geometry of Orthogonal Initializations and Training

Title Information Geometry of Orthogonal Initializations and Training
Authors Anonymous
Abstract Recently mean field theory has been successfully used to analyze properties of wide, random neural networks. It gave rise to a prescriptive theory for initializing feed-forward neural networks with orthogonal weights, which ensures that both the forward propagated activations and the backpropagated gradients are near `2 isometries and as a consequence training is orders of magnitude faster. Despite strong empirical performance, the mechanisms by which critical initializations confer an advantage in the optimization of deep neural networks are poorly understood. Here we show a novel connection between the maximum curvature of the optimization landscape (gradient smoothness) as measured by the Fisher information matrix (FIM) and the spectral radius of the input-output Jacobian, which partially explains why more isometric networks can train much faster. Furthermore, given that orthogonal weights are necessary to ensure that gradient norms are approximately preserved at initialization, we experimentally investigate the benefits of maintaining orthogonality throughout training, and we conclude that manifold optimization of weights performs well regardless of the smoothness of the gradients. Moreover, we observe a surprising yet robust behavior of highly isometric initializations— even though such networks have a lower FIM condition number at initialization, and therefore by analogy to convex functions should be easier to optimize, experimentally they prove to be much harder to train with stochastic gradient descent. We propose an explanation for this phenomenon by exploting connections between Fisher geometry and the recently introduced Neural Tangent Kernel. |
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=rkg1ngrFPr
PDF https://openreview.net/pdf?id=rkg1ngrFPr
PWC https://paperswithcode.com/paper/information-geometry-of-orthogonal-1
Repo
Framework

Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models

Title Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models
Authors Anonymous
Abstract Deep neural networks have achieved impressive performance in handling complicated semantics in natural language, while mostly treated as black boxes. To explain how the model handles compositional semantics of words and phrases, we study the hierarchical explanation problem. We highlight the key challenge is to compute non-additive and context-independent importance for individual words and phrases. We show some prior efforts on hierarchical explanations, e.g. contextual decomposition, do not satisfy the desired properties mathematically, leading to inconsistent explanation quality in different models. In this paper, we propose a formal way to quantify the importance of each word or phrase to generate hierarchical explanations. We modify contextual decomposition algorithms according to our formulation, and propose a model-agnostic explanation algorithm with competitive performance. Human evaluation and automatic metrics evaluation on both LSTM models and fine-tuned BERT Transformer models on multiple datasets show that our algorithms robustly outperform prior works on hierarchical explanations. We show our algorithms help explain compositionality of semantics, extract classification rules, and improve human trust of models.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BkxRRkSKwr
PDF https://openreview.net/pdf?id=BkxRRkSKwr
PWC https://paperswithcode.com/paper/towards-hierarchical-importance-attribution
Repo
Framework

GRAPH ANALYSIS AND GRAPH POOLING IN THE SPATIAL DOMAIN

Title GRAPH ANALYSIS AND GRAPH POOLING IN THE SPATIAL DOMAIN
Authors Anonymous
Abstract The spatial convolution layer which is widely used in the Graph Neural Networks (GNNs) aggregates the feature vector of each node with the feature vectors of its neighboring nodes. The GNN is not aware of the locations of the nodes in the global structure of the graph and when the local structures corresponding to different nodes are similar to each other, the convolution layer maps all those nodes to similar or same feature vectors in the continuous feature space. Therefore, the GNN cannot distinguish two graphs if their difference is not in their local structures. In addition, when the nodes are not labeled/attributed the convolution layers can fail to distinguish even different local structures. In this paper, we propose an effective solution to address this problem of the GNNs. The proposed approach leverages a spatial representation of the graph which makes the neural network aware of the differences between the nodes and also their locations in the graph. The spatial representation which is equivalent to a point-cloud representation of the graph is obtained by a graph embedding method. Using the proposed approach, the local feature extractor of the GNN distinguishes similar local structures in different locations of the graph and the GNN infers the topological structure of the graph from the spatial distribution of the locally extracted feature vectors. Moreover, the spatial representation is utilized to simplify the graph down-sampling problem. A new graph pooling method is proposed and it is shown that the proposed pooling method achieves competitive or better results in comparison with the state-of-the-art methods.
Tasks Graph Embedding
Published 2020-01-01
URL https://openreview.net/forum?id=r1eQeCEYwB
PDF https://openreview.net/pdf?id=r1eQeCEYwB
PWC https://paperswithcode.com/paper/graph-analysis-and-graph-pooling-in-the-1
Repo
Framework

Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

Title Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
Authors Anonymous
Abstract An open question in the Deep Learning community is why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data. We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent that we call Coherent Gradients: Gradients from similar examples are similar and so the overall gradient is stronger in certain directions where these reinforce each other. Thus changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples when such similarity exists. We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=ryeFY0EFwS
PDF https://openreview.net/pdf?id=ryeFY0EFwS
PWC https://paperswithcode.com/paper/coherent-gradients-an-approach-to
Repo
Framework

Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Title Understanding Why Neural Networks Generalize Well Through GSNR of Parameters
Authors Anonymous
Abstract As deep neural networks (DNNs) achieve tremendous success across many application domains, researchers tried to explore in many aspects on why they generalize well. In this paper, we provide a novel perspective on these issues using the gradient signal to noise ratio (GSNR) of parameters during training process of DNNs. The GSNR of a parameter is simply defined as the ratio between its gradient’s squared mean and variance, over the data distribution. Based on several approximations, we establish a quantitative relationship between model parameters’ GSNR and the generalization gap. This relationship indicates that larger GSNR during training process leads to better generalization performance. Futher, we show that, different from that of shallow models (e.g. logistic regression, support vector machines), the gradient descent optimization dynamics of DNNs naturally produces large GSNR during training, which is probably the key to DNNs’ remarkable generalization ability.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=HyevIJStwH
PDF https://openreview.net/pdf?id=HyevIJStwH
PWC https://paperswithcode.com/paper/understanding-why-neural-networks-generalize
Repo
Framework

Intrinsic Motivation for Encouraging Synergistic Behavior

Title Intrinsic Motivation for Encouraging Synergistic Behavior
Authors Anonymous
Abstract We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=SJleNCNtDH
PDF https://openreview.net/pdf?id=SJleNCNtDH
PWC https://paperswithcode.com/paper/intrinsic-motivation-for-encouraging
Repo
Framework

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Title ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Authors Anonymous
Abstract Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=H1eA7AEtvS
PDF https://openreview.net/pdf?id=H1eA7AEtvS
PWC https://paperswithcode.com/paper/albert-a-lite-bert-for-self-supervised-1
Repo
Framework

SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference

Title SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference
Authors Anonymous
Abstract We present a modern scalable reinforcement learning agent called SEED (Scalable, Efficient Deep-RL). By effectively utilizing modern accelerators, we show that it is not only possible to train on millions of frames per second but also to lower the cost. of experiments compared to current methods. We achieve this with a simple architecture that features centralized inference and an optimized communication layer. SEED adopts two state-of-the-art distributed algorithms, IMPALA/V-trace (policy gradients) and R2D2 (Q-learning), and is evaluated on Atari-57, DeepMind Lab and Google Research Football. We improve the state of the art on Football and are able to reach state of the art on Atari-57 twice as fast in wall-time. For the scenarios we consider, a 40% to 80% cost reduction for running experiments is achieved. The implementation along with experiments is open-sourced so results can be reproduced and novel ideas tried out.
Tasks Q-Learning
Published 2020-01-01
URL https://openreview.net/forum?id=rkgvXlrKwH
PDF https://openreview.net/pdf?id=rkgvXlrKwH
PWC https://paperswithcode.com/paper/seed-rl-scalable-and-efficient-deep-rl-with
Repo
Framework
Title Adversarial Attacks on Copyright Detection Systems
Authors Anonymous
Abstract It is well-known that many machine learning models are susceptible to adversarial attacks, in which an attacker evades a classifier by making small perturbations to inputs. This paper discusses how industrial copyright detection tools, which serve a central role on the web, are susceptible to adversarial attacks. We discuss a range of copyright detection systems, and why they are particularly vulnerable to attacks. These vulnerabilities are especially apparent for neural network based systems. As proof of concept, we describe a well-known music identification method and implement this system in the form of a neural net. We then attack this system using simple gradient methods. Adversarial music created this way successfully fools industrial systems, including the AudioTag copyright detector and YouTube’s Content ID system. Our goal is to raise awareness of the threats posed by adversarial examples in this space and to highlight the importance of hardening copyright detection systems to attacks.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=SJlRWC4FDB
PDF https://openreview.net/pdf?id=SJlRWC4FDB
PWC https://paperswithcode.com/paper/adversarial-attacks-on-copyright-detection-1
Repo
Framework

Goal-Conditioned Video Prediction

Title Goal-Conditioned Video Prediction
Authors Anonymous
Abstract Many processes can be concisely represented as a sequence of events leading from a starting state to an end state. Given raw ingredients, and a finished cake, an experienced chef can surmise the recipe. Building upon this intuition, we propose a new class of visual generative models: goal-conditioned predictors (GCP). Prior work on video generation largely focuses on prediction models that only observe frames from the beginning of the video. GCP instead treats videos as start-goal transformations, making video generation easier by conditioning on the more informative context provided by the first and final frames. Not only do existing forward prediction approaches synthesize better and longer videos when modified to become goal-conditioned, but GCP models can also utilize structures that are not linear in time, to accomplish hierarchical prediction. To this end, we study both auto-regressive GCP models and novel tree-structured GCP models that generate frames recursively, splitting the video iteratively into finer and finer segments delineated by subgoals. In experiments across simulated and real datasets, our GCP methods generate high-quality sequences over long horizons. Tree-structured GCPs are also substantially easier to parallelize than auto-regressive GCPs, making training and inference very efficient, and allowing the model to train on sequences that are thousands of frames in length.Finally, we demonstrate the utility of GCP approaches for imitation learning in the setting without access to expert actions. Videos are on the supplementary website: https://sites.google.com/view/video-gcp
Tasks Imitation Learning, Video Generation, Video Prediction
Published 2020-01-01
URL https://openreview.net/forum?id=B1g79grKPr
PDF https://openreview.net/pdf?id=B1g79grKPr
PWC https://paperswithcode.com/paper/goal-conditioned-video-prediction
Repo
Framework

Toward Amortized Ranking-Critical Training For Collaborative Filtering

Title Toward Amortized Ranking-Critical Training For Collaborative Filtering
Authors Anonymous
Abstract We investigate new methods for training collaborative filtering models based on actor-critic reinforcement learning, to more directly maximize ranking-based objective functions. Specifically, we train a critic network to approximate ranking-based metrics, and then update the actor network to directly optimize against the learned metrics. In contrast to traditional learning-to-rank methods that require re-running the optimization procedure for new lists, our critic-based method amortizes the scoring process with a neural network, and can directly provide the (approximate) ranking scores for new lists. We demonstrate the actor-critic’s ability to significantly improve the performance of a variety of prediction models, and achieve better or comparable performance to the state-of-the-art on three large-scale datasets.
Tasks Learning-To-Rank
Published 2020-01-01
URL https://openreview.net/forum?id=HJxR7R4FvS
PDF https://openreview.net/pdf?id=HJxR7R4FvS
PWC https://paperswithcode.com/paper/toward-amortized-ranking-critical-training
Repo
Framework

Improving SAT Solver Heuristics with Graph Networks and Reinforcement Learning

Title Improving SAT Solver Heuristics with Graph Networks and Reinforcement Learning
Authors Anonymous
Abstract We present GQSAT, a branching heuristic in a Boolean SAT solver trained with value-based reinforcement learning (RL) using Graph Neural Networks for function approximation. Solvers using GQSAT are complete SAT solvers that either provide a satisfying assignment or a proof of unsatisfiability, which is required for many SAT applications. The branching heuristic commonly used in SAT solvers today suffers from bad decisions during their warm-up period, whereas GQSAT has been trained to examine the structure of the particular problem instance to make better decisions at the beginning of the search. Training GQSAT is data efficient and does not require elaborate dataset preparation or feature engineering to train. We train GQSAT on small SAT problems using RL interfacing with an existing SAT solver. We show that GQSAT is able to reduce the number of iterations required to solve SAT problems by 2-3X, and it generalizes to unsatisfiable SAT instances, as well as to problems with 5X more variables than it was trained on. We also show that, to a lesser extent, it generalizes to SAT problems from different domains by evaluating it on graph coloring. Our experiments show that augmenting SAT solvers with agents trained with RL and graph neural networks can improve performance on the SAT search problem.
Tasks Feature Engineering
Published 2020-01-01
URL https://openreview.net/forum?id=B1lCn64tvS
PDF https://openreview.net/pdf?id=B1lCn64tvS
PWC https://paperswithcode.com/paper/improving-sat-solver-heuristics-with-graph-1
Repo
Framework
comments powered by Disqus