April 1, 2020

3098 words 15 mins read

Paper Group NANR 2

InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. VAENAS: Sampling Matters in Neural Architecture Search. Contextual Temperature for Language Modeling. Keyframing the Future: Discovering Temporal Hierarchy with Keyframe-Inpainter Prediction. Flexible and Efficient Long-Range Plannin …

InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization


Title	InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization
Authors	Anonymous
Abstract	This paper studies learning the representations of whole graphs in both unsupervised and semi-supervised scenarios. Graph-level representations are critical in a variety of real-world applications such as predicting the properties of molecules and community analysis in social networks. Traditional graph kernel based methods are simple, yet effective for obtaining fixed-length representations for graphs but they suffer from poor generalization due to hand-crafted designs. There are also some recent methods based on language models (e.g. graph2vec) but they tend to only consider certain substructures (e.g. subtrees) as graph representatives. Inspired by recent progress of unsupervised representation learning, in this paper we proposed a novel method called InfoGraph for learning graph-level representations. We maximize the mutual information between the graph-level representation and the representations of substructures of different scales (e.g., nodes, edges, triangles). By doing so, the graph-level representations encode aspects of the data that are shared across different scales of substructures. Furthermore, we further propose InfoGraph, an extension of InfoGraph for semisupervised scenarios. InfoGraph maximizes the mutual information between unsupervised graph representations learned by InfoGraph and the representations learned by existing supervised methods. As a result, the supervised encoder learns from unlabeled data while preserving the latent semantic space favored by the current supervised task. Experimental results on the tasks of graph classification and molecular property prediction show that InfoGraph is superior to state-of-the-art baselines and InfoGraph* can achieve performance competitive with state-of-the-art semi-supervised models.
Tasks	Graph Classification, Molecular Property Prediction, Representation Learning, Unsupervised Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=r1lfF2NYvH
PDF	https://openreview.net/pdf?id=r1lfF2NYvH
PWC	https://paperswithcode.com/paper/infograph-unsupervised-and-semi-supervised-1
Repo
Framework

VAENAS: Sampling Matters in Neural Architecture Search


Title	VAENAS: Sampling Matters in Neural Architecture Search
Authors	Anonymous
Abstract	Neural Architecture Search (NAS) aims at automatically finding neural network architectures within an enormous designed search space. The search space usually contains billions of network architectures which causes extremely expensive computing costs in searching for the best-performing architecture. One-shot and gradient-based NAS approaches have recently shown to achieve superior results on various computer vision tasks such as image recognition. With the weight sharing mechanism, these methods lead to efficient model search. Despite their success, however, current sampling methods are either fixed or hand-crafted and thus ineffective. In this paper, we propose a learnable sampling module based on variational auto-encoder (VAE) for neural architecture search (NAS), named as VAENAS, which can be easily embedded into existing weight sharing NAS framework, e.g., one-shot approach and gradient-based approach, and significantly improve the performance of searching results. VAENAS generates a series of competitive results on CIFAR-10 and ImageNet in NasNet-like search space. Moreover, combined with one-shot approach, our method achieves a new state-of-the-art result for ImageNet classification model under 400M FLOPs with 77.4% in ShuffleNet-like search space. Finally, we conduct a thorough analysis of VAENAS on NAS-bench-101 dataset, which demonstrates the effectiveness of our proposed methods.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=S1xKYJSYwS
PDF	https://openreview.net/pdf?id=S1xKYJSYwS
PWC	https://paperswithcode.com/paper/vaenas-sampling-matters-in-neural
Repo
Framework

Contextual Temperature for Language Modeling


Title	Contextual Temperature for Language Modeling
Authors	Anonymous
Abstract	Temperature scaling has been widely used to improve performance for NLP tasks that utilize Softmax decision layer. Current practices in using temperature either assume a ﬁxed value or a dynamically changing temperature but with a ﬁxed schedule. Little has been known on an optimal trajectory of temperature that can change with the context. In this paper, we propose contextual temperature, a mechanism that allows temperatures to change over the context for each vocabulary, and to co-adopt with model parameters during training. Experimental results illustrated that contextual temperature improves over state-of-the-art language models signiﬁcantly. Our model CT-MoS achieved a perplexity of 55.31 in the test set of Penn Treebank and a perplexity of 62.89 in the test set of WikiText-2. The in-depth analysis showed that the behavior of temperature schedule varies dramatically by vocabulary. The optimal temperature trajectory drops as the context becomes longer to suppress uncertainties in language modeling. These evidence further justiﬁed the need for contextual temperature and explained its performance advantage over ﬁxed temperature or scheduling.
Tasks	Language Modelling
Published	2020-01-01
URL	https://openreview.net/forum?id=H1x9004YPr
PDF	https://openreview.net/pdf?id=H1x9004YPr
PWC	https://paperswithcode.com/paper/contextual-temperature-for-language-modeling
Repo
Framework

Keyframing the Future: Discovering Temporal Hierarchy with Keyframe-Inpainter Prediction


Title	Keyframing the Future: Discovering Temporal Hierarchy with Keyframe-Inpainter Prediction
Authors	Anonymous
Abstract	To flexibly and efficiently reason about temporal sequences, abstract representations that compactly represent the important information in the sequence are needed. One way of constructing such representations is by focusing on the important events in a sequence. In this paper, we propose a model that learns both to discover such key events (or keyframes) as well as to represent the sequence in terms of them. We do so using a hierarchical Keyframe-Inpainter (KeyIn) model that first generates keyframes and their temporal placement and then inpaints the sequences between keyframes. We propose a fully differentiable formulation for efficiently learning the keyframe placement. We show that KeyIn finds informative keyframes in several datasets with diverse dynamics. When evaluated on a planning task, KeyIn outperforms other recent proposals for learning hierarchical representations.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BklfR3EYDH
PDF	https://openreview.net/pdf?id=BklfR3EYDH
PWC	https://paperswithcode.com/paper/keyframing-the-future-discovering-temporal
Repo
Framework

Flexible and Efficient Long-Range Planning Through Curious Exploration


Title	Flexible and Efficient Long-Range Planning Through Curious Exploration
Authors	Anonymous
Abstract	Identifying algorithms that flexibly and efficiently discover temporally-extended multi-phase plans is an essential next step for the advancement of robotics and model-based reinforcement learning. The core problem of long-range planning is finding an efficient way to search through the tree of possible action sequences — which, if left unchecked, grows exponentially with the length of the plan. Existing non-learned planning solutions from the Task and Motion Planning (TAMP) literature rely on the existence of logical descriptions for the effects and preconditions for actions. This constraint allows TAMP methods to efficiently reduce the tree search problem but limits their ability to generalize to unseen and complex physical environments. In contrast, deep reinforcement learning (DRL) methods use flexible neural-network-based function approximators to discover policies that generalize naturally to unseen circumstances. However, DRL methods have had trouble dealing with the very sparse reward landscapes inherent to long-range multi-step planning situations. Here, we propose the Curious Sample Planner (CSP), which fuses elements of TAMP and DRL by using a curiosity-guided sampling strategy to learn to efficiently explore the tree of action effects. We show that CSP can efficiently discover interesting and complex temporally-extended plans for solving a wide range of physically realistic 3D tasks. In contrast, standard DRL and random sampling methods often fail to solve these tasks at all or do so only with a huge and highly variable number of training samples. We explore the use of a variety of curiosity metrics with CSP and analyze the types of solutions that CSP discovers. Finally, we show that CSP supports task transfer so that the exploration policies learned during experience with one task can help improve efficiency on related tasks.
Tasks	Motion Planning
Published	2020-01-01
URL	https://openreview.net/forum?id=r1xo9grKPr
PDF	https://openreview.net/pdf?id=r1xo9grKPr
PWC	https://paperswithcode.com/paper/flexible-and-efficient-long-range-planning
Repo
Framework

CWAE-IRL: Formulating a supervised approach to Inverse Reinforcement Learning problem


Title	CWAE-IRL: Formulating a supervised approach to Inverse Reinforcement Learning problem
Authors	Anonymous
Abstract	Inverse reinforcement learning (IRL) is used to infer the reward function from the actions of an expert running a Markov Decision Process (MDP). A novel approach using variational inference for learning the reward function is proposed in this research. Using this technique, the intractable posterior distribution of the continuous latent variable (the reward function in this case) is analytically approximated to appear to be as close to the prior belief while trying to reconstruct the future state conditioned on the current state and action. The reward function is derived using a well-known deep generative model known as Conditional Variational Auto-encoder (CVAE) with Wasserstein loss function, thus referred to as Conditional Wasserstein Auto-encoder-IRL (CWAE-IRL), which can be analyzed as a combination of the backward and forward inference. This can then form an efficient alternative to the previous approaches to IRL while having no knowledge of the system dynamics of the agent. Experimental results on standard benchmarks such as objectworld and pendulum show that the proposed algorithm can effectively learn the latent reward function in complex, high-dimensional environments.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rJlCXlBtwH
PDF	https://openreview.net/pdf?id=rJlCXlBtwH
PWC	https://paperswithcode.com/paper/cwae-irl-formulating-a-supervised-approach-to
Repo
Framework

GATO: Gates Are Not the Only Option


Title	GATO: Gates Are Not the Only Option
Authors	Anonymous
Abstract	Recurrent Neural Networks (RNNs) facilitate prediction and generation of structured temporal data such as text and sound. However, training RNNs is hard. Vanishing gradients cause difficulties for learning long-range dependencies. Hidden states can explode for long sequences and send unbounded gradients to model parameters, even when hidden-to-hidden Jacobians are bounded. Models like the LSTM and GRU use gates to bound their hidden state, but most choices of gating functions lead to saturating gradients that contribute to, instead of alleviate, vanishing gradients. Moreover, performance of these models is not robust across random initializations. In this work, we specify desiderata for sequence models. We develop one model that satisfies them and that is capable of learning long-term dependencies, called GATO. GATO is constructed so that part of its hidden state does not have vanishing gradients, regardless of sequence length. We study GATO on copying and arithmetic tasks with long dependencies and on modeling intensive care unit and language data. Training GATO is more stable across random seeds and learning rates than GRUs and LSTMs. GATO solves these tasks using an order of magnitude fewer parameters.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BylfTySYvB
PDF	https://openreview.net/pdf?id=BylfTySYvB
PWC	https://paperswithcode.com/paper/gato-gates-are-not-the-only-option
Repo
Framework

SDGM: Sparse Bayesian Classifier Based on a Discriminative Gaussian Mixture Model


Title	SDGM: Sparse Bayesian Classifier Based on a Discriminative Gaussian Mixture Model
Authors	Anonymous
Abstract	In probabilistic classification, a discriminative model based on Gaussian mixture exhibits flexible fitting capability. Nevertheless, it is difficult to determine the number of components. We propose a sparse classifier based on a discriminative Gaussian mixture model (GMM), which is named sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is trained by sparse Bayesian learning. This learning algorithm improves the generalization capability by obtaining a sparse solution and automatically determines the number of components by removing redundant components. The SDGM can be embedded into neural networks (NNs) such as convolutional NNs and can be trained in an end-to-end manner. Experimental results indicated that the proposed method prevented overfitting by obtaining sparsity. Furthermore, we demonstrated that the proposed method outperformed a fully connected layer with the softmax function in certain cases when it was used as the last layer of a deep NN.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=r1xapAEKwS
PDF	https://openreview.net/pdf?id=r1xapAEKwS
PWC	https://paperswithcode.com/paper/sdgm-sparse-bayesian-classifier-based-on-a-1
Repo
Framework

On the Dynamics and Convergence of Weight Normalization for Training Neural Networks


Title	On the Dynamics and Convergence of Weight Normalization for Training Neural Networks
Authors	Anonymous
Abstract	We present a proof of convergence for ReLU networks trained with weight normalization. In the analysis, we consider over-parameterized 2-layer ReLU networks initialized at random and trained with batch gradient descent and a fixed step size. The proof builds on recent theoretical works that bound the trajectory of parameters from their initialization and monitor the network predictions via the evolution of a ‘‘neural tangent kernel’’ (Jacot et al. 2018). We discover that training with weight normalization decomposes such a kernel via the so called ‘‘length-direction decoupling’'. This in turn leads to two convergence regimes and can rigorously explain the utility of WeightNorm. From the modified convergence we make a few curious observations including a natural form of ‘‘lazy training’’ where the direction of each weight vector remains stationary.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJggj3VKPH
PDF	https://openreview.net/pdf?id=HJggj3VKPH
PWC	https://paperswithcode.com/paper/on-the-dynamics-and-convergence-of-weight
Repo
Framework

Characterizing Missing Information in Deep Networks Using Backpropagated Gradients


Title	Characterizing Missing Information in Deep Networks Using Backpropagated Gradients
Authors	Gukyeong Kwon, Mohit Prabhushankar, Dogancan Temel, Ghassan AlRegib
Abstract	Deep networks face challenges of ensuring their robustness against inputs that cannot be effectively represented by information learned from training data. We attribute this vulnerability to the limitations inherent to activation-based representation. To complement the learned information from activation-based representation, we propose utilizing a gradient-based representation that explicitly focuses on missing information. In addition, we propose a directional constraint on the gradients as an objective during training to improve the characterization of missing information. To validate the effectiveness of the proposed approach, we compare the anomaly detection performance of gradient-based and activation-based representations. We show that the gradient-based representation outperforms the activation-based representation by 0.093 in CIFAR-10 and 0.361 in CURE-TSR datasets in terms of AUROC averaged over all classes. Also, we propose an anomaly detection algorithm that uses the gradient-based representation, denoted as GradCon, and validate its performance on three benchmarking datasets. The proposed method outperforms the majority of the state-of-the-art algorithms in CIFAR-10, MNIST, and fMNIST datasets with an average AUROC of 0.664, 0.973, and 0.934, respectively.
Tasks	Anomaly Detection
Published	2020-01-01
URL	https://openreview.net/forum?id=SJxFWRVKDr
PDF	https://openreview.net/pdf?id=SJxFWRVKDr
PWC	https://paperswithcode.com/paper/characterizing-missing-information-in-deep
Repo
Framework

CAN ALTQ LEARN FASTER: EXPERIMENTS AND THEORY


Title	CAN ALTQ LEARN FASTER: EXPERIMENTS AND THEORY
Authors	Anonymous
Abstract	Differently from the popular Deep Q-Network (DQN) learning, Alternating Q-learning (AltQ) does not fully fit a target Q-function at each iteration, and is generally known to be unstable and inefficient. Limited applications of AltQ mostly rely on substantially altering the algorithm architecture in order to improve its performance. Although Adam appears to be a natural solution, its performance in AltQ has rarely been studied before. In this paper, we first provide a solid exploration on how well AltQ performs with Adam. We then take a further step to improve the implementation by adopting the technique of parameter restart. More specifically, the proposed algorithms are tested on a batch of Atari 2600 games and exhibit superior performance than the DQN learning method. The convergence rate of the slightly modified version of the proposed algorithms is characterized under the linear function approximation. To the best of our knowledge, this is the first theoretical study on the Adam-type algorithms in Q-learning.
Tasks	Atari Games, Q-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=H1eD7REtPr
PDF	https://openreview.net/pdf?id=H1eD7REtPr
PWC	https://paperswithcode.com/paper/can-altq-learn-faster-experiments-and-theory
Repo
Framework

Disentangling Trainability and Generalization in Deep Learning


Title	Disentangling Trainability and Generalization in Deep Learning
Authors	Anonymous
Abstract	A fundamental goal in deep learning is the characterization of trainability and generalization of neural networks as a function of their architecture and hyperparameters. In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably. To do this, we leverage recent advances that have separately shown: (1) that in the wide network limit, random networks before training are Gaussian Processes governed by a kernel known as the Neural Network Gaussian Process (NNGP) kernel, (2) that at large depths the spectrum of the NNGP kernel simplifies considerably and becomes ``weakly data-dependent’', and (3) that gradient descent training of wide neural networks is described by a kernel called the Neural Tangent Kernel (NTK) that is related to the NNGP. Here we show that by combining the in the large depth limit the spectrum of the NTK simplifies in much the same way as that of the NNGP kernel. By analyzing this spectrum, we arrive at a precise characterization of trainability and generalization across a range of architectures including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We find that there are large regions of hyperparameter space where networks will train but will fail to generalize, in contrast with several recent results. By comparing CNNs with- and without-global average pooling, we show that CNNs without average pooling have very nearly identical learning dynamics to FCNs while CNNs with pooling contain a correction that alters its generalization performance. We perform a thorough empirical investigation of these theoretical results and finding excellent agreement on real datasets. \|
Tasks	Gaussian Processes
Published	2020-01-01
URL	https://openreview.net/forum?id=Bkx1mxSKvB
PDF	https://openreview.net/pdf?id=Bkx1mxSKvB
PWC	https://paperswithcode.com/paper/disentangling-trainability-and-generalization
Repo
Framework

On the Relationship between Self-Attention and Convolutional Layers


Title	On the Relationship between Self-Attention and Convolutional Layers
Authors	Anonymous
Abstract	Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as powerful as any convolutional layer. Our numerical experiments then show that the phenomenon also occurs in practice, corroborating our analysis. Our code is publicly available.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJlnC1rKPB
PDF	https://openreview.net/pdf?id=HJlnC1rKPB
PWC	https://paperswithcode.com/paper/on-the-relationship-between-self-attention
Repo
Framework

On Understanding Knowledge Graph Representation


Title	On Understanding Knowledge Graph Representation
Authors	Anonymous
Abstract	Many methods have been developed to represent knowledge graph data, which implicitly exploit low-rank latent structure in the data to encode known information and enable unknown facts to be inferred. To predict whether a relationship holds between entities, their embeddings are typically compared in the latent space following a relation-specific mapping. Whilst link prediction has steadily improved, the latent structure, and hence why such models capture semantic information, remains unexplained. We build on recent theoretical interpretation of word embeddings as a basis to consider an explicit structure for representations of relations between entities. For identifiable relation types, we are able to predict properties and justify the relative performance of leading knowledge graph representation methods, including their often overlooked ability to make independent predictions.
Tasks	Link Prediction, Word Embeddings
Published	2020-01-01
URL	https://openreview.net/forum?id=SygcSlHFvS
PDF	https://openreview.net/pdf?id=SygcSlHFvS
PWC	https://paperswithcode.com/paper/on-understanding-knowledge-graph
Repo
Framework

Barcodes as summary of objective functions’ topology


Title	Barcodes as summary of objective functions’ topology
Authors	Anonymous
Abstract	We apply canonical forms of gradient complexes (barcodes) to explore neural networks loss surfaces. We present an algorithm for calculations of the objective function’s barcodes of minima. Our experiments confirm two principal observations: (1) the barcodes of minima are located in a small lower part of the range of values of objective function and (2) increase of the neural network’s depth brings down the minima’s barcodes. This has natural implications for the neural network learning and the ability to generalize.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=S1gwC1StwS
PDF	https://openreview.net/pdf?id=S1gwC1StwS
PWC	https://paperswithcode.com/paper/barcodes-as-summary-of-objective-functions
Repo
Framework