April 1, 2020

3080 words 15 mins read

Paper Group NANR 134

Learning from Partially-Observed Multimodal Data with Variational Autoencoders. Efficient meta reinforcement learning via meta goal generation. ASYNCHRONOUS MULTI-AGENT GENERATIVE ADVERSARIAL IMITATION LEARNING. Dynamic Instance Hardness. The Power of Semantic Similarity based Soft-Labeling for Generalized Zero-Shot Learning. Exploiting Excessive I …

Learning from Partially-Observed Multimodal Data with Variational Autoencoders


Title	Learning from Partially-Observed Multimodal Data with Variational Autoencoders
Authors	Anonymous
Abstract	Learning from only partially-observed data for imputation has been an active research area. Despite promising progress on unimodal data imputation (e.g., image in-painting), models designed for multimodal data imputation are far from satisfactory. In this paper, we propose variational selective autoencoders (VSAE) for this task. Different from previous works, our proposed VSAE learns only from partially-observed data. The proposed VSAE is capable of learning the joint distribution of observed and unobserved modalities as well as the imputation mask, resulting in a unified model for various down-stream tasks including data generation and imputation. Evaluation on both synthetic high-dimensional and challenging low-dimensional multi-modality datasets shows significant improvement over the state-of-the-art data imputation models.
Tasks	Imputation
Published	2020-01-01
URL	https://openreview.net/forum?id=rylT0AVtwH
PDF	https://openreview.net/pdf?id=rylT0AVtwH
PWC	https://paperswithcode.com/paper/learning-from-partially-observed-multimodal
Repo
Framework

Efficient meta reinforcement learning via meta goal generation


Title	Efficient meta reinforcement learning via meta goal generation
Authors	Anonymous
Abstract	Meta reinforcement learning (meta-RL) is able to accelerate the acquisition of new tasks by learning from past experience. Current meta-RL methods usually learn to adapt to new tasks by directly optimizing the parameters of policies over primitive actions. However, for complex tasks which requires sophisticated control strategies, it would be quite inefficient to to directly learn such a meta-policy. Moreover, this problem can become more severe and even fail in spare reward settings, which is quite common in practice. To this end, we propose a new meta-RL algorithm called meta goal-generation for hierarchical RL (MGHRL) by leveraging hierarchical actor-critic framework. Instead of directly generate policies over primitive actions for new tasks, MGHRL learns to generate high-level meta strategies over subgoals given past experience and leaves the rest of how to achieve subgoals as independent RL subtasks. Our empirical results on several challenging simulated robotics environments show that our method enables more efficient and effective meta-learning from past experience and outperforms state-of-the-art meta-RL and Hierarchical-RL methods in sparse reward settings.
Tasks	Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=rkgl51rKDB
PDF	https://openreview.net/pdf?id=rkgl51rKDB
PWC	https://paperswithcode.com/paper/efficient-meta-reinforcement-learning-via
Repo
Framework

ASYNCHRONOUS MULTI-AGENT GENERATIVE ADVERSARIAL IMITATION LEARNING


Title	ASYNCHRONOUS MULTI-AGENT GENERATIVE ADVERSARIAL IMITATION LEARNING
Authors	Anonymous
Abstract	Imitation learning aims to inversely learn a policy from expert demonstrations, which has been extensively studied in the literature for both single-agent setting with Markov decision process (MDP) model, and multi-agent setting with Markov game (MG) model. However, existing approaches for general multi-agent Markov games are not applicable to multi-agent extensive Markov games, where agents make asynchronous decisions following a certain order, rather than simultaneous decisions. We propose a novel framework for asynchronous multi-agent generative adversarial imitation learning (AMAGAIL) under general extensive Markov game settings, and the learned expert policies are proven to guarantee subgame perfect equilibrium (SPE), a more general and stronger equilibrium than Nash equilibrium (NE). The experiment results demonstrate that compared to state-of-the-art baselines, our AMAGAIL model can better infer the policy of each expert agent using their demonstration data collected from asynchronous decision-making scenarios (i.e., extensive Markov games).
Tasks	Decision Making, Imitation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=Syx33erYwH
PDF	https://openreview.net/pdf?id=Syx33erYwH
PWC	https://paperswithcode.com/paper/asynchronous-multi-agent-generative
Repo
Framework

Dynamic Instance Hardness


Title	Dynamic Instance Hardness
Authors	Anonymous
Abstract	We introduce dynamic instance hardness (DIH) to facilitate the training of machine learning models. DIH is a property of each training sample and is computed as the running mean of the sample’s instantaneous hardness as measured over the training history. We use DIH to evaluate how well a model retains knowledge about each training sample over time. We find that for deep neural nets (DNNs), the DIH of a sample in relatively early training stages reflects its DIH in later stages and as a result, DIH can be effectively used to reduce the set of training samples in future epochs. Specifically, during each epoch, only samples with high DIH are trained (since they are historically hard) while samples with low DIH can be safely ignored. DIH is updated each epoch only for the selected samples, so it does not require additional computation. Hence, using DIH during training leads to an appreciable speedup. Also, since the model is focused on the historically more challenging samples, resultant models are more accurate. The above, when formulated as an algorithm, can be seen as a form of curriculum learning, so we call our framework DIH curriculum learning (or DIHCL). The advantages of DIHCL, compared to other curriculum learning approaches, are: (1) DIHCL does not require additional inference steps over the data not selected by DIHCL in each epoch, (2) the dynamic instance hardness, compared to static instance hardness (e.g., instantaneous loss), is more stable as it integrates information over the entire training history up to the present time. Making certain mathematical assumptions, we formulate the problem of DIHCL as finding a curriculum that maximizes a multi-set function $f(\cdot)$, and derive an approximation bound for a DIH-produced curriculum relative to the optimal curriculum. Empirically, DIHCL-trained DNNs significantly outperform random mini-batch SGD and other recently developed curriculum learning methods in terms of efficiency, early-stage convergence, and final performance, and this is shown in training several state-of-the-art DNNs on 11 modern datasets.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1lXVJStwB
PDF	https://openreview.net/pdf?id=H1lXVJStwB
PWC	https://paperswithcode.com/paper/dynamic-instance-hardness
Repo
Framework

The Power of Semantic Similarity based Soft-Labeling for Generalized Zero-Shot Learning


Title	The Power of Semantic Similarity based Soft-Labeling for Generalized Zero-Shot Learning
Authors	Anonymous
Abstract	Zero-Shot Learning (ZSL) is a classification task where some classes referred as unseen classes have no labeled training images. Instead, we only have side information (or description) about seen and unseen classes, often in the form of semantic or descriptive attributes. Lack of training images from a set of classes restricts the use of standard classification techniques and losses, including the popular cross-entropy loss. The key step in tackling ZSL problem is bridging visual to semantic space via learning a nonlinear embedding. A well established approach is to obtain the semantic representation of the visual information and perform classification in the semantic space. In this paper, we propose a novel architecture of casting ZSL as a fully connected neural-network with cross-entropy loss to embed visual space to semantic space. During training in order to introduce unseen visual information to the network, we utilize soft-labeling based on semantic similarities between seen and unseen classes. To the best of our knowledge, such similarity based soft-labeling is not explored for cross-modal transfer and ZSL. We evaluate the proposed model on five benchmark datasets for zero-shot learning, AwA1, AwA2, aPY, SUN and CUB datasets, and show that, despite the simplicity, our approach achieves the state-of-the-art performance in Generalized-ZSL setting on all of these datasets and outperforms the state-of-the-art for some datasets.
Tasks	Semantic Similarity, Semantic Textual Similarity, Zero-Shot Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=B1lmSeHKwB
PDF	https://openreview.net/pdf?id=B1lmSeHKwB
PWC	https://paperswithcode.com/paper/the-power-of-semantic-similarity-based-soft
Repo
Framework

Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness


Title	Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness
Authors	Anonymous
Abstract	Adversarial examples are malicious inputs crafted to cause a model to misclassify them. In their most common instantiation, “perturbation-based” adversarial examples introduce changes to the input that leave its true label unchanged, yet result in a different model prediction. Conversely, “invariance-based” adversarial examples insert changes to the input that leave the model’s prediction unaffected despite the underlying input’s label having changed. So far, the relationship between these two notions of adversarial examples has not been studied, we close this gap. We demonstrate that solely achieving perturbation-based robustness is insufficient for complete adversarial robustness. Worse, we find that classifiers trained to be Lp-norm robust are more vulnerable to invariance-based adversarial examples than their undefended counterparts. We construct theoretical arguments and analytical examples to justify why this is the case. We then illustrate empirically that the consequences of excessive perturbation-robustness can be exploited to craft new attacks. Finally, we show how to attack a provably robust defense — certified on the MNIST test set to have at least 87% accuracy (with respect to the original test labels) under perturbations of Linfinity-norm below epsilon=0.4 — and reduce its accuracy (under this threat model with respect to an ensemble of human labelers) to 60% with an automated attack, or just 12% with human-crafted adversarial examples.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=r1eVX0EFvH
PDF	https://openreview.net/pdf?id=r1eVX0EFvH
PWC	https://paperswithcode.com/paper/exploiting-excessive-invariance-caused-by-1
Repo
Framework

Assessing Generalization in TD methods for Deep Reinforcement Learning


Title	Assessing Generalization in TD methods for Deep Reinforcement Learning
Authors	Anonymous
Abstract	Current Deep Reinforcement Learning (DRL) methods can exhibit both data inefficiency and brittleness, which seem to indicate that they generalize poorly. In this work, we experimentally analyze this issue through the lens of memorization, and show that it can be observed directly during training. More precisely, we find that Deep Neural Networks (DNNs) trained with supervised tasks on trajectories capture temporal structure well, but DNNs trained with TD(0) methods struggle to do so, while using TD(lambda) targets leads to better generalization.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Hyl8yANFDB
PDF	https://openreview.net/pdf?id=Hyl8yANFDB
PWC	https://paperswithcode.com/paper/assessing-generalization-in-td-methods-for
Repo
Framework

Tensorized Embedding Layers for Efficient Model Compression


Title	Tensorized Embedding Layers for Efficient Model Compression
Authors	Anonymous
Abstract	The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large, the corresponding weight matrices can be enormous, which precludes their deployment in a limited resource setting. We introduce a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance. We evaluate our method on a wide range of benchmarks in natural language processing and analyze the trade-off between performance and compression ratios for a wide range of architectures, from MLPs to LSTMs and Transformers.
Tasks	Model Compression
Published	2020-01-01
URL	https://openreview.net/forum?id=S1e4Q6EtDH
PDF	https://openreview.net/pdf?id=S1e4Q6EtDH
PWC	https://paperswithcode.com/paper/tensorized-embedding-layers-for-efficient-1
Repo
Framework

Fast Bilinear Matrix Normalization via Rank-1 Update


Title	Fast Bilinear Matrix Normalization via Rank-1 Update
Authors	Anonymous
Abstract	Bilinear pooling has achieved an impressive improvement over classical average and max pooling in many computer vision tasks. Recent studies discover that matrix normalization is vital for improving the performance of bilinear pooling since it effectively suppresses the burstiness. Nevertheless, exiting matrix normalization methods such as matrix square-root and matrix logarithm are based on singular value decomposition (SVD), which is not supported well in the GPU platform, limiting its efficiency in training and inference. To boost the efficiency in the GPU platform, recent methods rely on Newton-Schulz (NS) iteration which approximates the matrix square-root through several times of matrix-matrix multiplications. Despite that Newton-Schulz iteration is well supported by GPU, it takes $\mathcal{O}(KD^3)$ computation complexity where $D$ is dimension of local features and $K$ is the number of iterations, which is still costly. Meanwhile, NS iteration is applicable only to full bilinear matrix. In contrast, a compact bilinear feature obtained from tensor sketch or random projection has broken the matrix structure, cannot be normalized by NS iteration. To overcome these limitations, we propose a rank-1 update normalization (RUN), which reduces the computational cost from $\mathcal{O}(KD^3)$ to $\mathcal{O}(KDN)$ where $N$ is the number of local feature per image. More importantly, it supports the normalization on compact bilinear features. Meanwhile, the proposed RUN is differentiable, and thus it is feasible to plug it in a convolutional neural network as a layer to support an end-to-end training. Comprehensive experiments on four public benchmarks show that, for full bilinear pooling, the proposed RUN achieves comparable accuracies with a $330\times$ speedup over NS iteration. For compact bilinear pooling, our RUN achieves comparable accuracies with a $5400\times$ speedup over the SVD-based normalization.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HJe1gaVtwS
PDF	https://openreview.net/pdf?id=HJe1gaVtwS
PWC	https://paperswithcode.com/paper/fast-bilinear-matrix-normalization-via-rank-1
Repo
Framework

Robustness and/or Redundancy Emerge in Overparametrized Deep Neural Networks


Title	Robustness and/or Redundancy Emerge in Overparametrized Deep Neural Networks
Authors	Anonymous
Abstract	Deep neural networks (DNNs) perform well on a variety of tasks despite the fact that most used in practice are vastly overparametrized and even capable of perfectly fitting randomly labeled data. Recent evidence suggests that developing “compressible” representations is key for adjusting the complexity of overparametrized networks to the task at hand and avoiding overfitting (Arora et al., 2018; Zhou et al., 2018). In this paper, we provide new empirical evidence that supports this hypothesis, identifying two independent mechanisms that emerge when the network’s width is increased: robustness (having units that can be removed without affecting accuracy) and redundancy (having units with similar activity). In a series of experiments with AlexNet, ResNet and Inception networks in the CIFAR-10 and ImageNet datasets, and also using shallow networks with synthetic data, we show that DNNs consistently increase either their robustness, their redundancy, or both at greater widths for a comprehensive set of hyperparameters. These results suggest that networks in the deep learning regime adjust their effective capacity by developing either robustness or redundancy.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=S1xRbxHYDr
PDF	https://openreview.net/pdf?id=S1xRbxHYDr
PWC	https://paperswithcode.com/paper/robustness-andor-redundancy-emerge-in
Repo
Framework

GraphFlow: Exploiting Conversation Flow with Graph Neural Networks for Conversational Machine Comprehension


Title	GraphFlow: Exploiting Conversation Flow with Graph Neural Networks for Conversational Machine Comprehension
Authors	Anonymous
Abstract	Conversational machine comprehension (MC) has proven significantly more challenging compared to traditional MC since it requires better utilization of conversation history. However, most existing approaches do not effectively capture conversation history and thus have trouble handling questions involving coreference or ellipsis. We propose a novel graph neural network (GNN) based model, namely GraphFlow, which captures conversational flow in the dialog. Specifically, we first propose a new approach to dynamically construct a question-aware context graph from passage text at each turn. We then present a novel flow mechanism to model the temporal dependencies in the sequence of context graphs. The proposed GraphFlow model shows superior performance compared to existing state-of-the-art methods. For instance, GraphFlow outperforms two recently proposed models on the CoQA benchmark dataset: FlowQA by 2.3% and SDNet by 0.7% on F1 score, respectively. In addition, visualization experiments show that our proposed model can better mimic the human reasoning process for conversational MC compared to existing models.
Tasks	Reading Comprehension
Published	2020-01-01
URL	https://openreview.net/forum?id=rkgi6JSYvB
PDF	https://openreview.net/pdf?id=rkgi6JSYvB
PWC	https://paperswithcode.com/paper/graphflow-exploiting-conversation-flow-with-1
Repo
Framework

Simultaneous Classification and Out-of-Distribution Detection Using Deep Neural Networks


Title	Simultaneous Classification and Out-of-Distribution Detection Using Deep Neural Networks
Authors	Anonymous
Abstract	Deep neural networks have achieved great success in classiﬁcation tasks during the last years. However, one major problem to the path towards artiﬁcial intelligence is the inability of neural networks to accurately detect samples from novel class distributions and therefore, most of the existent classiﬁcation algorithms assume that all classes are known prior to the training stage. In this work, we propose a methodology for training a neural network that allows it to efﬁciently detect out-of-distribution (OOD) examples without compromising much of its classiﬁcation accuracy on the test examples from known classes. Based on the Outlier Exposure (OE) technique, we propose a novel loss function that achieves state-of-the-art results in out-of-distribution detection with OE both on image and text classiﬁcation tasks. Additionally, the way this method was constructed makes it suitable for training any classiﬁcation algorithm that is based on Maximum Likelihood methods.
Tasks	Out-of-Distribution Detection
Published	2020-01-01
URL	https://openreview.net/forum?id=Hyez1CVYvr
PDF	https://openreview.net/pdf?id=Hyez1CVYvr
PWC	https://paperswithcode.com/paper/simultaneous-classification-and-out-of
Repo
Framework

Towards neural networks that provably know when they don’t know


Title	Towards neural networks that provably know when they don’t know
Authors	Anonymous
Abstract	It has recently been shown that ReLU networks produce arbitrarily over-confident predictions far away from the training data. Thus, ReLU networks do not know when they don’t know. However, this is a highly important property in safety critical applications. In the context of out-of-distribution detection (OOD) there have been a number of proposals to mitigate this problem but none of them are able to make any mathematical guarantees. In this paper we propose a new approach to OOD which overcomes both problems. Our approach can be used with ReLU networks and provides provably low confidence predictions far away from the training data as well as the first certificates for low confidence predictions in a neighborhood of an out-distribution point. In the experiments we show that state-of-the-art methods fail in this worst-case setting whereas our model can guarantee its performance while retaining state-of-the-art OOD performance.
Tasks	Out-of-Distribution Detection
Published	2020-01-01
URL	https://openreview.net/forum?id=ByxGkySKwH
PDF	https://openreview.net/pdf?id=ByxGkySKwH
PWC	https://paperswithcode.com/paper/towards-neural-networks-that-provably-know-1
Repo
Framework

Deep Evidential Uncertainty


Title	Deep Evidential Uncertainty
Authors	Anonymous
Abstract	Deterministic neural networks (NNs) are increasingly being deployed in safety critical domains, where calibrated, robust and efficient measures of uncertainty are crucial. While it is possible to train regression networks to output the parameters of a probability distribution by maximizing a Gaussian likelihood function, the resulting model remains oblivious to the underlying confidence of its predictions. In this paper, we propose a novel method for training deterministic NNs to not only estimate the desired target but also the associated evidence in support of that target. We accomplish this by placing evidential priors over our original Gaussian likelihood function and training our NN to infer the hyperparameters of our evidential distribution. We impose priors during training such that the model is penalized when its predicted evidence is not aligned with the correct output. Thus the model estimates not only the probabilistic mean and variance of our target but also the underlying uncertainty associated with each of those parameters. We observe that our evidential regression method learns well-calibrated measures of uncertainty on various benchmarks, scales to complex computer vision tasks, and is robust to adversarial input perturbations.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=S1eSoeSYwr
PDF	https://openreview.net/pdf?id=S1eSoeSYwr
PWC	https://paperswithcode.com/paper/deep-evidential-uncertainty
Repo
Framework

Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization


Title	Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Authors	Anonymous
Abstract	As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed on clusters to perform model fitting in parallel. Alistarh et al. (2017) describe two variants of data-parallel SGD that quantize and encode gradients to lessen communication costs. For the first variant, QSGD, they provide strong theoretical guarantees. For the second variant, which we call QSGDinf, they demonstrate impressive empirical gains for distributed training of large neural networks. Building on their work, we propose an alternative scheme for quantizing gradients and show that it yields stronger theoretical guarantees than exist for QSGD while matching the empirical performance of QSGDinf.
Tasks	Quantization
Published	2020-01-01
URL	https://openreview.net/forum?id=HyeJmlrFvH
PDF	https://openreview.net/pdf?id=HyeJmlrFvH
PWC	https://paperswithcode.com/paper/provably-communication-efficient-data
Repo
Framework