April 1, 2020

3120 words 15 mins read

Paper Group NANR 21

Policy Optimization by Local Improvement through Search. Anchor & Transform: Learning Sparse Representations of Discrete Objects. Low Bias Gradient Estimates for Very Deep Boolean Stochastic Networks. Learning General and Reusable Features via Racecar-Training. Attention Forcing for Sequence-to-sequence Model Training. Stabilizing DARTS with Amende …

Policy Optimization by Local Improvement through Search


Title	Policy Optimization by Local Improvement through Search
Authors	Anonymous
Abstract	Imitation learning has emerged as a powerful strategy for learning initial policies that can be refined with reinforcement learning techniques. Most strategies in imitation learning, however, rely on per-step supervision either from expert demonstrations, referred to as behavioral cloning or from interactive expert policy queries such as DAgger. These strategies differ on the state distribution at which the expert actions are collected – the former using the state distribution of the expert, the latter using the state distribution of the policy being trained. However, the learning signal in both cases arises from the expert actions. On the other end of the spectrum, approaches rooted in Policy Iteration, such as Dual Policy Iteration do not choose next step actions based on an expert, but instead use planning or search over the policy to choose an action distribution to train towards. However, this can be computationally expensive, and can also end up training the policy on a state distribution that is far from the current policy’s induced distribution. In this paper, we propose an algorithm that finds a middle ground by using Monte Carlo Tree Search (MCTS) to perform local trajectory improvement over rollouts from the policy. We provide theoretical justification for both the proposed local trajectory search algorithm and for our use of MCTS as a local policy improvement operator. We also show empirically that our method (Policy Optimization by Local Improvement through Search or POLISH) is much faster than methods that plan globally, speeding up training by a factor of up to 14 in wall clock time. Furthermore, the resulting policy outperforms strong baselines in both reinforcement learning and imitation learning.
Tasks	Imitation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=HyxgoyHtDB
PDF	https://openreview.net/pdf?id=HyxgoyHtDB
PWC	https://paperswithcode.com/paper/policy-optimization-by-local-improvement
Repo
Framework

Anchor & Transform: Learning Sparse Representations of Discrete Objects


Title	Anchor & Transform: Learning Sparse Representations of Discrete Objects
Authors	Anonymous
Abstract	Learning continuous representations of discrete objects such as text, users, and items lies at the heart of many applications including text and user modeling. Unfortunately, traditional methods that embed all objects do not scale to large vocabulary sizes and embedding dimensions. In this paper, we propose a general method, Anchor & Transform (ANT) that learns sparse representations of discrete objects by jointly learning a small set of anchor embeddings and a sparse transformation from anchor objects to all objects. ANT is scalable, flexible, end-to-end trainable, and allows the user to easily incorporate domain knowledge about object relationships (e.g. WordNet, co-occurrence, item clusters). ANT also recovers several task-specific baselines under certain structural assumptions on the anchors and transformation matrices. On text classification and language modeling benchmarks, ANT demonstrates stronger performance with fewer parameters as compared to existing vocabulary selection and embedding compression baselines.
Tasks	Language Modelling, Text Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=H1epaJSYDS
PDF	https://openreview.net/pdf?id=H1epaJSYDS
PWC	https://paperswithcode.com/paper/anchor-transform-learning-sparse
Repo
Framework

Low Bias Gradient Estimates for Very Deep Boolean Stochastic Networks


Title	Low Bias Gradient Estimates for Very Deep Boolean Stochastic Networks
Authors	Anonymous
Abstract	Stochastic neural networks with discrete random variables are an important class of models for their expressivity and interpretability. Since direct differentiation and backpropagation is not possible, Monte Carlo gradient estimation techniques have been widely employed for training such models. Efficient stochastic gradient estimators, such Straight-Through and Gumbel-Softmax, work well for shallow models with one or two stochastic layers. Their performance, however, suffers with increasing model complexity. In this work we focus on stochastic networks with multiple layers of Boolean latent variables. To analyze such such networks, we employ the framework of harmonic analysis for Boolean functions. We use it to derive an analytic formulation for the source of bias in the biased Straight-Through estimator. Based on the analysis we propose \emph{FouST}, a simple gradient estimation algorithm that relies on three simple bias reduction steps. Extensive experiments show that FouST performs favorably compared to state-of-the-art biased estimators, while being much faster than unbiased ones. To the best of our knowledge FouST is the first gradient estimator to train up very deep stochastic neural networks, with up to 80 deterministic and 11 stochastic layers.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Bygadh4tDB
PDF	https://openreview.net/pdf?id=Bygadh4tDB
PWC	https://paperswithcode.com/paper/low-bias-gradient-estimates-for-very-deep
Repo
Framework

Learning General and Reusable Features via Racecar-Training


Title	Learning General and Reusable Features via Racecar-Training
Authors	Anonymous
Abstract	We propose a novel training approach for improving the learning of generalizing features in neural networks. We augment the network with a reverse pass which aims for reconstructing the full sequence of internal states of the network. Despite being a surprisingly simple change, we demonstrate that this forward-backward training approach, i.e. racecar training, leads to significantly more general features to be extracted from a given data set. We demonstrate in our paper that a network obtained in this way is continually trained for the original task, it outperforms baseline models trained in a regular fashion. This improved performance is visible for a wide range of learning tasks from classification, to regression and stylization. In addition, networks trained with our approach exhibit improved performance for task transfers. We additionally analyze the mutual information of our networks to explain the improved generalizing capabilities.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=H1gDaa4YwS
PDF	https://openreview.net/pdf?id=H1gDaa4YwS
PWC	https://paperswithcode.com/paper/learning-general-and-reusable-features-via
Repo
Framework

Attention Forcing for Sequence-to-sequence Model Training


Title	Attention Forcing for Sequence-to-sequence Model Training
Authors	Anonymous
Abstract	Auto-regressive sequence-to-sequence models with attention mechanism have achieved state-of-the-art performance in many tasks such as machine translation and speech synthesis. These models can be difficult to train. The standard approach, teacher forcing, guides a model with reference output history during training. The problem is that the model is unlikely to recover from its mistakes during inference, where the reference output is replaced by generated output. Several approaches deal with this problem, largely by guiding the model with generated output history. To make training stable, these approaches often require a heuristic schedule or an auxiliary classifier. This paper introduces attention forcing, which guides the model with generated output history and reference attention. This approach can train the model to recover from its mistakes, in a stable fashion, without the need for a schedule or a classifier. In addition, it allows the model to generate output sequences aligned with the references, which can be important for cascaded systems like many speech synthesis systems. Experiments on speech synthesis show that attention forcing yields significant performance gain. Experiments on machine translation show that for tasks where various re-orderings of the output are valid, guiding the model with generated output history is challenging, while guiding the model with reference attention is beneficial.
Tasks	Machine Translation, Speech Synthesis
Published	2020-01-01
URL	https://openreview.net/forum?id=rJe5_CNtPB
PDF	https://openreview.net/pdf?id=rJe5_CNtPB
PWC	https://paperswithcode.com/paper/attention-forcing-for-sequence-to-sequence-1
Repo
Framework

Stabilizing DARTS with Amended Gradient Estimation on Architectural Parameters


Title	Stabilizing DARTS with Amended Gradient Estimation on Architectural Parameters
Authors	Anonymous
Abstract	Differentiable neural architecture search has been a popular methodology of exploring architectures for deep learning. Despite the great advantage of search efficiency, it often suffers weak stability, which obstacles it from being applied to a large search space or being flexibly adjusted to different scenarios. This paper investigates DARTS, the currently most popular differentiable search algorithm, and points out an important factor of instability, which lies in its approximation on the gradients of architectural parameters. In the current status, the optimization algorithm can converge to another point which results in dramatic inaccuracy in the re-training process. Based on this analysis, we propose an amending term for computing architectural gradients by making use of a direct property of the optimality of network parameter optimization. Our approach mathematically guarantees that gradient estimation follows a roughly correct direction, which leads the search stage to converge on reasonable architectures. In practice, our algorithm is easily implemented and added to DARTS-based approaches efficiently. Experiments on CIFAR and ImageNet demonstrate that our approach enjoys accuracy gain and, more importantly, enables DARTS-based approaches to explore much larger search spaces that have not been studied before.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=BJlgt2EYwr
PDF	https://openreview.net/pdf?id=BJlgt2EYwr
PWC	https://paperswithcode.com/paper/stabilizing-darts-with-amended-gradient-1
Repo
Framework

Reinforcement Learning with Chromatic Networks


Title	Reinforcement Learning with Chromatic Networks
Authors	Anonymous
Abstract	We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL tasks, we manage to learn colorings translating to effective policies parameterized by as few as 17 weight parameters, providing >90 % compression over vanilla policies and 6x compression over state-of-the-art compact policies based on Toeplitz matrices, while still maintaining good reward. We believe that our work is one of the first attempts to propose a rigorous approach to training structured neural network architectures for RL problems that are of interest especially in mobile robotics with limited storage and computational resources.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=S1gKkpNKwH
PDF	https://openreview.net/pdf?id=S1gKkpNKwH
PWC	https://paperswithcode.com/paper/reinforcement-learning-with-chromatic-1
Repo
Framework

MANAS: Multi-Agent Neural Architecture Search


Title	MANAS: Multi-Agent Neural Architecture Search
Authors	Anonymous
Abstract	The Neural Architecture Search (NAS) problem is typically formulated as a graph search problem where the goal is to learn the optimal operations over edges in order to maximize a graph-level global objective. Due to the large architecture parameter space, efficiency is a key bottleneck preventing NAS from its practical use. In this paper, we address the issue by framing NAS as a multi-agent problem where agents control a subset of the network and coordinate to reach optimal architectures. We provide two distinct lightweight implementations, with reduced memory requirements ($1/8$th of state-of-the-art), and performances above those of much more computationally expensive methods. Theoretically, we demonstrate vanishing regrets of the form $\mathcal{O}(\sqrt{T})$, with $T$ being the total number of rounds. Finally, aware that random search is an (often ignored) effective baseline we perform additional experiments on $3$ alternative datasets and $2$ network configurations, and achieve favorable results in comparison with this baseline and other competing methods.
Tasks	Neural Architecture Search
Published	2020-01-01
URL	https://openreview.net/forum?id=ryedqa4FwS
PDF	https://openreview.net/pdf?id=ryedqa4FwS
PWC	https://paperswithcode.com/paper/manas-multi-agent-neural-architecture-search-1
Repo
Framework

Improving One-Shot NAS By Suppressing The Posterior Fading


Title	Improving One-Shot NAS By Suppressing The Posterior Fading
Authors	Anonymous
Abstract	There is a growing interest in automated neural architecture search (NAS). To improve the efficiency of NAS, previous approaches adopt weight sharing method to force all models share the same set of weights. However, it has been observed that a model performing better with shared weights does not necessarily perform better when trained alone. In this paper, we analyse existing weight sharing one-shot NAS approaches from a Bayesian point of view and identify the posterior fading problem, which compromises the effectiveness of shared weights. To alleviate this problem, we present a practical approach to guide the parameter posterior towards its true distribution. Moreover, a hard latency constraint is introduced during the search so that the desired latency can be achieved. The resulted method, namely Posterior Convergent NAS (PC-NAS), achieves state-of-the-art performance under standard GPU latency constraint on ImageNet. In our small search space, our model PC-NAS-S attains76.8% top-1 accuracy, 2.1% higher than MobileNetV2 (1.4x) with the same latency. When adopted to our large search space, PC-NAS-L achieves 78.1% top-1 accuracy within 11ms. The discovered architecture also transfers well to other computer vision applications such as object detection and person re-identification.
Tasks	Neural Architecture Search, Object Detection, Person Re-Identification
Published	2020-01-01
URL	https://openreview.net/forum?id=HJgJNCEKPr
PDF	https://openreview.net/pdf?id=HJgJNCEKPr
PWC	https://paperswithcode.com/paper/improving-one-shot-nas-by-suppressing-the-1
Repo
Framework

Graph Constrained Reinforcement Learning for Natural Language Action Spaces


Title	Graph Constrained Reinforcement Learning for Natural Language Action Spaces
Authors	Anonymous
Abstract	Interactive Fiction games are text-based simulations in which an agent interacts with the world purely through natural language. They are ideal environments for studying how to extend reinforcement learning agents to meet the challenges of natural language understanding, partial observability, and action generation in combinatorially-large text-based action spaces. We present KG-A2C, an agent that builds a dynamic knowledge graph while exploring and generates actions using a template-based action space. We contend that the dual uses of the knowledge graph to reason about game state and to constrain natural language generation are the keys to scalable exploration of combinatorially large natural language actions. Results across a wide variety of IF games show that KG-A2C outperforms current IF agents despite the exponential increase in action space size.
Tasks	Text Generation
Published	2020-01-01
URL	https://openreview.net/forum?id=B1x6w0EtwH
PDF	https://openreview.net/pdf?id=B1x6w0EtwH
PWC	https://paperswithcode.com/paper/graph-constrained-reinforcement-learning-for
Repo
Framework

Gated Channel Transformation for Visual Recognition


Title	Gated Channel Transformation for Visual Recognition
Authors	Anonymous
Abstract	In this work, we propose a generally applicable transformation unit for visual recognition with deep convolutional neural networks. This transformation explicitly models channel relationships with explainable control variables. These variables determine the neuron behaviors of competition or cooperation, and they are jointly optimized with convolutional weights towards more accurate recognition. In Squeeze-and-Excitation (SE) Networks, the channel relationships are implicitly learned by fully connected layers, and the SE block is integrated at the block-level. We instead introduce a channel normalization layer to reduce the number of parameters and computational complexity. This lightweight layer incorporates a simple L2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters. Extensive experiments demonstrate the effectiveness of our unit with clear margins on many vision tasks, i.e., image classification on ImageNet, object detection, and instance segmentation on COCO, video classification on Kinetics.
Tasks	Image Classification, Instance Segmentation, Object Detection, Semantic Segmentation, Video Classification
Published	2020-01-01
URL	https://openreview.net/forum?id=SJxbu6VKDr
PDF	https://openreview.net/pdf?id=SJxbu6VKDr
PWC	https://paperswithcode.com/paper/gated-channel-transformation-for-visual-1
Repo
Framework

Emergence of functional and structural properties of the head direction system by optimization of recurrent neural networks


Title	Emergence of functional and structural properties of the head direction system by optimization of recurrent neural networks
Authors	Anonymous
Abstract	Recent work suggests goal-driven training of neural networks can be used to model neural activity in the brain. While response properties of neurons in artificial neural networks bear similarities to those in the brain, the network architectures are often constrained to be different. Here we ask if a neural network can recover both neural representations and, if the architecture is unconstrained and optimized, also the anatomical properties of neural circuits. We demonstrate this in a system where the connectivity and the functional organization have been characterized, namely, the head direction circuit of the rodent and fruit fly. We trained recurrent neural networks (RNNs) to estimate head direction through integration of angular velocity. We found that the two distinct classes of neurons observed in the head direction system, the Ring neurons and the Shifter neurons, emerged naturally in artificial neural networks as a result of training. Furthermore, connectivity analysis and in-silico neurophysiology revealed structural and mechanistic similarities between artificial networks and the head direction system. Overall, our results show that optimization of RNNs in a goal-driven task can recapitulate the structure and function of biological circuits, suggesting that artificial neural networks can be used to study the brain at the level of both neural activity and anatomical organization.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HklSeREtPB
PDF	https://openreview.net/pdf?id=HklSeREtPB
PWC	https://paperswithcode.com/paper/emergence-of-functional-and-structural
Repo
Framework

Adversarially robust transfer learning


Title	Adversarially robust transfer learning
Authors	Anonymous
Abstract	Transfer learning, in which a network is trained on one task and re-purposed on another, is often used to produce neural network classifiers when data is scarce or full-scale training is too costly. When the goal is to produce a model that is not only accurate but also adversarially robust, data scarcity and computational limitations become even more cumbersome. We consider robust transfer learning, in which we transfer not only performance but also robustness from a source model to a target domain. We start by observing that robust networks contain robust feature extractors. By training classifiers on top of these feature extractors, we produce new models that inherit the robustness of their parent networks. We then consider the case of “fine tuning” a network by re-training end-to-end in the target domain. When using lifelong learning strategies, this process preserves the robustness of the source network while achieving high accuracy. By using such strategies, it is possible to produce accurate and robust models with little data, and without the cost of adversarial training. Additionally, we can improve the generalization of adversarially trained models, while maintaining their robustness.
Tasks	Transfer Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=ryebG04YvB
PDF	https://openreview.net/pdf?id=ryebG04YvB
PWC	https://paperswithcode.com/paper/adversarially-robust-transfer-learning-1
Repo
Framework

Synthesizing Programmatic Policies that Inductively Generalize


Title	Synthesizing Programmatic Policies that Inductively Generalize
Authors	Anonymous
Abstract	Deep reinforcement learning has successfully solved a number of challenging control tasks. However, learned policies typically have difficulty generalizing to novel environments. We propose an algorithm for learning programmatic state machine policies that can capture repeating behaviors. By doing so, they have the ability to generalize to instances requiring an arbitrary number of repetitions, a property we call inductive generalization. However, state machine policies are hard to learn since they consist of a combination of continuous and discrete structure. We propose a learning framework called adaptive teaching, which learns a state machine policy by imitating a teacher; in contrast to traditional imitation learning, our teacher adaptively updates itself based on the structure of the student. We show how our algorithm can be used to learn policies that inductively generalize to novel environments, whereas traditional neural network policies fail to do so.
Tasks	Imitation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=S1l8oANFDH
PDF	https://openreview.net/pdf?id=S1l8oANFDH
PWC	https://paperswithcode.com/paper/synthesizing-programmatic-policies-that
Repo
Framework

Fast Task Adaptation for Few-Shot Learning


Title	Fast Task Adaptation for Few-Shot Learning
Authors	Anonymous
Abstract	Few-shot classification is a challenging task due to the scarcity of training examples for each class. The key lies in generalization of prior knowledge learned from large-scale base classes and fast adaptation of the classifier to novel classes. In this paper, we introduce a two-stage framework. In the first stage, we attempt to learn task-agnostic feature on base data with a novel Metric-Softmax loss. The Metric-Softmax loss is trained against the whole label set and learns more discriminative feature than episodic training. Besides, the Metric-Softmax classifier can be applied to base and novel classes in a consistent manner, which is critical for the generalizability of the learned feature. In the second stage, we design a task-adaptive transformation which adapts the classifier to each few-shot setting very fast within a few tuning epochs. Compared with existing fine-tuning scheme, the scarce examples of novel classes are exploited more effectively. Experiments show that our approach outperforms current state-of-the-arts by a large margin on the commonly used mini-ImageNet and CUB-200-2011 benchmarks.
Tasks	Few-Shot Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=ByxhOyHYwH
PDF	https://openreview.net/pdf?id=ByxhOyHYwH
PWC	https://paperswithcode.com/paper/fast-task-adaptation-for-few-shot-learning
Repo
Framework