April 1, 2020

2855 words 14 mins read

Paper Group NANR 96

Fantastic Generalization Measures and Where to Find Them. Convergence Behaviour of Some Gradient-Based Methods on Bilinear Zero-Sum Games. Dynamic Model Pruning with Feedback. Structured Object-Aware Physics Prediction for Video Modeling and Planning. Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. Implementing Inductive bias …

Fantastic Generalization Measures and Where to Find Them


Title	Fantastic Generalization Measures and Where to Find Them
Authors	Anonymous
Abstract	Generalization of deep networks has been intensely researched in recent years, resulting in a number of theoretical bounds and empirically motivated measures. However, most papers proposing such measures only study a small set of models, leaving open the question of whether these measures are truly useful in practice. We present the first large scale study of generalization bounds and measures in deep networks. We train over two thousand CIFAR-10 networks with systematic changes in important hyper-parameters. We attempt to uncover potential causal relationships between each measure and generalization, by using rank correlation coefficient and its modified forms. We analyze the results and show that some of the studied measures are very promising for further research.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SJgIPJBFvH
PDF	https://openreview.net/pdf?id=SJgIPJBFvH
PWC	https://paperswithcode.com/paper/fantastic-generalization-measures-and-where
Repo
Framework

Convergence Behaviour of Some Gradient-Based Methods on Bilinear Zero-Sum Games


Title	Convergence Behaviour of Some Gradient-Based Methods on Bilinear Zero-Sum Games
Authors	Anonymous
Abstract	Min-max formulations have attracted great attention in the ML community due to the rise of deep generative models and adversarial methods, and understanding the dynamics of (stochastic) gradient algorithms for solving such formulations has been a grand challenge. As a first step, we restrict to bilinear zero-sum games and give a systematic analysis of popular gradient updates, for both simultaneous and alternating versions. We provide exact conditions for their convergence and find the optimal parameter setup and convergence rates. In particular, our results offer formal evidence that alternating updates converge “better” than simultaneous ones.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SJlVY04FwH
PDF	https://openreview.net/pdf?id=SJlVY04FwH
PWC	https://paperswithcode.com/paper/convergence-behaviour-of-some-gradient-based-1
Repo
Framework

Dynamic Model Pruning with Feedback


Title	Dynamic Model Pruning with Feedback
Authors	Anonymous
Abstract	Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate the method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models and further that their performance surpasses all previously proposed pruning schemes (that come without feedback mechanisms).
Tasks	Model Compression
Published	2020-01-01
URL	https://openreview.net/forum?id=SJem8lSFwB
PDF	https://openreview.net/pdf?id=SJem8lSFwB
PWC	https://paperswithcode.com/paper/dynamic-model-pruning-with-feedback
Repo
Framework

Structured Object-Aware Physics Prediction for Video Modeling and Planning


Title	Structured Object-Aware Physics Prediction for Video Modeling and Planning
Authors	Anonymous
Abstract	When humans observe a physical system, they can easily locate components, understand their interactions, and anticipate future behavior, even in settings with complicated and previously unseen interactions. For computers, however, learning such models from videos in an unsupervised fashion is an unsolved research problem. In this paper, we present STOVE, a novel state-space model for videos, which explicitly reasons about objects and their positions, velocities, and interactions. It is constructed by combining an image model and a dynamics model in compositional manner and improves on previous work by reusing the dynamics model for inference, accelerating and regularizing training. STOVE predicts videos with convincing physical behavior over hundreds of timesteps, outperforms previous unsupervised models, and even approaches the performance of supervised baselines. We further demonstrate the strength of our model as a simulator for sample efficient model-based control, in a task with heavily interacting objects.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1e-kxSKDH
PDF	https://openreview.net/pdf?id=B1e-kxSKDH
PWC	https://paperswithcode.com/paper/structured-object-aware-physics-prediction
Repo
Framework

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks


Title	Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
Authors	Anonymous
Abstract	In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (i.e., gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Our results generalize the previous results for logistic regression with one-layer or multi-layer linear networks, and provide more quantitative convergence results with weaker assumptions than previous results for homogeneous smooth neural networks. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SJeLIgBKPS
PDF	https://openreview.net/pdf?id=SJeLIgBKPS
PWC	https://paperswithcode.com/paper/gradient-descent-maximizes-the-margin-of-1
Repo
Framework


Title	Implementing Inductive bias for different navigation tasks through diverse RNN attrractors
Authors	Anonymous
Abstract	Navigation is crucial for animal behavior and is assumed to require an internal representation of the external environment, termed a cognitive map. The precise form of this representation is often considered to be a metric representation of space. An internal representation, however, is judged by its contribution to performance on a given task, and may thus vary between different types of navigation tasks. Here we train a recurrent neural network that controls an agent performing several navigation tasks in a simple environment. To focus on internal representations, we split learning into a task-agnostic pre-training stage that modifies internal connectivity and a task-specific Q learning stage that controls the network’s output. We show that pre-training shapes the attractor landscape of the networks, leading to either a continuous attractor, discrete attractors or a disordered state. These structures induce bias onto the Q-Learning phase, leading to a performance pattern across the tasks corresponding to metric and topological regularities. Our results show that, in recurrent networks, inductive bias takes the form of attractor landscapes – which can be shaped by pre-training and analyzed using dynamical systems methods. Furthermore, we demonstrate that non-metric representations are useful for navigation tasks.
Tasks	Q-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=Byx4NkrtDS
PDF	https://openreview.net/pdf?id=Byx4NkrtDS
PWC	https://paperswithcode.com/paper/implementing-inductive-bias-for-different
Repo
Framework

ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring


Title	ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring
Authors	Anonymous
Abstract	We improve the recently-proposed ``MixMatch semi-supervised learning algorithm by introducing two new techniques: distribution alignment and augmentation anchoring. - Distribution alignment encourages the marginal distribution of predictions on unlabeled data to be close to the marginal distribution of ground-truth labels. - Augmentation anchoring} feeds multiple strongly augmented versions of an input into the model and encourages each output to be close to the prediction for a weakly-augmented version of the same input. To produce strong augmentations, we propose a variant of AutoAugment which learns the augmentation policy while the model is being trained. Our new algorithm, dubbed ReMixMatch, is significantly more data-efficient than prior work, requiring between 5 times and 16 times less data to reach the same accuracy. For example, on CIFAR-10 with 250 labeled examples we reach 93.73% accuracy (compared to MixMatch’s accuracy of 93.58% with 4000 examples) and a median accuracy of 84.92% with just four labels per class. \|
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HklkeR4KPB
PDF	https://openreview.net/pdf?id=HklkeR4KPB
PWC	https://paperswithcode.com/paper/remixmatch-semi-supervised-learning-with
Repo
Framework

Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness


Title	Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness
Authors	Anonymous
Abstract	Mode connectivity provides novel geometric insights on analyzing loss landscapes and enables building high-accuracy pathways between well-trained neural networks. In this work, we propose to employ mode connectivity in loss landscapes to study the adversarial robustness of deep neural networks, and provide novel methods for improving this robustness. Our experiments cover various types of adversarial attacks applied to different network architectures and datasets. When network models are tampered with backdoor or error-injection attacks, our results demonstrate that the path connection learned using limited amount of bonafide data can effectively mitigate adversarial effects while maintaining the original accuracy on clean data. Therefore, mode connectivity provides users with the power to repair backdoored or error-injected models. We also use mode connectivity to investigate the loss landscapes of regular and robust models against evasion attacks. Experiments show that there exists a barrier in adversarial robustness loss on the path connecting regular and adversarially-trained models. A high correlation is observed between the adversarial robustness loss and the largest eigenvalue of the input Hessian matrix, for which theoretical justifications are provided. Our results suggest that mode connectivity offers a holistic tool and practical means for evaluating and improving adversarial robustness.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SJgwzCEKwH
PDF	https://openreview.net/pdf?id=SJgwzCEKwH
PWC	https://paperswithcode.com/paper/bridging-mode-connectivity-in-loss-landscapes
Repo
Framework

Lipschitz Lifelong Reinforcement Learning


Title	Lipschitz Lifelong Reinforcement Learning
Authors	Anonymous
Abstract	We consider the problem of reusing prior experience when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and focus on the study and exploitation of the optimal value function’s Lipschitz continuity in the task space with respect to that metric. These theoretical results lead us to a value transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm that exploits continuity to accelerate learning. We illustrate the benefits of the method in Lifelong RL experiments.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BylalAEtvB
PDF	https://openreview.net/pdf?id=BylalAEtvB
PWC	https://paperswithcode.com/paper/lipschitz-lifelong-reinforcement-learning
Repo
Framework

Empirical confidence estimates for classification by deep neural networks


Title	Empirical confidence estimates for classification by deep neural networks
Authors	Anonymous
Abstract	How well can we estimate the probability that the classification predicted by a deep neural network is correct (or in the Top 5)? It is well-known that the softmax values of the network are not estimates of the probabilities of class labels. However, there is a misconception that these values are not informative. We define the notion of implied loss and prove that if an uncertainty measure is an implied loss, then low uncertainty means high probability of correct (or Top-k) classification on the test set. We demonstrate empirically that these values can be used to measure the confidence that the classification is correct. Our method is simple to use on existing networks: we proposed confidence measures for Top-k which can be evaluated by binning values on the test set.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=Hke0oa4KwS
PDF	https://openreview.net/pdf?id=Hke0oa4KwS
PWC	https://paperswithcode.com/paper/empirical-confidence-estimates-for-1
Repo
Framework

Model-based reinforcement learning for biological sequence design


Title	Model-based reinforcement learning for biological sequence design
Authors	Anonymous
Abstract	The ability to design biological structures such as DNA or proteins would have considerable medical and industrial impact. Doing so presents a challenging black-box optimization problem characterized by the large-batch, low round setting due to the need for labor-intensive wet lab evaluations. In response, we propose using reinforcement learning (RL) based on proximal-policy optimization (PPO) for biological sequence design. RL provides a flexible framework for optimization generative sequence models to achieve specific criteria, such as diversity among the high-quality sequences discovered. We propose a model-based variant of PPO, DyNA-PPO, to improve sample efficiency, where the policy for a new round is trained offline using a simulator fit on functional measurements from prior rounds. To accommodate the growing number of observations across rounds, the simulator model is automatically selected at each round from a pool of diverse models of varying capacity. On the tasks of designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structure, we find that DyNA-PPO performs significantly better than existing methods in settings in which modeling is feasible, while still not performing worse in situations in which a reliable model cannot be learned.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=HklxbgBKvr
PDF	https://openreview.net/pdf?id=HklxbgBKvr
PWC	https://paperswithcode.com/paper/model-based-reinforcement-learning-for
Repo
Framework

Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware


Title	Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware
Authors	Anonymous
Abstract	With the proliferation of specialized neural network processors that operate on low-precision integers, the performance of Deep Neural Network inference becomes increasingly dependent on the result of quantization. Despite plenty of prior work on the quantization of weights or activations for neural networks, there is still a wide gap between the software quantizers and the low-precision accelerator implementation, which degrades either the efficiency of networks or that of the hardware for the lack of software and hardware coordination at design-phase. In this paper, we propose a learned linear symmetric quantizer for integer neural network processors, which not only quantizes neural parameters and activations to low-bit integer but also accelerates hardware inference by using batch normalization fusion and low-precision accumulators (e.g., 16-bit) and multipliers (e.g., 4-bit). We use a unified way to quantize weights and activations, and the results outperform many previous approaches for various networks such as AlexNet, ResNet, and lightweight models like MobileNet while keeping friendly to the accelerator architecture. Additional, we also apply the method to object detection models and witness high performance and accuracy in YOLO-v2. Finally, we deploy the quantized models on our specialized integer-arithmetic-only DNN accelerator to show the effectiveness of the proposed quantizer. We show that even with linear symmetric quantization, the results can be better than asymmetric or non-linear methods in 4-bit networks. In evaluation, the proposed quantizer induces less than 0.4% accuracy drop in ResNet18, ResNet34, and AlexNet when quantizing the whole network as required by the integer processors.
Tasks	Object Detection, Quantization
Published	2020-01-01
URL	https://openreview.net/forum?id=H1lBj2VFPS
PDF	https://openreview.net/pdf?id=H1lBj2VFPS
PWC	https://paperswithcode.com/paper/linear-symmetric-quantization-of-neural
Repo
Framework

BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations


Title	BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations
Authors	Anonymous
Abstract	Binary Neural Network (BNN) has been gaining interest thanks to its computing cost reduction and memory saving. However, BNN suffers from performance degradation mainly due to the gradient mismatch caused by binarizing activations. Previous works tried to address the gradient mismatch by reducing the discrepancy between activation functions used at forward and backward passes, which is an indirect measure. In this work, we introduce coordinate discrete gradient (CDG) to better estimate the gradient mismatch. Analysis using the CDG indicates that using higher precision for activation is more effective than modifying the backward pass of binary activation function. Based on the observation, we propose a new training scheme for binary activation network called BinaryDuo in which two binary activations are coupled into a ternary activation during training. Experimental results show that BinaryDuo outperforms state-of-the-art BNNs on various benchmarks with the same amount of parameters and computing cost.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=r1x0lxrFPS
PDF	https://openreview.net/pdf?id=r1x0lxrFPS
PWC	https://paperswithcode.com/paper/binaryduo-reducing-gradient-mismatch-in
Repo
Framework

Transformer-XH: Multi-hop question answering with eXtra Hop attention


Title	Transformer-XH: Multi-hop question answering with eXtra Hop attention
Authors	Anonymous
Abstract	Transformers have obtained significant success modeling natural language as a sequence of text tokens. However, in many real world scenarios, textual data inherently exhibits structures beyond a linear sequence such as tree and graph; an important one being multi-hop question answering, where evidence required to answer questions are scattered across multiple related documents. This paper presents Transformer-XH, which uses eXtra Hop attention to enable the intrinsic modeling of structured texts in a fully data-driven way. Its new attention mechanism naturally “hops” across the connected text sequences in addition to attending over tokens within each sequence. Thus, Transformer-XH better answers multi-hop questions by propagating information between multiple documents, constructing global contextualized representations, and jointly reasoning over multiple pieces of evidence. This leads to a simpler multi-hop QA system which outperforms previous state-of-the-art on the HotpotQA FullWiki setting by large margins.
Tasks	Question Answering
Published	2020-01-01
URL	https://openreview.net/forum?id=r1eIiCNYwS
PDF	https://openreview.net/pdf?id=r1eIiCNYwS
PWC	https://paperswithcode.com/paper/transformer-xh-multi-hop-question-answering
Repo
Framework

Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards


Title	Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards
Authors	Anonymous
Abstract	Imitation learning allows agents to learn complex behaviors from demonstrations. However, learning a complex vision-based task may require an impractical number of demonstrations. Meta-imitation learning is a promising approach towards enabling agents to learn a new task from one or a few demonstrations by leveraging experience from learning similar tasks. In the presence of task ambiguity or unobserved dynamics, demonstrations alone may not provide enough information; an agent must also try the task to successfully infer a policy. In this work, we propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback. In comparison to meta-imitation, this approach enables the agent to effectively and efficiently improve itself autonomously beyond the demonstration data. In comparison to meta-reinforcement learning, we can scale to substantially broader distributions of tasks, as the demonstration reduces the burden of exploration. Our experiments show that our method significantly outperforms prior approaches on a set of challenging, vision-based control tasks.
Tasks	Imitation Learning, Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=SJg5J6NtDr
PDF	https://openreview.net/pdf?id=SJg5J6NtDr
PWC	https://paperswithcode.com/paper/watch-try-learn-meta-learning-from-1
Repo
Framework