April 3, 2020

# Paper Group AWR 36

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training. Iterate Averaging Helps: An Alternative Perspective in Deep Learning. Bio-Inspired Modality Fusion for Active Speaker Detection. Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications. SimLoss: Class Similarities in Cross Entropy …

#### Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

Title Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
Authors Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, Jianfeng Gao
Abstract Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent called Prevalent. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and Help, Anna!’’ the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art. |
Published 2020-02-25
URL https://arxiv.org/abs/2002.10638v1
PDF https://arxiv.org/pdf/2002.10638v1.pdf
PWC https://paperswithcode.com/paper/towards-learning-a-generic-agent-for-vision
Repo https://github.com/weituo12321/PREVALENT
Framework none

#### Iterate Averaging Helps: An Alternative Perspective in Deep Learning

Title Iterate Averaging Helps: An Alternative Perspective in Deep Learning
Authors Diego Granziol, Xingchen Wan, Stephen Roberts
Abstract Iterate averaging has a rich history in optimisation, but has only very recently been popularised in deep learning. We investigate its effects in a deep learning context, and argue that previous explanations on its efficacy, which place a high importance on the local geometry (flatness vs sharpness) of final solutions, are not necessarily relevant. We instead argue that the robustness of iterate averaging towards the typically very high estimation noise in deep learning and the various regularisation effects averaging exert, are the key reasons for the performance gain, indeed this effect is made even more prominent due to the over-parameterisation of modern networks. Inspired by this, we propose Gadam, which combines Adam with iterate averaging to address one of key problems of adaptive optimisers that they often generalise worse. Without compromising adaptivity and with minimal additional computational burden, we show that Gadam (and its variant GadamX) achieve a generalisation performance that is consistently superior to tuned SGD and is even on par or better compared to SGD with iterate averaging on various image classification (CIFAR 10/100 and ImageNet 32$\times$32) and language tasks (PTB).
Published 2020-03-02
URL https://arxiv.org/abs/2003.01247v1
PDF https://arxiv.org/pdf/2003.01247v1.pdf
PWC https://paperswithcode.com/paper/iterate-averaging-helps-an-alternative
Framework pytorch

#### Bio-Inspired Modality Fusion for Active Speaker Detection

Title Bio-Inspired Modality Fusion for Active Speaker Detection
Authors Gustavo Assunção, Nuno Gonçalves, Paulo Menezes
Abstract Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened enabling, for instance, the well known “cocktail party” and McGurk effects, i.e. speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, Neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.
Published 2020-02-28
URL https://arxiv.org/abs/2003.00063v1
PDF https://arxiv.org/pdf/2003.00063v1.pdf
PWC https://paperswithcode.com/paper/bio-inspired-modality-fusion-for-active
Repo https://github.com/gustavomiguelsa/SCF
Framework none

#### Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Title Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Authors Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, Krzysztof Chalupka
Abstract Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at github.com/bbrattoli/ZeroShotVideoClassification.
Published 2020-03-03
URL https://arxiv.org/abs/2003.01455v3
PDF https://arxiv.org/pdf/2003.01455v3.pdf
PWC https://paperswithcode.com/paper/rethinking-zero-shot-video-classification-end
Repo https://github.com/bbrattoli/ZeroShotVideoClassification
Framework pytorch

#### SimLoss: Class Similarities in Cross Entropy

Title SimLoss: Class Similarities in Cross Entropy
Authors Konstantin Kobs, Michael Steininger, Albin Zehe, Florian Lautenschlager, Andreas Hotho
Abstract One common loss function in neural network classification tasks is Categorical Cross Entropy (CCE), which punishes all misclassifications equally. However, classes often have an inherent structure. For instance, classifying an image of a rose as “violet” is better than as “truck”. We introduce SimLoss, a drop-in replacement for CCE that incorporates class similarities along with two techniques to construct such matrices from task-specific knowledge. We test SimLoss on Age Estimation and Image Classification and find that it brings significant improvements over CCE on several metrics. SimLoss therefore allows for explicit modeling of background knowledge by simply exchanging the loss function, while keeping the neural network architecture the same. Code and additional resources can be found at https://github.com/konstantinkobs/SimLoss.
Published 2020-03-06
URL https://arxiv.org/abs/2003.03182v1
PDF https://arxiv.org/pdf/2003.03182v1.pdf
PWC https://paperswithcode.com/paper/simloss-class-similarities-in-cross-entropy
Repo https://github.com/konstantinkobs/SimLoss
Framework pytorch

#### The POLAR Framework: Polar Opposites Enable Interpretability of Pre-Trained Word Embeddings

Title The POLAR Framework: Polar Opposites Enable Interpretability of Pre-Trained Word Embeddings
Authors Binny Mathew, Sandipan Sikdar, Florian Lemmerich, Markus Strohmaier
Abstract We introduce POLAR - a framework that adds interpretability to pre-trained word embeddings via the adoption of semantic differentials. Semantic differentials are a psychometric construct for measuring the semantics of a word by analysing its position on a scale between two polar opposites (e.g., cold – hot, soft – hard). The core idea of our approach is to transform existing, pre-trained word embeddings via semantic differentials to a new “polar” space with interpretable dimensions defined by such polar opposites. Our framework also allows for selecting the most discriminative dimensions from a set of polar dimensions provided by an oracle, i.e., an external source. We demonstrate the effectiveness of our framework by deploying it to various downstream tasks, in which our interpretable word embeddings achieve a performance that is comparable to the original word embeddings. We also show that the interpretable dimensions selected by our framework align with human judgement. Together, these results demonstrate that interpretability can be added to word embeddings without compromising performance. Our work is relevant for researchers and engineers interested in interpreting pre-trained word embeddings.
Published 2020-01-27
URL https://arxiv.org/abs/2001.09876v2
PDF https://arxiv.org/pdf/2001.09876v2.pdf
PWC https://paperswithcode.com/paper/the-polar-framework-polar-opposites-enable
Repo https://github.com/Sandipan99/POLAR
Framework none

#### Selecting Relevant Features from a Universal Representation for Few-shot Classification

Title Selecting Relevant Features from a Universal Representation for Few-shot Classification
Authors Nikita Dvornik, Cordelia Schmid, Julien Mairal
Abstract Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples. In this work, we propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches. First, we obtain a universal representation by training a set of semantically different feature extractors. Then, given a few-shot learning task, we use our universal feature bank to automatically select the most relevant representations. We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training, which leads to state-of-the-art results on MetaDataset and improved accuracy on mini-ImageNet.
Published 2020-03-20
URL https://arxiv.org/abs/2003.09338v1
PDF https://arxiv.org/pdf/2003.09338v1.pdf
PWC https://paperswithcode.com/paper/selecting-relevant-features-from-a-universal
Repo https://github.com/dvornikita/SUR
Framework none

#### Extreme Classification via Adversarial Softmax Approximation

Title Extreme Classification via Adversarial Softmax Approximation
Authors Robert Bamler, Stephan Mandt
Abstract Training a classifier over a large number of classes, known as ‘extreme classification’, has become a topic of major interest with applications in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost proportional to the number of classes $C$, which often is prohibitively expensive. A popular scalable softmax approximation relies on uniform negative sampling, which suffers from slow convergence due a poor signal-to-noise ratio. In this paper, we propose a simple training method for drastically enhancing the gradient signal by drawing negative samples from an adversarial model that mimics the data distribution. Our contributions are three-fold: (i) an adversarial sampling mechanism that produces negative samples at a cost only logarithmic in $C$, thus still resulting in cheap gradient updates; (ii) a mathematical proof that this adversarial sampling minimizes the gradient variance while any bias due to non-uniform sampling can be removed; (iii) experimental results on large scale data sets that show a reduction of the training time by an order of magnitude relative to several competitive baselines.
Published 2020-02-15
URL https://arxiv.org/abs/2002.06298v1
PDF https://arxiv.org/pdf/2002.06298v1.pdf
Framework tf
Title Semi-Supervised Neural Architecture Search
Authors Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, Tie-Yan Liu
Abstract Neural architecture search (NAS) relies on a good controller to generate better architectures or predict the accuracy of given architectures. However, training the controller requires both abundant and high-quality pairs of architectures and their accuracy, while it is costly to evaluate an architecture and obtain its accuracy. In this paper, we propose SemiNAS, a semi-supervised NAS approach that leverages numerous unlabeled architectures (without evaluation and thus nearly no cost) to improve the controller. Specifically, SemiNAS 1) trains an initial controller with a small set of architecture-accuracy data pairs; 2) uses the trained controller to predict the accuracy of large amount of architectures~(without evaluation); and 3) adds the generated data pairs to the original data to further improve the controller. SemiNAS has two advantages: 1) It reduces the computational cost under the same accuracy guarantee. 2) It achieves higher accuracy under the same computational cost. On NASBench-101 benchmark dataset, it discovers a top 0.01% architecture after evaluating roughly 300 architectures, with only 1/7 computational cost compared with regularized evolution and gradient-based methods. On ImageNet, it achieves a state-of-the-art top-1 error rate of $23.5%$ (under the mobile setting) using 4 GPU-days for search. We further apply it to LJSpeech text to speech task and it achieves 97% intelligibility rate in the low-resource setting and 15% test error rate in the robustness setting, with 9%, 7% improvements over the baseline respectively. Our code is available at https://github.com/renqianluo/SemiNAS.
Tasks Natural Language Transduction, Neural Architecture Search
Published 2020-02-24
URL https://arxiv.org/abs/2002.10389v2
PDF https://arxiv.org/pdf/2002.10389v2.pdf
PWC https://paperswithcode.com/paper/semi-supervised-neural-architecture-search
Repo https://github.com/renqianluo/SemiNAS
Framework pytorch

#### Watch your Up-Convolution: CNN Based Generative Deep Neural Networks are Failing to Reproduce Spectral Distributions

Title Watch your Up-Convolution: CNN Based Generative Deep Neural Networks are Failing to Reproduce Spectral Distributions
Authors Ricard Durall, Margret Keuper, Janis Keuper
Abstract Generative convolutional deep neural networks, e.g. popular GAN architectures, are relying on convolution based up-sampling methods to produce non-scalar outputs like images or video sequences. In this paper, we show that common up-sampling methods, i.e. known as up-convolution or transposed convolution, are causing the inability of such models to reproduce spectral distributions of natural training data correctly. This effect is independent of the underlying architecture and we show that it can be used to easily detect generated data like deepfakes with up to 100% accuracy on public benchmarks. To overcome this drawback of current generative models, we propose to add a novel spectral regularization term to the training optimization objective. We show that this approach not only allows to train spectral consistent GANs that are avoiding high frequency errors. Also, we show that a correct approximation of the frequency spectrum has positive effects on the training stability and output quality of generative networks.
Published 2020-03-03
URL https://arxiv.org/abs/2003.01826v1
PDF https://arxiv.org/pdf/2003.01826v1.pdf
PWC https://paperswithcode.com/paper/watch-your-up-convolution-cnn-based
Repo https://github.com/cc-hpc-itwm/UpConv
Framework pytorch

#### Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding

Title Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding
Authors Victor Veitch, Anisha Zaveri
Abstract It is a truth universally acknowledged that an observed association without known mechanism must be in want of a causal estimate. However, causal estimation from observational data often relies on the (untestable) assumption of no unobserved confounding'. Violations of this assumption can induce bias in effect estimates. In principle, such bias could invalidate or reverse the conclusions of a study. However, in some cases, we might hope that the influence of unobserved confounders is weak relative to a large’ estimated effect, so the qualitative conclusions are robust to bias from unobserved confounding. The purpose of this paper is to develop \emph{Austen plots}, a sensitivity analysis tool to aid such judgments by making it easier to reason about potential bias induced by unobserved confounding. We formalize confounding strength in terms of how strongly the confounder influences treatment assignment and outcome. For a target level of bias, an Austen plot shows the minimum values of treatment and outcome influence required to induce that level of bias. Domain experts can then make subjective judgments about whether such strong confounders are plausible. To aid this judgment, the Austen plot additionally displays the estimated influence strength of (groups of) the observed covariates. Austen plots generalize the classic sensitivity analysis approach of Imbens [Imb03]. Critically, Austen plots allow any approach for modeling the observed data and producing the initial estimate. We illustrate the tool by assessing biases for several real causal inference problems, using a variety of machine learning approaches for the initial data analysis. Code is available at https://github.com/anishazaveri/austen_plots
Published 2020-03-03
URL https://arxiv.org/abs/2003.01747v1
PDF https://arxiv.org/pdf/2003.01747v1.pdf
PWC https://paperswithcode.com/paper/sense-and-sensitivity-analysis-simple-post
Repo https://github.com/anishazaveri/austen_plots
Framework none

#### Neural Bayes: A Generic Parameterization Method for Unsupervised Representation Learning

Title Neural Bayes: A Generic Parameterization Method for Unsupervised Representation Learning
Authors Devansh Arpit, Huan Wang, Caiming Xiong, Richard Socher, Yoshua Bengio
Abstract We introduce a parameterization method called Neural Bayes which allows computing statistical quantities that are in general difficult to compute and opens avenues for formulating new objectives for unsupervised representation learning. Specifically, given an observed random variable $\mathbf{x}$ and a latent discrete variable $z$, we can express $p(\mathbf{x}z)$, $p(z\mathbf{x})$ and $p(z)$ in closed form in terms of a sufficiently expressive function (Eg. neural network) using our parameterization without restricting the class of these distributions. To demonstrate its usefulness, we develop two independent use cases for this parameterization: 1. Mutual Information Maximization (MIM): MIM has become a popular means for self-supervised representation learning. Neural Bayes allows us to compute mutual information between observed random variables $\mathbf{x}$ and latent discrete random variables $z$ in closed form. We use this for learning image representations and show its usefulness on downstream classification tasks. 2. Disjoint Manifold Labeling: Neural Bayes allows us to formulate an objective which can optimally label samples from disjoint manifolds present in the support of a continuous distribution. This can be seen as a specific form of clustering where each disjoint manifold in the support is a separate cluster. We design clustering tasks that obey this formulation and empirically show that the model optimally labels the disjoint manifolds. Our code is available at \url{https://github.com/salesforce/NeuralBayes}
Tasks Representation Learning, Unsupervised Representation Learning
Published 2020-02-20
URL https://arxiv.org/abs/2002.09046v1
PDF https://arxiv.org/pdf/2002.09046v1.pdf
PWC https://paperswithcode.com/paper/neural-bayes-a-generic-parameterization
Repo https://github.com/salesforce/NeuralBayes
Framework pytorch

#### Nonlinear classifiers for ranking problems based on kernelized SVM

Title Nonlinear classifiers for ranking problems based on kernelized SVM
Authors Václav Mácha, Lukáš Adam, Václav Šmídl
Abstract Many classification problems focus on maximizing the performance only on the samples with the highest relevance instead of all samples. As an example, we can mention ranking problems, accuracy at the top or search engines where only the top few queries matter. In our previous work, we derived a general framework including several classes of these linear classification problems. In this paper, we extend the framework to nonlinear classifiers. Utilizing a similarity to SVM, we dualize the problems, add kernels and propose a componentwise dual ascent method. This allows us to perform one iteration in less than 20 milliseconds on relatively large datasets such as FashionMNIST.
Published 2020-02-26
URL https://arxiv.org/abs/2002.11436v1
PDF https://arxiv.org/pdf/2002.11436v1.pdf
PWC https://paperswithcode.com/paper/nonlinear-classifiers-for-ranking-problems
Repo https://github.com/VaclavMacha/ClassificationOnTop_new.jl
Framework none

Abstract Neural networks that are based on unfolding of an iterative solver, such as LISTA (learned iterative soft threshold algorithm), are widely used due to their accelerated performance. Nevertheless, as opposed to non-learned solvers, these networks are trained on a certain dictionary, and therefore they are inapplicable for varying model scenarios. This work introduces an adaptive learned solver, termed Ada-LISTA, which receives pairs of signals and their corresponding dictionaries as inputs, and learns a universal architecture to serve them all. We prove that this scheme is guaranteed to solve sparse coding in linear rate for varying models, including dictionary perturbations and permutations. We also provide an extensive numerical study demonstrating its practical adaptation capabilities. Finally, we deploy Ada-LISTA to natural image inpainting, where the patch-masks vary spatially, thus requiring such an adaptation.
Published 2020-01-23
URL https://arxiv.org/abs/2001.08456v2
PDF https://arxiv.org/pdf/2001.08456v2.pdf
Framework pytorch

#### Understanding and Enhancing Mixed Sample Data Augmentation

Title Understanding and Enhancing Mixed Sample Data Augmentation
Authors Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Prügel-Bennett, Jonathon Hare
Abstract Mixed Sample Data Augmentation (MSDA) has received increasing attention in recent years, with many successful variants such as MixUp and CutMix. Following insight on the efficacy of CutMix in particular, we propose FMix, an MSDA that uses binary masks obtained by applying a threshold to low frequency images sampled from Fourier space. FMix improves performance over MixUp and CutMix for a number of state-of-the-art models across a range of data sets and problem settings. We go on to analyse MixUp, CutMix, and FMix from an information theoretic perspective, characterising learned models in terms of how they progressively compress the input with depth. Ultimately, our analyses allow us to decouple two complementary properties of augmentations, and present a unified framework for reasoning about MSDA. Code for all experiments is available at https://github.com/ecs-vlc/FMix.