February 2, 2020

3401 words 16 mins read

Paper Group AWR 45

SelFlow: Self-Supervised Learning of Optical Flow. RUBi: Reducing Unimodal Biases in Visual Question Answering. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. Synthesis of Biologically Realistic Human Motion Using Joint Torque Actuation. Universal Boosting Variational Inference. SummAE: Zero-Shot Abstractive …

SelFlow: Self-Supervised Learning of Optical Flow


Title	SelFlow: Self-Supervised Learning of Optical Flow
Authors	Pengpeng Liu, Michael Lyu, Irwin King, Jia Xu
Abstract	We present a self-supervised learning approach for optical flow. Our method distills reliable flow estimations from non-occluded pixels, and uses these predictions as ground truth to learn optical flow for hallucinated occlusions. We further design a simple CNN to utilize temporal information from multiple frames for better flow estimation. These two principles lead to an approach that yields the best performance for unsupervised optical flow learning on the challenging benchmarks including MPI Sintel, KITTI 2012 and 2015. More notably, our self-supervised pre-trained model provides an excellent initialization for supervised fine-tuning. Our fine-tuned models achieve state-of-the-art results on all three datasets. At the time of writing, we achieve EPE=4.26 on the Sintel benchmark, outperforming all submitted methods.
Tasks	Optical Flow Estimation
Published	2019-04-19
URL	http://arxiv.org/abs/1904.09117v1
PDF	http://arxiv.org/pdf/1904.09117v1.pdf
PWC	https://paperswithcode.com/paper/selflow-self-supervised-learning-of-optical
Repo	https://github.com/ppliuboy/SelFlow
Framework	tf

RUBi: Reducing Unimodal Biases in Visual Question Answering


Title	RUBi: Reducing Unimodal Biases in Visual Question Answering
Authors	Remi Cadene, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, Devi Parikh
Abstract	Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image. It implicitly forces the VQA model to use the two input modalities instead of relying on statistical regularities between the question and the answer. We leverage a question-only model that captures the language biases by identifying when these unwanted regularities are used. It prevents the base VQA model from learning them by influencing its predictions. This leads to dynamically adjusting the loss in order to compensate for biases. We validate our contributions by surpassing the current state-of-the-art results on VQA-CP v2. This dataset is specifically designed to assess the robustness of VQA models when exposed to different question biases at test time than what was seen during training. Our code is available: github.com/cdancette/rubi.bootstrap.pytorch
Tasks	Question Answering, Visual Question Answering
Published	2019-06-24
URL	https://arxiv.org/abs/1906.10169v2
PDF	https://arxiv.org/pdf/1906.10169v2.pdf
PWC	https://paperswithcode.com/paper/rubi-reducing-unimodal-biases-in-visual
Repo	https://github.com/cdancette/rubi.bootstrap.pytorch
Framework	pytorch

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering


Title	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Authors	Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, Dacheng Tao
Abstract	Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos. The dataset is available at https://github.com/MILVLG/activitynet-qa
Tasks	Question Answering, Video Question Answering, Visual Question Answering
Published	2019-06-06
URL	https://arxiv.org/abs/1906.02467v1
PDF	https://arxiv.org/pdf/1906.02467v1.pdf
PWC	https://paperswithcode.com/paper/activitynet-qa-a-dataset-for-understanding
Repo	https://github.com/MILVLG/activitynet-qa
Framework	none

Synthesis of Biologically Realistic Human Motion Using Joint Torque Actuation


Title	Synthesis of Biologically Realistic Human Motion Using Joint Torque Actuation
Authors	Yifeng Jiang, Tom Van Wouwe, Friedl De Groote, C. Karen Liu
Abstract	Using joint actuators to drive the skeletal movements is a common practice in character animation, but the resultant torque patterns are often unnatural or infeasible for real humans to achieve. On the other hand, physiologically-based models explicitly simulate muscles and tendons and thus produce more human-like movements and torque patterns. This paper introduces a technique to transform an optimal control problem formulated in the muscle-actuation space to an equivalent problem in the joint-actuation space, such that the solutions to both problems have the same optimal value. By solving the equivalent problem in the joint-actuation space, we can generate human-like motions comparable to those generated by musculotendon models, while retaining the benefit of simple modeling and fast computation offered by joint-actuation models. Our method transforms constant bounds on muscle activations to nonlinear, state-dependent torque limits in the joint-actuation space. In addition, the metabolic energy function on muscle activations is transformed to a nonlinear function of joint torques, joint configuration and joint velocity. Our technique can also benefit policy optimization using deep reinforcement learning approach, by providing a more anatomically realistic action space for the agent to explore during the learning process. We take the advantage of the physiologically-based simulator, OpenSim, to provide training data for learning the torque limits and the metabolic energy function. Once trained, the same torque limits and the energy function can be applied to drastically different motor tasks formulated as either trajectory optimization or policy learning. Codebase: https://github.com/jyf588/lrle and https://github.com/jyf588/lrle-rl-examples
Tasks
Published	2019-04-30
URL	https://arxiv.org/abs/1904.13041v2
PDF	https://arxiv.org/pdf/1904.13041v2.pdf
PWC	https://paperswithcode.com/paper/synthesis-of-biologically-realistic-human
Repo	https://github.com/jyf588/lrle-rl-examples
Framework	tf

Universal Boosting Variational Inference


Title	Universal Boosting Variational Inference
Authors	Trevor Campbell, Xinglong Li
Abstract	Boosting variational inference (BVI) approximates an intractable probability density by iteratively building up a mixture of simple component distributions one at a time, using techniques from sparse convex optimization to provide both computational scalability and approximation error guarantees. But the guarantees have strong conditions that do not often hold in practice, resulting in degenerate component optimization problems; and we show that the ad-hoc regularization used to prevent degeneracy in practice can cause BVI to fail in unintuitive ways. We thus develop universal boosting variational inference (UBVI), a BVI scheme that exploits the simple geometry of probability densities under the Hellinger metric to prevent the degeneracy of other gradient-based BVI methods, avoid difficult joint optimizations of both component and weight, and simplify fully-corrective weight optimizations. We show that for any target density and any mixture component family, the output of UBVI converges to the best possible approximation in the mixture family, even when the mixture family is misspecified. We develop a scalable implementation based on exponential family mixture components and standard stochastic optimization techniques. Finally, we discuss statistical benefits of the Hellinger distance as a variational objective through bounds on posterior probability, moment, and importance sampling errors. Experiments on multiple datasets and models show that UBVI provides reliable, accurate posterior approximations.
Tasks	Stochastic Optimization
Published	2019-06-04
URL	https://arxiv.org/abs/1906.01235v2
PDF	https://arxiv.org/pdf/1906.01235v2.pdf
PWC	https://paperswithcode.com/paper/universal-boosting-variational-inference
Repo	https://github.com/trevorcampbell/ubvi
Framework	none

SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders


Title	SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders
Authors	Peter J. Liu, Yu-An Chung, Jie Ren
Abstract	We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced.
Tasks	Abstractive Text Summarization, Denoising, Text Summarization
Published	2019-10-02
URL	https://arxiv.org/abs/1910.00998v1
PDF	https://arxiv.org/pdf/1910.00998v1.pdf
PWC	https://paperswithcode.com/paper/summae-zero-shot-abstractive-text
Repo	https://github.com/DataScienceNigeria/SummAE-from-Google-Brain-and-MIT-CSAIL
Framework	none

Answers Unite! Unsupervised Metrics for Reinforced Summarization Models


Title	Answers Unite! Unsupervised Metrics for Reinforced Summarization Models
Authors	Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
Abstract	Abstractive summarization approaches based on Reinforcement Learning (RL) have recently been proposed to overcome classical likelihood maximization. RL enables to consider complex, possibly non-differentiable, metrics that globally assess the quality and relevance of the generated outputs. ROUGE, the most used summarization metric, is known to suffer from bias towards lexical similarity as well as from suboptimal accounting for fluency and readability of the generated abstracts. We thus explore and propose alternative evaluation measures: the reported human-evaluation analysis shows that the proposed metrics, based on Question Answering, favorably compares to ROUGE – with the additional property of not requiring reference summaries. Training a RL-based model on these metrics leads to improvements (both in terms of human or automated metrics) over current approaches that use ROUGE as a reward.
Tasks	Abstractive Text Summarization, Question Answering
Published	2019-09-04
URL	https://arxiv.org/abs/1909.01610v1
PDF	https://arxiv.org/pdf/1909.01610v1.pdf
PWC	https://paperswithcode.com/paper/answers-unite-unsupervised-metrics-for
Repo	https://github.com/recitalAI/summa-qa
Framework	pytorch

FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things


Title	FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things
Authors	Xiaying Wang, Michele Magno, Lukas Cavigelli, Luca Benini
Abstract	The growing number of low-power smart devices in the Internet of Things is coupled with the concept of “Edge Computing”, that is moving some of the intelligence, especially machine learning, towards the edge of the network. Enabling machine learning algorithms to run on resource-constrained hardware, typically on low-power smart devices, is challenging in terms of hardware (optimized and energy-efficient integrated circuits), algorithmic and firmware implementations. This paper presents FANN-on-MCU, an open-source toolkit built upon the Fast Artificial Neural Network (FANN) library to run lightweight and energy-efficient neural networks on microcontrollers based on both the ARM Cortex-M series and the novel RISC-V-based Parallel Ultra-Low-Power (PULP) platform. The toolkit takes multi-layer perceptrons trained with FANN and generates code targeted at execution on low-power microcontrollers either with a floating-point unit (i.e., ARM Cortex-M4F and M7F) or without (i.e., ARM Cortex M0-M3 or PULP-based processors). This paper also provides an architectural performance evaluation of neural networks on the most popular ARM Cortex-M family and the parallel RISC-V processor called Mr. Wolf. The evaluation includes experimental results for three different applications using a self-sustainable wearable multi-sensor bracelet. Experimental results show a measured latency in the order of only a few microseconds and a power consumption of few milliwatts while keeping the memory requirements below the limitations of the targeted microcontrollers. In particular, the parallel implementation on the octa-core RISC-V platform reaches a speedup of 22x and a 69% reduction in energy consumption with respect to a single-core implementation on Cortex-M4 for continuous real-time classification.
Tasks
Published	2019-11-08
URL	https://arxiv.org/abs/1911.03314v2
PDF	https://arxiv.org/pdf/1911.03314v2.pdf
PWC	https://paperswithcode.com/paper/fann-on-mcu-an-open-source-toolkit-for-energy
Repo	https://github.com/pulp-platform/fann-on-mcu
Framework	none

Hierarchical Reinforcement Learning for Open-Domain Dialog


Title	Hierarchical Reinforcement Learning for Open-Domain Dialog
Authors	Abdelrhman Saleh, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Rosalind Picard
Abstract	Open-domain dialog generation is a challenging problem; maximum likelihood training can lead to repetitive outputs, models have difficulty tracking long-term conversational goals, and training on standard movie or online datasets may lead to the generation of inappropriate, biased, or offensive text. Reinforcement Learning (RL) is a powerful framework that could potentially address these issues, for example by allowing a dialog model to optimize for reducing toxicity and repetitiveness. However, previous approaches which apply RL to open-domain dialog generation do so at the word level, making it difficult for the model to learn proper credit assignment for long-term conversational rewards. In this paper, we propose a novel approach to hierarchical reinforcement learning, VHRL, which uses policy gradients to tune the utterance-level embedding of a variational sequence model. This hierarchical approach provides greater flexibility for learning long-term, conversational rewards. We use self-play and RL to optimize for a set of human-centered conversation metrics, and show that our approach provides significant improvements – in terms of both human evaluation and automatic metrics – over state-of-the-art dialog models, including Transformers.
Tasks	Hierarchical Reinforcement Learning
Published	2019-09-17
URL	https://arxiv.org/abs/1909.07547v3
PDF	https://arxiv.org/pdf/1909.07547v3.pdf
PWC	https://paperswithcode.com/paper/hierarchical-reinforcement-learning-for-open
Repo	https://github.com/natashamjaques/neural_chat
Framework	pytorch

Improved Embeddings with Easy Positive Triplet Mining


Title	Improved Embeddings with Easy Positive Triplet Mining
Authors	Hong Xuan, Abby Stylianou, Robert Pless
Abstract	Deep metric learning seeks to define an embedding where semantically similar images are embedded to nearby locations, and semantically dissimilar images are embedded to distant locations. Substantial work has focused on loss functions and strategies to learn these embeddings by pushing images from the same class as close together in the embedding space as possible. In this paper, we propose an alternative, loosened embedding strategy that requires the embedding function only map each training image to the most similar examples from the same class, an approach we call “Easy Positive” mining. We provide a collection of experiments and visualizations that highlight that this Easy Positive mining leads to embeddings that are more flexible and generalize better to new unseen data. This simple mining strategy yields recall performance that exceeds state of the art approaches (including those with complicated loss functions and ensemble methods) on image retrieval datasets including CUB, Stanford Online Products, In-Shop Clothes and Hotels-50K.
Tasks	Image Retrieval, Metric Learning
Published	2019-04-08
URL	https://arxiv.org/abs/1904.04370v2
PDF	https://arxiv.org/pdf/1904.04370v2.pdf
PWC	https://paperswithcode.com/paper/improved-embeddings-with-easy-positive
Repo	https://github.com/littleredxh/EasyPositiveHardNegative
Framework	pytorch

SG-Net: Syntax-Guided Machine Reading Comprehension


Title	SG-Net: Syntax-Guided Machine Reading Comprehension
Authors	Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, Hai Zhao, Rui Wang
Abstract	For machine reading comprehension, the capacity of effectively modeling the linguistic knowledge from the detail-riddled and lengthy passages and getting ride of the noises is essential to improve its performance. Traditional attentive models attend to all words without explicit constraint, which results in inaccurate concentration on some dispensable words. In this work, we propose using syntax to guide the text modeling by incorporating explicit syntactic constraints into attention mechanism for better linguistically motivated word representations. In detail, for self-attention network (SAN) sponsored Transformer-based encoder, we introduce syntactic dependency of interest (SDOI) design into the SAN to form an SDOI-SAN with syntax-guided self-attention. Syntax-guided network (SG-Net) is then composed of this extra SDOI-SAN and the SAN from the original Transformer encoder through a dual contextual architecture for better linguistics inspired representation. To verify its effectiveness, the proposed SG-Net is applied to typical pre-trained language model BERT which is right based on a Transformer encoder. Extensive experiments on popular benchmarks including SQuAD 2.0 and RACE show that the proposed SG-Net design helps achieve substantial performance improvement over strong baselines.
Tasks	Language Modelling, Machine Reading Comprehension, Question Answering, Reading Comprehension
Published	2019-08-14
URL	https://arxiv.org/abs/1908.05147v3
PDF	https://arxiv.org/pdf/1908.05147v3.pdf
PWC	https://paperswithcode.com/paper/sg-net-syntax-guided-machine-reading
Repo	https://github.com/cooelf/SG-Net
Framework	pytorch

Semantics-aware BERT for Language Understanding


Title	Semantics-aware BERT for Language Understanding
Authors	Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, Xiang Zhou
Abstract	The latest work on language representations carefully integrates contextualized features into language model training, which enables a series of success especially in various machine reading comprehension and natural language inference tasks. However, the existing language representation models including ELMo, GPT and BERT only exploit plain context-sensitive features such as character or word embeddings. They rarely consider incorporating structured semantic information which can provide rich semantics for language representation. To promote natural language understanding, we propose to incorporate explicit contextual semantics from pre-trained semantic role labeling, and introduce an improved language representation model, Semantics-aware BERT (SemBERT), which is capable of explicitly absorbing contextual semantics over a BERT backbone. SemBERT keeps the convenient usability of its BERT precursor in a light fine-tuning way without substantial task-specific modifications. Compared with BERT, semantics-aware BERT is as simple in concept but more powerful. It obtains new state-of-the-art or substantially improves results on ten reading comprehension and language inference tasks.
Tasks	Language Modelling, Machine Reading Comprehension, Natural Language Inference, Question Answering, Reading Comprehension, Semantic Role Labeling, Word Embeddings
Published	2019-09-05
URL	https://arxiv.org/abs/1909.02209v3
PDF	https://arxiv.org/pdf/1909.02209v3.pdf
PWC	https://paperswithcode.com/paper/semantics-aware-bert-for-language
Repo	https://github.com/cooelf/SemBERT
Framework	pytorch

3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents


Title	3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents
Authors	Ue-Hwan Kim, Jin-Man Park, Taek-Jin Song, Jong-Hwan Kim
Abstract	Intelligent agents gather information and perceive semantics within the environments before taking on given tasks. The agents store the collected information in the form of environment models that compactly represent the surrounding environments. The agents, however, can only conduct limited tasks without an efficient and effective environment model. Thus, such an environment model takes a crucial role for the autonomy systems of intelligent agents. We claim the following characteristics for a versatile environment model: accuracy, applicability, usability, and scalability. Although a number of researchers have attempted to develop such models that represent environments precisely to a certain degree, they lack broad applicability, intuitive usability, and satisfactory scalability. To tackle these limitations, we propose 3-D scene graph as an environment model and the 3-D scene graph construction framework. The concise and widely used graph structure readily guarantees usability as well as scalability for 3-D scene graph. We demonstrate the accuracy and applicability of the 3-D scene graph by exhibiting the deployment of the 3-D scene graph in practical applications. Moreover, we verify the performance of the proposed 3-D scene graph and the framework by conducting a series of comprehensive experiments under various conditions.
Tasks	graph construction
Published	2019-08-14
URL	https://arxiv.org/abs/1908.04929v1
PDF	https://arxiv.org/pdf/1908.04929v1.pdf
PWC	https://paperswithcode.com/paper/3-d-scene-graph-a-sparse-and-semantic
Repo	https://github.com/Uehwan/3-D-Scene-Graph
Framework	pytorch

Domain Adaptation in Multi-Channel Autoencoder based Features for Robust Face Anti-Spoofing


Title	Domain Adaptation in Multi-Channel Autoencoder based Features for Robust Face Anti-Spoofing
Authors	Olegs Nikisins, Anjith George, Sebastien Marcel
Abstract	While the performance of face recognition systems has improved significantly in the last decade, they are proved to be highly vulnerable to presentation attacks (spoofing). Most of the research in the field of face presentation attack detection (PAD), was focused on boosting the performance of the systems within a single database. Face PAD datasets are usually captured with RGB cameras, and have very limited number of both bona-fide samples and presentation attack instruments. Training face PAD systems on such data leads to poor performance, even in the closed-set scenario, especially when sophisticated attacks are involved. We explore two paths to boost the performance of the face PAD system against challenging attacks. First, by using multi-channel (RGB, Depth and NIR) data, which is still easily accessible in a number of mass production devices. Second, we develop a novel Autoencoders + MLP based face PAD algorithm. Moreover, instead of collecting more data for training of the proposed deep architecture, the domain adaptation technique is proposed, transferring the knowledge of facial appearance from RGB to multi-channel domain. We also demonstrate, that learning the features of individual facial regions, is more discriminative than the features learned from an entire face. The proposed system is tested on a very recent publicly available multi-channel PAD database with a wide variety of presentation attacks.
Tasks	Domain Adaptation, Face Anti-Spoofing, Face Presentation Attack Detection, Face Recognition
Published	2019-07-09
URL	https://arxiv.org/abs/1907.04048v1
PDF	https://arxiv.org/pdf/1907.04048v1.pdf
PWC	https://paperswithcode.com/paper/domain-adaptation-in-multi-channel
Repo	https://github.com/anjith2006/bob.paper.mcae.icb2019
Framework	none

D-VAE: A Variational Autoencoder for Directed Acyclic Graphs


Title	D-VAE: A Variational Autoencoder for Directed Acyclic Graphs
Authors	Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, Yixin Chen
Abstract	Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interest to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose an asynchronous message passing scheme that allows encoding the computations on DAGs, rather than using existing simultaneous message passing schemes to encode local graph structures. We demonstrate the effectiveness of our proposed DVAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization.
Tasks	Neural Architecture Search
Published	2019-04-24
URL	https://arxiv.org/abs/1904.11088v4
PDF	https://arxiv.org/pdf/1904.11088v4.pdf
PWC	https://paperswithcode.com/paper/d-vae-a-variational-autoencoder-for-directed
Repo	https://github.com/muhanzhang/D-VAE
Framework	pytorch