Paper Group AWR 45
SelFlow: Self-Supervised Learning of Optical Flow. RUBi: Reducing Unimodal Biases in Visual Question Answering. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. Synthesis of Biologically Realistic Human Motion Using Joint Torque Actuation. Universal Boosting Variational Inference. SummAE: Zero-Shot Abstractive …
SelFlow: Self-Supervised Learning of Optical Flow
Title | SelFlow: Self-Supervised Learning of Optical Flow |
Authors | Pengpeng Liu, Michael Lyu, Irwin King, Jia Xu |
Abstract | We present a self-supervised learning approach for optical flow. Our method distills reliable flow estimations from non-occluded pixels, and uses these predictions as ground truth to learn optical flow for hallucinated occlusions. We further design a simple CNN to utilize temporal information from multiple frames for better flow estimation. These two principles lead to an approach that yields the best performance for unsupervised optical flow learning on the challenging benchmarks including MPI Sintel, KITTI 2012 and 2015. More notably, our self-supervised pre-trained model provides an excellent initialization for supervised fine-tuning. Our fine-tuned models achieve state-of-the-art results on all three datasets. At the time of writing, we achieve EPE=4.26 on the Sintel benchmark, outperforming all submitted methods. |
Tasks | Optical Flow Estimation |
Published | 2019-04-19 |
URL | http://arxiv.org/abs/1904.09117v1 |
http://arxiv.org/pdf/1904.09117v1.pdf | |
PWC | https://paperswithcode.com/paper/selflow-self-supervised-learning-of-optical |
Repo | https://github.com/ppliuboy/SelFlow |
Framework | tf |
RUBi: Reducing Unimodal Biases in Visual Question Answering
Title | RUBi: Reducing Unimodal Biases in Visual Question Answering |
Authors | Remi Cadene, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, Devi Parikh |
Abstract | Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image. It implicitly forces the VQA model to use the two input modalities instead of relying on statistical regularities between the question and the answer. We leverage a question-only model that captures the language biases by identifying when these unwanted regularities are used. It prevents the base VQA model from learning them by influencing its predictions. This leads to dynamically adjusting the loss in order to compensate for biases. We validate our contributions by surpassing the current state-of-the-art results on VQA-CP v2. This dataset is specifically designed to assess the robustness of VQA models when exposed to different question biases at test time than what was seen during training. Our code is available: github.com/cdancette/rubi.bootstrap.pytorch |
Tasks | Question Answering, Visual Question Answering |
Published | 2019-06-24 |
URL | https://arxiv.org/abs/1906.10169v2 |
https://arxiv.org/pdf/1906.10169v2.pdf | |
PWC | https://paperswithcode.com/paper/rubi-reducing-unimodal-biases-in-visual |
Repo | https://github.com/cdancette/rubi.bootstrap.pytorch |
Framework | pytorch |
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Title | ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering |
Authors | Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, Dacheng Tao |
Abstract | Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos. The dataset is available at https://github.com/MILVLG/activitynet-qa |
Tasks | Question Answering, Video Question Answering, Visual Question Answering |
Published | 2019-06-06 |
URL | https://arxiv.org/abs/1906.02467v1 |
https://arxiv.org/pdf/1906.02467v1.pdf | |
PWC | https://paperswithcode.com/paper/activitynet-qa-a-dataset-for-understanding |
Repo | https://github.com/MILVLG/activitynet-qa |
Framework | none |
Synthesis of Biologically Realistic Human Motion Using Joint Torque Actuation
Title | Synthesis of Biologically Realistic Human Motion Using Joint Torque Actuation |
Authors | Yifeng Jiang, Tom Van Wouwe, Friedl De Groote, C. Karen Liu |
Abstract | Using joint actuators to drive the skeletal movements is a common practice in character animation, but the resultant torque patterns are often unnatural or infeasible for real humans to achieve. On the other hand, physiologically-based models explicitly simulate muscles and tendons and thus produce more human-like movements and torque patterns. This paper introduces a technique to transform an optimal control problem formulated in the muscle-actuation space to an equivalent problem in the joint-actuation space, such that the solutions to both problems have the same optimal value. By solving the equivalent problem in the joint-actuation space, we can generate human-like motions comparable to those generated by musculotendon models, while retaining the benefit of simple modeling and fast computation offered by joint-actuation models. Our method transforms constant bounds on muscle activations to nonlinear, state-dependent torque limits in the joint-actuation space. In addition, the metabolic energy function on muscle activations is transformed to a nonlinear function of joint torques, joint configuration and joint velocity. Our technique can also benefit policy optimization using deep reinforcement learning approach, by providing a more anatomically realistic action space for the agent to explore during the learning process. We take the advantage of the physiologically-based simulator, OpenSim, to provide training data for learning the torque limits and the metabolic energy function. Once trained, the same torque limits and the energy function can be applied to drastically different motor tasks formulated as either trajectory optimization or policy learning. Codebase: https://github.com/jyf588/lrle and https://github.com/jyf588/lrle-rl-examples |
Tasks | |
Published | 2019-04-30 |
URL | https://arxiv.org/abs/1904.13041v2 |
https://arxiv.org/pdf/1904.13041v2.pdf | |
PWC | https://paperswithcode.com/paper/synthesis-of-biologically-realistic-human |
Repo | https://github.com/jyf588/lrle-rl-examples |
Framework | tf |
Universal Boosting Variational Inference
Title | Universal Boosting Variational Inference |
Authors | Trevor Campbell, Xinglong Li |
Abstract | Boosting variational inference (BVI) approximates an intractable probability density by iteratively building up a mixture of simple component distributions one at a time, using techniques from sparse convex optimization to provide both computational scalability and approximation error guarantees. But the guarantees have strong conditions that do not often hold in practice, resulting in degenerate component optimization problems; and we show that the ad-hoc regularization used to prevent degeneracy in practice can cause BVI to fail in unintuitive ways. We thus develop universal boosting variational inference (UBVI), a BVI scheme that exploits the simple geometry of probability densities under the Hellinger metric to prevent the degeneracy of other gradient-based BVI methods, avoid difficult joint optimizations of both component and weight, and simplify fully-corrective weight optimizations. We show that for any target density and any mixture component family, the output of UBVI converges to the best possible approximation in the mixture family, even when the mixture family is misspecified. We develop a scalable implementation based on exponential family mixture components and standard stochastic optimization techniques. Finally, we discuss statistical benefits of the Hellinger distance as a variational objective through bounds on posterior probability, moment, and importance sampling errors. Experiments on multiple datasets and models show that UBVI provides reliable, accurate posterior approximations. |
Tasks | Stochastic Optimization |
Published | 2019-06-04 |
URL | https://arxiv.org/abs/1906.01235v2 |
https://arxiv.org/pdf/1906.01235v2.pdf | |
PWC | https://paperswithcode.com/paper/universal-boosting-variational-inference |
Repo | https://github.com/trevorcampbell/ubvi |
Framework | none |
SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders
Title | SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders |
Authors | Peter J. Liu, Yu-An Chung, Jie Ren |
Abstract | We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced. |
Tasks | Abstractive Text Summarization, Denoising, Text Summarization |
Published | 2019-10-02 |
URL | https://arxiv.org/abs/1910.00998v1 |
https://arxiv.org/pdf/1910.00998v1.pdf | |
PWC | https://paperswithcode.com/paper/summae-zero-shot-abstractive-text |
Repo | https://github.com/DataScienceNigeria/SummAE-from-Google-Brain-and-MIT-CSAIL |
Framework | none |
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models
Title | Answers Unite! Unsupervised Metrics for Reinforced Summarization Models |
Authors | Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano |
Abstract | Abstractive summarization approaches based on Reinforcement Learning (RL) have recently been proposed to overcome classical likelihood maximization. RL enables to consider complex, possibly non-differentiable, metrics that globally assess the quality and relevance of the generated outputs. ROUGE, the most used summarization metric, is known to suffer from bias towards lexical similarity as well as from suboptimal accounting for fluency and readability of the generated abstracts. We thus explore and propose alternative evaluation measures: the reported human-evaluation analysis shows that the proposed metrics, based on Question Answering, favorably compares to ROUGE – with the additional property of not requiring reference summaries. Training a RL-based model on these metrics leads to improvements (both in terms of human or automated metrics) over current approaches that use ROUGE as a reward. |
Tasks | Abstractive Text Summarization, Question Answering |
Published | 2019-09-04 |
URL | https://arxiv.org/abs/1909.01610v1 |
https://arxiv.org/pdf/1909.01610v1.pdf | |
PWC | https://paperswithcode.com/paper/answers-unite-unsupervised-metrics-for |
Repo | https://github.com/recitalAI/summa-qa |
Framework | pytorch |
FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things
Title | FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things |
Authors | Xiaying Wang, Michele Magno, Lukas Cavigelli, Luca Benini |
Abstract | The growing number of low-power smart devices in the Internet of Things is coupled with the concept of “Edge Computing”, that is moving some of the intelligence, especially machine learning, towards the edge of the network. Enabling machine learning algorithms to run on resource-constrained hardware, typically on low-power smart devices, is challenging in terms of hardware (optimized and energy-efficient integrated circuits), algorithmic and firmware implementations. This paper presents FANN-on-MCU, an open-source toolkit built upon the Fast Artificial Neural Network (FANN) library to run lightweight and energy-efficient neural networks on microcontrollers based on both the ARM Cortex-M series and the novel RISC-V-based Parallel Ultra-Low-Power (PULP) platform. The toolkit takes multi-layer perceptrons trained with FANN and generates code targeted at execution on low-power microcontrollers either with a floating-point unit (i.e., ARM Cortex-M4F and M7F) or without (i.e., ARM Cortex M0-M3 or PULP-based processors). This paper also provides an architectural performance evaluation of neural networks on the most popular ARM Cortex-M family and the parallel RISC-V processor called Mr. Wolf. The evaluation includes experimental results for three different applications using a self-sustainable wearable multi-sensor bracelet. Experimental results show a measured latency in the order of only a few microseconds and a power consumption of few milliwatts while keeping the memory requirements below the limitations of the targeted microcontrollers. In particular, the parallel implementation on the octa-core RISC-V platform reaches a speedup of 22x and a 69% reduction in energy consumption with respect to a single-core implementation on Cortex-M4 for continuous real-time classification. |
Tasks | |
Published | 2019-11-08 |
URL | https://arxiv.org/abs/1911.03314v2 |
https://arxiv.org/pdf/1911.03314v2.pdf | |
PWC | https://paperswithcode.com/paper/fann-on-mcu-an-open-source-toolkit-for-energy |
Repo | https://github.com/pulp-platform/fann-on-mcu |
Framework | none |
Hierarchical Reinforcement Learning for Open-Domain Dialog
Title | Hierarchical Reinforcement Learning for Open-Domain Dialog |
Authors | Abdelrhman Saleh, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Rosalind Picard |
Abstract | Open-domain dialog generation is a challenging problem; maximum likelihood training can lead to repetitive outputs, models have difficulty tracking long-term conversational goals, and training on standard movie or online datasets may lead to the generation of inappropriate, biased, or offensive text. Reinforcement Learning (RL) is a powerful framework that could potentially address these issues, for example by allowing a dialog model to optimize for reducing toxicity and repetitiveness. However, previous approaches which apply RL to open-domain dialog generation do so at the word level, making it difficult for the model to learn proper credit assignment for long-term conversational rewards. In this paper, we propose a novel approach to hierarchical reinforcement learning, VHRL, which uses policy gradients to tune the utterance-level embedding of a variational sequence model. This hierarchical approach provides greater flexibility for learning long-term, conversational rewards. We use self-play and RL to optimize for a set of human-centered conversation metrics, and show that our approach provides significant improvements – in terms of both human evaluation and automatic metrics – over state-of-the-art dialog models, including Transformers. |
Tasks | Hierarchical Reinforcement Learning |
Published | 2019-09-17 |
URL | https://arxiv.org/abs/1909.07547v3 |
https://arxiv.org/pdf/1909.07547v3.pdf | |
PWC | https://paperswithcode.com/paper/hierarchical-reinforcement-learning-for-open |
Repo | https://github.com/natashamjaques/neural_chat |
Framework | pytorch |
Improved Embeddings with Easy Positive Triplet Mining
Title | Improved Embeddings with Easy Positive Triplet Mining |
Authors | Hong Xuan, Abby Stylianou, Robert Pless |
Abstract | Deep metric learning seeks to define an embedding where semantically similar images are embedded to nearby locations, and semantically dissimilar images are embedded to distant locations. Substantial work has focused on loss functions and strategies to learn these embeddings by pushing images from the same class as close together in the embedding space as possible. In this paper, we propose an alternative, loosened embedding strategy that requires the embedding function only map each training image to the most similar examples from the same class, an approach we call “Easy Positive” mining. We provide a collection of experiments and visualizations that highlight that this Easy Positive mining leads to embeddings that are more flexible and generalize better to new unseen data. This simple mining strategy yields recall performance that exceeds state of the art approaches (including those with complicated loss functions and ensemble methods) on image retrieval datasets including CUB, Stanford Online Products, In-Shop Clothes and Hotels-50K. |
Tasks | Image Retrieval, Metric Learning |
Published | 2019-04-08 |
URL | https://arxiv.org/abs/1904.04370v2 |
https://arxiv.org/pdf/1904.04370v2.pdf | |
PWC | https://paperswithcode.com/paper/improved-embeddings-with-easy-positive |
Repo | https://github.com/littleredxh/EasyPositiveHardNegative |
Framework | pytorch |
SG-Net: Syntax-Guided Machine Reading Comprehension
Title | SG-Net: Syntax-Guided Machine Reading Comprehension |
Authors | Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, Hai Zhao, Rui Wang |
Abstract | For machine reading comprehension, the capacity of effectively modeling the linguistic knowledge from the detail-riddled and lengthy passages and getting ride of the noises is essential to improve its performance. Traditional attentive models attend to all words without explicit constraint, which results in inaccurate concentration on some dispensable words. In this work, we propose using syntax to guide the text modeling by incorporating explicit syntactic constraints into attention mechanism for better linguistically motivated word representations. In detail, for self-attention network (SAN) sponsored Transformer-based encoder, we introduce syntactic dependency of interest (SDOI) design into the SAN to form an SDOI-SAN with syntax-guided self-attention. Syntax-guided network (SG-Net) is then composed of this extra SDOI-SAN and the SAN from the original Transformer encoder through a dual contextual architecture for better linguistics inspired representation. To verify its effectiveness, the proposed SG-Net is applied to typical pre-trained language model BERT which is right based on a Transformer encoder. Extensive experiments on popular benchmarks including SQuAD 2.0 and RACE show that the proposed SG-Net design helps achieve substantial performance improvement over strong baselines. |
Tasks | Language Modelling, Machine Reading Comprehension, Question Answering, Reading Comprehension |
Published | 2019-08-14 |
URL | https://arxiv.org/abs/1908.05147v3 |
https://arxiv.org/pdf/1908.05147v3.pdf | |
PWC | https://paperswithcode.com/paper/sg-net-syntax-guided-machine-reading |
Repo | https://github.com/cooelf/SG-Net |
Framework | pytorch |
Semantics-aware BERT for Language Understanding
Title | Semantics-aware BERT for Language Understanding |
Authors | Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, Xiang Zhou |
Abstract | The latest work on language representations carefully integrates contextualized features into language model training, which enables a series of success especially in various machine reading comprehension and natural language inference tasks. However, the existing language representation models including ELMo, GPT and BERT only exploit plain context-sensitive features such as character or word embeddings. They rarely consider incorporating structured semantic information which can provide rich semantics for language representation. To promote natural language understanding, we propose to incorporate explicit contextual semantics from pre-trained semantic role labeling, and introduce an improved language representation model, Semantics-aware BERT (SemBERT), which is capable of explicitly absorbing contextual semantics over a BERT backbone. SemBERT keeps the convenient usability of its BERT precursor in a light fine-tuning way without substantial task-specific modifications. Compared with BERT, semantics-aware BERT is as simple in concept but more powerful. It obtains new state-of-the-art or substantially improves results on ten reading comprehension and language inference tasks. |
Tasks | Language Modelling, Machine Reading Comprehension, Natural Language Inference, Question Answering, Reading Comprehension, Semantic Role Labeling, Word Embeddings |
Published | 2019-09-05 |
URL | https://arxiv.org/abs/1909.02209v3 |
https://arxiv.org/pdf/1909.02209v3.pdf | |
PWC | https://paperswithcode.com/paper/semantics-aware-bert-for-language |
Repo | https://github.com/cooelf/SemBERT |
Framework | pytorch |
3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents
Title | 3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents |
Authors | Ue-Hwan Kim, Jin-Man Park, Taek-Jin Song, Jong-Hwan Kim |
Abstract | Intelligent agents gather information and perceive semantics within the environments before taking on given tasks. The agents store the collected information in the form of environment models that compactly represent the surrounding environments. The agents, however, can only conduct limited tasks without an efficient and effective environment model. Thus, such an environment model takes a crucial role for the autonomy systems of intelligent agents. We claim the following characteristics for a versatile environment model: accuracy, applicability, usability, and scalability. Although a number of researchers have attempted to develop such models that represent environments precisely to a certain degree, they lack broad applicability, intuitive usability, and satisfactory scalability. To tackle these limitations, we propose 3-D scene graph as an environment model and the 3-D scene graph construction framework. The concise and widely used graph structure readily guarantees usability as well as scalability for 3-D scene graph. We demonstrate the accuracy and applicability of the 3-D scene graph by exhibiting the deployment of the 3-D scene graph in practical applications. Moreover, we verify the performance of the proposed 3-D scene graph and the framework by conducting a series of comprehensive experiments under various conditions. |
Tasks | graph construction |
Published | 2019-08-14 |
URL | https://arxiv.org/abs/1908.04929v1 |
https://arxiv.org/pdf/1908.04929v1.pdf | |
PWC | https://paperswithcode.com/paper/3-d-scene-graph-a-sparse-and-semantic |
Repo | https://github.com/Uehwan/3-D-Scene-Graph |
Framework | pytorch |
Domain Adaptation in Multi-Channel Autoencoder based Features for Robust Face Anti-Spoofing
Title | Domain Adaptation in Multi-Channel Autoencoder based Features for Robust Face Anti-Spoofing |
Authors | Olegs Nikisins, Anjith George, Sebastien Marcel |
Abstract | While the performance of face recognition systems has improved significantly in the last decade, they are proved to be highly vulnerable to presentation attacks (spoofing). Most of the research in the field of face presentation attack detection (PAD), was focused on boosting the performance of the systems within a single database. Face PAD datasets are usually captured with RGB cameras, and have very limited number of both bona-fide samples and presentation attack instruments. Training face PAD systems on such data leads to poor performance, even in the closed-set scenario, especially when sophisticated attacks are involved. We explore two paths to boost the performance of the face PAD system against challenging attacks. First, by using multi-channel (RGB, Depth and NIR) data, which is still easily accessible in a number of mass production devices. Second, we develop a novel Autoencoders + MLP based face PAD algorithm. Moreover, instead of collecting more data for training of the proposed deep architecture, the domain adaptation technique is proposed, transferring the knowledge of facial appearance from RGB to multi-channel domain. We also demonstrate, that learning the features of individual facial regions, is more discriminative than the features learned from an entire face. The proposed system is tested on a very recent publicly available multi-channel PAD database with a wide variety of presentation attacks. |
Tasks | Domain Adaptation, Face Anti-Spoofing, Face Presentation Attack Detection, Face Recognition |
Published | 2019-07-09 |
URL | https://arxiv.org/abs/1907.04048v1 |
https://arxiv.org/pdf/1907.04048v1.pdf | |
PWC | https://paperswithcode.com/paper/domain-adaptation-in-multi-channel |
Repo | https://github.com/anjith2006/bob.paper.mcae.icb2019 |
Framework | none |
D-VAE: A Variational Autoencoder for Directed Acyclic Graphs
Title | D-VAE: A Variational Autoencoder for Directed Acyclic Graphs |
Authors | Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, Yixin Chen |
Abstract | Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interest to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose an asynchronous message passing scheme that allows encoding the computations on DAGs, rather than using existing simultaneous message passing schemes to encode local graph structures. We demonstrate the effectiveness of our proposed DVAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization. |
Tasks | Neural Architecture Search |
Published | 2019-04-24 |
URL | https://arxiv.org/abs/1904.11088v4 |
https://arxiv.org/pdf/1904.11088v4.pdf | |
PWC | https://paperswithcode.com/paper/d-vae-a-variational-autoencoder-for-directed |
Repo | https://github.com/muhanzhang/D-VAE |
Framework | pytorch |