October 20, 2019

3276 words 16 mins read

Paper Group AWR 265

Learning to recognize touch gestures: recurrent vs. convolutional features and dynamic sampling. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. Learning Neural Parsers with Deterministic Differentiable Imitation Learning. Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning. A …

Learning to recognize touch gestures: recurrent vs. convolutional features and dynamic sampling


Title	Learning to recognize touch gestures: recurrent vs. convolutional features and dynamic sampling
Authors	Quentin Debard, Christian Wolf, Stéphane Canu, Julien Arné
Abstract	We propose a fully automatic method for learning gestures on big touch devices in a potentially multi-user context. The goal is to learn general models capable of adapting to different gestures, user styles and hardware variations (e.g. device sizes, sampling frequencies and regularities). Based on deep neural networks, our method features a novel dynamic sampling and temporal normalization component, transforming variable length gestures into fixed length representations while preserving finger/surface contact transitions, that is, the topology of the signal. This sequential representation is then processed with a convolutional model capable, unlike recurrent networks, of learning hierarchical representations with different levels of abstraction. To demonstrate the interest of the proposed method, we introduce a new touch gestures dataset with 6591 gestures performed by 27 people, which is, up to our knowledge, the first of its kind: a publicly available multi-touch gesture dataset for interaction. We also tested our method on a standard dataset of symbolic touch gesture recognition, the MMG dataset, outperforming the state of the art and reporting close to perfect performance.
Tasks	Gesture Recognition
Published	2018-02-19
URL	http://arxiv.org/abs/1802.09901v1
PDF	http://arxiv.org/pdf/1802.09901v1.pdf
PWC	https://paperswithcode.com/paper/learning-to-recognize-touch-gestures
Repo	https://github.com/chriswegmann/drone_steering
Framework	none

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation


Title	BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation
Authors	Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, Nong Sang
Abstract	Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048x1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.
Tasks	Real-Time Semantic Segmentation, Semantic Segmentation
Published	2018-08-02
URL	http://arxiv.org/abs/1808.00897v1
PDF	http://arxiv.org/pdf/1808.00897v1.pdf
PWC	https://paperswithcode.com/paper/bisenet-bilateral-segmentation-network-for
Repo	https://github.com/Blaizzy/BiSeNet-Implementation
Framework	tf

Learning Neural Parsers with Deterministic Differentiable Imitation Learning


Title	Learning Neural Parsers with Deterministic Differentiable Imitation Learning
Authors	Tanmay Shankar, Nicholas Rhinehart, Katharina Muelling, Kris M. Kitani
Abstract	We explore the problem of learning to decompose spatial tasks into segments, as exemplified by the problem of a painting robot covering a large object. Inspired by the ability of classical decision tree algorithms to construct structured partitions of their input spaces, we formulate the problem of decomposing objects into segments as a parsing approach. We make the insight that the derivation of a parse-tree that decomposes the object into segments closely resembles a decision tree constructed by ID3, which can be done when the ground-truth available. We learn to imitate an expert parsing oracle, such that our neural parser can generalize to parse natural images without ground truth. We introduce a novel deterministic policy gradient update, DRAG (i.e., DeteRministically AGgrevate) in the form of a deterministic actor-critic variant of AggreVaTeD, to train our neural parser. From another perspective, our approach is a variant of the Deterministic Policy Gradient suitable for the imitation learning setting. The deterministic policy representation offered by training our neural parser with DRAG allows it to outperform state of the art imitation and reinforcement learning approaches.
Tasks	Imitation Learning
Published	2018-06-20
URL	http://arxiv.org/abs/1806.07822v2
PDF	http://arxiv.org/pdf/1806.07822v2.pdf
PWC	https://paperswithcode.com/paper/learning-neural-parsers-with-deterministic
Repo	https://github.com/tanmayshankar/ParsingbyImitation
Framework	none


Title	Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning
Authors	Valts Blukis, Nataly Brukhim, Andrew Bennett, Ross A. Knepper, Yoav Artzi
Abstract	We introduce a method for following high-level navigation instructions by mapping directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time control. The Grounded Semantic Mapping Network (GSMN) is a fully-differentiable neural network architecture that builds an explicit semantic map in the world reference frame by incorporating a pinhole camera projection model within the network. The information stored in the map is learned from experience, while the local-to-world transformation is computed explicitly. We train the model using DAggerFM, a modified variant of DAgger that trades tabular convergence guarantees for improved training speed and memory use. We test GSMN in virtual environments on a realistic quadcopter simulator and show that incorporating an explicit mapping and grounding modules allows GSMN to outperform strong neural baselines and almost reach an expert policy performance. Finally, we analyze the learned map representations and show that using an explicit map leads to an interpretable instruction-following model.
Tasks	Imitation Learning
Published	2018-05-31
URL	http://arxiv.org/abs/1806.00047v1
PDF	http://arxiv.org/pdf/1806.00047v1.pdf
PWC	https://paperswithcode.com/paper/following-high-level-navigation-instructions
Repo	https://github.com/clic-lab/gsmn
Framework	none

A Supervised Approach To The Interpretation Of Imperative To-Do Lists


Title	A Supervised Approach To The Interpretation Of Imperative To-Do Lists
Authors	Paul Landes, Barbara Di Eugenio
Abstract	To-do lists are a popular medium for personal information management. As to-do tasks are increasingly tracked in electronic form with mobile and desktop organizers, so does the potential for software support for the corresponding tasks by means of intelligent agents. While there has been work in the area of personal assistants for to-do tasks, no work has focused on classifying user intention and information extraction as we do. We show that our methods perform well across two corpora that span sub-domains, one of which we released.
Tasks
Published	2018-06-20
URL	http://arxiv.org/abs/1806.07999v1
PDF	http://arxiv.org/pdf/1806.07999v1.pdf
PWC	https://paperswithcode.com/paper/a-supervised-approach-to-the-interpretation
Repo	https://github.com/plandes/todo-task
Framework	none

MIDV-500: A Dataset for Identity Documents Analysis and Recognition on Mobile Devices in Video Stream


Title	MIDV-500: A Dataset for Identity Documents Analysis and Recognition on Mobile Devices in Video Stream
Authors	Vladimir V. Arlazarov, Konstantin Bulatov, Timofey Chernov, Vladimir L. Arlazarov
Abstract	A lot of research has been devoted to identity documents analysis and recognition on mobile devices. However, no publicly available datasets designed for this particular problem currently exist. There are a few datasets which are useful for associated subtasks but in order to facilitate a more comprehensive scientific and technical approach to identity document recognition more specialized datasets are required. In this paper we present a Mobile Identity Document Video dataset (MIDV-500) consisting of 500 video clips for 50 different identity document types with ground truth which allows to perform research in a wide scope of document analysis problems. The paper presents characteristics of the dataset and evaluation results for existing methods of face detection, text line recognition, and document fields data extraction. Since an important feature of identity documents is their sensitiveness as they contain personal data, all source document images used in MIDV-500 are either in public domain or distributed under public copyright licenses. The main goal of this paper is to present a dataset. However, in addition and as a baseline, we present evaluation results for existing methods for face detection, text line recognition, and document data extraction, using the presented dataset. (The dataset is available for download at ftp://smartengines.com/midv-500/.)
Tasks	Face Detection
Published	2018-07-16
URL	https://arxiv.org/abs/1807.05786v4
PDF	https://arxiv.org/pdf/1807.05786v4.pdf
PWC	https://paperswithcode.com/paper/midv-500-a-dataset-for-identity-documents
Repo	https://github.com/AdivarekarBhumit/ID-Card-Segmentation
Framework	tf

GaussianProcesses.jl: A Nonparametric Bayes package for the Julia Language


Title	GaussianProcesses.jl: A Nonparametric Bayes package for the Julia Language
Authors	Jamie Fairbrother, Christopher Nemeth, Maxime Rischard, Johanni Brea, Thomas Pinder
Abstract	Gaussian processes are a class of flexible nonparametric Bayesian tools that are widely used across the sciences, and in industry, to model complex data sources. Key to applying Gaussian process models is the availability of well-developed open source software, which is available in many programming languages. In this paper, we present a tutorial of the GaussianProcesses.jl package that has been developed for the Julia programming language. GaussianProcesses.jl utilises the inherent computational benefits of the Julia language, including multiple dispatch and just-in-time compilation, to produce a fast, flexible and user-friendly Gaussian processes package. The package provides many mean and kernel functions with supporting inference tools to fit exact Gaussian process models, as well as a range of alternative likelihood functions to handle non-Gaussian data (e.g. binary classification models) and sparse approximations for scalable Gaussian processes. The package makes efficient use of existing Julia packages to provide users with a range of optimization and plotting tools.
Tasks	Gaussian Processes
Published	2018-12-21
URL	https://arxiv.org/abs/1812.09064v2
PDF	https://arxiv.org/pdf/1812.09064v2.pdf
PWC	https://paperswithcode.com/paper/gaussianprocessesjl-a-nonparametric-bayes
Repo	https://github.com/UnofficialJuliaMirrorSnapshots/GaussianProcesses.jl-891a1506-143c-57d2-908e-e1f8e92e6de9
Framework	none

Bridge the Gap Between VQA and Human Behavior on Omnidirectional Video: A Large-Scale Dataset and a Deep Learning Model


Title	Bridge the Gap Between VQA and Human Behavior on Omnidirectional Video: A Large-Scale Dataset and a Deep Learning Model
Authors	Chen Li, Mai Xu, Xinzhe Du, Zulin Wang
Abstract	Omnidirectional video enables spherical stimuli with the $360 \times 180^ \circ$ viewing range. Meanwhile, only the viewport region of omnidirectional video can be seen by the observer through head movement (HM), and an even smaller region within the viewport can be clearly perceived through eye movement (EM). Thus, the subjective quality of omnidirectional video may be correlated with HM and EM of human behavior. To fill in the gap between subjective quality and human behavior, this paper proposes a large-scale visual quality assessment (VQA) dataset of omnidirectional video, called VQA-OV, which collects 60 reference sequences and 540 impaired sequences. Our VQA-OV dataset provides not only the subjective quality scores of sequences but also the HM and EM data of subjects. By mining our dataset, we find that the subjective quality of omnidirectional video is indeed related to HM and EM. Hence, we develop a deep learning model, which embeds HM and EM, for objective VQA on omnidirectional video. Experimental results show that our model significantly improves the state-of-the-art performance of VQA on omnidirectional video.
Tasks	Visual Question Answering
Published	2018-07-29
URL	http://arxiv.org/abs/1807.10990v1
PDF	http://arxiv.org/pdf/1807.10990v1.pdf
PWC	https://paperswithcode.com/paper/bridge-the-gap-between-vqa-and-human-behavior
Repo	https://github.com/Archer-Tatsu/VQA-ODV
Framework	none

AgriColMap: Aerial-Ground Collaborative 3D Mapping for Precision Farming


Title	AgriColMap: Aerial-Ground Collaborative 3D Mapping for Precision Farming
Authors	Ciro Potena, Raghav Khanna, Juan Nieto, Roland Siegwart, Daniele Nardi, Alberto Pretto
Abstract	The combination of aerial survey capabilities of Unmanned Aerial Vehicles with targeted intervention abilities of agricultural Unmanned Ground Vehicles can significantly improve the effectiveness of robotic systems applied to precision agriculture. In this context, building and updating a common map of the field is an essential but challenging task. The maps built using robots of different types show differences in size, resolution and scale, the associated geolocation data may be inaccurate and biased, while the repetitiveness of both visual appearance and geometric structures found within agricultural contexts render classical map merging techniques ineffective. In this paper we propose AgriColMap, a novel map registration pipeline that leverages a grid-based multimodal environment representation which includes a vegetation index map and a Digital Surface Model. We cast the data association problem between maps built from UAVs and UGVs as a multimodal, large displacement dense optical flow estimation. The dominant, coherent flows, selected using a voting scheme, are used as point-to-point correspondences to infer a preliminary non-rigid alignment between the maps. A final refinement is then performed, by exploiting only meaningful parts of the registered maps. We evaluate our system using real world data for 3 fields with different crop species. The results show that our method outperforms several state of the art map registration and matching techniques by a large margin, and has a higher tolerance to large initial misalignments. We release an implementation of the proposed approach along with the acquired datasets with this paper.
Tasks	Optical Flow Estimation
Published	2018-09-30
URL	http://arxiv.org/abs/1810.00457v2
PDF	http://arxiv.org/pdf/1810.00457v2.pdf
PWC	https://paperswithcode.com/paper/agricolmap-aerial-ground-collaborative-3d
Repo	https://github.com/cirpote/AgriColMap
Framework	none

Which Training Methods for GANs do actually Converge?


Title	Which Training Methods for GANs do actually Converge?
Authors	Lars Mescheder, Andreas Geiger, Sebastian Nowozin
Abstract	Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show that the requirement of absolute continuity is necessary: we describe a simple yet prototypical counterexample showing that in the more realistic case of distributions that are not absolutely continuous, unregularized GAN training is not always convergent. Furthermore, we discuss regularization strategies that were recently proposed to stabilize GAN training. Our analysis shows that GAN training with instance noise or zero-centered gradient penalties converges. On the other hand, we show that Wasserstein-GANs and WGAN-GP with a finite number of discriminator updates per generator update do not always converge to the equilibrium point. We discuss these results, leading us to a new explanation for the stability problems of GAN training. Based on our analysis, we extend our convergence results to more general GANs and prove local convergence for simplified gradient penalties even if the generator and data distribution lie on lower dimensional manifolds. We find these penalties to work well in practice and use them to learn high-resolution generative image models for a variety of datasets with little hyperparameter tuning.
Tasks
Published	2018-01-13
URL	http://arxiv.org/abs/1801.04406v4
PDF	http://arxiv.org/pdf/1801.04406v4.pdf
PWC	https://paperswithcode.com/paper/which-training-methods-for-gans-do-actually
Repo	https://github.com/wittawatj/cadgan
Framework	pytorch

Removing the Feature Correlation Effect of Multiplicative Noise


Title	Removing the Feature Correlation Effect of Multiplicative Noise
Authors	Zijun Zhang, Yining Zhang, Zongpeng Li
Abstract	Multiplicative noise, including dropout, is widely used to regularize deep neural networks (DNNs), and is shown to be effective in a wide range of architectures and tasks. From an information perspective, we consider injecting multiplicative noise into a DNN as training the network to solve the task with noisy information pathways, which leads to the observation that multiplicative noise tends to increase the correlation between features, so as to increase the signal-to-noise ratio of information pathways. However, high feature correlation is undesirable, as it increases redundancy in representations. In this work, we propose non-correlating multiplicative noise (NCMN), which exploits batch normalization to remove the correlation effect in a simple yet effective way. We show that NCMN significantly improves the performance of standard multiplicative noise on image classification tasks, providing a better alternative to dropout for batch-normalized networks. Additionally, we present a unified view of NCMN and shake-shake regularization, which explains the performance gain of the latter.
Tasks	Image Classification
Published	2018-09-19
URL	http://arxiv.org/abs/1809.07023v1
PDF	http://arxiv.org/pdf/1809.07023v1.pdf
PWC	https://paperswithcode.com/paper/removing-the-feature-correlation-effect-of
Repo	https://github.com/zj10/NCMN
Framework	pytorch

Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions


Title	Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions
Authors	Jing Peng, Anna Feldman, Ekaterina Vylomova
Abstract	We describe an algorithm for automatic classification of idiomatic and literal expressions. Our starting point is that words in a given text segment, such as a paragraph, that are highranking representatives of a common topic of discussion are less likely to be a part of an idiomatic expression. Our additional hypothesis is that contexts in which idioms occur, typically, are more affective and therefore, we incorporate a simple analysis of the intensity of the emotions expressed by the contexts. We investigate the bag of words topic representation of one to three paragraphs containing an expression that should be classified as idiomatic or literal (a target phrase). We extract topics from paragraphs containing idioms and from paragraphs containing literals using an unsupervised clustering method, Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Since idiomatic expressions exhibit the property of non-compositionality, we assume that they usually present different semantics than the words used in the local topic. We treat idioms as semantic outliers, and the identification of a semantic shift as outlier detection. Thus, this topic representation allows us to differentiate idioms from literals using local semantic contexts. Our results are encouraging.
Tasks	Outlier Detection, Topic Models
Published	2018-02-27
URL	http://arxiv.org/abs/1802.09961v1
PDF	http://arxiv.org/pdf/1802.09961v1.pdf
PWC	https://paperswithcode.com/paper/classifying-idiomatic-and-literal-expressions-2
Repo	https://github.com/bondfeld/BNC_idioms
Framework	none

Recurrent Flow-Guided Semantic Forecasting


Title	Recurrent Flow-Guided Semantic Forecasting
Authors	Adam M. Terwilliger, Garrick Brazil, Xiaoming Liu
Abstract	Understanding the world around us and making decisions about the future is a critical component to human intelligence. As autonomous systems continue to develop, their ability to reason about the future will be the key to their success. Semantic anticipation is a relatively under-explored area for which autonomous vehicles could take advantage of (e.g., forecasting pedestrian trajectories). Motivated by the need for real-time prediction in autonomous systems, we propose to decompose the challenging semantic forecasting task into two subtasks: current frame segmentation and future optical flow prediction. Through this decomposition, we built an efficient, effective, low overhead model with three main components: flow prediction network, feature-flow aggregation LSTM, and end-to-end learnable warp layer. Our proposed method achieves state-of-the-art accuracy on short-term and moving objects semantic forecasting while simultaneously reducing model parameters by up to 95% and increasing efficiency by greater than 40x.
Tasks	Autonomous Vehicles, Optical Flow Estimation
Published	2018-09-21
URL	http://arxiv.org/abs/1809.08318v2
PDF	http://arxiv.org/pdf/1809.08318v2.pdf
PWC	https://paperswithcode.com/paper/recurrent-flow-guided-semantic-forecasting
Repo	https://github.com/adamtwig/segpred
Framework	none

MAttNet: Modular Attention Network for Referring Expression Comprehension


Title	MAttNet: Modular Attention Network for Referring Expression Comprehension
Authors	Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg
Abstract	In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo and code are provided.
Tasks
Published	2018-01-24
URL	http://arxiv.org/abs/1801.08186v3
PDF	http://arxiv.org/pdf/1801.08186v3.pdf
PWC	https://paperswithcode.com/paper/mattnet-modular-attention-network-for
Repo	https://github.com/lichengunc/MAttNet
Framework	pytorch

On the stability analysis of deep neural network representations of an optimal state-feedback


Title	On the stability analysis of deep neural network representations of an optimal state-feedback
Authors	Dario Izzo, Dharmesh Tailor, Thomas Vasileiou
Abstract	Recent work have shown how the optimal state-feedback, obtained as the solution to the Hamilton-Jacobi-Bellman equations, can be approximated for several nonlinear, deterministic systems by deep neural networks. When imitation (supervised) learning is used to train the neural network on optimal state-action pairs, for instance as derived by applying Pontryagin’s theory of optimal processes, the resulting model is referred here as the guidance and control network. In this work, we analyze the stability of nonlinear and deterministic systems controlled by such networks. We then propose a method utilising differential algebraic techniques and high-order Taylor maps to gain information on the stability of the neurocontrolled state trajectories. We exemplify the proposed methods in the case of the two-dimensional dynamics of a quadcopter controlled to reach the origin and we study how different architectures of the guidance and control network affect the stability of the target equilibrium point and the stability margins to time delay. Moreover, we show how to study the robustness to initial conditions of a nominal trajectory, using a Taylor representation of the neurocontrolled neighbouring trajectories.
Tasks	Imitation Learning
Published	2018-12-06
URL	http://arxiv.org/abs/1812.02532v3
PDF	http://arxiv.org/pdf/1812.02532v3.pdf
PWC	https://paperswithcode.com/paper/on-the-stability-analysis-of-optimal-state
Repo	https://github.com/darioizzo/neurostability
Framework	none