July 30, 2019

3255 words 16 mins read

Paper Group AWR 73

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning. Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks. Active Learning for Graph Embedding. Encouraging LSTMs to Anticipate Actions Very Early. Localizing Moments in Video with Natural Language. Globally-Optimal Inlier Set Maximisation fo …

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning


Title	Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
Authors	Daniel S. Brown, Scott Niekum
Abstract	In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting—where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the $\alpha$-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert’s unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform risk-aware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.
Tasks
Published	2017-07-03
URL	http://arxiv.org/abs/1707.00724v5
PDF	http://arxiv.org/pdf/1707.00724v5.pdf
PWC	https://paperswithcode.com/paper/efficient-probabilistic-performance-bounds
Repo	https://github.com/dsbrown1331/safe-imitation-learning
Framework	none

Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks


Title	Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks
Authors	Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, Shih-Fu Chang
Abstract	We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test images and their classes are both unseen during training. SP-AEN aims to tackle the inherent problem — semantic loss — in the prevailing family of embedding-based ZSL, where some semantics would be discarded during training if they are non-discriminative for training classes, but could become critical for recognizing test classes. Specifically, SP-AEN prevents the semantic loss by introducing an independent visual-to-semantic space embedder which disentangles the semantic space into two subspaces for the two arguably conflicting objectives: classification and reconstruction. Through adversarial learning of the two subspaces, SP-AEN can transfer the semantics from the reconstructive subspace to the discriminative one, accomplishing the improved zero-shot recognition of unseen classes. Comparing with prior works, SP-AEN can not only improve classification but also generate photo-realistic images, demonstrating the effectiveness of semantic preservation. On four popular benchmarks: CUB, AWA, SUN and aPY, SP-AEN considerably outperforms other state-of-the-art methods by an absolute performance difference of 12.2%, 9.3%, 4.0%, and 3.6% in terms of harmonic mean values
Tasks	Zero-Shot Learning
Published	2017-12-05
URL	http://arxiv.org/abs/1712.01928v2
PDF	http://arxiv.org/pdf/1712.01928v2.pdf
PWC	https://paperswithcode.com/paper/zero-shot-visual-recognition-using-semantics
Repo	https://github.com/MARMOTatZJU/ZSLPR-TIANCHI
Framework	pytorch

Active Learning for Graph Embedding


Title	Active Learning for Graph Embedding
Authors	Hongyun Cai, Vincent W. Zheng, Kevin Chen-Chuan Chang
Abstract	Graph embedding provides an efficient solution for graph analysis by converting the graph into a low-dimensional space which preserves the structure information. In contrast to the graph structure data, the i.i.d. node embedding can be processed efficiently in terms of both time and space. Current semi-supervised graph embedding algorithms assume the labelled nodes are given, which may not be always true in the real world. While manually label all training data is inapplicable, how to select the subset of training data to label so as to maximize the graph analysis task performance is of great importance. This motivates our proposed active graph embedding (AGE) framework, in which we design a general active learning query strategy for any semi-supervised graph embedding algorithm. AGE selects the most informative nodes as the training labelled nodes based on the graphical information (i.e., node centrality) as well as the learnt node embedding (i.e., node classification uncertainty and node embedding representativeness). Different query criteria are combined with the time-sensitive parameters which shift the focus from graph based query criteria to embedding based criteria as the learning progresses. Experiments have been conducted on three public data sets and the results verified the effectiveness of each component of our query strategy and the power of combining them using time-sensitive parameters. Our code is available online at: https://github.com/vwz/AGE.
Tasks	Active Learning, Graph Embedding, Node Classification
Published	2017-05-15
URL	http://arxiv.org/abs/1705.05085v1
PDF	http://arxiv.org/pdf/1705.05085v1.pdf
PWC	https://paperswithcode.com/paper/active-learning-for-graph-embedding
Repo	https://github.com/vwz/AGE
Framework	tf

Encouraging LSTMs to Anticipate Actions Very Early


Title	Encouraging LSTMs to Anticipate Actions Very Early
Authors	Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Basura Fernando, Lars Petersson, Lars Andersson
Abstract	In contrast to the widely studied problem of recognizing an action given a complete sequence, action anticipation aims to identify the action from only partially available videos. As such, it is therefore key to the success of computer vision applications requiring to react as early as possible, such as autonomous navigation. In this paper, we propose a new action anticipation method that achieves high prediction accuracy even in the presence of a very small percentage of a video sequence. To this end, we develop a multi-stage LSTM architecture that leverages context-aware and action-aware features, and introduce a novel loss function that encourages the model to predict the correct class as early as possible. Our experiments on standard benchmark datasets evidence the benefits of our approach; We outperform the state-of-the-art action anticipation methods for early prediction by a relative increase in accuracy of 22.0% on JHMDB-21, 14.0% on UT-Interaction and 49.9% on UCF-101.
Tasks	Autonomous Navigation
Published	2017-03-21
URL	http://arxiv.org/abs/1703.07023v3
PDF	http://arxiv.org/pdf/1703.07023v3.pdf
PWC	https://paperswithcode.com/paper/encouraging-lstms-to-anticipate-actions-very
Repo	https://github.com/mangalutsav/Multi-Stage-LSTM-for-Action-Anticipation
Framework	none

Localizing Moments in Video with Natural Language


Title	Localizing Moments in Video with Natural Language
Authors	Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
Abstract	We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.
Tasks
Published	2017-08-04
URL	http://arxiv.org/abs/1708.01641v1
PDF	http://arxiv.org/pdf/1708.01641v1.pdf
PWC	https://paperswithcode.com/paper/localizing-moments-in-video-with-natural
Repo	https://github.com/mrsalehi/ground-sentence-video
Framework	pytorch

Globally-Optimal Inlier Set Maximisation for Simultaneous Camera Pose and Feature Correspondence


Title	Globally-Optimal Inlier Set Maximisation for Simultaneous Camera Pose and Feature Correspondence
Authors	Dylan Campbell, Lars Petersson, Laurent Kneip, Hongdong Li
Abstract	Estimating the 6-DoF pose of a camera from a single image relative to a pre-computed 3D point-set is an important task for many computer vision applications. Perspective-n-Point (PnP) solvers are routinely used for camera pose estimation, provided that a good quality set of 2D-3D feature correspondences are known beforehand. However, finding optimal correspondences between 2D key-points and a 3D point-set is non-trivial, especially when only geometric (position) information is known. Existing approaches to the simultaneous pose and correspondence problem use local optimisation, and are therefore unlikely to find the optimal solution without a good pose initialisation, or introduce restrictive assumptions. Since a large proportion of outliers are common for this problem, we instead propose a globally-optimal inlier set cardinality maximisation approach which jointly estimates optimal camera pose and optimal correspondences. Our approach employs branch-and-bound to search the 6D space of camera poses, guaranteeing global optimality without requiring a pose prior. The geometry of SE(3) is used to find novel upper and lower bounds for the number of inliers and local optimisation is integrated to accelerate convergence. The evaluation empirically supports the optimality proof and shows that the method performs much more robustly than existing approaches, including on a large-scale outdoor data-set.
Tasks	Pose Estimation
Published	2017-09-27
URL	http://arxiv.org/abs/1709.09384v1
PDF	http://arxiv.org/pdf/1709.09384v1.pdf
PWC	https://paperswithcode.com/paper/globally-optimal-inlier-set-maximisation-for
Repo	https://github.com/Awesome-Image-Registration-Organization/2D-3D-matching
Framework	none

Towards a Seamless Integration of Word Senses into Downstream NLP Applications


Title	Towards a Seamless Integration of Word Senses into Downstream NLP Applications
Authors	Mohammad Taher Pilehvar, Jose Camacho-Collados, Roberto Navigli, Nigel Collier
Abstract	Lexical ambiguity can impede NLP systems from accurate understanding of semantics. Despite its potential benefits, the integration of sense-level information into NLP systems has remained understudied. By incorporating a novel disambiguation algorithm into a state-of-the-art classification model, we create a pipeline to integrate sense-level information into downstream NLP applications. We show that a simple disambiguation of the input text can lead to consistent performance improvement on multiple topic categorization and polarity detection datasets, particularly when the fine granularity of the underlying sense inventory is reduced and the document is sufficiently large. Our results also point to the need for sense representation research to focus more on in vivo evaluations which target the performance in downstream NLP applications rather than artificial benchmarks.
Tasks
Published	2017-10-18
URL	http://arxiv.org/abs/1710.06632v1
PDF	http://arxiv.org/pdf/1710.06632v1.pdf
PWC	https://paperswithcode.com/paper/towards-a-seamless-integration-of-word-senses
Repo	https://github.com/pilehvar/sensecnn
Framework	none

Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image


Title	Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image
Authors	Fangchang Ma, Sertac Karaman
Abstract	We consider the problem of dense depth prediction from a sparse set of depth measurements and a single RGB image. Since depth estimation from monocular images alone is inherently ambiguous and unreliable, to attain a higher level of robustness and accuracy, we introduce additional sparse depth samples, which are either acquired with a low-resolution depth sensor or computed via visual Simultaneous Localization and Mapping (SLAM) algorithms. We propose the use of a single deep regression network to learn directly from the RGB-D raw data, and explore the impact of number of depth samples on prediction accuracy. Our experiments show that, compared to using only RGB images, the addition of 100 spatially random depth samples reduces the prediction root-mean-square error by 50% on the NYU-Depth-v2 indoor dataset. It also boosts the percentage of reliable prediction from 59% to 92% on the KITTI dataset. We demonstrate two applications of the proposed algorithm: a plug-in module in SLAM to convert sparse maps to dense maps, and super-resolution for LiDARs. Software and video demonstration are publicly available.
Tasks	Depth Estimation, Simultaneous Localization and Mapping, Super-Resolution
Published	2017-09-21
URL	http://arxiv.org/abs/1709.07492v2
PDF	http://arxiv.org/pdf/1709.07492v2.pdf
PWC	https://paperswithcode.com/paper/sparse-to-dense-depth-prediction-from-sparse
Repo	https://github.com/fangchangma/sparse-to-dense.pytorch
Framework	pytorch

Basic concepts and tools for the Toki Pona minimal and constructed language: description of the language and main issues; analysis of the vocabulary; text synthesis and syntax highlighting; Wordnet synsets


Title	Basic concepts and tools for the Toki Pona minimal and constructed language: description of the language and main issues; analysis of the vocabulary; text synthesis and syntax highlighting; Wordnet synsets
Authors	Renato Fabbri
Abstract	A minimal constructed language (conlang) is useful for experiments and comfortable for making tools. The Toki Pona (TP) conlang is minimal both in the vocabulary (with only 14 letters and 124 lemmas) and in the (about) 10 syntax rules. The language is useful for being a used and somewhat established minimal conlang with at least hundreds of fluent speakers. This article exposes current concepts and resources for TP, and makes available Python (and Vim) scripted routines for the analysis of the language, synthesis of texts, syntax highlighting schemes, and the achievement of a preliminary TP Wordnet. Focus is on the analysis of the basic vocabulary, as corpus analyses were found. The synthesis is based on sentence templates, relates to context by keeping track of used words, and renders larger texts by using a fixed number of phonemes (e.g. for poems) and number of sentences, words and letters (e.g. for paragraphs). Syntax highlighting reflects morphosyntactic classes given in the official dictionary and different solutions are described and implemented in the well-established Vim text editor. The tentative TP Wordnet is made available in three patterns of relations between synsets and word lemmas. In summary, this text holds potentially novel conceptualizations about, and tools and results in analyzing, synthesizing and syntax highlighting the TP language.
Tasks
Published	2017-12-26
URL	http://arxiv.org/abs/1712.09359v3
PDF	http://arxiv.org/pdf/1712.09359v3.pdf
PWC	https://paperswithcode.com/paper/basic-concepts-and-tools-for-the-toki-pona
Repo	https://github.com/ttm/tokipona
Framework	none

Improving Context Aware Language Models


Title	Improving Context Aware Language Models
Authors	Aaron Jaech, Mari Ostendorf
Abstract	Increased adaptability of RNN language models leads to improved predictions that benefit many applications. However, current methods do not take full advantage of the RNN structure. We show that the most widely-used approach to adaptation (concatenating the context with the word embedding at the input to the recurrent layer) is outperformed by a model that has some low-cost improvements: adaptation of both the hidden and output layers. and a feature hashing bias term to capture context idiosyncrasies. Experiments on language modeling and classification tasks using three different corpora demonstrate the advantages of the proposed techniques.
Tasks	Language Modelling
Published	2017-04-21
URL	http://arxiv.org/abs/1704.06380v1
PDF	http://arxiv.org/pdf/1704.06380v1.pdf
PWC	https://paperswithcode.com/paper/improving-context-aware-language-models
Repo	https://github.com/ajaech/calm
Framework	tf

Learning to Infer Graphics Programs from Hand-Drawn Images


Title	Learning to Infer Graphics Programs from Hand-Drawn Images
Authors	Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Joshua B. Tenenbaum
Abstract	We introduce a model that learns to convert simple hand drawings into graphics programs written in a subset of \LaTeX. The model combines techniques from deep learning and program synthesis. We learn a convolutional neural network that proposes plausible drawing primitives that explain an image. These drawing primitives are like a trace of the set of primitive commands issued by a graphics program. We learn a model that uses program synthesis techniques to recover a graphics program from that trace. These programs have constructs like variable bindings, iterative loops, or simple kinds of conditionals. With a graphics program in hand, we can correct errors made by the deep network, measure similarity between drawings by use of similar high-level geometric structures, and extrapolate drawings. Taken together these results are a step towards agents that induce useful, human-readable programs from perceptual input.
Tasks	Program Synthesis
Published	2017-07-30
URL	http://arxiv.org/abs/1707.09627v5
PDF	http://arxiv.org/pdf/1707.09627v5.pdf
PWC	https://paperswithcode.com/paper/learning-to-infer-graphics-programs-from-hand
Repo	https://github.com/azarafrooz/LSTM-program-synthesis
Framework	none

Sensitivity Analysis for Mirror-Stratifiable Convex Functions


Title	Sensitivity Analysis for Mirror-Stratifiable Convex Functions
Authors	Jalal Fadili, Jérôme Malick, Gabriel Peyré
Abstract	This paper provides a set of sensitivity analysis and activity identification results for a class of convex functions with a strong geometric structure, that we coined “mirror-stratifiable”. These functions are such that there is a bijection between a primal and a dual stratification of the space into partitioning sets, called strata. This pairing is crucial to track the strata that are identifiable by solutions of parametrized optimization problems or by iterates of optimization algorithms. This class of functions encompasses all regularizers routinely used in signal and image processing, machine learning, and statistics. We show that this “mirror-stratifiable” structure enjoys a nice sensitivity theory, allowing us to study stability of solutions of optimization problems to small perturbations, as well as activity identification of first-order proximal splitting-type algorithms. Existing results in the literature typically assume that, under a non-degeneracy condition, the active set associated to a minimizer is stable to small perturbations and is identified in finite time by optimization schemes. In contrast, our results do not require any non-degeneracy assumption: in consequence, the optimal active set is not necessarily stable anymore, but we are able to track precisely the set of identifiable strata.We show that these results have crucial implications when solving challenging ill-posed inverse problems via regularization, a typical scenario where the non-degeneracy condition is not fulfilled. Our theoretical results, illustrated by numerical simulations, allow to characterize the instability behaviour of the regularized solutions, by locating the set of all low-dimensional strata that can be potentially identified by these solutions.
Tasks
Published	2017-07-11
URL	http://arxiv.org/abs/1707.03194v3
PDF	http://arxiv.org/pdf/1707.03194v3.pdf
PWC	https://paperswithcode.com/paper/sensitivity-analysis-for-mirror-stratifiable
Repo	https://github.com/gpeyre/2017-SIOPT-stratification
Framework	none

FaceBoxes: A CPU Real-time Face Detector with High Accuracy


Title	FaceBoxes: A CPU Real-time Face Detector with High Accuracy
Authors	Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, Stan Z. Li
Abstract	Although tremendous strides have been made in face detection, one of the remaining open challenges is to achieve real-time speed on the CPU as well as maintain high performance, since effective models for face detection tend to be computationally prohibitive. To address this challenge, we propose a novel face detector, named FaceBoxes, with superior performance on both speed and accuracy. Specifically, our method has a lightweight yet powerful network structure that consists of the Rapidly Digested Convolutional Layers (RDCL) and the Multiple Scale Convolutional Layers (MSCL). The RDCL is designed to enable FaceBoxes to achieve real-time speed on the CPU. The MSCL aims at enriching the receptive fields and discretizing anchors over different layers to handle faces of various scales. Besides, we propose a new anchor densification strategy to make different types of anchors have the same density on the image, which significantly improves the recall rate of small faces. As a consequence, the proposed detector runs at 20 FPS on a single CPU core and 125 FPS using a GPU for VGA-resolution images. Moreover, the speed of FaceBoxes is invariant to the number of faces. We comprehensively evaluate this method and present state-of-the-art detection performance on several face detection benchmark datasets, including the AFW, PASCAL face, and FDDB. Code is available at https://github.com/sfzhang15/FaceBoxes
Tasks	Face Detection
Published	2017-08-17
URL	http://arxiv.org/abs/1708.05234v4
PDF	http://arxiv.org/pdf/1708.05234v4.pdf
PWC	https://paperswithcode.com/paper/faceboxes-a-cpu-real-time-face-detector-with
Repo	https://github.com/XiaXuehai/faceboxes
Framework	pytorch

Depth Super-Resolution Meets Uncalibrated Photometric Stereo


Title	Depth Super-Resolution Meets Uncalibrated Photometric Stereo
Authors	Songyou Peng, Bjoern Haefner, Yvain Quéau, Daniel Cremers
Abstract	A novel depth super-resolution approach for RGB-D sensors is presented. It disambiguates depth super-resolution through high-resolution photometric clues and, symmetrically, it disambiguates uncalibrated photometric stereo through low-resolution depth cues. To this end, an RGB-D sequence is acquired from the same viewing angle, while illuminating the scene from various uncalibrated directions. This sequence is handled by a variational framework which fits high-resolution shape and reflectance, as well as lighting, to both the low-resolution depth measurements and the high-resolution RGB ones. The key novelty consists in a new PDE-based photometric stereo regularizer which implicitly ensures surface regularity. This allows to carry out depth super-resolution in a purely data-driven manner, without the need for any ad-hoc prior or material calibration. Real-world experiments are carried out using an out-of-the-box RGB-D sensor and a hand-held LED light source.
Tasks	Calibration, Super-Resolution
Published	2017-08-01
URL	http://arxiv.org/abs/1708.00411v2
PDF	http://arxiv.org/pdf/1708.00411v2.pdf
PWC	https://paperswithcode.com/paper/depth-super-resolution-meets-uncalibrated
Repo	https://github.com/pengsongyou/SRmeetsPS
Framework	none

Unsupervised Learning of Depth and Ego-Motion from Video


Title	Unsupervised Learning of Depth and Ego-Motion from Video
Authors	Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe
Abstract	We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. We achieve this by simultaneously training depth and camera pose estimation networks using the task of view synthesis as the supervisory signal. The networks are thus coupled via the view synthesis objective during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performing comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performing favorably with established SLAM systems under comparable input settings.
Tasks	Depth And Camera Motion, Depth Estimation, Motion Estimation, Pose Estimation
Published	2017-04-25
URL	http://arxiv.org/abs/1704.07813v2
PDF	http://arxiv.org/pdf/1704.07813v2.pdf
PWC	https://paperswithcode.com/paper/unsupervised-learning-of-depth-and-ego-motion-1
Repo	https://github.com/ClementPinard/SfmLearner-Pytorch
Framework	pytorch