January 28, 2020

3077 words 15 mins read

Paper Group ANR 1009

Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach. Learning Effective Visual Relationship Detector on 1 GPU. Geometric Pose Affordance: 3D Human Pose with Scene Constraints. Vision-Based Lane-Changing Behavior Detection Using Deep Residual Neural Network. Silhouette-Net: 3D Hand Pose Estimation from Silhouettes. …

Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach


Title	Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach
Authors	Shiyang Li, Jianshu Chen, Dian Yu
Abstract	Recently, pretrained language models (e.g., BERT) have achieved great success on many downstream natural language understanding tasks and exhibit a certain level of commonsense reasoning ability. However, their performance on commonsense tasks is still far from that of humans. As a preliminary attempt, we propose a simple yet effective method to teach pretrained models with commonsense reasoning by leveraging the structured knowledge in ConceptNet, the largest commonsense knowledge base (KB). Specifically, the structured knowledge in KB allows us to construct various logical forms, and then generate multiple-choice questions requiring commonsense logical reasoning. Experimental results demonstrate that, when refined on these training examples, the pretrained models consistently improve their performance on tasks that require commonsense reasoning, especially in the few-shot learning setting. Besides, we also perform analysis to understand which logical relations are more relevant to commonsense reasoning.
Tasks	Few-Shot Learning
Published	2019-09-20
URL	https://arxiv.org/abs/1909.09743v1
PDF	https://arxiv.org/pdf/1909.09743v1.pdf
PWC	https://paperswithcode.com/paper/190909743
Repo
Framework

Learning Effective Visual Relationship Detector on 1 GPU


Title	Learning Effective Visual Relationship Detector on 1 GPU
Authors	Yichao Lu, Cheng Chang, Himanshu Rai, Guangwei Yu, Maksims Volkovs
Abstract	We present our winning solution to the Open Images 2019 Visual Relationship challenge. This is the largest challenge of its kind to date with nearly 9 million training images. Challenge task consists of detecting objects and identifying relationships between them in complex scenes. Our solution has three stages, first object detection model is fine-tuned for the challenge classes using a novel weight transfer approach. Then, spatio-semantic and visual relationship models are trained on candidate object pairs. Finally, features and model predictions are combined to generate the final relationship prediction. Throughout the challenge we focused on minimizing the hardware requirements of our architecture. Specifically, our weight transfer approach enables much faster optimization, allowing the entire architecture to be trained on a single GPU in under two days. In addition to efficient optimization, our approach also achieves superior accuracy winning first place out of over 200 teams, and outperforming the second place team by over $5%$ on the held-out private leaderboard.
Tasks	Object Detection
Published	2019-12-12
URL	https://arxiv.org/abs/1912.06185v1
PDF	https://arxiv.org/pdf/1912.06185v1.pdf
PWC	https://paperswithcode.com/paper/learning-effective-visual-relationship
Repo
Framework

Geometric Pose Affordance: 3D Human Pose with Scene Constraints


Title	Geometric Pose Affordance: 3D Human Pose with Scene Constraints
Authors	Zhe Wang, Liyan Chen, Shaurya Rathore, Daeyun Shin, Charless Fowlkes
Abstract	Full 3D estimation of human pose from a single image remains a challenging task despite many recent advances. In this paper, we explore the hypothesis that strong prior information about scene geometry can be used to improve pose estimation accuracy. To tackle this question empirically, we have assembled a novel $\textbf{Geometric Pose Affordance}$ dataset, consisting of multi-view imagery of people interacting with a variety of rich 3D environments. We utilized a commercial motion capture system to collect gold-standard estimates of pose and construct accurate geometric 3D CAD models of the scene itself. To inject prior knowledge of scene constraints into existing frameworks for pose estimation from images, we introduce a novel, view-based representation of scene geometry, a $\textbf{multi-layer depth map}$, which employs multi-hit ray tracing to concisely encode multiple surface entry and exit points along each camera view ray direction. We propose two different mechanisms for integrating multi-layer depth information pose estimation: input as encoded ray features used in lifting 2D pose to full 3D, and secondly as a differentiable loss that encourages learned models to favor geometrically consistent pose estimates. We show experimentally that these techniques can improve the accuracy of 3D pose estimates, particularly in the presence of occlusion and complex scene geometry.
Tasks	3D Human Pose Estimation, Motion Capture, Pose Estimation
Published	2019-05-19
URL	https://arxiv.org/abs/1905.07718v1
PDF	https://arxiv.org/pdf/1905.07718v1.pdf
PWC	https://paperswithcode.com/paper/geometric-pose-affordance-3d-human-pose-with
Repo
Framework

Vision-Based Lane-Changing Behavior Detection Using Deep Residual Neural Network


Title	Vision-Based Lane-Changing Behavior Detection Using Deep Residual Neural Network
Authors	Zhensong Wei, Chao Wang, Peng Hao, Matthew Barth
Abstract	Accurate lane localization and lane change detection are crucial in advanced driver assistance systems and autonomous driving systems for safer and more efficient trajectory planning. Conventional localization devices such as Global Positioning System only provide road-level resolution for car navigation, which is incompetent to assist in lane-level decision making. The state of art technique for lane localization is to use Light Detection and Ranging sensors to correct the global localization error and achieve centimeter-level accuracy, but the real-time implementation and popularization for LiDAR is still limited by its computational burden and current cost. As a cost-effective alternative, vision-based lane change detection has been highly regarded for affordable autonomous vehicles to support lane-level localization. A deep learning-based computer vision system is developed to detect the lane change behavior using the images captured by a front-view camera mounted on the vehicle and data from the inertial measurement unit for highway driving. Testing results on real-world driving data have shown that the proposed method is robust with real-time working ability and could achieve around 87% lane change detection accuracy. Compared to the average human reaction to visual stimuli, the proposed computer vision system works 9 times faster, which makes it capable of helping make life-saving decisions in time.
Tasks	Autonomous Driving, Autonomous Vehicles, Decision Making
Published	2019-11-08
URL	https://arxiv.org/abs/1911.03565v1
PDF	https://arxiv.org/pdf/1911.03565v1.pdf
PWC	https://paperswithcode.com/paper/vision-based-lane-changing-behavior-detection
Repo
Framework

Silhouette-Net: 3D Hand Pose Estimation from Silhouettes


Title	Silhouette-Net: 3D Hand Pose Estimation from Silhouettes
Authors	Kuo-Wei Lee, Shih-Hung Liu, Hwann-Tzong Chen, Koichi Ito
Abstract	3D hand pose estimation has received a lot of attention for its wide range of applications and has made great progress owing to the development of deep learning. Existing approaches mainly consider different input modalities and settings, such as monocular RGB, multi-view RGB, depth, or point cloud, to provide sufficient cues for resolving variations caused by self occlusion and viewpoint change. In contrast, this work aims to address the less-explored idea of using minimal information to estimate 3D hand poses. We present a new architecture that automatically learns a guidance from implicit depth perception and solves the ambiguity of hand pose through end-to-end training. The experimental results show that 3D hand poses can be accurately estimated from solely {\em hand silhouettes} without using depth maps. Extensive evaluations on the {\em 2017 Hands In the Million Challenge} (HIM2017) benchmark dataset further demonstrate that our method achieves comparable or even better performance than recent depth-based approaches and serves as the state-of-the-art of its own kind on estimating 3D hand poses from silhouettes.
Tasks	Hand Pose Estimation, Pose Estimation
Published	2019-12-28
URL	https://arxiv.org/abs/1912.12436v1
PDF	https://arxiv.org/pdf/1912.12436v1.pdf
PWC	https://paperswithcode.com/paper/silhouette-net-3d-hand-pose-estimation-from
Repo
Framework

Multi-Person 3D Human Pose Estimation from Monocular Images


Title	Multi-Person 3D Human Pose Estimation from Monocular Images
Authors	Rishabh Dabral, Nitesh B Gundavarapu, Rahul Mitra, Abhishek Sharma, Ganesh Ramakrishnan, Arjun Jain
Abstract	Multi-person 3D human pose estimation from a single image is a challenging problem, especially for in-the-wild settings due to the lack of 3D annotated data. We propose HG-RCNN, a Mask-RCNN based network that also leverages the benefits of the Hourglass architecture for multi-person 3D Human Pose Estimation. A two-staged approach is presented that first estimates the 2D keypoints in every Region of Interest (RoI) and then lifts the estimated keypoints to 3D. Finally, the estimated 3D poses are placed in camera-coordinates using weak-perspective projection assumption and joint optimization of focal length and root translations. The result is a simple and modular network for multi-person 3D human pose estimation that does not require any multi-person 3D pose dataset. Despite its simple formulation, HG-RCNN achieves the state-of-the-art results on MuPoTS-3D while also approximating the 3D pose in the camera-coordinate system.
Tasks	3D Human Pose Estimation, Pose Estimation
Published	2019-09-24
URL	https://arxiv.org/abs/1909.10854v1
PDF	https://arxiv.org/pdf/1909.10854v1.pdf
PWC	https://paperswithcode.com/paper/multi-person-3d-human-pose-estimation-from
Repo
Framework

An End-to-end Framework for Unconstrained Monocular 3D Hand Pose Estimation


Title	An End-to-end Framework for Unconstrained Monocular 3D Hand Pose Estimation
Authors	Sanjeev Sharma, Shaoli Huang, Dacheng Tao
Abstract	This work addresses the challenging problem of unconstrained 3D hand pose estimation using monocular RGB images. Most of the existing approaches assume some prior knowledge of hand (such as hand locations and side information) is available for 3D hand pose estimation. This restricts their use in unconstrained environments. We, therefore, present an end-to-end framework that robustly predicts hand prior information and accurately infers 3D hand pose by learning ConvNet models while only using keypoint annotations. To achieve robustness, the proposed framework uses a novel keypoint-based method to simultaneously predict hand regions and side labels, unlike existing methods that suffer from background color confusion caused by using segmentation or detection-based technology. Moreover, inspired by the biological structure of the human hand, we introduce two geometric constraints directly into the 3D coordinates prediction that further improves its performance in a weakly-supervised training. Experimental results show that our proposed framework not only performs robustly on unconstrained setting, but also outperforms the state-of-art methods on standard benchmark datasets.
Tasks	Hand Pose Estimation, Pose Estimation
Published	2019-11-28
URL	https://arxiv.org/abs/1911.12501v1
PDF	https://arxiv.org/pdf/1911.12501v1.pdf
PWC	https://paperswithcode.com/paper/an-end-to-end-framework-for-unconstrained
Repo
Framework

All Roads Lead to UD: Converting Stanford and Penn Parses to English Universal Dependencies with Multilayer Annotations


Title	All Roads Lead to UD: Converting Stanford and Penn Parses to English Universal Dependencies with Multilayer Annotations
Authors	Siyao Peng, Amir Zeldes
Abstract	We describe and evaluate different approaches to the conversion of gold standard corpus data from Stanford Typed Dependencies (SD) and Penn-style constituent trees to the latest English Universal Dependencies representation (UD 2.2). Our results indicate that pure SD to UD conversion is highly accurate across multiple genres, resulting in around 1.5% errors, but can be improved further to fewer than 0.5% errors given access to annotations beyond the pure syntax tree, such as entity types and coreference resolution, which are necessary for correct generation of several UD relations. We show that constituent-based conversion using CoreNLP (with automatic NER) performs substantially worse in all genres, including when using gold constituent trees, primarily due to underspecification of phrasal grammatical functions.
Tasks	Coreference Resolution
Published	2019-09-02
URL	https://arxiv.org/abs/1909.00522v1
PDF	https://arxiv.org/pdf/1909.00522v1.pdf
PWC	https://paperswithcode.com/paper/all-roads-lead-to-ud-converting-stanford-and-1
Repo
Framework

Sentiment Analysis from Images of Natural Disasters


Title	Sentiment Analysis from Images of Natural Disasters
Authors	Syed Zohaib, Kashif Ahmad, Nicola Conci, Ala Al-Fuqaha
Abstract	Social media have been widely exploited to detect and gather relevant information about opinions and events. However, the relevance of the information is very subjective and rather depends on the application and the end-users. In this article, we tackle a specific facet of social media data processing, namely the sentiment analysis of disaster-related images by considering people’s opinions, attitudes, feelings and emotions. We analyze how visual sentiment analysis can improve the results for the end-users/beneficiaries in terms of mining information from social media. We also identify the challenges and related applications, which could help defining a benchmark for future research efforts in visual sentiment analysis.
Tasks	Sentiment Analysis
Published	2019-10-10
URL	https://arxiv.org/abs/1910.04416v1
PDF	https://arxiv.org/pdf/1910.04416v1.pdf
PWC	https://paperswithcode.com/paper/sentiment-analysis-from-images-of-natural
Repo
Framework

Fast and Accurate 3D Hand Pose Estimation via Recurrent Neural Network for Capturing Hand Articulations


Title	Fast and Accurate 3D Hand Pose Estimation via Recurrent Neural Network for Capturing Hand Articulations
Authors	Cheol-hwan Yoo, Seo-won Ji, Yong-goo Shin, Seung-wook Kim, Sung-jea Ko
Abstract	3D hand pose estimation from a single depth image plays an important role in computer vision and human-computer interaction. Although recent hand pose estimation methods using convolution neural network (CNN) have shown notable improvements in accuracy, most of them have a limitation that they rely on a complex network structure without fully exploiting the articulated structure of the hand. A hand, which is an articulated object, is composed of six local parts: the palm and five independent fingers. Each finger consists of sequential-joints that provide constrained motion, referred to as a kinematic chain. In this paper, we propose a hierarchically-structured convolutional recurrent neural network (HCRNN) with six branches that estimate the 3D position of the palm and five fingers independently. The palm position is predicted via fully-connected layers. Each sequential-joint, i.e. finger position, is obtained using a recurrent neural network (RNN) to capture the spatial dependencies between adjacent joints. Then the output features of the palm and finger branches are concatenated to estimate the global hand position. HCRNN directly takes the depth map as an input without a time-consuming data conversion, such as 3D voxels and point clouds. Experimental results on public datasets demonstrate that the proposed HCRNN not only outperforms most 2D CNN-based methods using the depth image as their inputs but also achieves competitive results with state-of-the-art 3D CNN-based methods with a highly efficient running speed of 285 fps on a single GPU.
Tasks	Hand Pose Estimation, Pose Estimation
Published	2019-11-18
URL	https://arxiv.org/abs/1911.07424v2
PDF	https://arxiv.org/pdf/1911.07424v2.pdf
PWC	https://paperswithcode.com/paper/capturing-hand-articulations-using-recurrent
Repo
Framework

Machine Learning Methods Economists Should Know About


Title	Machine Learning Methods Economists Should Know About
Authors	Susan Athey, Guido Imbens
Abstract	We discuss the relevance of the recent Machine Learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the machine learning literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, as well as matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics, methods that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, problems that include causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.
Tasks	Causal Inference, Matrix Completion
Published	2019-03-24
URL	http://arxiv.org/abs/1903.10075v1
PDF	http://arxiv.org/pdf/1903.10075v1.pdf
PWC	https://paperswithcode.com/paper/machine-learning-methods-economists-should
Repo
Framework

Provably Efficient Reinforcement Learning with Linear Function Approximation


Title	Provably Efficient Reinforcement Learning with Linear Function Approximation
Authors	Chi Jin, Zhuoran Yang, Zhaoran Wang, Michael I. Jordan
Abstract	Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a “simulator” or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)—a classical algorithm frequently studied in the linear setting—achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.
Tasks
Published	2019-07-11
URL	https://arxiv.org/abs/1907.05388v2
PDF	https://arxiv.org/pdf/1907.05388v2.pdf
PWC	https://paperswithcode.com/paper/provably-efficient-reinforcement-learning
Repo
Framework

Mask Mining for Improved Liver Lesion Segmentation


Title	Mask Mining for Improved Liver Lesion Segmentation
Authors	Karsten Roth, Jürgen Hesser, Tomasz Konopczyński
Abstract	We propose a novel procedure to improve liver and lesion segmentation from CT scans for U-Net based models. Our method extends standard segmentation pipelines to focus on higher target recall or reduction of noisy false-positive predictions, boosting overall segmentation performance. To achieve this, we include segmentation errors into a new learning process appended to the main training setup, allowing the model to find features which explain away previous errors. We evaluate this on semantically distinct architectures: cascaded two- and three-dimensional as well as combined learning setups for multitask segmentation. Liver and lesion segmentation data are provided by the Liver Tumor Segmentation challenge (LiTS), with an increase in dice score of up to 2 points.
Tasks	Lesion Segmentation
Published	2019-08-14
URL	https://arxiv.org/abs/1908.05062v4
PDF	https://arxiv.org/pdf/1908.05062v4.pdf
PWC	https://paperswithcode.com/paper/boosting-liver-and-lesion-segmentation-from
Repo
Framework

In-field grape berries counting for yield estimation using dilated CNNs


Title	In-field grape berries counting for yield estimation using dilated CNNs
Authors	L. Coviello, M. Cristoforetti, G. Jurman, C. Furlanello
Abstract	Digital technologies ignited a revolution in the agrifood domain known as precision agriculture: a main question for enabling precision agriculture at scale is if accurate product quality control can be made available at minimal cost, leveraging existing technologies and agronomists’ skills. As a contribution along this direction we demonstrate a tool for accurate fruit yield estimation from smartphone cameras, by adapting Deep Learning algorithms originally developed for crowd counting.
Tasks	Crowd Counting
Published	2019-09-26
URL	https://arxiv.org/abs/1909.12083v1
PDF	https://arxiv.org/pdf/1909.12083v1.pdf
PWC	https://paperswithcode.com/paper/in-field-grape-berries-counting-for-yield
Repo
Framework

Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting


Title	Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting
Authors	Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, Jun-Yan He, Alexander Hauptmann
Abstract	Tremendous variation in the scale of people/head size is a critical problem for crowd counting. To improve the scale invariance of feature representation, recent works extensively employ Convolutional Neural Networks with multi-column structures to handle different scales and resolutions. However, due to the substantial redundant parameters in columns, existing multi-column networks invariably exhibit almost the same scale features in different columns, which severely affects counting accuracy and leads to overfitting. In this paper, we attack this problem by proposing a novel Multi-column Mutual Learning (McML) strategy. It has two main innovations: 1) A statistical network is incorporated into the multi-column framework to estimate the mutual information between columns, which can approximately indicate the scale correlation between features from different columns. By minimizing the mutual information, each column is guided to learn features with different image scales. 2) We devise a mutual learning scheme that can alternately optimize each column while keeping the other columns fixed on each mini-batch training data. With such asynchronous parameter update process, each column is inclined to learn different feature representation from others, which can efficiently reduce the parameter redundancy and improve generalization ability. More remarkably, McML can be applied to all existing multi-column networks and is end-to-end trainable. Extensive experiments on four challenging benchmarks show that McML can significantly improve the original multi-column networks and outperform the other state-of-the-art approaches.
Tasks	Crowd Counting
Published	2019-09-17
URL	https://arxiv.org/abs/1909.07608v1
PDF	https://arxiv.org/pdf/1909.07608v1.pdf
PWC	https://paperswithcode.com/paper/improving-the-learning-of-multi-column
Repo
Framework