April 1, 2020

3107 words 15 mins read

Paper Group ANR 461

Joint Unsupervised Learning of Optical Flow and Egomotion with Bi-Level Optimization. Semantic Flow for Fast and Accurate Scene Parsing. Restricted Structural Random Matrix for Compressive Sensing. Feasibility of Video-based Sub-meter Localization on Resource-constrained Platforms. Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model. Predic …

Joint Unsupervised Learning of Optical Flow and Egomotion with Bi-Level Optimization


Title	Joint Unsupervised Learning of Optical Flow and Egomotion with Bi-Level Optimization
Authors	Shihao Jiang, Dylan Campbell, Miaomiao Liu, Stephen Gould, Richard Hartley
Abstract	We address the problem of joint optical flow and camera motion estimation in rigid scenes by incorporating geometric constraints into an unsupervised deep learning framework. Unlike existing approaches which rely on brightness constancy and local smoothness for optical flow estimation, we exploit the global relationship between optical flow and camera motion using epipolar geometry. In particular, we formulate the prediction of optical flow and camera motion as a bi-level optimization problem, consisting of an upper-level problem to estimate the flow that conforms to the predicted camera motion, and a lower-level problem to estimate the camera motion given the predicted optical flow. We use implicit differentiation to enable back-propagation through the lower-level geometric optimization layer independent of its implementation, allowing end-to-end training of the network. With globally-enforced geometric constraints, we are able to improve the quality of the estimated optical flow in challenging scenarios and obtain better camera motion estimates compared to other unsupervised learning methods.
Tasks	Motion Estimation, Optical Flow Estimation
Published	2020-02-26
URL	https://arxiv.org/abs/2002.11826v1
PDF	https://arxiv.org/pdf/2002.11826v1.pdf
PWC	https://paperswithcode.com/paper/joint-unsupervised-learning-of-optical-flow-1
Repo
Framework

Semantic Flow for Fast and Accurate Scene Parsing


Title	Semantic Flow for Fast and Accurate Scene Parsing
Authors	Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Yunhai Tong
Abstract	In this paper, we focus on effective methods for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used—astrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolution features effectively and efficiently. Furthermore, integrating our module to a common feature pyramid structure exhibits superior performance over other real-time methods even on very light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Particularly, our network is the first to achieve 80.4% mIoU on Cityscapes with a frame rate of 26 FPS. The code will be available at \url{https://github.com/donnyyou/torchcv}.
Tasks	Optical Flow Estimation, Scene Parsing
Published	2020-02-24
URL	https://arxiv.org/abs/2002.10120v1
PDF	https://arxiv.org/pdf/2002.10120v1.pdf
PWC	https://paperswithcode.com/paper/semantic-flow-for-fast-and-accurate-scene
Repo
Framework

Restricted Structural Random Matrix for Compressive Sensing


Title	Restricted Structural Random Matrix for Compressive Sensing
Authors	Thuong Nguyen Canh, Byeungwoo Jeon
Abstract	Compressive sensing (CS) is well-known for its unique functionalities of sensing, compressing, and security (i.e. CS measurements are equally important). However, there is a tradeoff. Improving sensing and compressing efficiency with prior signal information tends to favor particular measurements, thus decrease the security. This work aimed to improve the sensing and compressing efficiency without compromise the security with a novel sampling matrix, named Restricted Structural Random Matrix (RSRM). RSRM unified the advantages of frame-based and block-based sensing together with the global smoothness prior (i.e. low-resolution signals are highly correlated). RSRM acquired compressive measurements with random projection (equally important) of multiple randomly sub-sampled signals, which was restricted to be the low-resolution signals (equal in energy), thereby, its observations are equally important. RSRM was proven to satisfies the Restricted Isometry Property and shows comparable reconstruction performance with recent state-of-the-art compressive sensing and deep learning-based methods.
Tasks	Compressive Sensing
Published	2020-02-18
URL	https://arxiv.org/abs/2002.07346v1
PDF	https://arxiv.org/pdf/2002.07346v1.pdf
PWC	https://paperswithcode.com/paper/restricted-structural-random-matrix-for
Repo
Framework

Feasibility of Video-based Sub-meter Localization on Resource-constrained Platforms


Title	Feasibility of Video-based Sub-meter Localization on Resource-constrained Platforms
Authors	Abm Musa, Jakob Eriksson
Abstract	While the satellite-based Global Positioning System (GPS) is adequate for some outdoor applications, many other applications are held back by its multi-meter positioning errors and poor indoor coverage. In this paper, we study the feasibility of real-time video-based localization on resource-constrained platforms. Before commencing a localization task, a video-based localization system downloads an offline model of a restricted target environment, such as a set of city streets, or an indoor shopping mall. The system is then able to localize the user within the model, using only video as input. To enable such a system to run on resource-constrained embedded systems or smartphones, we (a) propose techniques for efficiently building a 3D model of a surveyed path, through frame selection and efficient feature matching, (b) substantially reduce model size by multiple compression techniques, without sacrificing localization accuracy, (c) propose efficient and concurrent techniques for feature extraction and matching to enable online localization, (d) propose a method with interleaved feature matching and optical flow based tracking to reduce the feature extraction and matching time in online localization. Based on an extensive set of both indoor and outdoor videos, manually annotated with location ground truth, we demonstrate that sub-meter accuracy, at real-time rates, is achievable on smart-phone type platforms, despite challenging video conditions.
Tasks	Optical Flow Estimation
Published	2020-02-19
URL	https://arxiv.org/abs/2002.08039v1
PDF	https://arxiv.org/pdf/2002.08039v1.pdf
PWC	https://paperswithcode.com/paper/feasibility-of-video-based-sub-meter
Repo
Framework

Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model


Title	Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model
Authors	Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Mohamed Eldesouki
Abstract	Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0% and 4.3% for MSA and CA respectively. This highlights the effectiveness of feature engineering for such deep neural models.
Tasks	Feature Engineering
Published	2020-02-04
URL	https://arxiv.org/abs/2002.01207v1
PDF	https://arxiv.org/pdf/2002.01207v1.pdf
PWC	https://paperswithcode.com/paper/arabic-diacritic-recovery-using-a-feature
Repo
Framework

Prediction Confidence from Neighbors


Title	Prediction Confidence from Neighbors
Authors	Mark Philip Philipsen, Thomas Baltzer Moeslund
Abstract	The inability of Machine Learning (ML) models to successfully extrapolate correct predictions from out-of-distribution (OoD) samples is a major hindrance to the application of ML in critical applications. Until the generalization ability of ML methods is improved it is necessary to keep humans in the loop. The need for human supervision can only be reduced if it is possible to determining a level of confidence in predictions, which can be used to either ask for human assistance or to abstain from making predictions. We show that feature space distance is a meaningful measure that can provide confidence in predictions. The distance between unseen samples and nearby training samples proves to be correlated to the prediction error of unseen samples. Depending on the acceptable degree of error, predictions can either be trusted or rejected based on the distance to training samples. %Additionally, a novelty threshold can be used to decide whether a sample is worth adding to the training set. This enables earlier and safer deployment of models in critical applications and is vital for deploying models under ever-changing conditions.
Tasks
Published	2020-03-31
URL	https://arxiv.org/abs/2003.14047v1
PDF	https://arxiv.org/pdf/2003.14047v1.pdf
PWC	https://paperswithcode.com/paper/prediction-confidence-from-neighbors
Repo
Framework

Robotic Table Tennis with Model-Free Reinforcement Learning


Title	Robotic Table Tennis with Model-Free Reinforcement Learning
Authors	Wenbo Gao, Laura Graesser, Krzysztof Choromanski, Xingyou Song, Nevena Lazic, Pannag Sanketi, Vikas Sindhwani, Navdeep Jaitly
Abstract	We propose a model-free algorithm for learning efficient policies capable of returning table tennis balls by controlling robot joints at a rate of 100Hz. We demonstrate that evolutionary search (ES) methods acting on CNN-based policy architectures for non-visual inputs and convolving across time learn compact controllers leading to smooth motions. Furthermore, we show that with appropriately tuned curriculum learning on the task and rewards, policies are capable of developing multi-modal styles, specifically forehand and backhand stroke, whilst achieving 80% return rate on a wide range of ball throws. We observe that multi-modality does not require any architectural priors, such as multi-head architectures or hierarchical policies.
Tasks
Published	2020-03-31
URL	https://arxiv.org/abs/2003.14398v1
PDF	https://arxiv.org/pdf/2003.14398v1.pdf
PWC	https://paperswithcode.com/paper/robotic-table-tennis-with-model-free
Repo
Framework

TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style


Title	TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style
Authors	Chaitanya Patel, Zhouyingcheng Liao, Gerard Pons-Moll
Abstract	In this paper, we present TailorNet, a neural model which predicts clothing deformation in 3D as a function of three factors: pose, shape and style (garment geometry), while retaining wrinkle detail. This goes beyond prior models, which are either specific to one style and shape, or generalize to different shapes producing smooth results, despite being style specific. Our hypothesis is that (even non-linear) combinations of examples smooth out high frequency components such as fine-wrinkles, which makes learning the three factors jointly hard. At the heart of our technique is a decomposition of deformation into a high frequency and a low frequency component. While the low-frequency component is predicted from pose, shape and style parameters with an MLP, the high-frequency component is predicted with a mixture of shape-style specific pose models. The weights of the mixture are computed with a narrow bandwidth kernel to guarantee that only predictions with similar high-frequency patterns are combined. The style variation is obtained by computing, in a canonical pose, a subspace of deformation, which satisfies physical constraints such as inter-penetration, and draping on the body. TailorNet delivers 3D garments which retain the wrinkles from the physics based simulations (PBS) it is learned from, while running more than 1000 times faster. In contrast to PBS, TailorNet is easy to use and fully differentiable, which is crucial for computer vision algorithms. Several experiments demonstrate TailorNet produces more realistic results than prior work, and even generates temporally coherent deformations on sequences of the AMASS dataset, despite being trained on static poses from a different dataset. To stimulate further research in this direction, we will make a dataset consisting of 55800 frames, as well as our model publicly available at https://virtualhumans.mpi-inf.mpg.de/tailornet.
Tasks
Published	2020-03-10
URL	https://arxiv.org/abs/2003.04583v2
PDF	https://arxiv.org/pdf/2003.04583v2.pdf
PWC	https://paperswithcode.com/paper/the-virtual-tailor-predicting-clothing-in-3d
Repo
Framework

Towards Intelligent Robotic Process Automation for BPMers


Title	Towards Intelligent Robotic Process Automation for BPMers
Authors	Simone Agostinelli, Andrea Marrella, Massimo Mecella
Abstract	Robotic Process Automation (RPA) is a fast-emerging automation technology that sits between the fields of Business Process Management (BPM) and Artificial Intelligence (AI), and allows organizations to automate high volume routines. RPA tools are able to capture the execution of such routines previously performed by a human users on the interface of a computer system, and then emulate their enactment in place of the user by means of a software robot. Nowadays, in the BPM domain, only simple, predictable business processes involving routine work can be automated by RPA tools in situations where there is no room for interpretation, while more sophisticated work is still left to human experts. In this paper, starting from an in-depth experimentation of the RPA tools available on the market, we provide a classification framework to categorize them on the basis of some key dimensions. Then, based on this analysis, we derive four research challenges and discuss prospective approaches necessary to inject intelligence into current RPA technology, in order to achieve more widespread adoption of RPA in the BPM domain.
Tasks
Published	2020-01-03
URL	https://arxiv.org/abs/2001.00804v1
PDF	https://arxiv.org/pdf/2001.00804v1.pdf
PWC	https://paperswithcode.com/paper/towards-intelligent-robotic-process
Repo
Framework

Automated Discovery of Data Transformations for Robotic Process Automation


Title	Automated Discovery of Data Transformations for Robotic Process Automation
Authors	Volodymyr Leno, Marlon Dumas, Marcello La Rosa, Fabrizio Maria Maggi, Artem Polyvyanyy
Abstract	Robotic Process Automation (RPA) is a technology for automating repetitive routines consisting of sequences of user interactions with one or more applications. In order to fully exploit the opportunities opened by RPA, companies need to discover which specific routines may be automated, and how. In this setting, this paper addresses the problem of analyzing User Interaction (UI) logs in order to discover routines where a user transfers data from one spreadsheet or (Web) form to another. The paper maps this problem to that of discovering data transformations by example - a problem for which several techniques are available. The paper shows that a naive application of a state-of-the-art technique for data transformation discovery is computationally inefficient. Accordingly, the paper proposes two optimizations that take advantage of the information in the UI log and the fact that data transfers across applications typically involve copying alphabetic and numeric tokens separately. The proposed approach and its optimizations are evaluated using UI logs that replicate a real-life repetitive data transfer routine.
Tasks
Published	2020-01-03
URL	https://arxiv.org/abs/2001.01007v1
PDF	https://arxiv.org/pdf/2001.01007v1.pdf
PWC	https://paperswithcode.com/paper/automated-discovery-of-data-transformations
Repo
Framework

Predicting Real-Time Locational Marginal Prices: A GAN-Based Video Prediction Approach


Title	Predicting Real-Time Locational Marginal Prices: A GAN-Based Video Prediction Approach
Authors	Zhongxia Zhang, Meng Wu
Abstract	In this paper, we propose an unsupervised data-driven approach to predict real-time locational marginal prices (RTLMPs). The proposed approach is built upon a general data structure for organizing system-wide heterogeneous market data streams into the format of market data images and videos. Leveraging this general data structure, the system-wide RTLMP prediction problem is formulated as a video prediction problem. A video prediction model based on generative adversarial networks (GAN) is proposed to learn the spatio-temporal correlations among historical RTLMPs and predict system-wide RTLMPs for the next hour. An autoregressive moving average (ARMA) calibration method is adopted to improve the prediction accuracy. The proposed RTLMP prediction method takes public market data as inputs, without requiring any confidential information on system topology, model parameters, or market operating details. Case studies using public market data from ISO New England (ISO-NE) and Southwest Power Pool (SPP) demonstrate that the proposed method is able to learn spatio-temporal correlations among RTLMPs and perform accurate RTLMP prediction.
Tasks	Calibration, Video Prediction
Published	2020-03-20
URL	https://arxiv.org/abs/2003.09527v1
PDF	https://arxiv.org/pdf/2003.09527v1.pdf
PWC	https://paperswithcode.com/paper/predicting-real-time-locational-marginal
Repo
Framework


Title	Attention-based Multi-modal Fusion Network for Semantic Scene Completion
Authors	Siqi Li, Changqing Zou, Yipeng Li, Xibin Zhao, Yue Gao1
Abstract	This paper presents an end-to-end 3D convolutional network named attention-based multi-modal fusion network (AMFNet) for the semantic scene completion (SSC) task of inferring the occupancy and semantic labels of a volumetric 3D scene from single-view RGB-D images. Compared with previous methods which use only the semantic features extracted from RGB-D images, the proposed AMFNet learns to perform effective 3D scene completion and semantic segmentation simultaneously via leveraging the experience of inferring 2D semantic segmentation from RGB-D images as well as the reliable depth cues in spatial dimension. It is achieved by employing a multi-modal fusion architecture boosted from 2D semantic segmentation and a 3D semantic completion network empowered by residual attention blocks. We validate our method on both the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset and the results show that our method respectively achieves the gains of 2.5% and 2.6% on the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset against the state-of-the-art method.
Tasks	Semantic Segmentation
Published	2020-03-31
URL	https://arxiv.org/abs/2003.13910v1
PDF	https://arxiv.org/pdf/2003.13910v1.pdf
PWC	https://paperswithcode.com/paper/attention-based-multi-modal-fusion-network
Repo
Framework

Learning Compact Reward for Image Captioning


Title	Learning Compact Reward for Image Captioning
Authors	Nannan Li, Zhenzhong Chen
Abstract	Adversarial learning has shown its advances in generating natural and diverse descriptions in image captioning. However, the learned reward of existing adversarial methods is vague and ill-defined due to the reward ambiguity problem. In this paper, we propose a refined Adversarial Inverse Reinforcement Learning (rAIRL) method to handle the reward ambiguity problem by disentangling reward for each word in a sentence, as well as achieve stable adversarial training by refining the loss function to shift the generator towards Nash equilibrium. In addition, we introduce a conditional term in the loss function to mitigate mode collapse and to increase the diversity of the generated descriptions. Our experiments on MS COCO and Flickr30K show that our method can learn compact reward for image captioning.
Tasks	Image Captioning
Published	2020-03-24
URL	https://arxiv.org/abs/2003.10925v1
PDF	https://arxiv.org/pdf/2003.10925v1.pdf
PWC	https://paperswithcode.com/paper/learning-compact-reward-for-image-captioning-1
Repo
Framework

Evaluating Amharic Machine Translation


Title	Evaluating Amharic Machine Translation
Authors	Asmelash Teka Hadgu, Adam Beaudoin, Abel Aregawi
Abstract	Machine translation (MT) systems are now able to provide very accurate results for high resource language pairs. However, for many low resource languages, MT is still under active research. In this paper, we develop and share a dataset to automatically evaluate the quality of MT systems for Amharic. We compare two commercially available MT systems that support translation of Amharic to and from English to assess the current state of MT for Amharic. The BLEU score results show that the results for Amharic translation are promising but still low. We hope that this dataset will be useful to the research community both in academia and industry as a benchmark to evaluate Amharic MT systems.
Tasks	Machine Translation
Published	2020-03-31
URL	https://arxiv.org/abs/2003.14386v1
PDF	https://arxiv.org/pdf/2003.14386v1.pdf
PWC	https://paperswithcode.com/paper/evaluating-amharic-machine-translation
Repo
Framework

On conditional versus marginal bias in multi-armed bandits


Title	On conditional versus marginal bias in multi-armed bandits
Authors	Jaehyeok Shin, Aaditya Ramdas, Alessandro Rinaldo
Abstract	The bias of the sample means of the arms in multi-armed bandits is an important issue in adaptive data analysis that has recently received considerable attention in the literature. Existing results relate in precise ways the sign and magnitude of the bias to various sources of data adaptivity, but do not apply to the conditional inference setting in which the sample means are computed only if some specific conditions are satisfied. In this paper, we characterize the sign of the conditional bias of monotone functions of the rewards, including the sample mean. Our results hold for arbitrary conditioning events and leverage natural monotonicity properties of the data collection policy. We further demonstrate, through several examples from sequential testing and best arm identification, that the sign of the conditional and unconditional bias of the sample mean of an arm can be different, depending on the conditioning event. Our analysis offers new and interesting perspectives on the subtleties of assessing the bias in data adaptive settings.
Tasks	Multi-Armed Bandits
Published	2020-02-19
URL	https://arxiv.org/abs/2002.08422v1
PDF	https://arxiv.org/pdf/2002.08422v1.pdf
PWC	https://paperswithcode.com/paper/on-conditional-versus-marginal-bias-in-multi
Repo
Framework