April 2, 2020

3021 words 15 mins read

Paper Group ANR 175

Multilingual Alignment of Contextual Word Representations. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building. Spatio-Temporal Ranked-Attention Networks for Video Captioning. LSCP: Enhanced Large Scale Colloquial Persian Language Understanding. Ro …

Multilingual Alignment of Contextual Word Representations


Title	Multilingual Alignment of Contextual Word Representations
Authors	Steven Cao, Nikita Kitaev, Dan Klein
Abstract	We propose procedures for evaluating and strengthening contextual embedding alignment and show that they are useful in analyzing and improving multilingual BERT. In particular, after our proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching pseudo-fully-supervised translate-train models for Bulgarian and Greek. Further, to measure the degree of alignment, we introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer. Using this word retrieval task, we also analyze BERT and find that it exhibits systematic deficiencies, e.g. worse alignment for open-class parts-of-speech and word pairs written in different scripts, that are corrected by the alignment procedure. These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
Tasks
Published	2020-02-10
URL	https://arxiv.org/abs/2002.03518v2
PDF	https://arxiv.org/pdf/2002.03518v2.pdf
PWC	https://paperswithcode.com/paper/multilingual-alignment-of-contextual-word-1
Repo
Framework

Spatio-Temporal Graph for Video Captioning with Knowledge Distillation


Title	Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
Authors	Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles
Abstract	Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.
Tasks	Video Captioning
Published	2020-03-31
URL	https://arxiv.org/abs/2003.13942v1
PDF	https://arxiv.org/pdf/2003.13942v1.pdf
PWC	https://paperswithcode.com/paper/spatio-temporal-graph-for-video-captioning
Repo
Framework

Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building


Title	Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building
Authors	Fangyi Zhu, Jenq-Neng Hwang, Zhanyu Ma, Jun Guo
Abstract	Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Besides, most methods adopt frame-level inter-object features and ambiguous descriptions during training, which is difficult for learning the vision-language relationships. Without associating the transition trajectories, these image-based methods cannot understand the activities with visual features. We propose a novel task, named object-oriented video captioning, which focuses on understanding the videos in object-level. We re-annotate the object-sentence pairs for more effective cross-modal learning. Thereafter, we design the video-based object-oriented video captioning (OVC)-Net to reliably analyze the activities along time with only visual features and capture the vision-language connections under small datasets stably. To demonstrate the effectiveness, we evaluate the method on the new dataset and compare it with the state-of-the-arts for video captioning. From the experimental results, the OVC-Net exhibits the ability of precisely describing the concurrent objects and their activities in details.
Tasks	Video Captioning
Published	2020-03-08
URL	https://arxiv.org/abs/2003.03715v2
PDF	https://arxiv.org/pdf/2003.03715v2.pdf
PWC	https://paperswithcode.com/paper/object-oriented-video-captioning-with
Repo
Framework

Spatio-Temporal Ranked-Attention Networks for Video Captioning


Title	Spatio-Temporal Ranked-Attention Networks for Video Captioning
Authors	Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks
Abstract	Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame. We propose a novel LSTM-based temporal ranking function, which we call ranked attention, for the ST model to capture action dynamics. Our entire framework is trained end-to-end. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
Tasks	Video Captioning
Published	2020-01-17
URL	https://arxiv.org/abs/2001.06127v1
PDF	https://arxiv.org/pdf/2001.06127v1.pdf
PWC	https://paperswithcode.com/paper/spatio-temporal-ranked-attention-networks-for
Repo
Framework

LSCP: Enhanced Large Scale Colloquial Persian Language Understanding


Title	LSCP: Enhanced Large Scale Colloquial Persian Language Understanding
Authors	Hadi Abdi Khojasteh, Ebrahim Ansari, Mahdi Bohlouli
Abstract	Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a “Large Scale Colloquial Persian Dataset” (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
Tasks
Published	2020-03-13
URL	https://arxiv.org/abs/2003.06499v1
PDF	https://arxiv.org/pdf/2003.06499v1.pdf
PWC	https://paperswithcode.com/paper/lscp-enhanced-large-scale-colloquial-persian
Repo
Framework

Robustness from Simple Classifiers


Title	Robustness from Simple Classifiers
Authors	Sharon Qian, Dimitris Kalimeris, Gal Kaplun, Yaron Singer
Abstract	Despite the vast success of Deep Neural Networks in numerous application domains, it has been shown that such models are not robust i.e., they are vulnerable to small adversarial perturbations of the input. While extensive work has been done on why such perturbations occur or how to successfully defend against them, we still do not have a complete understanding of robustness. In this work, we investigate the connection between robustness and simplicity. We find that simpler classifiers, formed by reducing the number of output classes, are less susceptible to adversarial perturbations. Consequently, we demonstrate that decomposing a complex multiclass model into an aggregation of binary models enhances robustness. This behavior is consistent across different datasets and model architectures and can be combined with known defense techniques such as adversarial training. Moreover, we provide further evidence of a disconnect between standard and robust learning regimes. In particular, we show that elaborate label information can help standard accuracy but harm robustness.
Tasks
Published	2020-02-21
URL	https://arxiv.org/abs/2002.09422v1
PDF	https://arxiv.org/pdf/2002.09422v1.pdf
PWC	https://paperswithcode.com/paper/robustness-from-simple-classifiers
Repo
Framework

DeepSIP: A System for Predicting Service Impact of Network Failure by Temporal Multimodal CNN


Title	DeepSIP: A System for Predicting Service Impact of Network Failure by Temporal Multimodal CNN
Authors	Yoichi Matsuo, Tatsuaki Kimura, Ken Nishimatsu
Abstract	When a failure occurs in a network, network operators need to recognize service impact, since service impact is essential information for handling failures. In this paper, we propose Deep learning based Service Impact Prediction (DeepSIP), a system to predict the time to recovery from the failure and the loss of traffic volume due to the failure in a network element using a temporal multimodal convolutional neural network (CNN). Since the time to recovery is useful information for a service level agreement (SLA) and the loss of traffic volume is directly related to the severity of the failures, we regard these as the service impact. The service impact is challenging to predict, since a network element does not explicitly contain any information about the service impact. Thus, we aim to predict the service impact from syslog messages and traffic volume by extracting hidden information about failures. To extract useful features for prediction from syslog messages and traffic volume which are multimodal and strongly correlated, and have temporal dependencies, we use temporal multimodal CNN. We experimentally evaluated DeepSIP and DeepSIP reduced prediction error by approximately 50% in comparison with other NN-based methods with a synthetic dataset.
Tasks
Published	2020-03-24
URL	https://arxiv.org/abs/2003.10643v1
PDF	https://arxiv.org/pdf/2003.10643v1.pdf
PWC	https://paperswithcode.com/paper/deepsip-a-system-for-predicting-service
Repo
Framework


Title	Multi-Drone based Single Object Tracking with Agent Sharing Network
Authors	Pengfei Zhu, Jiayu Zheng, Dawei Du, Longyin Wen, Yiming Sun, Qinghua Hu
Abstract	Drone equipped with cameras can dynamically track the target in the air from a broader view compared with static cameras or moving sensors over the ground. However, it is still challenging to accurately track the target using a single drone due to several factors such as appearance variations and severe occlusions. In this paper, we collect a new Multi-Drone single Object Tracking (MDOT) dataset that consists of 92 groups of video clips with 113,918 high resolution frames taken by two drones and 63 groups of video clips with 145,875 high resolution frames taken by three drones. Besides, two evaluation metrics are specially designed for multi-drone single object tracking, i.e. automatic fusion score (AFS) and ideal fusion score (IFS). Moreover, an agent sharing network (ASNet) is proposed by self-supervised template sharing and view-aware fusion of the target from multiple drones, which can improve the tracking accuracy significantly compared with single drone tracking. Extensive experiments on MDOT show that our ASNet significantly outperforms recent state-of-the-art trackers.
Tasks	Object Tracking
Published	2020-03-16
URL	https://arxiv.org/abs/2003.06994v1
PDF	https://arxiv.org/pdf/2003.06994v1.pdf
PWC	https://paperswithcode.com/paper/multi-drone-based-single-object-tracking-with
Repo
Framework

Diabetic Retinopathy detection by retinal image recognizing


Title	Diabetic Retinopathy detection by retinal image recognizing
Authors	Gilberto Luis De Conto Junior
Abstract	Many people are affected by diabetes around the world. This disease may have type 1 and 2. Diabetes brings with it several complications including diabetic retinopathy, which is a disease that if not treated correctly can lead to irreversible damage in the patient’s vision. The earlier it is detected, the better the chances that the patient will not lose vision. Methods of automating manual procedures are currently in evidence and the diagnostic process for retinopathy is manual with the physician analyzing the patient’s retina on the monitor. The practice of image recognition can aid this detection by recognizing Diabetic Retinopathy patterns and comparing it with the patient’s retina in diagnosis. This method can also assist in the act of telemedicine, in which people without access to the exam can benefit from the diagnosis provided by the application. The application development took place through convolutional neural networks, which do digital image processing analyzing each image pixel. The use of VGG-16 as a pre-trained model to the application basis was very useful and the final model accuracy was 82%.
Tasks	Diabetic Retinopathy Detection
Published	2020-01-14
URL	https://arxiv.org/abs/2001.05835v1
PDF	https://arxiv.org/pdf/2001.05835v1.pdf
PWC	https://paperswithcode.com/paper/diabetic-retinopathy-detection-by-retinal
Repo
Framework

Making Logic Learnable With Neural Networks


Title	Making Logic Learnable With Neural Networks
Authors	Tobias Brudermueller, Dennis L. Shung, Loren Laine, Adrian J. Stanley, Stig B. Laursen, Harry R. Dalton, Jeffrey Ngu, Michael Schultz, Johannes Stegmaier, Smita Krishnaswamy
Abstract	While neural networks are good at learning unspecified functions from training samples, they cannot be directly implemented in hardware and are often not interpretable or formally verifiable. On the other hand, logic circuits are implementable, verifiable, and interpretable but are not able to learn from training data in a generalizable way. We propose a novel logic learning pipeline that combines the advantages of neural networks and logic circuits. Our pipeline first trains a neural network on a classification task, and then translates this, first to random forests or look-up tables, and then to AND-Inverter logic. We show that our pipeline maintains greater accuracy than naive translations to logic, and minimizes the logic such that it is more interpretable and has decreased hardware cost. We show the utility of our pipeline on a network that is trained on biomedical data from patients presenting with gastrointestinal bleeding with the prediction task of determining if patients need immediate hospital-based intervention. This approach could be applied to patient care to provide risk stratification and guide clinical decision-making.
Tasks	Decision Making
Published	2020-02-10
URL	https://arxiv.org/abs/2002.03847v2
PDF	https://arxiv.org/pdf/2002.03847v2.pdf
PWC	https://paperswithcode.com/paper/making-logic-learnable-with-neural-networks
Repo
Framework

Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation


Title	Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation
Authors	Yingjie Cai, Buyu Li, Zeyu Jiao, Hongsheng Li, Xingyu Zeng, Xiaogang Wang
Abstract	Monocular 3D object detection task aims to predict the 3D bounding boxes of objects based on monocular RGB images. Since the location recovery in 3D space is quite difficult on account of absence of depth information, this paper proposes a novel unified framework which decomposes the detection problem into a structured polygon prediction task and a depth recovery task. Different from the widely studied 2D bounding boxes, the proposed novel structured polygon in the 2D image consists of several projected surfaces of the target object. Compared to the widely-used 3D bounding box proposals, it is shown to be a better representation for 3D detection. In order to inversely project the predicted 2D structured polygon to a cuboid in the 3D physical world, the following depth recovery task uses the object height prior to complete the inverse projection transformation with the given camera projection matrix. Moreover, a fine-grained 3D box refinement scheme is proposed to further rectify the 3D detection results. Experiments are conducted on the challenging KITTI benchmark, in which our method achieves state-of-the-art detection accuracy.
Tasks	3D Object Detection, Depth Estimation, Object Detection
Published	2020-02-05
URL	https://arxiv.org/abs/2002.01619v1
PDF	https://arxiv.org/pdf/2002.01619v1.pdf
PWC	https://paperswithcode.com/paper/monocular-3d-object-detection-with-decoupled
Repo
Framework

3D Object Detection on Point Clouds using Local Ground-aware and Adaptive Representation of scenes’ surface


Title	3D Object Detection on Point Clouds using Local Ground-aware and Adaptive Representation of scenes’ surface
Authors	Arun CS Kumar, Disha Ahuja, Ashwath Aithal
Abstract	A novel, adaptive ground-aware, and cost-effective 3D Object Detection pipeline is proposed. The ground surface representation introduced in this paper, in comparison to its uni-planar counterparts (methods that model the surface of a whole 3D scene using single plane), is far more accurate while being ~10x faster. The novelty of the ground representation lies both in the way in which the ground surface of the scene is represented in Lidar perception problems, as well as in the (cost-efficient) way in which it is computed. Furthermore, the proposed object detection pipeline builds on the traditional two-stage object detection models by incorporating the ability to dynamically reason the surface of the scene, ultimately achieving a new state-of-the-art 3D object detection performance among the two-stage Lidar Object Detection pipelines.
Tasks	3D Object Detection, Object Detection
Published	2020-02-02
URL	https://arxiv.org/abs/2002.00336v1
PDF	https://arxiv.org/pdf/2002.00336v1.pdf
PWC	https://paperswithcode.com/paper/3d-object-detection-on-point-clouds-using
Repo
Framework

Discretization and Machine Learning Approximation of BSDEs with a Constraint on the Gains-Process


Title	Discretization and Machine Learning Approximation of BSDEs with a Constraint on the Gains-Process
Authors	Idris Kharroubi, Thomas Lim, Xavier Warin
Abstract	We study the approximation of backward stochastic differential equations (BSDEs for short) with a constraint on the gains process. We first discretize the constraint by applying a so-called facelift operator at times of a grid. We show that this discretely constrained BSDE converges to the continuously constrained one as the mesh grid converges to zero. We then focus on the approximation of the discretely constrained BSDE. For that we adopt a machine learning approach. We show that the facelift can be approximated by an optimization problem over a class of neural networks under constraints on the neural network and its derivative. We then derive an algorithm converging to the discretely constrained BSDE as the number of neurons goes to infinity. We end by numerical experiments. Mathematics Subject Classification (2010): 65C30, 65M75, 60H35, 93E20, 49L25.
Tasks
Published	2020-02-07
URL	https://arxiv.org/abs/2002.02675v1
PDF	https://arxiv.org/pdf/2002.02675v1.pdf
PWC	https://paperswithcode.com/paper/discretization-and-machine-learning
Repo
Framework


Title	Leveraging Uncertainties for Deep Multi-modal Object Detection in Autonomous Driving
Authors	Di Feng, Yifan Cao, Lars Rosenbaum, Fabian Timm, Klaus Dietmayer
Abstract	This work presents a probabilistic deep neural network that combines LiDAR point clouds and RGB camera images for robust, accurate 3D object detection. We explicitly model uncertainties in the classification and regression tasks, and leverage uncertainties to train the fusion network via a sampling mechanism. We validate our method on three datasets with challenging real-world driving scenarios. Experimental results show that the predicted uncertainties reflect complex environmental uncertainty like difficulties of a human expert to label objects. The results also show that our method consistently improves the Average Precision by up to 7% compared to the baseline method. When sensors are temporally misaligned, the sampling method improves the Average Precision by up to 20%, showing its high robustness against noisy sensor inputs.
Tasks	3D Object Detection, Autonomous Driving, Object Detection
Published	2020-02-01
URL	https://arxiv.org/abs/2002.00216v1
PDF	https://arxiv.org/pdf/2002.00216v1.pdf
PWC	https://paperswithcode.com/paper/leveraging-uncertainties-for-deep-multi-modal
Repo
Framework

Dynamic Impact for Ant Colony Optimization algorithm


Title	Dynamic Impact for Ant Colony Optimization algorithm
Authors	Jonas Skackauskas, Tatiana Kalganova, Ian Dear, Mani Janakram
Abstract	This paper proposes an extension method for Ant Colony Optimization (ACO) algorithm called Dynamic Impact. Dynamic Impact is designed to solve challenging optimization problems that has nonlinear relationship between resource consumption and fitness in relation to other part of the optimized solution. This proposed method is tested against complex real-world Microchip Manufacturing Plant Production Floor Optimization (MMPPFO) problem, as well as theoretical benchmark Multi-Dimensional Knapsack problem (MKP). MMPPFO is a non-trivial optimization problem, due the nature of solution fitness value dependence on collection of wafer-lots without prioritization of any individual wafer-lot. Using Dynamic Impact on single objective optimization fitness value is improved by 33.2%. Furthermore, MKP benchmark instances of small complexity have been solved to 100% success rate where high degree of solution sparseness is observed, and large instances have showed average gap improved by 4.26 times. Algorithm implementation demonstrated superior performance across small and large datasets and sparse optimization problems.
Tasks
Published	2020-02-10
URL	https://arxiv.org/abs/2002.04099v1
PDF	https://arxiv.org/pdf/2002.04099v1.pdf
PWC	https://paperswithcode.com/paper/dynamic-impact-for-ant-colony-optimization
Repo
Framework