February 1, 2020

3496 words 17 mins read

Paper Group AWR 279

Exploiting temporal consistency for real-time video depth estimation. Unsupervised Single Image Underwater Depth Estimation. Shape Robust Text Detection with Progressive Scale Expansion Network. Segmentation of Roots in Soil with U-Net. Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes. Human Pose Estimation for Real-W …

Exploiting temporal consistency for real-time video depth estimation


Title	Exploiting temporal consistency for real-time video depth estimation
Authors	Haokui Zhang, Chunhua Shen, Ying Li, Yuanzhouhan Cao, Yu Liu, Youliang Yan
Abstract	Accuracy of depth estimation from static images has been significantly improved recently, by exploiting hierarchical features from deep convolutional neural networks (CNNs). Compared with static images, vast information exists among video frames and can be exploited to improve the depth estimation performance. In this work, we focus on exploring temporal information from monocular videos for depth estimation. Specifically, we take the advantage of convolutional long short-term memory (CLSTM) and propose a novel spatial-temporal CSLTM (ST-CLSTM) structure. Our ST-CLSTM structure can capture not only the spatial features but also the temporal correlations/consistency among consecutive video frames with negligible increase in computational cost. Additionally, in order to maintain the temporal consistency among the estimated depth frames, we apply the generative adversarial learning scheme and design a temporal consistency loss. The temporal consistency loss is combined with the spatial loss to update the model in an end-to-end fashion. By taking advantage of the temporal information, we build a video depth estimation framework that runs in real-time and generates visually pleasant results. Moreover, our approach is flexible and can be generalized to most existing depth estimation frameworks. Code is available at: https://tinyurl.com/STCLSTM
Tasks	Depth Estimation
Published	2019-08-10
URL	https://arxiv.org/abs/1908.03706v1
PDF	https://arxiv.org/pdf/1908.03706v1.pdf
PWC	https://paperswithcode.com/paper/exploiting-temporal-consistency-for-real-time
Repo	https://github.com/Adelaide-AI-Group/ST-CLSTM
Framework	pytorch

Unsupervised Single Image Underwater Depth Estimation


Title	Unsupervised Single Image Underwater Depth Estimation
Authors	Honey Gupta, Kaushik Mitra
Abstract	Depth estimation from a single underwater image is one of the most challenging problems and is highly ill-posed. Due to the absence of large generalized underwater depth datasets and the difficulty in obtaining ground truth depth-maps, supervised learning techniques such as direct depth regression cannot be used. In this paper, we propose an unsupervised method for depth estimation from a single underwater image taken `in the wild’ by using haze as a cue for depth. Our approach is based on indirect depth-map estimation where we learn the mapping functions between unpaired RGB-D terrestrial images and arbitrary underwater images to estimate the required depth-map. We propose a method which is based on the principles of cycle-consistent learning and uses dense-block based auto-encoders as generator networks. We evaluate and compare our method both quantitatively and qualitatively on various underwater images with diverse attenuation and scattering conditions and show that our method produces state-of-the-art results for unsupervised depth estimation from a single underwater image. \|
Tasks	Depth Estimation
Published	2019-05-25
URL	https://arxiv.org/abs/1905.10595v2
PDF	https://arxiv.org/pdf/1905.10595v2.pdf
PWC	https://paperswithcode.com/paper/unsupervised-single-image-underwater-depth
Repo	https://github.com/honeygupta/UW-Net
Framework	tf

Shape Robust Text Detection with Progressive Scale Expansion Network


Title	Shape Robust Text Detection with Progressive Scale Expansion Network
Authors	Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, Shuai Shao
Abstract	Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.
Tasks	Scene Text Detection
Published	2019-03-28
URL	https://arxiv.org/abs/1903.12473v2
PDF	https://arxiv.org/pdf/1903.12473v2.pdf
PWC	https://paperswithcode.com/paper/shape-robust-text-detection-with-progressive-1
Repo	https://github.com/whai362/PSENet
Framework	tf

Segmentation of Roots in Soil with U-Net


Title	Segmentation of Roots in Soil with U-Net
Authors	Abraham George Smith, Jens Petersen, Raghavendra Selvan, Camilla Ruø Rasmussen
Abstract	Plant root research can provide a way to attain stress-tolerant crops that produce greater yield in a diverse array of conditions. Phenotyping roots in soil is often challenging due to the roots being difficult to access and the use of time consuming manual methods. Rhizotrons allow visual inspection of root growth through transparent surfaces. Agronomists currently manually label photographs of roots obtained from rhizotrons using a line-intersect method to obtain root length density and rooting depth measurements which are essential for their experiments. We investigate the effectiveness of an automated image segmentation method based on the U-Net Convolutional Neural Network (CNN) architecture to enable such measurements. We design a data-set of 50 annotated Chicory (Cichorium intybus L.) root images which we use to train, validate and test the system and compare against a baseline built using the Frangi vesselness filter. We obtain metrics using manual annotations and line-intersect counts. Our results on the held out data show our proposed automated segmentation system to be a viable solution for detecting and quantifying roots. We evaluate our system using 867 images for which we have obtained line-intersect counts, attaining a Spearman rank correlation of 0.9748 and an $r^2$ of 0.9217. We also achieve an $F_1$ of 0.7 when comparing the automated segmentation to the manual annotations, with our automated segmentation system producing segmentations with higher quality than the manual annotations for large portions of the image.
Tasks	Semantic Segmentation
Published	2019-02-28
URL	http://arxiv.org/abs/1902.11050v2
PDF	http://arxiv.org/pdf/1902.11050v2.pdf
PWC	https://paperswithcode.com/paper/segmentation-of-roots-in-soil-with-u-net
Repo	https://github.com/Abe404/segmentation_of_roots_in_soil_with_unet
Framework	pytorch

Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes


Title	Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes
Authors	Jingru Yi, Pengxiang Wu, Qiaoying Huang, Hui Qu, Bo Liu, Daniel J. Hoeppner, Dimitris N. Metaxas
Abstract	Most existing methods handle cell instance segmentation problems directly without relying on additional detection boxes. These methods generally fails to separate touching cells due to the lack of global understanding of the objects. In contrast, box-based instance segmentation solves this problem by combining object detection with segmentation. However, existing methods typically utilize anchor box-based detectors, which would lead to inferior instance segmentation performance due to the class imbalance issue. In this paper, we propose a new box-based cell instance segmentation method. In particular, we first detect the five pre-defined points of a cell via keypoints detection. Then we group these points according to a keypoint graph and subsequently extract the bounding box for each cell. Finally, cell segmentation is performed on feature maps within the bounding boxes. We validate our method on two cell datasets with distinct object shapes, and empirically demonstrate the superiority of our method compared to other instance segmentation techniques. Code is available at: https://github.com/yijingru/KG_Instance_Segmentation.
Tasks	Cell Segmentation, Instance Segmentation, Object Detection, Semantic Segmentation
Published	2019-07-22
URL	https://arxiv.org/abs/1907.09140v2
PDF	https://arxiv.org/pdf/1907.09140v2.pdf
PWC	https://paperswithcode.com/paper/multi-scale-cell-instance-segmentation-with
Repo	https://github.com/yijingru/KG_Instance_Segmentation
Framework	pytorch

Human Pose Estimation for Real-World Crowded Scenarios


Title	Human Pose Estimation for Real-World Crowded Scenarios
Authors	Thomas Golda, Tobias Kalb, Arne Schumann, Jürgen Beyerer
Abstract	Human pose estimation has recently made significant progress with the adoption of deep convolutional neural networks. Its many applications have attracted tremendous interest in recent years. However, many practical applications require pose estimation for human crowds, which still is a rarely addressed problem. In this work, we explore methods to optimize pose estimation for human crowds, focusing on challenges introduced with dense crowds, such as occlusions, people in close proximity to each other, and partial visibility of people. In order to address these challenges, we evaluate three aspects of a pose detection approach: i) a data augmentation method to introduce robustness to occlusions, ii) the explicit detection of occluded body parts, and iii) the use of the synthetic generated datasets. The first approach to improve the accuracy in crowded scenarios is to generate occlusions at training time using person and object cutouts from the object recognition dataset COCO (Common Objects in Context). Furthermore, the synthetically generated dataset JTA (Joint Track Auto) is evaluated for the use in real-world crowd applications. In order to overcome the transfer gap of JTA originating from a low pose variety and less dense crowds, an extension dataset is created to ease the use for real-world applications. Additionally, the occlusion flags provided with JTA are utilized to train a model, which explicitly distinguishes between occluded and visible body parts in two distinct branches. The combination of the proposed additions to the baseline method help to improve the overall accuracy by 4.7% AP and thereby provide comparable results to current state-of-the-art approaches on the respective dataset.
Tasks	Data Augmentation, Object Recognition, Pose Estimation
Published	2019-07-16
URL	https://arxiv.org/abs/1907.06922v1
PDF	https://arxiv.org/pdf/1907.06922v1.pdf
PWC	https://paperswithcode.com/paper/human-pose-estimation-for-real-world-crowded
Repo	https://github.com/thomasgolda/Human-Pose-Estimation-for-Real-World-Crowded-Scenarios
Framework	none

FlatteNet: A Simple Versatile Framework for Dense Pixelwise Prediction


Title	FlatteNet: A Simple Versatile Framework for Dense Pixelwise Prediction
Authors	Xin Cai, Yi-Fei Pu
Abstract	In this paper, we focus on devising a versatile framework for dense pixelwise prediction whose goal is to assign a discrete or continuous label to each pixel for an image. It is well-known that the reduced feature resolution due to repeated subsampling operations poses a serious challenge to Fully Convolutional Network (FCN) based models. In contrast to the commonly-used strategies, such as dilated convolution and encoder-decoder structure, we introduce the Flattening Module to produce high-resolution predictions without either removing any subsampling operations or building a complicated decoder module. In addition, the Flattening Module is lightweight and can be easily combined with any existing FCNs, allowing the model builder to trade off among model size, computational cost and accuracy by simply choosing different backbone networks. We empirically demonstrate the effectiveness of the proposed Flattening Module through competitive results in human pose estimation on MPII, semantic segmentation on PASCAL-Context and object detection on PASCAL VOC. We hope that the proposed approach can serve as a simple and strong alternative of current dominant dense pixelwise prediction frameworks.
Tasks	Object Detection, Pose Estimation, Semantic Segmentation
Published	2019-09-22
URL	https://arxiv.org/abs/1909.09961v3
PDF	https://arxiv.org/pdf/1909.09961v3.pdf
PWC	https://paperswithcode.com/paper/190909961
Repo	https://github.com/TotalVariation/Flattenet
Framework	pytorch

Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network


Title	Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network
Authors	Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, Wei Liu
Abstract	In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., the motion and content features of an input video. One POS sequence generator relies on this fused representation to predict the global syntactic structure, which is thereafter leveraged to guide the video captioning generation and control the syntax of the generated sentence. Specifically, a gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word. Experimental results on two benchmark datasets, namely MSR-VTT and MSVD, demonstrate that the proposed model can well exploit complementary information from multiple representations, resulting in improved performances. Moreover, the generated global POS information can well capture the global syntactic structure of the sentence, and thus be exploited to control the syntactic structure of the description. Such POS information not only boosts the video captioning performance but also improves the diversity of the generated captions. Our code is at: https://github.com/vsislab/Controllable_XGating.
Tasks	Video Captioning
Published	2019-08-27
URL	https://arxiv.org/abs/1908.10072v1
PDF	https://arxiv.org/pdf/1908.10072v1.pdf
PWC	https://paperswithcode.com/paper/controllable-video-captioning-with-pos
Repo	https://github.com/vsislab/Controllable_XGating
Framework	pytorch

ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records


Title	ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records
Authors	Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki
Abstract	We propose a Historical Document Reading Challenge on Large Chinese Structured Family Records, in short ICDAR2019 HDRC CHINESE. The objective of the proposed competition is to recognize and analyze the layout, and finally detect and recognize the textlines and characters of the large historical document collection containing more than 20 000 pages kindly provided by FamilySearch.
Tasks
Published	2019-03-08
URL	https://arxiv.org/abs/1903.03341v3
PDF	https://arxiv.org/pdf/1903.03341v3.pdf
PWC	https://paperswithcode.com/paper/icdar-2019-historical-document-reading
Repo	https://github.com/DIVA-DIA/DIVA_Layout_Analysis_Evaluator
Framework	none

High-Resolution Traffic Sensing with Autonomous Vehicles


Title	High-Resolution Traffic Sensing with Autonomous Vehicles
Authors	Wei Ma, Sean Qian
Abstract	The last decades have witnessed the breakthrough of autonomous vehicles (AVs), and the perception capabilities of AVs have been dramatically improved. Various sensors installed on AVs, including, but are not limited to, LiDAR, radar, camera and stereovision, will be collecting massive data and perceiving the surrounding traffic states continuously. In fact, a fleet of AVs can serve as floating (or probe) sensors, which can be utilized to infer traffic information while cruising around the roadway networks. In contrast, conventional traffic sensing methods rely on fixed traffic sensors such as loop detectors, cameras and microwave vehicle detectors. Due to the high cost of conventional traffic sensors, traffic state data are usually obtained in a low-frequency and sparse manner. In view of this, this paper leverages rich data collected through AVs to propose the high-resolution traffic sensing framework. The proposed framework estimates the fundamental traffic state variables, namely, flow, density and speed in high spatio-temporal resolution, and it is developed under different levels of AV perception capabilities and low AV market penetration rate. The Next Generation Simulation (NGSIM) data is adopted to examine the accuracy and robustness of the proposed framework. Experimental results show that the proposed estimation framework achieves high accuracy even with low AV market penetration rate. Sensitivity analysis regarding AV penetration rate, sensor configuration, and perception accuracy will also be studied. This study will help policymakers and private sectors (e.g Uber, Waymo) to understand the values of AVs, especially the values of massive data collected by AVs, in traffic operation and management.
Tasks	Autonomous Vehicles
Published	2019-10-06
URL	https://arxiv.org/abs/1910.02376v1
PDF	https://arxiv.org/pdf/1910.02376v1.pdf
PWC	https://paperswithcode.com/paper/high-resolution-traffic-sensing-with
Repo	https://github.com/Lemma1/NGSIM-interface
Framework	none

Transformer-CNN: Fast and Reliable tool for QSAR


Title	Transformer-CNN: Fast and Reliable tool for QSAR
Authors	Pavel Karpov, Guillaume Godin, Igor V. Tetko
Abstract	We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model’s result. OCHEM [3] environment (https://ochem.eu) hosts the on-line implementation of the method proposed.
Tasks	Transfer Learning
Published	2019-10-21
URL	https://arxiv.org/abs/1911.06603v3
PDF	https://arxiv.org/pdf/1911.06603v3.pdf
PWC	https://paperswithcode.com/paper/transformer-cnn-fast-and-reliable-tool-for
Repo	https://github.com/bigchem/transformer-cnn
Framework	tf

Video Classification with Channel-Separated Convolutional Networks


Title	Video Classification with Channel-Separated Convolutional Networks
Authors	Du Tran, Heng Wang, Lorenzo Torresani, Matt Feiszli
Abstract	Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture – Channel-Separated Convolutional Network (CSN) – which is simple, efficient, yet accurate. On Sports1M, Kinetics, and Something-Something, our CSNs are comparable with or better than the state-of-the-art while being 2-3 times more efficient.
Tasks	Action Classification, Action Recognition In Videos, Image Classification, Video Classification
Published	2019-04-04
URL	https://arxiv.org/abs/1904.02811v4
PDF	https://arxiv.org/pdf/1904.02811v4.pdf
PWC	https://paperswithcode.com/paper/video-classification-with-channel-separated
Repo	https://github.com/facebookresearch/VMZ
Framework	caffe2

Publicly Available Clinical BERT Embeddings


Title	Publicly Available Clinical BERT Embeddings
Authors	Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, Matthew B. A. McDermott
Abstract	Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on three common clinical NLP tasks as compared to nonspecific embeddings. These domain-specific models are not as performant on two clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.
Tasks
Published	2019-04-06
URL	https://arxiv.org/abs/1904.03323v3
PDF	https://arxiv.org/pdf/1904.03323v3.pdf
PWC	https://paperswithcode.com/paper/publicly-available-clinical-bert-embeddings
Repo	https://github.com/ManasRMohanty/DS5500-capstone
Framework	none

Large Scale Holistic Video Understanding


Title	Large Scale Holistic Video Understanding
Authors	Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jurgen Gall, Rainer Stiefelhagen, Luc Van Gool
Abstract	Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale “Holistic Video Understanding Dataset"~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation and test set spanning over 3457 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts which naturally captures the real-world scenarios. Further, we introduce a new spatio-temporal deep neural network architecture called “Holistic Appearance and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. The experiments show that HATNet trained on HVU outperforms current state-of-the-art methods on challenging human action datasets: HMDB51, UCF101, and Kinetics. The dataset and codes will be made publicly available.
Tasks	Action Classification, Action Recognition In Videos, Multi-Task Learning, Temporal Action Localization, Video Recognition, Video Understanding
Published	2019-04-25
URL	https://arxiv.org/abs/1904.11451v2
PDF	https://arxiv.org/pdf/1904.11451v2.pdf
PWC	https://paperswithcode.com/paper/holistic-large-scale-video-understanding
Repo	https://github.com/holistic-video-understanding/Mini-HVU
Framework	none

Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding


Title	Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding
Authors	Shahin Khobahi, Mojtaba Soltanalian
Abstract	Parameterized mathematical models play a central role in understanding and design of complex information systems. However, they often cannot take into account the intricate interactions innate to such systems. On the contrary, purely data-driven approaches do not need explicit mathematical models for data generation and have a wider applicability at the cost of interpretability. In this paper, we consider the design of a one-bit compressive variational autoencoder, and propose a novel hybrid model-based and data-driven methodology that allows us not only to design the sensing matrix and the quantization thresholds for one-bit data acquisition, but also allows for learning the latent-parameters of iterative optimization algorithms specifically designed for the problem of one-bit sparse signal recovery. In addition, the proposed method has the ability to adaptively learn the proper quantization thresholds, paving the way for amplitude recovery in one-bit compressive sensing. Our results demonstrate a significant improvement compared to state-of-the-art model-based algorithms.
Tasks	Compressive Sensing, Quantization
Published	2019-11-27
URL	https://arxiv.org/abs/1911.12410v1
PDF	https://arxiv.org/pdf/1911.12410v1.pdf
PWC	https://paperswithcode.com/paper/model-aware-deep-architectures-for-one-bit
Repo	https://github.com/skhobahi/deep1bitVAE
Framework	none