Paper Group AWR 279
Exploiting temporal consistency for real-time video depth estimation. Unsupervised Single Image Underwater Depth Estimation. Shape Robust Text Detection with Progressive Scale Expansion Network. Segmentation of Roots in Soil with U-Net. Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes. Human Pose Estimation for Real-W …
Exploiting temporal consistency for real-time video depth estimation
Title | Exploiting temporal consistency for real-time video depth estimation |
Authors | Haokui Zhang, Chunhua Shen, Ying Li, Yuanzhouhan Cao, Yu Liu, Youliang Yan |
Abstract | Accuracy of depth estimation from static images has been significantly improved recently, by exploiting hierarchical features from deep convolutional neural networks (CNNs). Compared with static images, vast information exists among video frames and can be exploited to improve the depth estimation performance. In this work, we focus on exploring temporal information from monocular videos for depth estimation. Specifically, we take the advantage of convolutional long short-term memory (CLSTM) and propose a novel spatial-temporal CSLTM (ST-CLSTM) structure. Our ST-CLSTM structure can capture not only the spatial features but also the temporal correlations/consistency among consecutive video frames with negligible increase in computational cost. Additionally, in order to maintain the temporal consistency among the estimated depth frames, we apply the generative adversarial learning scheme and design a temporal consistency loss. The temporal consistency loss is combined with the spatial loss to update the model in an end-to-end fashion. By taking advantage of the temporal information, we build a video depth estimation framework that runs in real-time and generates visually pleasant results. Moreover, our approach is flexible and can be generalized to most existing depth estimation frameworks. Code is available at: https://tinyurl.com/STCLSTM |
Tasks | Depth Estimation |
Published | 2019-08-10 |
URL | https://arxiv.org/abs/1908.03706v1 |
https://arxiv.org/pdf/1908.03706v1.pdf | |
PWC | https://paperswithcode.com/paper/exploiting-temporal-consistency-for-real-time |
Repo | https://github.com/Adelaide-AI-Group/ST-CLSTM |
Framework | pytorch |
Unsupervised Single Image Underwater Depth Estimation
Title | Unsupervised Single Image Underwater Depth Estimation |
Authors | Honey Gupta, Kaushik Mitra |
Abstract | Depth estimation from a single underwater image is one of the most challenging problems and is highly ill-posed. Due to the absence of large generalized underwater depth datasets and the difficulty in obtaining ground truth depth-maps, supervised learning techniques such as direct depth regression cannot be used. In this paper, we propose an unsupervised method for depth estimation from a single underwater image taken `in the wild’ by using haze as a cue for depth. Our approach is based on indirect depth-map estimation where we learn the mapping functions between unpaired RGB-D terrestrial images and arbitrary underwater images to estimate the required depth-map. We propose a method which is based on the principles of cycle-consistent learning and uses dense-block based auto-encoders as generator networks. We evaluate and compare our method both quantitatively and qualitatively on various underwater images with diverse attenuation and scattering conditions and show that our method produces state-of-the-art results for unsupervised depth estimation from a single underwater image. | |
Tasks | Depth Estimation |
Published | 2019-05-25 |
URL | https://arxiv.org/abs/1905.10595v2 |
https://arxiv.org/pdf/1905.10595v2.pdf | |
PWC | https://paperswithcode.com/paper/unsupervised-single-image-underwater-depth |
Repo | https://github.com/honeygupta/UW-Net |
Framework | tf |
Shape Robust Text Detection with Progressive Scale Expansion Network
Title | Shape Robust Text Detection with Progressive Scale Expansion Network |
Authors | Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, Shuai Shao |
Abstract | Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future. |
Tasks | Scene Text Detection |
Published | 2019-03-28 |
URL | https://arxiv.org/abs/1903.12473v2 |
https://arxiv.org/pdf/1903.12473v2.pdf | |
PWC | https://paperswithcode.com/paper/shape-robust-text-detection-with-progressive-1 |
Repo | https://github.com/whai362/PSENet |
Framework | tf |
Segmentation of Roots in Soil with U-Net
Title | Segmentation of Roots in Soil with U-Net |
Authors | Abraham George Smith, Jens Petersen, Raghavendra Selvan, Camilla Ruø Rasmussen |
Abstract | Plant root research can provide a way to attain stress-tolerant crops that produce greater yield in a diverse array of conditions. Phenotyping roots in soil is often challenging due to the roots being difficult to access and the use of time consuming manual methods. Rhizotrons allow visual inspection of root growth through transparent surfaces. Agronomists currently manually label photographs of roots obtained from rhizotrons using a line-intersect method to obtain root length density and rooting depth measurements which are essential for their experiments. We investigate the effectiveness of an automated image segmentation method based on the U-Net Convolutional Neural Network (CNN) architecture to enable such measurements. We design a data-set of 50 annotated Chicory (Cichorium intybus L.) root images which we use to train, validate and test the system and compare against a baseline built using the Frangi vesselness filter. We obtain metrics using manual annotations and line-intersect counts. Our results on the held out data show our proposed automated segmentation system to be a viable solution for detecting and quantifying roots. We evaluate our system using 867 images for which we have obtained line-intersect counts, attaining a Spearman rank correlation of 0.9748 and an $r^2$ of 0.9217. We also achieve an $F_1$ of 0.7 when comparing the automated segmentation to the manual annotations, with our automated segmentation system producing segmentations with higher quality than the manual annotations for large portions of the image. |
Tasks | Semantic Segmentation |
Published | 2019-02-28 |
URL | http://arxiv.org/abs/1902.11050v2 |
http://arxiv.org/pdf/1902.11050v2.pdf | |
PWC | https://paperswithcode.com/paper/segmentation-of-roots-in-soil-with-u-net |
Repo | https://github.com/Abe404/segmentation_of_roots_in_soil_with_unet |
Framework | pytorch |
Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes
Title | Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes |
Authors | Jingru Yi, Pengxiang Wu, Qiaoying Huang, Hui Qu, Bo Liu, Daniel J. Hoeppner, Dimitris N. Metaxas |
Abstract | Most existing methods handle cell instance segmentation problems directly without relying on additional detection boxes. These methods generally fails to separate touching cells due to the lack of global understanding of the objects. In contrast, box-based instance segmentation solves this problem by combining object detection with segmentation. However, existing methods typically utilize anchor box-based detectors, which would lead to inferior instance segmentation performance due to the class imbalance issue. In this paper, we propose a new box-based cell instance segmentation method. In particular, we first detect the five pre-defined points of a cell via keypoints detection. Then we group these points according to a keypoint graph and subsequently extract the bounding box for each cell. Finally, cell segmentation is performed on feature maps within the bounding boxes. We validate our method on two cell datasets with distinct object shapes, and empirically demonstrate the superiority of our method compared to other instance segmentation techniques. Code is available at: https://github.com/yijingru/KG_Instance_Segmentation. |
Tasks | Cell Segmentation, Instance Segmentation, Object Detection, Semantic Segmentation |
Published | 2019-07-22 |
URL | https://arxiv.org/abs/1907.09140v2 |
https://arxiv.org/pdf/1907.09140v2.pdf | |
PWC | https://paperswithcode.com/paper/multi-scale-cell-instance-segmentation-with |
Repo | https://github.com/yijingru/KG_Instance_Segmentation |
Framework | pytorch |
Human Pose Estimation for Real-World Crowded Scenarios
Title | Human Pose Estimation for Real-World Crowded Scenarios |
Authors | Thomas Golda, Tobias Kalb, Arne Schumann, Jürgen Beyerer |
Abstract | Human pose estimation has recently made significant progress with the adoption of deep convolutional neural networks. Its many applications have attracted tremendous interest in recent years. However, many practical applications require pose estimation for human crowds, which still is a rarely addressed problem. In this work, we explore methods to optimize pose estimation for human crowds, focusing on challenges introduced with dense crowds, such as occlusions, people in close proximity to each other, and partial visibility of people. In order to address these challenges, we evaluate three aspects of a pose detection approach: i) a data augmentation method to introduce robustness to occlusions, ii) the explicit detection of occluded body parts, and iii) the use of the synthetic generated datasets. The first approach to improve the accuracy in crowded scenarios is to generate occlusions at training time using person and object cutouts from the object recognition dataset COCO (Common Objects in Context). Furthermore, the synthetically generated dataset JTA (Joint Track Auto) is evaluated for the use in real-world crowd applications. In order to overcome the transfer gap of JTA originating from a low pose variety and less dense crowds, an extension dataset is created to ease the use for real-world applications. Additionally, the occlusion flags provided with JTA are utilized to train a model, which explicitly distinguishes between occluded and visible body parts in two distinct branches. The combination of the proposed additions to the baseline method help to improve the overall accuracy by 4.7% AP and thereby provide comparable results to current state-of-the-art approaches on the respective dataset. |
Tasks | Data Augmentation, Object Recognition, Pose Estimation |
Published | 2019-07-16 |
URL | https://arxiv.org/abs/1907.06922v1 |
https://arxiv.org/pdf/1907.06922v1.pdf | |
PWC | https://paperswithcode.com/paper/human-pose-estimation-for-real-world-crowded |
Repo | https://github.com/thomasgolda/Human-Pose-Estimation-for-Real-World-Crowded-Scenarios |
Framework | none |
FlatteNet: A Simple Versatile Framework for Dense Pixelwise Prediction
Title | FlatteNet: A Simple Versatile Framework for Dense Pixelwise Prediction |
Authors | Xin Cai, Yi-Fei Pu |
Abstract | In this paper, we focus on devising a versatile framework for dense pixelwise prediction whose goal is to assign a discrete or continuous label to each pixel for an image. It is well-known that the reduced feature resolution due to repeated subsampling operations poses a serious challenge to Fully Convolutional Network (FCN) based models. In contrast to the commonly-used strategies, such as dilated convolution and encoder-decoder structure, we introduce the Flattening Module to produce high-resolution predictions without either removing any subsampling operations or building a complicated decoder module. In addition, the Flattening Module is lightweight and can be easily combined with any existing FCNs, allowing the model builder to trade off among model size, computational cost and accuracy by simply choosing different backbone networks. We empirically demonstrate the effectiveness of the proposed Flattening Module through competitive results in human pose estimation on MPII, semantic segmentation on PASCAL-Context and object detection on PASCAL VOC. We hope that the proposed approach can serve as a simple and strong alternative of current dominant dense pixelwise prediction frameworks. |
Tasks | Object Detection, Pose Estimation, Semantic Segmentation |
Published | 2019-09-22 |
URL | https://arxiv.org/abs/1909.09961v3 |
https://arxiv.org/pdf/1909.09961v3.pdf | |
PWC | https://paperswithcode.com/paper/190909961 |
Repo | https://github.com/TotalVariation/Flattenet |
Framework | pytorch |
Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network
Title | Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network |
Authors | Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, Wei Liu |
Abstract | In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., the motion and content features of an input video. One POS sequence generator relies on this fused representation to predict the global syntactic structure, which is thereafter leveraged to guide the video captioning generation and control the syntax of the generated sentence. Specifically, a gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word. Experimental results on two benchmark datasets, namely MSR-VTT and MSVD, demonstrate that the proposed model can well exploit complementary information from multiple representations, resulting in improved performances. Moreover, the generated global POS information can well capture the global syntactic structure of the sentence, and thus be exploited to control the syntactic structure of the description. Such POS information not only boosts the video captioning performance but also improves the diversity of the generated captions. Our code is at: https://github.com/vsislab/Controllable_XGating. |
Tasks | Video Captioning |
Published | 2019-08-27 |
URL | https://arxiv.org/abs/1908.10072v1 |
https://arxiv.org/pdf/1908.10072v1.pdf | |
PWC | https://paperswithcode.com/paper/controllable-video-captioning-with-pos |
Repo | https://github.com/vsislab/Controllable_XGating |
Framework | pytorch |
ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records
Title | ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records |
Authors | Rajkumar Saini, Derek Dobson, Jon Morrey, Marcus Liwicki, Foteini Simistira Liwicki |
Abstract | We propose a Historical Document Reading Challenge on Large Chinese Structured Family Records, in short ICDAR2019 HDRC CHINESE. The objective of the proposed competition is to recognize and analyze the layout, and finally detect and recognize the textlines and characters of the large historical document collection containing more than 20 000 pages kindly provided by FamilySearch. |
Tasks | |
Published | 2019-03-08 |
URL | https://arxiv.org/abs/1903.03341v3 |
https://arxiv.org/pdf/1903.03341v3.pdf | |
PWC | https://paperswithcode.com/paper/icdar-2019-historical-document-reading |
Repo | https://github.com/DIVA-DIA/DIVA_Layout_Analysis_Evaluator |
Framework | none |
High-Resolution Traffic Sensing with Autonomous Vehicles
Title | High-Resolution Traffic Sensing with Autonomous Vehicles |
Authors | Wei Ma, Sean Qian |
Abstract | The last decades have witnessed the breakthrough of autonomous vehicles (AVs), and the perception capabilities of AVs have been dramatically improved. Various sensors installed on AVs, including, but are not limited to, LiDAR, radar, camera and stereovision, will be collecting massive data and perceiving the surrounding traffic states continuously. In fact, a fleet of AVs can serve as floating (or probe) sensors, which can be utilized to infer traffic information while cruising around the roadway networks. In contrast, conventional traffic sensing methods rely on fixed traffic sensors such as loop detectors, cameras and microwave vehicle detectors. Due to the high cost of conventional traffic sensors, traffic state data are usually obtained in a low-frequency and sparse manner. In view of this, this paper leverages rich data collected through AVs to propose the high-resolution traffic sensing framework. The proposed framework estimates the fundamental traffic state variables, namely, flow, density and speed in high spatio-temporal resolution, and it is developed under different levels of AV perception capabilities and low AV market penetration rate. The Next Generation Simulation (NGSIM) data is adopted to examine the accuracy and robustness of the proposed framework. Experimental results show that the proposed estimation framework achieves high accuracy even with low AV market penetration rate. Sensitivity analysis regarding AV penetration rate, sensor configuration, and perception accuracy will also be studied. This study will help policymakers and private sectors (e.g Uber, Waymo) to understand the values of AVs, especially the values of massive data collected by AVs, in traffic operation and management. |
Tasks | Autonomous Vehicles |
Published | 2019-10-06 |
URL | https://arxiv.org/abs/1910.02376v1 |
https://arxiv.org/pdf/1910.02376v1.pdf | |
PWC | https://paperswithcode.com/paper/high-resolution-traffic-sensing-with |
Repo | https://github.com/Lemma1/NGSIM-interface |
Framework | none |
Transformer-CNN: Fast and Reliable tool for QSAR
Title | Transformer-CNN: Fast and Reliable tool for QSAR |
Authors | Pavel Karpov, Guillaume Godin, Igor V. Tetko |
Abstract | We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model’s result. OCHEM [3] environment (https://ochem.eu) hosts the on-line implementation of the method proposed. |
Tasks | Transfer Learning |
Published | 2019-10-21 |
URL | https://arxiv.org/abs/1911.06603v3 |
https://arxiv.org/pdf/1911.06603v3.pdf | |
PWC | https://paperswithcode.com/paper/transformer-cnn-fast-and-reliable-tool-for |
Repo | https://github.com/bigchem/transformer-cnn |
Framework | tf |
Video Classification with Channel-Separated Convolutional Networks
Title | Video Classification with Channel-Separated Convolutional Networks |
Authors | Du Tran, Heng Wang, Lorenzo Torresani, Matt Feiszli |
Abstract | Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture – Channel-Separated Convolutional Network (CSN) – which is simple, efficient, yet accurate. On Sports1M, Kinetics, and Something-Something, our CSNs are comparable with or better than the state-of-the-art while being 2-3 times more efficient. |
Tasks | Action Classification, Action Recognition In Videos, Image Classification, Video Classification |
Published | 2019-04-04 |
URL | https://arxiv.org/abs/1904.02811v4 |
https://arxiv.org/pdf/1904.02811v4.pdf | |
PWC | https://paperswithcode.com/paper/video-classification-with-channel-separated |
Repo | https://github.com/facebookresearch/VMZ |
Framework | caffe2 |
Publicly Available Clinical BERT Embeddings
Title | Publicly Available Clinical BERT Embeddings |
Authors | Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, Matthew B. A. McDermott |
Abstract | Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on three common clinical NLP tasks as compared to nonspecific embeddings. These domain-specific models are not as performant on two clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text. |
Tasks | |
Published | 2019-04-06 |
URL | https://arxiv.org/abs/1904.03323v3 |
https://arxiv.org/pdf/1904.03323v3.pdf | |
PWC | https://paperswithcode.com/paper/publicly-available-clinical-bert-embeddings |
Repo | https://github.com/ManasRMohanty/DS5500-capstone |
Framework | none |
Large Scale Holistic Video Understanding
Title | Large Scale Holistic Video Understanding |
Authors | Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jurgen Gall, Rainer Stiefelhagen, Luc Van Gool |
Abstract | Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale “Holistic Video Understanding Dataset"~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation and test set spanning over 3457 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts which naturally captures the real-world scenarios. Further, we introduce a new spatio-temporal deep neural network architecture called “Holistic Appearance and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. The experiments show that HATNet trained on HVU outperforms current state-of-the-art methods on challenging human action datasets: HMDB51, UCF101, and Kinetics. The dataset and codes will be made publicly available. |
Tasks | Action Classification, Action Recognition In Videos, Multi-Task Learning, Temporal Action Localization, Video Recognition, Video Understanding |
Published | 2019-04-25 |
URL | https://arxiv.org/abs/1904.11451v2 |
https://arxiv.org/pdf/1904.11451v2.pdf | |
PWC | https://paperswithcode.com/paper/holistic-large-scale-video-understanding |
Repo | https://github.com/holistic-video-understanding/Mini-HVU |
Framework | none |
Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding
Title | Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding |
Authors | Shahin Khobahi, Mojtaba Soltanalian |
Abstract | Parameterized mathematical models play a central role in understanding and design of complex information systems. However, they often cannot take into account the intricate interactions innate to such systems. On the contrary, purely data-driven approaches do not need explicit mathematical models for data generation and have a wider applicability at the cost of interpretability. In this paper, we consider the design of a one-bit compressive variational autoencoder, and propose a novel hybrid model-based and data-driven methodology that allows us not only to design the sensing matrix and the quantization thresholds for one-bit data acquisition, but also allows for learning the latent-parameters of iterative optimization algorithms specifically designed for the problem of one-bit sparse signal recovery. In addition, the proposed method has the ability to adaptively learn the proper quantization thresholds, paving the way for amplitude recovery in one-bit compressive sensing. Our results demonstrate a significant improvement compared to state-of-the-art model-based algorithms. |
Tasks | Compressive Sensing, Quantization |
Published | 2019-11-27 |
URL | https://arxiv.org/abs/1911.12410v1 |
https://arxiv.org/pdf/1911.12410v1.pdf | |
PWC | https://paperswithcode.com/paper/model-aware-deep-architectures-for-one-bit |
Repo | https://github.com/skhobahi/deep1bitVAE |
Framework | none |