January 27, 2020

3209 words 16 mins read

Paper Group ANR 1232

Leaf segmentation through the classification of edges. DRFN: Deep Recurrent Fusion Network for Single-Image Super-Resolution with Large Factors. A Fast Free-viewpoint Video Synthesis Algorithm for Sports Scenes. Towards Annotating and Creating Sub-Sentence Summary Highlights. MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition. L …

Leaf segmentation through the classification of edges


Title	Leaf segmentation through the classification of edges
Authors	Jonathan Bell, Hannah M. Dee
Abstract	We present an approach to leaf level segmentation of images of Arabidopsis thaliana plants based upon detected edges. We introduce a novel approach to edge classification, which forms an important part of a method to both count the leaves and establish the leaf area of a growing plant from images obtained in a high-throughput phenotyping system. Our technique uses a relatively shallow convolutional neural network to classify image edges as background, plant edge, leaf-on-leaf edge or internal leaf noise. The edges themselves were found using the Canny edge detector and the classified edges can be used with simple image processing techniques to generate a region-based segmentation in which the leaves are distinct. This approach is strong at distinguishing occluding pairs of leaves where one leaf is largely hidden, a situation which has proved troublesome for plant image analysis systems in the past. In addition, we introduce the publicly available plant image dataset that was used for this work.
Tasks
Published	2019-04-05
URL	http://arxiv.org/abs/1904.03124v1
PDF	http://arxiv.org/pdf/1904.03124v1.pdf
PWC	https://paperswithcode.com/paper/leaf-segmentation-through-the-classification
Repo
Framework

DRFN: Deep Recurrent Fusion Network for Single-Image Super-Resolution with Large Factors


Title	DRFN: Deep Recurrent Fusion Network for Single-Image Super-Resolution with Large Factors
Authors	Xin Yang, Haiyang Mei, Jiqing Zhang, Ke Xu, Baocai Yin, Qiang Zhang, Xiaopeng Wei
Abstract	Recently, single-image super-resolution has made great progress owing to the development of deep convolutional neural networks (CNNs). The vast majority of CNN-based models use a pre-defined upsampling operator, such as bicubic interpolation, to upscale input low-resolution images to the desired size and learn non-linear mapping between the interpolated image and ground truth high-resolution (HR) image. However, interpolation processing can lead to visual artifacts as details are over-smoothed, particularly when the super-resolution factor is high. In this paper, we propose a Deep Recurrent Fusion Network (DRFN), which utilizes transposed convolution instead of bicubic interpolation for upsampling and integrates different-level features extracted from recurrent residual blocks to reconstruct the final HR images. We adopt a deep recurrence learning strategy and thus have a larger receptive field, which is conducive to reconstructing an image more accurately. Furthermore, we show that the multi-level fusion structure is suitable for dealing with image super-resolution problems. Extensive benchmark evaluations demonstrate that the proposed DRFN performs better than most current deep learning methods in terms of accuracy and visual effects, especially for large-scale images, while using fewer parameters.
Tasks	Image Super-Resolution, Super-Resolution
Published	2019-08-23
URL	https://arxiv.org/abs/1908.08837v1
PDF	https://arxiv.org/pdf/1908.08837v1.pdf
PWC	https://paperswithcode.com/paper/drfn-deep-recurrent-fusion-network-for-single
Repo
Framework

A Fast Free-viewpoint Video Synthesis Algorithm for Sports Scenes


Title	A Fast Free-viewpoint Video Synthesis Algorithm for Sports Scenes
Authors	Jun Chen, Ryosuke Watanabe, Keisuke Nonaka, Tomoaki Konno, Hiroshi Sankoh, Sei Naito
Abstract	In this paper, we report on a parallel freeviewpoint video synthesis algorithm that can efficiently reconstruct a high-quality 3D scene representation of sports scenes. The proposed method focuses on a scene that is captured by multiple synchronized cameras featuring wide-baselines. The following strategies are introduced to accelerate the production of a free-viewpoint video taking the improvement of visual quality into account: (1) a sparse point cloud is reconstructed using a volumetric visual hull approach, and an exact 3D ROI is found for each object using an efficient connected components labeling algorithm. Next, the reconstruction of a dense point cloud is accelerated by implementing visual hull only in the ROIs; (2) an accurate polyhedral surface mesh is built by estimating the exact intersections between grid cells and the visual hull; (3) the appearance of the reconstructed presentation is reproduced in a view-dependent manner that respectively renders the non-occluded and occluded region with the nearest camera and its neighboring cameras. The production for volleyball and judo sequences demonstrates the effectiveness of our method in terms of both execution time and visual quality.
Tasks
Published	2019-03-28
URL	https://arxiv.org/abs/1903.11785v2
PDF	https://arxiv.org/pdf/1903.11785v2.pdf
PWC	https://paperswithcode.com/paper/a-fast-free-viewpoint-video-synthesis
Repo
Framework

Towards Annotating and Creating Sub-Sentence Summary Highlights


Title	Towards Annotating and Creating Sub-Sentence Summary Highlights
Authors	Kristjan Arumae, Parminder Bhatia, Fei Liu
Abstract	Highlighting is a powerful tool to pick out important content and emphasize. Creating summary highlights at the sub-sentence level is particularly desirable, because sub-sentences are more concise than whole sentences. They are also better suited than individual words and phrases that can potentially lead to disfluent, fragmented summaries. In this paper we seek to generate summary highlights by annotating summary-worthy sub-sentences and teaching classifiers to do the same. We frame the task as jointly selecting important sentences and identifying a single most informative textual unit from each sentence. This formulation dramatically reduces the task complexity involved in sentence compression. Our study provides new benchmarks and baselines for generating highlights at the sub-sentence level.
Tasks	Sentence Compression
Published	2019-10-17
URL	https://arxiv.org/abs/1910.07659v1
PDF	https://arxiv.org/pdf/1910.07659v1.pdf
PWC	https://paperswithcode.com/paper/towards-annotating-and-creating-sub-sentence
Repo
Framework

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition


Title	MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition
Authors	Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
Abstract	Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-to-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.
Tasks	Speech Recognition, Speech Separation
Published	2019-10-15
URL	https://arxiv.org/abs/1910.06522v2
PDF	https://arxiv.org/pdf/1910.06522v2.pdf
PWC	https://paperswithcode.com/paper/mimo-speech-end-to-end-multi-channel-multi
Repo
Framework

Large Area 3D Human Pose Detection Via Stereo Reconstruction in Panoramic Cameras


Title	Large Area 3D Human Pose Detection Via Stereo Reconstruction in Panoramic Cameras
Authors	Christoph Heindl, Thomas Pönitz, Andreas Pichler, Josef Scharinger
Abstract	We propose a novel 3D human pose detector using two panoramic cameras. We show that transforming fisheye perspectives to rectilinear views allows a direct application of two-dimensional deep-learning pose estimation methods, without the explicit need for a costly re-training step to compensate for fisheye image distortions. By utilizing panoramic cameras, our method is capable of accurately estimating human poses over a large field of view. This renders our method suitable for ergonomic analyses and other pose based assessments.
Tasks	Pose Estimation
Published	2019-07-01
URL	https://arxiv.org/abs/1907.00534v1
PDF	https://arxiv.org/pdf/1907.00534v1.pdf
PWC	https://paperswithcode.com/paper/large-area-3d-human-pose-detection-via-stereo
Repo
Framework

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events


Title	Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
Authors	Wim Boes, Hugo Van hamme
Abstract	We tackle the task of environmental event classification by drawing inspiration from the transformer neural network architecture used in machine translation. We modify this attention-based feedforward structure in such a way that allows the resulting model to use audio as well as video to compute sound event predictions. We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection. The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information. We show that the proposed modified transformers strongly improve upon previously introduced models and in fact achieve state-of-the-art results. We also make a compelling case for devoting more attention to research in multimodal audiovisual classification by proving the usefulness of visual information for the task at hand,namely audio event recognition. In addition, we visualize internal attention patterns of the audiovisual transformers and in doing so demonstrate their potential for performing multimodal synchronization.
Tasks	Machine Translation
Published	2019-12-02
URL	https://arxiv.org/abs/1912.02615v1
PDF	https://arxiv.org/pdf/1912.02615v1.pdf
PWC	https://paperswithcode.com/paper/audiovisual-transformer-architectures-for
Repo
Framework

Online Budgeted Learning for Classifier Induction


Title	Online Budgeted Learning for Classifier Induction
Authors	Eran Fainman, Bracha Shapira, Lior Rokach, Yisroel Mirsky
Abstract	In real-world machine learning applications, there is a cost associated with sampling of different features. Budgeted learning can be used to select which feature-values to acquire from each instance in a dataset, such that the best model is induced under a given constraint. However, this approach is not possible in the domain of online learning since one may not retroactively acquire feature-values from past instances. In online learning, the challenge is to find the optimum set of features to be acquired from each instance upon arrival from a data stream. In this paper we introduce the issue of online budgeted learning and describe a general framework for addressing this challenge. We propose two types of feature value acquisition policies based on the multi-armed bandit problem: random and adaptive. Adaptive policies perform online adjustments according to new information coming from a data stream, while random policies are not sensitive to the information that arrives from the data stream. Our comparative study on five real-world datasets indicates that adaptive policies outperform random policies for most budget limitations and datasets. Furthermore, we found that in some cases adaptive policies achieve near-optimal results.
Tasks
Published	2019-03-13
URL	http://arxiv.org/abs/1903.05382v1
PDF	http://arxiv.org/pdf/1903.05382v1.pdf
PWC	https://paperswithcode.com/paper/online-budgeted-learning-for-classifier
Repo
Framework

DAVID: Dual-Attentional Video Deblurring


Title	DAVID: Dual-Attentional Video Deblurring
Authors	Junru Wu, Xiang Yu, Ding Liu, Manmohan Chandraker, Zhangyang Wang
Abstract	Blind video deblurring restores sharp frames from a blurry sequence without any prior. It is a challenging task because the blur due to camera shake, object movement and defocusing is heterogeneous in both temporal and spatial dimensions. Traditional methods train on datasets synthesized with a single level of blur, and thus do not generalize well across levels of blurriness. To address this challenge, we propose a dual attention mechanism to dynamically aggregate temporal cues for deblurring with an end-to-end trainable network structure. Specifically, an internal attention module adaptively selects the optimal temporal scales for restoring the sharp center frame. An external attention module adaptively aggregates and refines multiple sharp frame estimates, from several internal attention modules designed for different blur levels. To train and evaluate on more diverse blur severity levels, we propose a Challenging DVD dataset generated from the raw DVD video set by pooling frames with different temporal windows. Our framework achieves consistently better performance on this more challenging dataset while obtaining strongly competitive results on the original DVD benchmark. Extensive ablative studies and qualitative visualizations further demonstrate the advantage of our method in handling real video blur.
Tasks	Deblurring
Published	2019-12-07
URL	https://arxiv.org/abs/1912.03445v1
PDF	https://arxiv.org/pdf/1912.03445v1.pdf
PWC	https://paperswithcode.com/paper/david-dual-attentional-video-deblurring
Repo
Framework

A Clinical Approach to Training Effective Data Scientists


Title	A Clinical Approach to Training Effective Data Scientists
Authors	Kit T Rodolfa, Adolfo De Unanue, Matt Gee, Rayid Ghani
Abstract	Like medicine, psychology, or education, data science is fundamentally an applied discipline, with most students who receive advanced degrees in the field going on to work on practical problems. Unlike these disciplines, however, data science education remains heavily focused on theory and methods, and practical coursework typically revolves around cleaned or simplified data sets that have little analog in professional applications. We believe that the environment in which new data scientists are trained should more accurately reflect that in which they will eventually practice and propose here a data science master’s degree program that takes inspiration from the residency model used in medicine. Students in the suggested program would spend three years working on a practical problem with an industry, government, or nonprofit partner, supplemented with coursework in data science methods and theory. We also discuss how this program can also be implemented in shorter formats to augment existing professional masters programs in different disciplines. This approach to learning by doing is designed to fill gaps in our current approach to data science education and ensure that students develop the skills they need to practice data science in a professional context and under the many constraints imposed by that context.
Tasks
Published	2019-05-15
URL	https://arxiv.org/abs/1905.06875v1
PDF	https://arxiv.org/pdf/1905.06875v1.pdf
PWC	https://paperswithcode.com/paper/a-clinical-approach-to-training-effective
Repo
Framework

Fast Image Caption Generation with Position Alignment


Title	Fast Image Caption Generation with Position Alignment
Authors	Zheng-cong Fei
Abstract	Recent neural network models for image captioning usually employ an encoder-decoder architecture, where the decoder adopts a recursive sequence decoding way. However, such autoregressive decoding may result in sequential error accumulation and slow generation which limit the applications in practice. Non-autoregressive (NA) decoding has been proposed to cover these issues but suffers from language quality problem due to the indirect modeling of the target distribution. Towards that end, we propose an improved NA prediction framework to accelerate image captioning. Our decoding part consists of a position alignment to order the words that describe the content detected in the given image, and a fine non-autoregressive decoder to generate elegant descriptions. Furthermore, we introduce an inference strategy that regards position information as a latent variable to guide the further sentence generation. The Experimental results on public datasets show that our proposed model achieves better performance compared to general NA captioning models, while achieves comparable performance as autoregressive image captioning models with a significant speedup.
Tasks	Image Captioning
Published	2019-12-13
URL	https://arxiv.org/abs/1912.06365v1
PDF	https://arxiv.org/pdf/1912.06365v1.pdf
PWC	https://paperswithcode.com/paper/fast-image-caption-generation-with-position
Repo
Framework

Better Understanding Hierarchical Visual Relationship for Image Caption


Title	Better Understanding Hierarchical Visual Relationship for Image Caption
Authors	Zheng-cong Fei
Abstract	The Convolutional Neural Network (CNN) has been the dominant image feature extractor in computer vision for years. However, it fails to get the relationship between images/objects and their hierarchical interactions which can be helpful for representing and describing an image. In this paper, we propose a new design for image caption under a general encoder-decoder framework. It takes into account the hierarchical interactions between different abstraction levels of visual information in the images and their bounding-boxes. Specifically, we present CNN plus Graph Convolutional Network (GCN) architecture that novelly integrates both semantic and spatial visual relationships into image encoder. The representations of regions in an image and the connections between images are refined by leveraging graph structure through GCN. With the learned multi-level features, our model capitalizes on the Transformer-based decoder for description generation. We conduct experiments on the COCO image captioning dataset. Evaluations show that our proposed model outperforms the previous state-of-the-art models in the task of image caption, leading to a better performance in terms of all evaluation metrics.
Tasks	Image Captioning
Published	2019-12-04
URL	https://arxiv.org/abs/1912.01881v1
PDF	https://arxiv.org/pdf/1912.01881v1.pdf
PWC	https://paperswithcode.com/paper/better-understanding-hierarchical-visual
Repo
Framework

Context in Neural Machine Translation: A Review of Models and Evaluations


Title	Context in Neural Machine Translation: A Review of Models and Evaluations
Authors	Andrei Popescu-Belis
Abstract	This review paper discusses how context has been used in neural machine translation (NMT) in the past two years (2017-2018). Starting with a brief retrospect on the rapid evolution of NMT models, the paper then reviews studies that evaluate NMT output from various perspectives, with emphasis on those analyzing limitations of the translation of contextual phenomena. In a subsequent version, the paper will then present the main methods that were proposed to leverage context for improving translation quality, and distinguishes methods that aim to improve the translation of specific phenomena from those that consider a wider unstructured context.
Tasks	Machine Translation
Published	2019-01-25
URL	http://arxiv.org/abs/1901.09115v1
PDF	http://arxiv.org/pdf/1901.09115v1.pdf
PWC	https://paperswithcode.com/paper/context-in-neural-machine-translation-a
Repo
Framework

Microscopy Image Restoration with Deep Wiener-Kolmogorov filters


Title	Microscopy Image Restoration with Deep Wiener-Kolmogorov filters
Authors	Valeriya Pronina, Filippos Kokkinos, Dmitry V. Dylov, Stamatios Lefkimmiatis
Abstract	Microscopy is a powerful visualization tool in biology, enabling the study of cells, tissues, and the fundamental biological processes. Yet, the observed images of the objects at the micro-scale suffer from two major inherent distortions: the blur caused by the diffraction of light, and the background noise caused by the imperfections of the imaging detectors. The latter is especially severe in fluorescence and in confocal microscopes, which are known for operating at the low photon count with the Poisson noise statistics. Restoration of such images is usually accomplished by image deconvolution, with the nature of the noise statistics taken into account, and by solving an optimization problem given some prior information about the underlying data (i.e., regularization). In this work, we propose a unifying framework of algorithms for Poisson image deblurring and denoising. The algorithms are based on deep learning techniques for the design of learnable regularizers paired with an appropriate optimization scheme. Our extensive experimentation line showcases that the proposed approach achieves superior quality of image reconstruction and beats the solutions that rely on deep learning or on the optimization schemes alone. Moreover, several implementations of the proposed framework demonstrate competitive performance at a low computational complexity, which is of high importance for real-time imaging applications.
Tasks	Deblurring, Denoising, Image Deconvolution, Image Reconstruction, Image Restoration
Published	2019-11-25
URL	https://arxiv.org/abs/1911.10989v2
PDF	https://arxiv.org/pdf/1911.10989v2.pdf
PWC	https://paperswithcode.com/paper/microscopy-image-restoration-with-deep-wiener
Repo
Framework

Exposing and Correcting the Gender Bias in Image Captioning Datasets and Models


Title	Exposing and Correcting the Gender Bias in Image Captioning Datasets and Models
Authors	Shruti Bhargava, David Forsyth
Abstract	The task of image captioning implicitly involves gender identification. However, due to the gender bias in data, gender identification by an image captioning model suffers. Also, the gender-activity bias, owing to the word-by-word prediction, influences other words in the caption prediction, resulting in the well-known problem of label bias. In this work, we investigate gender bias in the COCO captioning dataset and show that it engenders not only from the statistical distribution of genders with contexts but also from the flawed annotation by the human annotators. We look at the issues created by this bias in the trained models. We propose a technique to get rid of the bias by splitting the task into 2 subtasks: gender-neutral image captioning and gender classification. By this decoupling, the gender-context influence can be eradicated. We train the gender-neutral image captioning model, which gives comparable results to a gendered model even when evaluating against a dataset that possesses a similar bias as the training data. Interestingly, the predictions by this model on images with no humans, are also visibly different from the one trained on gendered captions. We train gender classifiers using the available bounding box and mask-based annotations for the person in the image. This allows us to get rid of the context and focus on the person to predict the gender. By substituting the genders into the gender-neutral captions, we get the final gendered predictions. Our predictions achieve similar performance to a model trained with gender, and at the same time are devoid of gender bias. Finally, our main result is that on an anti-stereotypical dataset, our model outperforms a popular image captioning model which is trained with gender.
Tasks	Image Captioning
Published	2019-12-02
URL	https://arxiv.org/abs/1912.00578v1
PDF	https://arxiv.org/pdf/1912.00578v1.pdf
PWC	https://paperswithcode.com/paper/exposing-and-correcting-the-gender-bias-in
Repo
Framework