October 20, 2019

3444 words 17 mins read

Paper Group AWR 318

BTS-DSN: Deeply Supervised Neural Network with Short Connections for Retinal Vessel Segmentation. cGANs with Projection Discriminator. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. Deep convolutional neural networks for segmenting 3D in vivo multiphoton images of vasculature in Alzheimer disease mouse models. Towards Two- …

BTS-DSN: Deeply Supervised Neural Network with Short Connections for Retinal Vessel Segmentation


Title	BTS-DSN: Deeply Supervised Neural Network with Short Connections for Retinal Vessel Segmentation
Authors	Song Guo, Kai Wang, Hong Kang, Yujun Zhang, Yingqi Gao, Tao Li
Abstract	Background and Objective: The condition of vessel of the human eye is an important factor for the diagnosis of ophthalmological diseases. Vessel segmentation in fundus images is a challenging task due to complex vessel structure, the presence of similar structures such as microaneurysms and hemorrhages, micro-vessel with only one to several pixels wide, and requirements for finer results. Methods:In this paper, we present a multi-scale deeply supervised network with short connections (BTS-DSN) for vessel segmentation. We used short connections to transfer semantic information between side-output layers. Bottom-top short connections pass low level semantic information to high level for refining results in high-level side-outputs, and top-bottom short connection passes much structural information to low level for reducing noises in low-level side-outputs. In addition, we employ cross-training to show that our model is suitable for real world fundus images. Results: The proposed BTS-DSN has been verified on DRIVE, STARE and CHASE_DB1 datasets, and showed competitive performance over other state-of-the-art methods. Specially, with patch level input, the network achieved 0.7891/0.8212 sensitivity, 0.9804/0.9843 specificity, 0.9806/0.9859 AUC, and 0.8249/0.8421 F1-score on DRIVE and STARE, respectively. Moreover, our model behaves better than other methods in cross-training experiments. Conclusions: BTS-DSN achieves competitive performance in vessel segmentation task on three public datasets. It is suitable for vessel segmentation. The source code of our method is available at https://github.com/guomugong/BTS-DSN.
Tasks	Retinal Vessel Segmentation
Published	2018-03-11
URL	https://arxiv.org/abs/1803.03963v2
PDF	https://arxiv.org/pdf/1803.03963v2.pdf
PWC	https://paperswithcode.com/paper/deeply-supervised-neural-network-with-short
Repo	https://github.com/guomugong/BTS-DSN
Framework	none

cGANs with Projection Discriminator


Title	cGANs with Projection Discriminator
Authors	Takeru Miyato, Masanori Koyama
Abstract	We propose a novel, projection based way to incorporate the conditional information into the discriminator of GANs that respects the role of the conditional information in the underlining probabilistic model. This approach is in contrast with most frameworks of conditional GANs used in application today, which use the conditional information by concatenating the (embedded) conditional vector to the feature vectors. With this modification, we were able to significantly improve the quality of the class conditional image generation on ILSVRC2012 (ImageNet) 1000-class image dataset from the current state-of-the-art result, and we achieved this with a single pair of a discriminator and a generator. We were also able to extend the application to super-resolution and succeeded in producing highly discriminative super-resolution images. This new structure also enabled high quality category transformation based on parametric functional transformation of conditional batch normalization layers in the generator.
Tasks	Conditional Image Generation, Image Generation, Super-Resolution
Published	2018-02-15
URL	http://arxiv.org/abs/1802.05637v2
PDF	http://arxiv.org/pdf/1802.05637v2.pdf
PWC	https://paperswithcode.com/paper/cgans-with-projection-discriminator
Repo	https://github.com/DanielLongo/AdversarialTrain
Framework	pytorch

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding


Title	Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
Authors	Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, Shih-Fu Chang
Abstract	We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at each level. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available datasets show significant performance gains (20%-60% relative) over the state-of-the-art in phrase localization and set a new performance record on those datasets. We provide a detailed ablation study to show the contribution of each element of our approach and release our code on GitHub.
Tasks	Language Modelling, Phrase Grounding, Sentence Embeddings
Published	2018-11-28
URL	https://arxiv.org/abs/1811.11683v2
PDF	https://arxiv.org/pdf/1811.11683v2.pdf
PWC	https://paperswithcode.com/paper/multi-level-multimodal-common-semantic-space
Repo	https://github.com/hassanhub/MultiGrounding
Framework	tf

Deep convolutional neural networks for segmenting 3D in vivo multiphoton images of vasculature in Alzheimer disease mouse models


Title	Deep convolutional neural networks for segmenting 3D in vivo multiphoton images of vasculature in Alzheimer disease mouse models
Authors	Mohammad Haft-Javaherian, Linjing Fang, Victorine Muse, Chris B. Schaffer, Nozomi Nishimura, Mert R. Sabuncu
Abstract	The health and function of tissue rely on its vasculature network to provide reliable blood perfusion. Volumetric imaging approaches, such as multiphoton microscopy, are able to generate detailed 3D images of blood vessels that could contribute to our understanding of the role of vascular structure in normal physiology and in disease mechanisms. The segmentation of vessels, a core image analysis problem, is a bottleneck that has prevented the systematic comparison of 3D vascular architecture across experimental populations. We explored the use of convolutional neural networks to segment 3D vessels within volumetric in vivo images acquired by multiphoton microscopy. We evaluated different network architectures and machine learning techniques in the context of this segmentation problem. We show that our optimized convolutional neural network architecture, which we call DeepVess, yielded a segmentation accuracy that was better than both the current state-of-the-art and a trained human annotator, while also being orders of magnitude faster. To explore the effects of aging and Alzheimer’s disease on capillaries, we applied DeepVess to 3D images of cortical blood vessels in young and old mouse models of Alzheimer’s disease and wild type littermates. We found little difference in the distribution of capillary diameter or tortuosity between these groups, but did note a decrease in the number of longer capillary segments ($>75\mu m$) in aged animals as compared to young, in both wild type and Alzheimer’s disease mouse models.
Tasks
Published	2018-01-03
URL	http://arxiv.org/abs/1801.00880v4
PDF	http://arxiv.org/pdf/1801.00880v4.pdf
PWC	https://paperswithcode.com/paper/deep-convolutional-neural-networks-for-7
Repo	https://github.com/cornellneuronex/DeepVess-1
Framework	tf

Towards Two-Dimensional Sequence to Sequence Model in Neural Machine Translation


Title	Towards Two-Dimensional Sequence to Sequence Model in Neural Machine Translation
Authors	Parnia Bahar, Christopher Brix, Hermann Ney
Abstract	This work investigates an alternative model for neural machine translation (NMT) and proposes a novel architecture, where we employ a multi-dimensional long short-term memory (MDLSTM) for translation modeling. In the state-of-the-art methods, source and target sentences are treated as one-dimensional sequences over time, while we view translation as a two-dimensional (2D) mapping using an MDLSTM layer to define the correspondence between source and target words. We extend beyond the current sequence to sequence backbone NMT models to a 2D structure in which the source and target sentences are aligned with each other in a 2D grid. Our proposed topology shows consistent improvements over attention-based sequence to sequence model on two WMT 2017 tasks, German$\leftrightarrow$English.
Tasks	Machine Translation
Published	2018-10-09
URL	http://arxiv.org/abs/1810.03975v1
PDF	http://arxiv.org/pdf/1810.03975v1.pdf
PWC	https://paperswithcode.com/paper/towards-two-dimensional-sequence-to-sequence
Repo	https://github.com/FlorianPfisterer/2D-LSTM-Seq2Seq
Framework	pytorch

Towards High Performance Video Object Detection for Mobiles


Title	Towards High Performance Video Object Detection for Mobiles
Authors	Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan
Abstract	Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. It is also unclear whether the key principles of sparse feature propagation and multi-frame feature aggregation apply at very limited computational resources. In this paper, we present a light weight network architecture for video object detection on mobiles. Light weight image object detector is applied on sparse key frames. A very small network, Light Flow, is designed for establishing correspondence across frames. A flow-guided GRU module is designed to effectively aggregate features on key frames. For non-key frames, sparse feature propagation is performed. The whole network can be trained end-to-end. The proposed system achieves 60.2% mAP score at speed of 25.6 fps on mobiles (e.g., HuaWei Mate 8).
Tasks	Object Detection, Video Object Detection
Published	2018-04-16
URL	http://arxiv.org/abs/1804.05830v1
PDF	http://arxiv.org/pdf/1804.05830v1.pdf
PWC	https://paperswithcode.com/paper/towards-high-performance-video-object
Repo	https://github.com/stanlee321/LightFlow-TensorFlow
Framework	tf

Visual Coreference Resolution in Visual Dialog using Neural Module Networks


Title	Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Authors	Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
Abstract	Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., `it'), as the dialog agent must first link it to a previous coreference (e.g.,` boat’), and only then can rely on the visual grounding of the coreference `boat' to reason about the pronoun` it’. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively.
Tasks	Coreference Resolution, Visual Dialog, Visual Question Answering
Published	2018-09-06
URL	http://arxiv.org/abs/1809.01816v1
PDF	http://arxiv.org/pdf/1809.01816v1.pdf
PWC	https://paperswithcode.com/paper/visual-coreference-resolution-in-visual
Repo	https://github.com/facebookresearch/corefnmn
Framework	tf

DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal


Title	DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal
Authors	Stanislas Chambon, Valentin Thorey, Pierrick J. Arnal, Emmanuel Mignot, Alexandre Gramfort
Abstract	Background: Electroencephalography (EEG) monitors brain activity during sleep and is used to identify sleep disorders. In sleep medicine, clinicians interpret raw EEG signals in so-called sleep stages, which are assigned by experts to every 30s window of signal. For diagnosis, they also rely on shorter prototypical micro-architecture events which exhibit variable durations and shapes, such as spindles, K-complexes or arousals. Annotating such events is traditionally performed by a trained sleep expert, making the process time consuming, tedious and subject to inter-scorer variability. To automate this procedure, various methods have been developed, yet these are event-specific and rely on the extraction of hand-crafted features. New method: We propose a novel deep learning architecure called Dreem One Shot Event Detector (DOSED). DOSED jointly predicts locations, durations and types of events in EEG time series. The proposed approach, applied here on sleep related micro-architecture events, is inspired by object detectors developed for computer vision such as YOLO and SSD. It relies on a convolutional neural network that builds a feature representation from raw EEG signals, as well as two modules performing localization and classification respectively. Results and comparison with other methods: The proposed approach is tested on 4 datasets and 3 types of events (spindles, K-complexes, arousals) and compared to the current state-of-the-art detection algorithms. Conclusions: Results demonstrate the versatility of this new approach and improved performance compared to the current state-of-the-art detection methods.
Tasks	EEG, K-complex detection, Sleep apnea detection, Sleep Arousal Detection, Sleep Micro-event detection, Sleep Quality, Spindle Detection, Time Series
Published	2018-12-07
URL	http://arxiv.org/abs/1812.04079v1
PDF	http://arxiv.org/pdf/1812.04079v1.pdf
PWC	https://paperswithcode.com/paper/dosed-a-deep-learning-approach-to-detect
Repo	https://github.com/Dreem-Organization/dosed
Framework	pytorch

Relational Deep Reinforcement Learning


Title	Relational Deep Reinforcement Learning
Authors	Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, Peter Battaglia
Abstract	We introduce an approach for deep reinforcement learning (RL) that improves upon the efficiency, generalization capacity, and interpretability of conventional approaches through structured perception and relational reasoning. It uses self-attention to iteratively reason about the relations between entities in a scene and to guide a model-free policy. Our results show that in a novel navigation and planning task called Box-World, our agent finds interpretable solutions that improve upon baselines in terms of sample complexity, ability to generalize to more complex scenes than experienced during training, and overall performance. In the StarCraft II Learning Environment, our agent achieves state-of-the-art performance on six mini-games – surpassing human grandmaster performance on four. By considering architectural inductive biases, our work opens new directions for overcoming important, but stubborn, challenges in deep RL.
Tasks	Relational Reasoning, Starcraft, Starcraft II
Published	2018-06-05
URL	http://arxiv.org/abs/1806.01830v2
PDF	http://arxiv.org/pdf/1806.01830v2.pdf
PWC	https://paperswithcode.com/paper/relational-deep-reinforcement-learning
Repo	https://github.com/nathangrinsztajn/Box-World
Framework	none

Transformation Networks for Target-Oriented Sentiment Classification


Title	Transformation Networks for Target-Oriented Sentiment Classification
Authors	Xin Li, Lidong Bing, Wai Lam, Bei Shi
Abstract	Target-oriented sentiment classification aims at classifying sentiment polarities over individual opinion targets in a sentence. RNN with attention seems a good fit for the characteristics of this task, and indeed it achieves the state-of-the-art performance. After re-examining the drawbacks of attention mechanism and the obstacles that block CNN to perform well in this classification task, we propose a new model to overcome these issues. Instead of attention, our model employs a CNN layer to extract salient features from the transformed word representations originated from a bi-directional RNN layer. Between the two layers, we propose a component to generate target-specific representations of words in the sentence, meanwhile incorporate a mechanism for preserving the original contextual information from the RNN layer. Experiments show that our model achieves a new state-of-the-art performance on a few benchmarks.
Tasks	Aspect-Based Sentiment Analysis
Published	2018-05-03
URL	http://arxiv.org/abs/1805.01086v1
PDF	http://arxiv.org/pdf/1805.01086v1.pdf
PWC	https://paperswithcode.com/paper/transformation-networks-for-target-oriented
Repo	https://github.com/lixin4ever/TNet
Framework	none

Towards an efficient deep learning model for musical onset detection


Title	Towards an efficient deep learning model for musical onset detection
Authors	Rong Gong, Xavier Serra
Abstract	In this paper, we propose an efficient and reproducible deep learning model for musical onset detection (MOD). We first review the state-of-the-art deep learning models for MOD, and identify their shortcomings and challenges: (i) the lack of hyper-parameter tuning details, (ii) the non-availability of code for training models on other datasets, and (iii) ignoring the network capability when comparing different architectures. Taking the above issues into account, we experiment with seven deep learning architectures. The most efficient one achieves equivalent performance to our implementation of the state-of-the-art architecture. However, it has only 28.3% of the total number of trainable parameters compared to the state-of-the-art. Our experiments are conducted using two different datasets: one mainly consists of instrumental music excerpts, and another developed by ourselves includes only solo singing voice excerpts. Further, inter-dataset transfer learning experiments are conducted. The results show that the model pre-trained on one dataset fails to detect onsets on another dataset, which denotes the importance of providing the implementation code to enable re-training the model for a different dataset. Datasets, code and a Jupyter notebook running on Google Colab are publicly available to make this research understandable and easy to reproduce.
Tasks	Transfer Learning
Published	2018-06-18
URL	http://arxiv.org/abs/1806.06773v2
PDF	http://arxiv.org/pdf/1806.06773v2.pdf
PWC	https://paperswithcode.com/paper/180606773
Repo	https://github.com/YingjingLu/Music_Onset
Framework	none

Recurrent Predictive State Policy Networks


Title	Recurrent Predictive State Policy Networks
Authors	Ahmed Hefny, Zita Marinho, Wen Sun, Siddhartha Srinivasa, Geoffrey Gordon
Abstract	We introduce Recurrent Predictive State Policy (RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially observable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz and Gordon, 2004; Sun et al., 2016) by modeling predictive state– a prediction of the distribution of future observations conditioned on history and future actions. This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al., 2018) to initialize the recursive filter. Predictive state serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behaviour. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992) and gradient descent based on prediction error. We show the efficacy of RPSP-networks under partial observability on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.
Tasks
Published	2018-03-05
URL	http://arxiv.org/abs/1803.01489v1
PDF	http://arxiv.org/pdf/1803.01489v1.pdf
PWC	https://paperswithcode.com/paper/recurrent-predictive-state-policy-networks
Repo	https://github.com/ahefnycmu/rpsp
Framework	none

DocFace+: ID Document to Selfie Matching


Title	DocFace+: ID Document to Selfie Matching
Authors	Yichun Shi, Anil K. Jain
Abstract	Numerous activities in our daily life require us to verify who we are by showing our ID documents containing face images, such as passports and driver licenses, to human operators. However, this process is slow, labor intensive and unreliable. As such, an automated system for matching ID document photos to live face images (selfies) in real time and with high accuracy is required. In this paper, we propose DocFace+ to meet this objective. We first show that gradient-based optimization methods converge slowly (due to the underfitting of classifier weights) when many classes have very few samples, a characteristic of existing ID-selfie datasets. To overcome this shortcoming, we propose a method, called dynamic weight imprinting (DWI), to update the classifier weights, which allows faster convergence and more generalizable representations. Next, a pair of sibling networks with partially shared parameters are trained to learn a unified face representation with domain-specific parameters. Cross-validation on an ID-selfie dataset shows that while a publicly available general face matcher (SphereFace) only achieves a True Accept Rate (TAR) of 59.29+-1.55% at a False Accept Rate (FAR) of 0.1% on the problem, DocFace+ improves the TAR to 97.51+-0.40%.
Tasks
Published	2018-09-15
URL	http://arxiv.org/abs/1809.05620v2
PDF	http://arxiv.org/pdf/1809.05620v2.pdf
PWC	https://paperswithcode.com/paper/docface-id-document-to-selfie-matching
Repo	https://github.com/seasonSH/DocFace
Framework	tf

On the Intrinsic Dimensionality of Image Representations


Title	On the Intrinsic Dimensionality of Image Representations
Authors	Sixue Gong, Vishnu Naresh Boddeti, Anil K. Jain
Abstract	This paper addresses the following questions pertaining to the intrinsic dimensionality of any given image representation: (i) estimate its intrinsic dimensionality, (ii) develop a deep neural network based non-linear mapping, dubbed DeepMDS, that transforms the ambient representation to the minimal intrinsic space, and (iii) validate the veracity of the mapping through image matching in the intrinsic space. Experiments on benchmark image datasets (LFW, IJB-C and ImageNet-100) reveal that the intrinsic dimensionality of deep neural network representations is significantly lower than the dimensionality of the ambient features. For instance, SphereFace’s 512-dim face representation and ResNet’s 512-dim image representation have an intrinsic dimensionality of 16 and 19 respectively. Further, the DeepMDS mapping is able to obtain a representation of significantly lower dimensionality while maintaining discriminative ability to a large extent, 59.75% TAR @ 0.1% FAR in 16-dim vs 71.26% TAR in 512-dim on IJB-C and a Top-1 accuracy of 77.0% at 19-dim vs 83.4% at 512-dim on ImageNet-100.
Tasks
Published	2018-03-26
URL	http://arxiv.org/abs/1803.09672v2
PDF	http://arxiv.org/pdf/1803.09672v2.pdf
PWC	https://paperswithcode.com/paper/on-the-intrinsic-dimensionality-of-face
Repo	https://github.com/ansuini/IntrinsicDimDeep
Framework	pytorch

Learning to Adapt Structured Output Space for Semantic Segmentation


Title	Learning to Adapt Structured Output Space for Semantic Segmentation
Authors	Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, Manmohan Chandraker
Abstract	Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorithms that can adapt source ground truth labels to the target domain is of great interest. In this paper, we propose an adversarial learning method for domain adaptation in the context of semantic segmentation. Considering semantic segmentations as structured outputs that contain spatial similarities between the source and target domains, we adopt adversarial learning in the output space. To further enhance the adapted model, we construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. Extensive experiments and ablation study are conducted under various domain adaptation settings, including synthetic-to-real and cross-city scenarios. We show that the proposed method performs favorably against the state-of-the-art methods in terms of accuracy and visual quality.
Tasks	Domain Adaptation, Image-to-Image Translation, Semantic Segmentation, Synthetic-to-Real Translation
Published	2018-02-28
URL	https://arxiv.org/abs/1802.10349v2
PDF	https://arxiv.org/pdf/1802.10349v2.pdf
PWC	https://paperswithcode.com/paper/learning-to-adapt-structured-output-space-for
Repo	https://github.com/lym29/DASeg
Framework	pytorch