October 21, 2019

3271 words 16 mins read

Paper Group AWR 14

Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing. A Fusion Approach for Multi-Frame Optical Flow Estimation. SpotTune: Transfer Learning through Adaptive Fine-tuning. Automatic Instrument Segmentation in Robot-Assisted Surgery Using Deep Learning. Counterfactually Fair Prediction U …

Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing


Title	Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing
Authors	Jian Zhao, Jianshu Li, Yu Cheng, Li Zhou, Terence Sim, Shuicheng Yan, Jiashi Feng
Abstract	Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification and autonomous driving, etc. To this end, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we present a new large-scale database “Multi-Human Parsing (MHP)” for algorithm development and evaluation, and advances the state-of-the-art in understanding humans in crowded scenes. MHP contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels, involving 2-26 persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.
Tasks	Autonomous Driving, Human Parsing, Instance Segmentation, Multi-Human Parsing, Person Re-Identification, Saliency Prediction, Semantic Segmentation
Published	2018-04-10
URL	http://arxiv.org/abs/1804.03287v3
PDF	http://arxiv.org/pdf/1804.03287v3.pdf
PWC	https://paperswithcode.com/paper/understanding-humans-in-crowded-scenes-deep
Repo	https://github.com/ZhaoJ9014/Multi-Human-Parsing
Framework	tf

A Fusion Approach for Multi-Frame Optical Flow Estimation


Title	A Fusion Approach for Multi-Frame Optical Flow Estimation
Authors	Zhile Ren, Orazio Gallo, Deqing Sun, Ming-Hsuan Yang, Erik B. Sudderth, Jan Kautz
Abstract	To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account. While elegant and appealing, the idea of using more than two frames has not yet produced state-of-the-art results. We present a simple, yet effective fusion approach for multi-frame optical flow that benefits from longer-term temporal cues. Our method first warps the optical flow from previous frames to the current, thereby yielding multiple plausible estimates. It then fuses the complementary information carried by these estimates into a new optical flow field. At the time of writing, our method ranks first among published results in the MPI Sintel and KITTI 2015 benchmarks. Our models will be available on https://github.com/NVlabs/PWC-Net.
Tasks	Optical Flow Estimation
Published	2018-10-23
URL	http://arxiv.org/abs/1810.10066v2
PDF	http://arxiv.org/pdf/1810.10066v2.pdf
PWC	https://paperswithcode.com/paper/a-fusion-approach-for-multi-frame-optical
Repo	https://github.com/NVlabs/PWC-Net
Framework	pytorch

SpotTune: Transfer Learning through Adaptive Fine-tuning


Title	SpotTune: Transfer Learning through Adaptive Fine-tuning
Authors	Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, Rogerio Feris
Abstract	Transfer learning, which allows a source task to affect the inductive bias of the target task, is widely used in computer vision. The typical way of conducting transfer learning with deep neural networks is to fine-tune a model pre-trained on the source task using data from the target task. In this paper, we propose an adaptive fine-tuning approach, called SpotTune, which finds the optimal fine-tuning strategy per instance for the target data. In SpotTune, given an image from the target task, a policy network is used to make routing decisions on whether to pass the image through the fine-tuned layers or the pre-trained layers. We conduct extensive experiments to demonstrate the effectiveness of the proposed approach. Our method outperforms the traditional fine-tuning approach on 12 out of 14 standard datasets.We also compare SpotTune with other state-of-the-art fine-tuning strategies, showing superior performance. On the Visual Decathlon datasets, our method achieves the highest score across the board without bells and whistles.
Tasks	Transfer Learning
Published	2018-11-21
URL	http://arxiv.org/abs/1811.08737v1
PDF	http://arxiv.org/pdf/1811.08737v1.pdf
PWC	https://paperswithcode.com/paper/spottune-transfer-learning-through-adaptive
Repo	https://github.com/gyhui14/spottune
Framework	pytorch

Automatic Instrument Segmentation in Robot-Assisted Surgery Using Deep Learning


Title	Automatic Instrument Segmentation in Robot-Assisted Surgery Using Deep Learning
Authors	Alexey Shvets, Alexander Rakhlin, Alexandr A. Kalinin, Vladimir Iglovikov
Abstract	Semantic segmentation of robotic instruments is an important problem for the robot-assisted surgery. One of the main challenges is to correctly detect an instrument’s position for the tracking and pose estimation in the vicinity of surgical scenes. Accurate pixel-wise instrument segmentation is needed to address this challenge. In this paper we describe our winning solution for MICCAI 2017 Endoscopic Vision SubChallenge: Robotic Instrument Segmentation. Our approach demonstrates an improvement over the state-of-the-art results using several novel deep neural network architectures. It addressed the binary segmentation problem, where every pixel in an image is labeled as an instrument or background from the surgery video feed. In addition, we solve a multi-class segmentation problem, where we distinguish different instruments or different parts of an instrument from the background. In this setting, our approach outperforms other methods in every task subcategory for automatic instrument segmentation thereby providing state-of-the-art solution for this problem. The source code for our solution is made publicly available at https://github.com/ternaus/robot-surgery-segmentation
Tasks	Pose Estimation, Semantic Segmentation
Published	2018-03-03
URL	http://arxiv.org/abs/1803.01207v2
PDF	http://arxiv.org/pdf/1803.01207v2.pdf
PWC	https://paperswithcode.com/paper/automatic-instrument-segmentation-in-robot
Repo	https://github.com/ternaus/angiodysplasia-segmentation
Framework	pytorch

Counterfactually Fair Prediction Using Multiple Causal Models


Title	Counterfactually Fair Prediction Using Multiple Causal Models
Authors	Fabio Massimo Zennaro, Magdalena Ivanovska
Abstract	In this paper we study the problem of making predictions using multiple structural casual models defined by different agents, under the constraint that the prediction satisfies the criterion of counterfactual fairness. Relying on the frameworks of causality, fairness and opinion pooling, we build upon and extend previous work focusing on the qualitative aggregation of causal Bayesian networks and causal models. In order to complement previous qualitative results, we devise a method based on Monte Carlo simulations. This method enables a decision-maker to aggregate the outputs of the causal models provided by different experts while guaranteeing the counterfactual fairness of the result. We demonstrate our approach on a simple, yet illustrative, toy case study.
Tasks
Published	2018-10-01
URL	http://arxiv.org/abs/1810.00694v1
PDF	http://arxiv.org/pdf/1810.00694v1.pdf
PWC	https://paperswithcode.com/paper/counterfactually-fair-prediction-using
Repo	https://github.com/FMZennaro/Fair-Pooling-Causal-Models
Framework	tf

Mapping Images to Psychological Similarity Spaces Using Neural Networks


Title	Mapping Images to Psychological Similarity Spaces Using Neural Networks
Authors	Lucas Bechberger, Elektra Kypridemou
Abstract	The cognitive framework of conceptual spaces bridges the gap between symbolic and subsymbolic AI by proposing an intermediate conceptual layer where knowledge is represented geometrically. There are two main approaches for obtaining the dimensions of this conceptual similarity space: using similarity ratings from psychological experiments and using machine learning techniques. In this paper, we propose a combination of both approaches by using psychologically derived similarity ratings to constrain the machine learning process. This way, a mapping from stimuli to conceptual spaces can be learned that is both supported by psychological data and allows generalization to unseen stimuli. The results of a first feasibility study support our proposed approach.
Tasks
Published	2018-04-20
URL	https://arxiv.org/abs/1804.07758v2
PDF	https://arxiv.org/pdf/1804.07758v2.pdf
PWC	https://paperswithcode.com/paper/mapping-images-to-psychological-similarity
Repo	https://github.com/lbechberger/LearningPsychologicalSpaces
Framework	tf

PyramidBox: A Context-assisted Single Shot Face Detector


Title	PyramidBox: A Context-assisted Single Shot Face Detector
Authors	Xu Tang, Daniel K. Du, Zeqiang He, Jingtuo Liu
Abstract	Face detection has been well studied for many years and one of remaining challenges is to detect small, blurred and partially occluded faces in uncontrolled environment. This paper proposes a novel context-assisted single shot face detector, named \emph{PyramidBox} to handle the hard face detection problem. Observing the importance of the context, we improve the utilization of contextual information in the following three aspects. First, we design a novel context anchor to supervise high-level contextual feature learning by a semi-supervised method, which we call it PyramidAnchors. Second, we propose the Low-level Feature Pyramid Network to combine adequate high-level context semantic feature and Low-level facial feature together, which also allows the PyramidBox to predict faces of all scales in a single shot. Third, we introduce a context-sensitive structure to increase the capacity of prediction network to improve the final accuracy of output. In addition, we use the method of Data-anchor-sampling to augment the training samples across different scales, which increases the diversity of training data for smaller faces. By exploiting the value of context, PyramidBox achieves superior performance among the state-of-the-art over the two common face detection benchmarks, FDDB and WIDER FACE. Our code is available in PaddlePaddle: \href{https://github.com/PaddlePaddle/models/tree/develop/fluid/face_detection}{\url{https://github.com/PaddlePaddle/models/tree/develop/fluid/face_detection}}.
Tasks	Face Detection
Published	2018-03-21
URL	http://arxiv.org/abs/1803.07737v2
PDF	http://arxiv.org/pdf/1803.07737v2.pdf
PWC	https://paperswithcode.com/paper/pyramidbox-a-context-assisted-single-shot
Repo	https://github.com/EricZgw/PyramidBox
Framework	tf

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation


Title	Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
Authors	Daniel Stoller, Sebastian Ewert, Simon Dixon
Abstract	Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.
Tasks	Music Source Separation
Published	2018-06-08
URL	http://arxiv.org/abs/1806.03185v1
PDF	http://arxiv.org/pdf/1806.03185v1.pdf
PWC	https://paperswithcode.com/paper/wave-u-net-a-multi-scale-neural-network-for
Repo	https://github.com/ShichengChen/WaveNetSeparateAudio
Framework	pytorch

Deep Reinforcement Learning For Sequence to Sequence Models


Title	Deep Reinforcement Learning For Sequence to Sequence Models
Authors	Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, Chandan K. Reddy
Abstract	In recent times, sequence-to-sequence (seq2seq) models have gained a lot of popularity and provide state-of-the-art performance in a wide variety of tasks such as machine translation, headline generation, text summarization, speech to text conversion, and image caption generation. The underlying framework for all these models is usually a deep neural network comprising an encoder and a decoder. Although simple encoder-decoder models produce competitive results, many researchers have proposed additional improvements over these sequence-to-sequence models, e.g., using an attention-based model over the input, pointer-generation models, and self-attention models. However, such seq2seq models suffer from two common problems: 1) exposure bias and 2) inconsistency between train/test measurement. Recently, a completely novel point of view has emerged in addressing these two problems in seq2seq models, leveraging methods from reinforcement learning (RL). In this survey, we consider seq2seq problems from the RL point of view and provide a formulation combining the power of RL methods in decision-making with sequence-to-sequence models that enable remembering long-term memories. We present some of the most recent frameworks that combine concepts from RL and deep neural networks and explain how these two areas could benefit from each other in solving complex seq2seq tasks. Our work aims to provide insights into some of the problems that inherently arise with current approaches and how we can address them with better RL models. We also provide the source code for implementing most of the RL models discussed in this paper to support the complex task of abstractive text summarization.
Tasks	Abstractive Text Summarization, Decision Making, Machine Translation, Text Summarization
Published	2018-05-24
URL	http://arxiv.org/abs/1805.09461v4
PDF	http://arxiv.org/pdf/1805.09461v4.pdf
PWC	https://paperswithcode.com/paper/deep-reinforcement-learning-for-sequence-to
Repo	https://github.com/theamrzaki/text_summurization_abstractive_methods
Framework	tf

ALIGNet: Partial-Shape Agnostic Alignment via Unsupervised Learning


Title	ALIGNet: Partial-Shape Agnostic Alignment via Unsupervised Learning
Authors	Rana Hanocka, Noa Fish, Zhenhua Wang, Raja Giryes, Shachar Fleishman, Daniel Cohen-Or
Abstract	The process of aligning a pair of shapes is a fundamental operation in computer graphics. Traditional approaches rely heavily on matching corresponding points or features to guide the alignment, a paradigm that falters when significant shape portions are missing. These techniques generally do not incorporate prior knowledge about expected shape characteristics, which can help compensate for any misleading cues left by inaccuracies exhibited in the input shapes. We present an approach based on a deep neural network, leveraging shape datasets to learn a shape-aware prior for source-to-target alignment that is robust to shape incompleteness. In the absence of ground truth alignments for supervision, we train a network on the task of shape alignment using incomplete shapes generated from full shapes for self-supervision. Our network, called ALIGNet, is trained to warp complete source shapes to incomplete targets, as if the target shapes were complete, thus essentially rendering the alignment partial-shape agnostic. We aim for the network to develop specialized expertise over the common characteristics of the shapes in each dataset, thereby achieving a higher-level understanding of the expected shape space to which a local approach would be oblivious. We constrain ALIGNet through an anisotropic total variation identity regularization to promote piecewise smooth deformation fields, facilitating both partial-shape agnosticism and post-deformation applications. We demonstrate that ALIGNet learns to align geometrically distinct shapes, and is able to infer plausible mappings even when the target shape is significantly incomplete. We show that our network learns the common expected characteristics of shape collections, without over-fitting or memorization, enabling it to produce plausible deformations on unseen data during test time.
Tasks
Published	2018-04-23
URL	http://arxiv.org/abs/1804.08497v2
PDF	http://arxiv.org/pdf/1804.08497v2.pdf
PWC	https://paperswithcode.com/paper/alignet-partial-shape-agnostic-alignment-via
Repo	https://github.com/ranahanocka/ALIGNet
Framework	pytorch

Deep learning as a tool for neural data analysis: speech classification and cross-frequency coupling in human sensorimotor cortex


Title	Deep learning as a tool for neural data analysis: speech classification and cross-frequency coupling in human sensorimotor cortex
Authors	Jesse A. Livezey, Kristofer E. Bouchard, Edward F. Chang
Abstract	A fundamental challenge in neuroscience is to understand what structure in the world is represented in spatially distributed patterns of neural activity from multiple single-trial measurements. This is often accomplished by learning a simple, linear transformations between neural features and features of the sensory stimuli or motor task. While successful in some early sensory processing areas, linear mappings are unlikely to be ideal tools for elucidating nonlinear, hierarchical representations of higher-order brain areas during complex tasks, such as the production of speech by humans. Here, we apply deep networks to predict produced speech syllables from cortical surface electric potentials recorded from human sensorimotor cortex. We found that deep networks had higher decoding prediction accuracy compared to baseline models, and also exhibited greater improvements in accuracy with increasing dataset size. We further demonstrate that deep network’s confusions revealed hierarchical latent structure in the neural data, which recapitulated the underlying articulatory nature of speech motor control. Finally, we used deep networks to compare task-relevant information in different neural frequency bands, and found that the high-gamma band contains the vast majority of information relevant for the speech prediction task, with little-to-no additional contribution from lower-frequencies. Together, these results demonstrate the utility of deep networks as a data analysis tool for neuroscience.
Tasks
Published	2018-03-26
URL	http://arxiv.org/abs/1803.09807v1
PDF	http://arxiv.org/pdf/1803.09807v1.pdf
PWC	https://paperswithcode.com/paper/deep-learning-as-a-tool-for-neural-data
Repo	https://github.com/BouchardLab/deprecated_process_ecog
Framework	none

APRIL: Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning


Title	APRIL: Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning
Authors	Yang Gao, Christian M. Meyer, Iryna Gurevych
Abstract	We propose a method to perform automatic document summarisation without using reference summaries. Instead, our method interactively learns from users’ preferences. The merit of preference-based interactive summarisation is that preferences are easier for users to provide than reference summaries. Existing preference-based interactive learning methods suffer from high sample complexity, i.e. they need to interact with the oracle for many rounds in order to converge. In this work, we propose a new objective function, which enables us to leverage active learning, preference learning and reinforcement learning techniques in order to reduce the sample complexity. Both simulation and real-user experiments suggest that our method significantly advances the state of the art. Our source code is freely available at https://github.com/UKPLab/emnlp2018-april.
Tasks	Active Learning
Published	2018-08-29
URL	http://arxiv.org/abs/1808.09658v1
PDF	http://arxiv.org/pdf/1808.09658v1.pdf
PWC	https://paperswithcode.com/paper/april-interactively-learning-to-summarise-by
Repo	https://github.com/UKPLab/emnlp2018-april
Framework	none

Urdu Word Segmentation using Conditional Random Fields (CRFs)


Title	Urdu Word Segmentation using Conditional Random Fields (CRFs)
Authors	Haris Bin Zia, Agha Ali Raza, Awais Athar
Abstract	State-of-the-art Natural Language Processing algorithms rely heavily on efficient word segmentation. Urdu is amongst languages for which word segmentation is a complex task as it exhibits space omission as well as space insertion issues. This is partly due to the Arabic script which although cursive in nature, consists of characters that have inherent joining and non-joining attributes regardless of word boundary. This paper presents a word segmentation system for Urdu which uses a Conditional Random Field sequence modeler with orthographic, linguistic and morphological features. Our proposed model automatically learns to predict white space as word boundary as well as Zero Width Non-Joiner (ZWNJ) as sub-word boundary. Using a manually annotated corpus, our model achieves F1 score of 0.97 for word boundary identification and 0.85 for sub-word boundary identification tasks. We have made our code and corpus publicly available to make our results reproducible.
Tasks
Published	2018-06-14
URL	http://arxiv.org/abs/1806.05432v1
PDF	http://arxiv.org/pdf/1806.05432v1.pdf
PWC	https://paperswithcode.com/paper/urdu-word-segmentation-using-conditional
Repo	https://github.com/harisbinzia/Urdu-Word-Segmentation
Framework	none

Mask Editor : an Image Annotation Tool for Image Segmentation Tasks


Title	Mask Editor : an Image Annotation Tool for Image Segmentation Tasks
Authors	Chuanhai Zhang, Kurt Loken, Zhiyu Chen, Zhiyong Xiao, Gary Kunkel
Abstract	Deep convolutional neural network (DCNN) is the state-of-the-art method for image segmentation, which is one of key challenging computer vision tasks. However, DCNN requires a lot of training images with corresponding image masks to get a good segmentation result. Image annotation software which is easy to use and allows fast image mask generation is in great demand. To the best of our knowledge, all existing image annotation software support only drawing bounding polygons, bounding boxes, or bounding ellipses to mark target objects. These existing software are inefficient when targeting objects that have irregular shapes (e.g., defects in fabric images or tire images). In this paper we design an easy-to-use image annotation software called Mask Editor for image mask generation. Mask Editor allows drawing any bounding curve to mark objects and improves efficiency to mark objects with irregular shapes. Mask Editor also supports drawing bounding polygons, drawing bounding boxes, drawing bounding ellipses, painting, erasing, super-pixel-marking, image cropping, multi-class masks, mask loading, and mask modifying.
Tasks	Image Cropping, Semantic Segmentation
Published	2018-09-17
URL	http://arxiv.org/abs/1809.06461v1
PDF	http://arxiv.org/pdf/1809.06461v1.pdf
PWC	https://paperswithcode.com/paper/mask-editor-an-image-annotation-tool-for
Repo	https://github.com/Chuanhai/Mask-Editor
Framework	none

CoQA: A Conversational Question Answering Challenge


Title	CoQA: A Conversational Question Answering Challenge
Authors	Siva Reddy, Danqi Chen, Christopher D. Manning
Abstract	Humans gather information by engaging in conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong conversational and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating there is ample room for improvement. We launch CoQA as a challenge to the community at http://stanfordnlp.github.io/coqa/
Tasks	Question Answering, Reading Comprehension
Published	2018-08-21
URL	http://arxiv.org/abs/1808.07042v2
PDF	http://arxiv.org/pdf/1808.07042v2.pdf
PWC	https://paperswithcode.com/paper/coqa-a-conversational-question-answering
Repo	https://github.com/JepsonWong/NLPCorpus
Framework	none