October 21, 2019

2929 words 14 mins read

Paper Group AWR 147

Document Informed Neural Autoregressive Topic Models. Pose-Driven Deep Models for Person Re-Identification. Recursive Visual Attention in Visual Dialog. Spatial Deep Learning for Wireless Scheduling. Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks. ODSQA: Open-domain Spoken Question Answering Dataset. Variational Le …

Document Informed Neural Autoregressive Topic Models


Title	Document Informed Neural Autoregressive Topic Models
Authors	Pankaj Gupta, Florian Buettner, Hinrich Schütze
Abstract	Context information around words helps in determining their actual meaning, for example “networks” used in contexts of artificial neural networks or biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. This results in an improved performance in terms of generalization, interpretability and applicability. We apply our modeling approach to seven data sets from various domains and demonstrate that our approach consistently outperforms stateof-the-art generative topic models. With the learned representations, we show on an average a gain of 9.6% (0.57 Vs 0.52) in precision at retrieval fraction 0.02 and 7.2% (0.582 Vs 0.543) in F1 for text categorization.
Tasks	Language Modelling, Text Categorization, Topic Models
Published	2018-08-11
URL	http://arxiv.org/abs/1808.03793v1
PDF	http://arxiv.org/pdf/1808.03793v1.pdf
PWC	https://paperswithcode.com/paper/document-informed-neural-autoregressive-topic-1
Repo	https://github.com/pgcool/iDocNADE
Framework	none

Pose-Driven Deep Models for Person Re-Identification


Title	Pose-Driven Deep Models for Person Re-Identification
Authors	Andreas Eberle
Abstract	Person re-identification (re-id) is the task of recognizing and matching persons at different locations recorded by cameras with non-overlapping views. One of the main challenges of re-id is the large variance in person poses and camera angles since neither of them can be influenced by the re-id system. In this work, an effective approach to integrate coarse camera view information as well as fine-grained pose information into a convolutional neural network (CNN) model for learning discriminative re-id embeddings is introduced. In most recent work pose information is either explicitly modeled within the re-id system or explicitly used for pre-processing, for example by pose-normalizing person images. In contrast, the proposed approach shows that a direct use of camera view as well as the detected body joint locations into a standard CNN can be used to significantly improve the robustness of learned re-id embeddings. On four challenging surveillance and video re-id datasets significant improvements over the current state of the art have been achieved. Furthermore, a novel reordering of the MARS dataset, called X-MARS is introduced to allow cross-validation of models trained for single-image re-id on tracklet data.
Tasks	Person Re-Identification
Published	2018-03-23
URL	http://arxiv.org/abs/1803.08709v1
PDF	http://arxiv.org/pdf/1803.08709v1.pdf
PWC	https://paperswithcode.com/paper/pose-driven-deep-models-for-person-re
Repo	https://github.com/andreas-eberle/x-mars
Framework	none

Recursive Visual Attention in Visual Dialog


Title	Recursive Visual Attention in Visual Dialog
Authors	Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, Ji-Rong Wen
Abstract	Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image. It typically needs to address two major problems: (1) How to answer visually-grounded questions, which is the core challenge in visual question answering (VQA); (2) How to infer the co-reference between questions and the dialog history. An example of visual co-reference is: pronouns (\eg, `they'') in the question (\eg,` Are they on or off?'') are linked with nouns (\eg, `lamps'') appearing in the dialog history (\eg,` How many lamps are there?'') and the object grounded in the image. In this work, to resolve the visual co-reference for visual dialog, we propose a novel attention mechanism called Recursive Visual Attention (RvA). Specifically, our dialog agent browses the dialog history until the agent has sufficient confidence in the visual co-reference resolution, and refines the visual attention recursively. The quantitative and qualitative experimental results on the large-scale VisDial v0.9 and v1.0 datasets demonstrate that the proposed RvA not only outperforms the state-of-the-art methods, but also achieves reasonable recursion and interpretable attention maps without additional annotations. The code is available at \url{https://github.com/yuleiniu/rva}.
Tasks	Question Answering, Visual Dialog, Visual Question Answering
Published	2018-12-06
URL	http://arxiv.org/abs/1812.02664v2
PDF	http://arxiv.org/pdf/1812.02664v2.pdf
PWC	https://paperswithcode.com/paper/recursive-visual-attention-in-visual-dialog
Repo	https://github.com/yuleiniu/rva
Framework	pytorch

Spatial Deep Learning for Wireless Scheduling


Title	Spatial Deep Learning for Wireless Scheduling
Authors	Wei Cui, Kaiming Shen, Wei Yu
Abstract	The optimal scheduling of interfering links in a dense wireless network with full frequency reuse is a challenging task. The traditional method involves first estimating all the interfering channel strengths then optimizing the scheduling based on the model. This model-based method is however resource intensive and computationally hard, because channel estimation is expensive in dense networks; further, finding even a locally optimal solution of the resulting optimization problem may be computationally complex. This paper shows that by using a deep learning approach, it is possible to bypass channel estimation and to schedule links efficiently based solely on the geographic locations of transmitters and receivers for networks in which the channels are largely functions of distance dependent path-losses. This is accomplished by unsupervised training over randomly deployed networks, and by using a novel neural network architecture that takes the geographic spatial convolutions of the interfering or interfered neighboring nodes as input over multiple feedback stages to learn the optimum solution. The resulting neural network gives near-optimal performance for sum-rate maximization and is capable of generalizing to larger deployment areas and to deployments of different link densities. Moreover, to provide fairness, this paper proposes a novel scheduling approach that utilizes the sum-rate optimal scheduling algorithm over judiciously chosen subsets of links for maximizing a proportional fairness objective over the network. The proposed approach shows highly competitive and generalizable network utility maximization results.
Tasks
Published	2018-08-04
URL	http://arxiv.org/abs/1808.01486v2
PDF	http://arxiv.org/pdf/1808.01486v2.pdf
PWC	https://paperswithcode.com/paper/spatial-deep-learning-for-wireless-scheduling
Repo	https://github.com/willtop/Spatial_DeepLearning_Wireless_Scheduling
Framework	tf

Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks


Title	Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks
Authors	Ting-Wu Chin, Cha Zhang, Diana Marculescu
Abstract	Resource-efficient convolution neural networks enable not only the intelligence on edge devices but also opportunities in system-level optimization such as scheduling. In this work, we aim to improve the performance of resource-constrained filter pruning by merging two sub-problems commonly considered, i.e., (i) how many filters to prune for each layer and (ii) which filters to prune given a per-layer pruning budget, into a global filter ranking problem. Our framework entails a novel algorithm, dubbed layer-compensated pruning, where meta-learning is involved to determine better solutions. We show empirically that the proposed algorithm is superior to prior art in both effectiveness and efficiency. Specifically, we reduce the accuracy gap between the pruned and original networks from 0.9% to 0.7% with 8x reduction in time needed for meta-learning, i.e., from 1 hour down to 7 minutes. To this end, we demonstrate the effectiveness of our algorithm using ResNet and MobileNetV2 networks under CIFAR-10, ImageNet, and Bird-200 datasets.
Tasks	Meta-Learning
Published	2018-10-01
URL	http://arxiv.org/abs/1810.00518v2
PDF	http://arxiv.org/pdf/1810.00518v2.pdf
PWC	https://paperswithcode.com/paper/layer-compensated-pruning-for-resource
Repo	https://github.com/cmu-enyac/LeGR
Framework	pytorch

ODSQA: Open-domain Spoken Question Answering Dataset


Title	ODSQA: Open-domain Spoken Question Answering Dataset
Authors	Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng Chang, Hung-Yi Lee
Abstract	Reading comprehension by machine has been widely studied, but machine comprehension of spoken content is still a less investigated problem. In this paper, we release Open-Domain Spoken Question Answering Dataset (ODSQA) with more than three thousand questions. To the best of our knowledge, this is the largest real SQA dataset. On this dataset, we found that ASR errors have catastrophic impact on SQA. To mitigate the effect of ASR errors, subword units are involved, which brings consistent improvements over all the models. We further found that data augmentation on text-based QA training examples can improve SQA.
Tasks	Data Augmentation, Question Answering, Reading Comprehension
Published	2018-08-07
URL	http://arxiv.org/abs/1808.02280v1
PDF	http://arxiv.org/pdf/1808.02280v1.pdf
PWC	https://paperswithcode.com/paper/odsqa-open-domain-spoken-question-answering
Repo	https://github.com/chiahsuan156/ODSQA
Framework	none

Variational Learning on Aggregate Outputs with Gaussian Processes


Title	Variational Learning on Aggregate Outputs with Gaussian Processes
Authors	Ho Chung Leon Law, Dino Sejdinovic, Ewan Cameron, Tim CD Lucas, Seth Flaxman, Katherine Battle, Kenji Fukumizu
Abstract	While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the fine-scale spatial modelling of malaria incidence, with over 1 million observations.
Tasks	Gaussian Processes
Published	2018-05-22
URL	http://arxiv.org/abs/1805.08463v1
PDF	http://arxiv.org/pdf/1805.08463v1.pdf
PWC	https://paperswithcode.com/paper/variational-learning-on-aggregate-outputs
Repo	https://github.com/hcllaw/VBAgg
Framework	tf

A Style-Based Generator Architecture for Generative Adversarial Networks


Title	A Style-Based Generator Architecture for Generative Adversarial Networks
Authors	Tero Karras, Samuli Laine, Timo Aila
Abstract	We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
Tasks
Published	2018-12-12
URL	http://arxiv.org/abs/1812.04948v3
PDF	http://arxiv.org/pdf/1812.04948v3.pdf
PWC	https://paperswithcode.com/paper/a-style-based-generator-architecture-for
Repo	https://github.com/mokeam/StatueStyleGAN
Framework	tf

Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks


Title	Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks
Authors	Yuan Li, Yuanjie Yu, Zefeng Li, Yangkun Lin, Meifang Xu, Jiwei Li, Xi Zhou
Abstract	Recently, semantic segmentation and general object detection frameworks have been widely adopted by scene text detecting tasks. However, both of them alone have obvious shortcomings in practice. In this paper, we propose a novel end-to-end trainable deep neural network framework, named Pixel-Anchor, which combines semantic segmentation and SSD in one network by feature sharing and anchor-level attention mechanism to detect oriented scene text. To deal with scene text which has large variances in size and aspect ratio, we combine FPN and ASPP operation as our encoder-decoder structure in the semantic segmentation part, and propose a novel Adaptive Predictor Layer in the SSD. Pixel-Anchor detects scene text in a single network forward pass, no complex post-processing other than an efficient fusion Non-Maximum Suppression is involved. We have benchmarked the proposed Pixel-Anchor on the public datasets. Pixel-Anchor outperforms the competing methods in terms of text localization accuracy and run speed, more specifically, on the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.8768 at 10 FPS for 960 x 1728 resolution images.
Tasks	Object Detection, Semantic Segmentation
Published	2018-11-19
URL	http://arxiv.org/abs/1811.07432v1
PDF	http://arxiv.org/pdf/1811.07432v1.pdf
PWC	https://paperswithcode.com/paper/pixel-anchor-a-fast-oriented-scene-text
Repo	https://github.com/HannaRiver/Pixel-Anchor
Framework	none

Parsing Geometry Using Structure-Aware Shape Templates


Title	Parsing Geometry Using Structure-Aware Shape Templates
Authors	Vignesh Ganapathi-Subramanian, Olga Diamanti, Soeren Pirk, Chengcheng Tang, Matthias Niessner, Leonidas J. Guibas
Abstract	Real-life man-made objects often exhibit strong and easily-identifiable structure, as a direct result of their design or their intended functionality. Structure typically appears in the form of individual parts and their arrangement. Knowing about object structure can be an important cue for object recognition and scene understanding - a key goal for various AR and robotics applications. However, commodity RGB-D sensors used in these scenarios only produce raw, unorganized point clouds, without structural information about the captured scene. Moreover, the generated data is commonly partial and susceptible to artifacts and noise, which makes inferring the structure of scanned objects challenging. In this paper, we organize large shape collections into parameterized shape templates to capture the underlying structure of the objects. The templates allow us to transfer the structural information onto new objects and incomplete scans. We employ a deep neural network that matches the partial scan with one of the shape templates, then match and fit it to complete and detailed models from the collection. This allows us to faithfully label its parts and to guide the reconstruction of the scanned object. We showcase the effectiveness of our method by comparing it to other state-of-the-art approaches.
Tasks	Object Recognition, Scene Understanding
Published	2018-08-03
URL	http://arxiv.org/abs/1808.01337v2
PDF	http://arxiv.org/pdf/1808.01337v2.pdf
PWC	https://paperswithcode.com/paper/parsing-geometry-using-structure-aware-shape
Repo	https://github.com/vigansub/StructureAwareShapeTemplates
Framework	none

Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution


Title	Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution
Authors	Yichen Zhou, Giles Hooker
Abstract	This paper examines a novel gradient boosting framework for regression. We regularize gradient boosted trees by introducing subsampling and employ a modified shrinkage algorithm so that at every boosting stage the estimate is given by an average of trees. The resulting algorithm, titled Boulevard, is shown to converge as the number of trees grows. We also demonstrate a central limit theorem for this limit, allowing a characterization of uncertainty for predictions. A simulation study and real world examples provide support for both the predictive accuracy of the model and its limiting behavior.
Tasks
Published	2018-06-26
URL	https://arxiv.org/abs/1806.09762v3
PDF	https://arxiv.org/pdf/1806.09762v3.pdf
PWC	https://paperswithcode.com/paper/boulevard-regularized-stochastic-gradient
Repo	https://github.com/siriuz42/boulevard
Framework	none

Detecting Multi-Oriented Text with Corner-based Region Proposals


Title	Detecting Multi-Oriented Text with Corner-based Region Proposals
Authors	Linjie Deng, Yanxiang Gong, Yi Lin, Jingwen Shuai, Xiaoguang Tu, Yuefei Zhang, Zheng Ma, Mei Xie
Abstract	Previous approaches for scene text detection usually rely on manually defined sliding windows. This work presents an intuitive two-stage region-based method to detect multi-oriented text without any prior knowledge regarding the textual shape. In the first stage, we estimate the possible locations of text instances by detecting and linking corners instead of shifting a set of default anchors. The quadrilateral proposals are geometry adaptive, which allows our method to cope with various text aspect ratios and orientations. In the second stage, we design a new pooling layer named Dual-RoI Pooling which embeds data augmentation inside the region-wise subnetwork for more robust classification and regression over these proposals. Experimental results on public benchmarks confirm that the proposed method is capable of achieving comparable performance with state-of-the-art methods. The code is publicly available at https://github.com/xhzdeng/crpn
Tasks	Data Augmentation, Scene Text Detection
Published	2018-04-08
URL	https://arxiv.org/abs/1804.02690v2
PDF	https://arxiv.org/pdf/1804.02690v2.pdf
PWC	https://paperswithcode.com/paper/detecting-multi-oriented-text-with-corner
Repo	https://github.com/xhzdeng/crpn
Framework	none

Inferring Concept Prerequisite Relations from Online Educational Resources


Title	Inferring Concept Prerequisite Relations from Online Educational Resources
Authors	Sudeshna Roy, Meghana Madhyastha, Sheril Lawrence, Vaibhav Rajan
Abstract	The Internet has rich and rapidly increasing sources of high quality educational content. Inferring prerequisite relations between educational concepts is required for modern large-scale online educational technology applications such as personalized recommendations and automatic curriculum creation. We present PREREQ, a new supervised learning method for inferring concept prerequisite relations. PREREQ is designed using latent representations of concepts obtained from the Pairwise Latent Dirichlet Allocation model, and a neural network based on the Siamese network architecture. PREREQ can learn unknown concept prerequisites from course prerequisites and labeled concept prerequisite data. It outperforms state-of-the-art approaches on benchmark datasets and can effectively learn from very less training data. PREREQ can also use unlabeled video playlists, a steadily growing source of training data, to learn concept prerequisites, thus obviating the need for manual annotation of course prerequisites.
Tasks
Published	2018-11-30
URL	http://arxiv.org/abs/1811.12640v2
PDF	http://arxiv.org/pdf/1811.12640v2.pdf
PWC	https://paperswithcode.com/paper/inferring-concept-prerequisite-relations-from
Repo	https://github.com/suderoy/PREREQ-IAAI-19
Framework	none

FPETS : Fully Parallel End-to-End Text-to-Speech System


Title	FPETS : Fully Parallel End-to-End Text-to-Speech System
Authors	Dabiao Ma, Zhiba Su, Wenxuan Wang, Yuhao Lu
Abstract	End-to-end Text-to-speech (TTS) system can greatly improve the quality of synthesised speech. But it usually suffers form high time latency due to its auto-regressive structure. And the synthesised speech may also suffer from some error modes, e.g. repeated words, mispronunciations, and skipped words. In this paper, we propose a novel non-autoregressive, fully parallel end-to-end TTS system (FPETS). It utilizes a new alignment model and the recently proposed U-shape convolutional structure, UFANS. Different from RNN, UFANS can capture long term information in a fully parallel manner. Trainable position encoding and two-step training strategy are used for learning better alignments. Experimental results show FPETS utilizes the power of parallel computation and reaches a significant speed up of inference compared with state-of-the-art end-to-end TTS systems. More specifically, FPETS is 600X faster than Tacotron2, 50X faster than DCTTS and 10X faster than Deep Voice3. And FPETS can generates audios with equal or better quality and fewer errors comparing with other system. As far as we know, FPETS is the first end-to-end TTS system which is fully parallel.
Tasks
Published	2018-12-12
URL	https://arxiv.org/abs/1812.05710v5
PDF	https://arxiv.org/pdf/1812.05710v5.pdf
PWC	https://paperswithcode.com/paper/fpuas-fully-parallel-ufans-based-end-to-end
Repo	https://github.com/TuringAILab/End2End_training_English
Framework	mxnet

Visual Entailment Task for Visually-Grounded Language Learning


Title	Visual Entailment Task for Visually-Grounded Language Learning
Authors	Ning Xie, Farley Lai, Derek Doran, Asim Kadav
Abstract	We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.
Tasks	Natural Language Inference, Question Answering, Visual Question Answering
Published	2018-11-26
URL	http://arxiv.org/abs/1811.10582v2
PDF	http://arxiv.org/pdf/1811.10582v2.pdf
PWC	https://paperswithcode.com/paper/visual-entailment-task-for-visually-grounded
Repo	https://github.com/necla-ml/SNLI-VE
Framework	none