October 17, 2019

3026 words 15 mins read

Paper Group ANR 919

Deep context: end-to-end contextual speech recognition. MTBI Identification From Diffusion MR Images Using Bag of Adversarial Visual Features. Conceptual Analysis of Hypertext. Convolutional Neural Networks for Epileptic Seizure Prediction. Face Detection in the Operating Room: Comparison of State-of-the-art Methods and a Self-supervised Approach. …

Deep context: end-to-end contextual speech recognition


Title	Deep context: end-to-end contextual speech recognition
Authors	Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar, Anjuli Kannan, Ding Zhao
Abstract	In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem that utilizes such context. Our approach, which we re- fer to as Contextual Listen, Attend and Spell (CLAS) jointly- optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain out-of- vocabulary (OOV) terms not seen during training. We com- pare our proposed system to a more traditional contextualiza- tion approach, which performs shallow-fusion between inde- pendently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the pro- posed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components. Index Terms: speech recognition, sequence-to-sequence models, listen attend and spell, LAS, attention, embedded speech recognition.
Tasks	Speech Recognition
Published	2018-08-07
URL	http://arxiv.org/abs/1808.02480v1
PDF	http://arxiv.org/pdf/1808.02480v1.pdf
PWC	https://paperswithcode.com/paper/deep-context-end-to-end-contextual-speech
Repo
Framework

MTBI Identification From Diffusion MR Images Using Bag of Adversarial Visual Features


Title	MTBI Identification From Diffusion MR Images Using Bag of Adversarial Visual Features
Authors	Shervin Minaee, Yao Wang, Alp Aygar, Sohae Chung, Xiuyuan Wang, Yvonne W. Lui, Els Fieremans, Steven Flanagan, Joseph Rath
Abstract	In this work, we propose bag of adversarial features (BAF) for identifying mild traumatic brain injury (MTBI) patients from their diffusion magnetic resonance images (MRI) (obtained within one month of injury) by incorporating unsupervised feature learning techniques. MTBI is a growing public health problem with an estimated incidence of over 1.7 million people annually in US. Diagnosis is based on clinical history and symptoms, and accurate, concrete measures of injury are lacking. Unlike most of previous works, which use hand-crafted features extracted from different parts of brain for MTBI classification, we employ feature learning algorithms to learn more discriminative representation for this task. A major challenge in this field thus far is the relatively small number of subjects available for training. This makes it difficult to use an end-to-end convolutional neural network to directly classify a subject from MR images. To overcome this challenge, we first apply an adversarial auto-encoder (with convolutional structure) to learn patch-level features, from overlapping image patches extracted from different brain regions. We then aggregate these features through a bag-of-word approach. We perform an extensive experimental study on a dataset of 227 subjects (including 109 MTBI patients, and 118 age and sex matched healthy controls), and compare the bag-of-deep-features with several previous approaches. Our experimental results show that the BAF significantly outperforms earlier works relying on the mean values of MR metrics in selected brain regions.
Tasks
Published	2018-06-27
URL	http://arxiv.org/abs/1806.10419v1
PDF	http://arxiv.org/pdf/1806.10419v1.pdf
PWC	https://paperswithcode.com/paper/mtbi-identification-from-diffusion-mr-images
Repo
Framework

Conceptual Analysis of Hypertext


Title	Conceptual Analysis of Hypertext
Authors	Robert E. Kent, Christian Neuss
Abstract	In this chapter tools and techniques from the mathematical theory of formal concept analysis are applied to hypertext systems in general, and the World Wide Web in particular. Various processes for the conceptual structuring of hypertext are discussed: summarization, conceptual scaling, and the creation of conceptual links. Well-known interchange formats for summarizing networked information resources as resource meta-information are reviewed, and two new interchange formats originating from formal concept analysis are advocated. Also reviewed is conceptual scaling, which provides a principled approach to the faceted analysis techniques in library science classification. The important notion of conceptual linkage is introduced as a generalization of a hyperlink. The automatic hyperization of the content of legacy data is described, and the composite conceptual structuring with hypertext linkage is defined. For the conceptual empowerment of the Web user, a new technique called conceptual browsing is advocated. Conceptual browsing, which browses over conceptual links, is dual mode (extensional versus intensional) and dual scope (global versus local).
Tasks
Published	2018-10-16
URL	http://arxiv.org/abs/1810.07232v1
PDF	http://arxiv.org/pdf/1810.07232v1.pdf
PWC	https://paperswithcode.com/paper/conceptual-analysis-of-hypertext
Repo
Framework

Convolutional Neural Networks for Epileptic Seizure Prediction


Title	Convolutional Neural Networks for Epileptic Seizure Prediction
Authors	Matthias Eberlein, Raphael Hildebrand, Ronald Tetzlaff, Nico Hoffmann, Levin Kuhlmann, Benjamin Brinkmann, Jens Müller
Abstract	Epilepsy is the most common neurological disorder and an accurate forecast of seizures would help to overcome the patient’s uncertainty and helplessness. In this contribution, we present and discuss a novel methodology for the classification of intracranial electroencephalography (iEEG) for seizure prediction. Contrary to previous approaches, we categorically refrain from an extraction of hand-crafted features and use a convolutional neural network (CNN) topology instead for both the determination of suitable signal characteristics and the binary classification of preictal and interictal segments. Three different models have been evaluated on public datasets with long-term recordings from four dogs and three patients. Overall, our findings demonstrate the general applicability. In this work we discuss the strengths and limitations of our methodology.
Tasks	Seizure prediction
Published	2018-11-02
URL	http://arxiv.org/abs/1811.00915v2
PDF	http://arxiv.org/pdf/1811.00915v2.pdf
PWC	https://paperswithcode.com/paper/convolutional-neural-networks-for-epileptic
Repo
Framework

Face Detection in the Operating Room: Comparison of State-of-the-art Methods and a Self-supervised Approach


Title	Face Detection in the Operating Room: Comparison of State-of-the-art Methods and a Self-supervised Approach
Authors	Thibaut Issenhuth, Vinkle Srivastav, Afshin Gangi, Nicolas Padoy
Abstract	Purpose: Face detection is a needed component for the automatic analysis and assistance of human activities during surgical procedures. Efficient face detection algorithms can indeed help to detect and identify the persons present in the room, and also be used to automatically anonymize the data. However, current algorithms trained on natural images do not generalize well to the operating room (OR) images. In this work, we provide a comparison of state-of-the-art face detectors on OR data and also present an approach to train a face detector for the OR by exploiting non-annotated OR images. Methods: We propose a comparison of 6 state-of-the-art face detectors on clinical data using Multi-View Operating Room Faces (MVOR-Faces), a dataset of operating room images capturing real surgical activities. We then propose to use self-supervision, a domain adaptation method, for the task of face detection in the OR. The approach makes use of non-annotated images to fine-tune a state-of-the-art detector for the OR without using any human supervision. Results: The results show that the best model, namely the tiny face detector, yields an average precision of 0.536 at Intersection over Union (IoU) of 0.5. Our self-supervised model using non-annotated clinical data outperforms this result by 9.2%. Conclusion: We present the first comparison of state-of-the-art face detectors on operating room images and show that results can be significantly improved by using self-supervision on non-annotated data.
Tasks	Domain Adaptation, Face Detection
Published	2018-11-29
URL	http://arxiv.org/abs/1811.12296v2
PDF	http://arxiv.org/pdf/1811.12296v2.pdf
PWC	https://paperswithcode.com/paper/face-detection-in-the-operating-room
Repo
Framework

DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor


Title	DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
Authors	Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, Yebin Liu
Abstract	We propose DoubleFusion, a new real-time system that combines volumetric dynamic reconstruction with data-driven template fitting to simultaneously reconstruct detailed geometry, non-rigid motion and the inner human body shape from a single depth camera. One of the key contributions of this method is a double layer representation consisting of a complete parametric body shape inside, and a gradually fused outer surface layer. A pre-defined node graph on the body surface parameterizes the non-rigid deformations near the body, and a free-form dynamically changing graph parameterizes the outer surface layer far from the body, which allows more general reconstruction. We further propose a joint motion tracking method based on the double layer representation to enable robust and fast motion tracking performance. Moreover, the inner body shape is optimized online and forced to fit inside the outer surface layer. Overall, our method enables increasingly denoised, detailed and complete surface reconstructions, fast motion tracking performance and plausible inner body shape reconstruction in real-time. In particular, experiments show improved fast motion tracking and loop closure performance on more challenging scenarios.
Tasks
Published	2018-04-17
URL	http://arxiv.org/abs/1804.06023v1
PDF	http://arxiv.org/pdf/1804.06023v1.pdf
PWC	https://paperswithcode.com/paper/doublefusion-real-time-capture-of-human
Repo
Framework

Linguistic Characteristics of Censorable Language on SinaWeibo


Title	Linguistic Characteristics of Censorable Language on SinaWeibo
Authors	Kei Yin Ng, Anna Feldman, Jing Peng, Chris Leberknight
Abstract	This paper investigates censorship from a linguistic perspective. We collect a corpus of censored and uncensored posts on a number of topics, build a classifier that predicts censorship decisions independent of discussion topics. Our investigation reveals that the strongest linguistic indicator of censored content of our corpus is its readability.
Tasks
Published	2018-07-10
URL	http://arxiv.org/abs/1807.03654v1
PDF	http://arxiv.org/pdf/1807.03654v1.pdf
PWC	https://paperswithcode.com/paper/linguistic-characteristics-of-censorable
Repo
Framework

Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision


Title	Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision
Authors	Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng, Yiping Meng, Binghui Chen, Xiubao Zhang, Yuan Zhao, Xiehe Huang
Abstract	The performance of face detectors has been largely improved with the development of convolutional neural network. However, it remains challenging for face detectors to detect tiny, occluded or blurry faces. Besides, most face detectors can’t locate face’s position precisely and can’t achieve high Intersection-over-Union (IoU) scores. We assume that problems inside are inadequate use of supervision information and imbalance between semantics and details at all level feature maps in CNN even with Feature Pyramid Networks (FPN). In this paper, we present a novel single-shot face detection network, named DF$^2$S$^2$ (Detection with Feature Fusion and Segmentation Supervision), which introduces a more effective feature fusion pyramid and a more efficient segmentation branch on ResNet-50 to handle mentioned problems. Specifically, inspired by FPN and SENet, we apply semantic information from higher-level feature maps as contextual cues to augment low-level feature maps via a spatial and channel-wise attention style, preventing details from being covered by too much semantics and making semantics and details complement each other. We further propose a semantic segmentation branch to best utilize detection supervision information meanwhile applying attention mechanism in a self-supervised manner. The segmentation branch is supervised by weak segmentation ground-truth (no extra annotation is required) in a hierarchical manner, deprecated in the inference time so it wouldn’t compromise the inference speed. We evaluate our model on WIDER FACE dataset and achieved state-of-art results.
Tasks	Face Detection, Semantic Segmentation
Published	2018-11-20
URL	http://arxiv.org/abs/1811.08557v3
PDF	http://arxiv.org/pdf/1811.08557v3.pdf
PWC	https://paperswithcode.com/paper/learning-better-features-for-face-detection
Repo
Framework

Are All Languages Equally Hard to Language-Model?


Title	Are All Languages Equally Hard to Language-Model?
Authors	Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, Brian Roark
Abstract	For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both $n$-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.
Tasks	Language Modelling
Published	2018-06-10
URL	https://arxiv.org/abs/1806.03743v2
PDF	https://arxiv.org/pdf/1806.03743v2.pdf
PWC	https://paperswithcode.com/paper/are-all-languages-equally-hard-to-language
Repo
Framework

Learning the Base Distribution in Implicit Generative Models


Title	Learning the Base Distribution in Implicit Generative Models
Authors	Cem Subakan, Oluwasanmi Koyejo, Paris Smaragdis
Abstract	Popular generative model learning methods such as Generative Adversarial Networks (GANs), and Variational Autoencoders (VAE) enforce the latent representation to follow simple distributions such as isotropic Gaussian. In this paper, we argue that learning a complicated distribution over the latent space of an auto-encoder enables more accurate modeling of complicated data distributions. Based on this observation, we propose a two stage optimization procedure which maximizes an approximate implicit density model. We experimentally verify that our method outperforms GANs and VAEs on two image datasets (MNIST, CELEB-A). We also show that our approach is amenable to learning generative model for sequential data, by learning to generate speech and music.
Tasks
Published	2018-03-12
URL	http://arxiv.org/abs/1803.04357v2
PDF	http://arxiv.org/pdf/1803.04357v2.pdf
PWC	https://paperswithcode.com/paper/learning-the-base-distribution-in-implicit
Repo
Framework

Exploring End-to-End Techniques for Low-Resource Speech Recognition


Title	Exploring End-to-End Techniques for Low-Resource Speech Recognition
Authors	Vladimir Bataev, Maxim Korenevsky, Ivan Medennikov, Alexander Zatvornitskiy
Abstract	In this work we present simple grapheme-based system for low-resource speech recognition using Babel data for Turkish spontaneous speech (80 hours). We have investigated different neural network architectures performance, including fully-convolutional, recurrent and ResNet with GRU. Different features and normalization techniques are compared as well. We also proposed CTC-loss modification using segmentation during training, which leads to improvement while decoding with small beam size. Our best model achieved word error rate of 45.8%, which is the best reported result for end-to-end systems using in-domain data for this task, according to our knowledge.
Tasks	Speech Recognition
Published	2018-07-02
URL	http://arxiv.org/abs/1807.00868v1
PDF	http://arxiv.org/pdf/1807.00868v1.pdf
PWC	https://paperswithcode.com/paper/exploring-end-to-end-techniques-for-low
Repo
Framework

Scalable inference of topic evolution via models for latent geometric structures


Title	Scalable inference of topic evolution via models for latent geometric structures
Authors	Mikhail Yurochkin, Zhiwei Fan, Aritra Guha, Paraschos Koutris, XuanLong Nguyen
Abstract	We develop new models and algorithms for learning the temporal dynamics of the topic polytopes and related geometric objects that arise in topic model based inference. Our model is nonparametric Bayesian and the corresponding inference algorithm is able to discover new topics as the time progresses. By exploiting the connection between the modeling of topic polytope evolution, Beta-Bernoulli process and the Hungarian matching algorithm, our method is shown to be several orders of magnitude faster than existing topic modeling approaches, as demonstrated by experiments working with several million documents in under two dozens of minutes.
Tasks
Published	2018-09-24
URL	https://arxiv.org/abs/1809.08738v3
PDF	https://arxiv.org/pdf/1809.08738v3.pdf
PWC	https://paperswithcode.com/paper/streaming-dynamic-and-distributed-inference
Repo
Framework

Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking


Title	Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking
Authors	Brian H. Wang, Yan Wang, Kilian Q. Weinberger, Mark Campbell
Abstract	We present a data association method for vision-based multiple pedestrian tracking, using deep convolutional features to distinguish between different people based on their appearances. These re-identification (re-ID) features are learned such that they are invariant to transformations such as rotation, translation, and changes in the background, allowing consistent identification of a pedestrian moving through a scene. We incorporate re-ID features into a general data association likelihood model for multiple person tracking, experimentally validate this model by using it to perform tracking in two evaluation video sequences, and examine the performance improvements gained as compared to several baseline approaches. Our results demonstrate that using deep person re-ID for data association greatly improves tracking robustness to challenges such as occlusions and path crossings.
Tasks	Person Re-Identification
Published	2018-10-19
URL	http://arxiv.org/abs/1810.08565v1
PDF	http://arxiv.org/pdf/1810.08565v1.pdf
PWC	https://paperswithcode.com/paper/deep-person-re-identification-for
Repo
Framework

Singularity, Misspecification, and the Convergence Rate of EM


Title	Singularity, Misspecification, and the Convergence Rate of EM
Authors	Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Michael I. Jordan, Martin J. Wainwright, Bin Yu
Abstract	A line of recent work has characterized the behavior of the EM algorithm in favorable settings in which the population likelihood is locally strongly concave around its maximizing argument. Examples include suitably separated Gaussian mixture models and mixtures of linear regressions. We consider instead over-fitted settings in which the likelihood need not be strongly concave, or, equivalently, when the Fisher information matrix might be singular. In such settings, it is known that a global maximum of the MLE based on $n$ samples can have a non-standard $n^{-1/4}$ rate of convergence. How does the EM algorithm behave in such settings? Focusing on the simple setting of a two-component mixture fit to a multivariate Gaussian distribution, we study the behavior of the EM algorithm both when the mixture weights are different (unbalanced case), and are equal (balanced case). Our analysis reveals a sharp distinction between these cases: in the former, the EM algorithm converges geometrically to a point at Euclidean distance $O((d/n)^{1/2})$ from the true parameter, whereas in the latter case, the convergence rate is exponentially slower, and the fixed point has a much lower $O((d/n)^{1/4})$ accuracy. The slower convergence in the balanced over-fitted case arises from the singularity of the Fisher information matrix. Analysis of this singular case requires the introduction of some novel analysis techniques, in particular we make use of a careful form of localization in the associated empirical process, and develop a recursive argument to progressively sharpen the statistical rate.
Tasks
Published	2018-10-01
URL	http://arxiv.org/abs/1810.00828v1
PDF	http://arxiv.org/pdf/1810.00828v1.pdf
PWC	https://paperswithcode.com/paper/singularity-misspecification-and-the
Repo
Framework

A mixed signal architecture for convolutional neural networks


Title	A mixed signal architecture for convolutional neural networks
Authors	Qiuwen Lou, Chenyun Pan, John McGuiness, Andras Horvath, Azad Naeemi, Michael Niemier, X. Sharon Hu
Abstract	Deep neural network (DNN) accelerators with improved energy and delay are desirable for meeting the requirements of hardware targeted for IoT and edge computing systems. Convolutional neural networks (CoNNs) belong to one of the most popular types of DNN architectures. This paper presents the design and evaluation of an accelerator for CoNNs. The system-level architecture is based on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i) the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly layers to reduce computational overheads typically associated with a CoNN, (iii) a mixed-signal CeNN architecture that performs CoNN computations in the analog and mixed signal domain, and (iv) design space exploration that identifies what CeNN-based algorithm and architectural features fare best compared to existing algorithms and architectures when evaluated over common datasets – MNIST and CIFAR-10. Notably, the proposed approach can lead to 8.7$\times$ improvements in energy-delay product (EDP) per digit classification for the MNIST dataset at iso-accuracy when compared with the state-of-the-art DNN engine, while our approach could offer 4.3$\times$ improvements in EDP when compared to other network implementations for the CIFAR-10 dataset.
Tasks
Published	2018-10-30
URL	https://arxiv.org/abs/1811.02636v4
PDF	https://arxiv.org/pdf/1811.02636v4.pdf
PWC	https://paperswithcode.com/paper/a-mixed-signal-architecture-for-convolutional
Repo
Framework