October 18, 2019

3322 words 16 mins read

Paper Group ANR 426

Deep Audio-Visual Speech Recognition. VoroTop: Voronoi Cell Topology Visualization and Analysis Toolkit. Evolving Chaos: Identifying New Attractors of the Generalised Lorenz Family. An Orchestrated Empirical Study on Deep Learning Frameworks and Platforms. Cartesian Neural Network Constitutive Models for Data-driven Elasticity Imaging. Skilled Expe …

Deep Audio-Visual Speech Recognition


Title	Deep Audio-Visual Speech Recognition
Authors	Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
Abstract	The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.
Tasks	Audio-Visual Speech Recognition, Speech Recognition, Visual Speech Recognition
Published	2018-09-06
URL	http://arxiv.org/abs/1809.02108v2
PDF	http://arxiv.org/pdf/1809.02108v2.pdf
PWC	https://paperswithcode.com/paper/deep-audio-visual-speech-recognition
Repo
Framework

VoroTop: Voronoi Cell Topology Visualization and Analysis Toolkit


Title	VoroTop: Voronoi Cell Topology Visualization and Analysis Toolkit
Authors	Emanuel A. Lazar
Abstract	This paper introduces a new open-source software program called VoroTop, which uses Voronoi topology to analyze local structure in atomic systems. Strengths of this approach include its abilities to analyze high-temperature systems and to characterize complex structure such as grain boundaries. This approach enables the automated analysis of systems and mechanisms previously not possible.
Tasks
Published	2018-04-11
URL	http://arxiv.org/abs/1804.04221v1
PDF	http://arxiv.org/pdf/1804.04221v1.pdf
PWC	https://paperswithcode.com/paper/vorotop-voronoi-cell-topology-visualization
Repo
Framework

Evolving Chaos: Identifying New Attractors of the Generalised Lorenz Family


Title	Evolving Chaos: Identifying New Attractors of the Generalised Lorenz Family
Authors	Indranil Pan, Saptarshi Das
Abstract	In a recent paper, we presented an intelligent evolutionary search technique through genetic programming (GP) for finding new analytical expressions of nonlinear dynamical systems, similar to the classical Lorenz attractor’s which also exhibit chaotic behaviour in the phase space. In this paper, we extend our previous finding to explore yet another gallery of new chaotic attractors which are derived from the original Lorenz system of equations. Compared to the previous exploration with sinusoidal type transcendental nonlinearity, here we focus on only cross-product and higher-power type nonlinearities in the three state equations. We here report over 150 different structures of chaotic attractors along with their one set of parameter values, phase space dynamics and the Largest Lyapunov Exponents (LLE). The expressions of these new Lorenz-like nonlinear dynamical systems have been automatically evolved through multi-gene genetic programming (MGGP). In the past two decades, there have been many claims of designing new chaotic attractors as an incremental extension of the Lorenz family. We provide here a large family of chaotic systems whose structure closely resemble the original Lorenz system but with drastically different phase space dynamics. This advances the state of the art knowledge of discovering new chaotic systems which can find application in many real-world problems. This work may also find its archival value in future in the domain of new chaotic system discovery.
Tasks
Published	2018-01-28
URL	http://arxiv.org/abs/1803.00052v1
PDF	http://arxiv.org/pdf/1803.00052v1.pdf
PWC	https://paperswithcode.com/paper/evolving-chaos-identifying-new-attractors-of
Repo
Framework

An Orchestrated Empirical Study on Deep Learning Frameworks and Platforms


Title	An Orchestrated Empirical Study on Deep Learning Frameworks and Platforms
Authors	Qianyu Guo, Xiaofei Xie, Lei Ma, Qiang Hu, Ruitao Feng, Li Li, Yang Liu, Jianjun Zhao, Xiaohong Li
Abstract	Deep learning (DL) has recently achieved tremendous success in a variety of cutting-edge applications, e.g., image recognition, speech and natural language processing, and autonomous driving. Besides the available big data and hardware evolution, DL frameworks and platforms play a key role to catalyze the research, development, and deployment of DL intelligent solutions. However, the difference in computation paradigm, architecture design and implementation of existing DL frameworks and platforms brings challenges for DL software development, deployment, maintenance, and migration. Up to the present, it still lacks a comprehensive study on how current diverse DL frameworks and platforms influence the DL software development process. In this paper, we initiate the first step towards the investigation on how existing state-of-the-art DL frameworks (i.e., TensorFlow, Theano, and Torch) and platforms (i.e., server/desktop, web, and mobile) support the DL software development activities. We perform an in-depth and comparative evaluation on metrics such as learning accuracy, DL model size, robustness, and performance, on state-of-the-art DL frameworks across platforms using two popular datasets MNIST and CIFAR-10. Our study reveals that existing DL frameworks still suffer from compatibility issues, which becomes even more severe when it comes to different platforms. We pinpoint the current challenges and opportunities towards developing high quality and compatible DL systems. To ignite further investigation along this direction to address urgent industrial demands of intelligent solutions, we make all of our assembled feasible toolchain and dataset publicly available.
Tasks	Autonomous Driving
Published	2018-11-13
URL	http://arxiv.org/abs/1811.05187v1
PDF	http://arxiv.org/pdf/1811.05187v1.pdf
PWC	https://paperswithcode.com/paper/an-orchestrated-empirical-study-on-deep
Repo
Framework

Cartesian Neural Network Constitutive Models for Data-driven Elasticity Imaging


Title	Cartesian Neural Network Constitutive Models for Data-driven Elasticity Imaging
Authors	Cameron Hoerig, Jamshid Ghaboussi, Michael F. Insana
Abstract	Elasticity images map biomechanical properties of soft tissues to aid in the detection and diagnosis of pathological states. In particular, quasi-static ultrasonic (US) elastography techniques use force-displacement measurements acquired during an US scan to parameterize the spatio-temporal stress-strain behavior. Current methods use a model-based inverse approach to estimate the parameters associated with a chosen constitutive model. However, model-based methods rely on simplifying assumptions of tissue biomechanical properties, often limiting elastography to imaging one or two linear-elastic parameters. We previously described a data-driven method for building neural network constitutive models (NNCMs) that learn stress-strain relationships from force-displacement data. Using measurements acquired on gelatin phantoms, we demonstrated the ability of NNCMs to characterize linear-elastic mechanical properties without an initial model assumption and thus circumvent the mathematical constraints typically encountered in classic model-based approaches to the inverse problem. While successful, we were required to use a priori knowledge of the internal object shape to define the spatial distribution of regions exhibiting different material properties. Here, we introduce Cartesian neural network constitutive models (CaNNCMs) that are capable of using data to model both linear-elastic mechanical properties and their distribution in space. We demonstrate the ability of CaNNCMs to capture arbitrary material property distributions using stress-strain data from simulated phantoms. Furthermore, we show that a trained CaNNCM can be used to reconstruct a Young’s modulus image. CaNNCMs are an important step toward data-driven modeling and imaging the complex mechanical properties of soft tissues.
Tasks
Published	2018-09-11
URL	http://arxiv.org/abs/1809.04121v1
PDF	http://arxiv.org/pdf/1809.04121v1.pdf
PWC	https://paperswithcode.com/paper/cartesian-neural-network-constitutive-models
Repo
Framework

Skilled Experience Catalogue: A Skill-Balancing Mechanism for Non-Player Characters using Reinforcement Learning


Title	Skilled Experience Catalogue: A Skill-Balancing Mechanism for Non-Player Characters using Reinforcement Learning
Authors	Frank G. Glavin, Michael G. Madden
Abstract	In this paper, we introduce a skill-balancing mechanism for adversarial non-player characters (NPCs), called Skilled Experience Catalogue (SEC). The objective of this mechanism is to approximately match the skill level of an NPC to an opponent in real-time. We test the technique in the context of a First-Person Shooter (FPS) game. Specifically, the technique adjusts a reinforcement learning NPC’s proficiency with a weapon based on its current performance against an opponent. Firstly, a catalogue of experience, in the form of stored learning policies, is built up by playing a series of training games. Once the NPC has been sufficiently trained, the catalogue acts as a timeline of experience with incremental knowledge milestones in the form of stored learning policies. If the NPC is performing poorly, it can jump to a later stage in the learning timeline to be equipped with more informed decision-making. Likewise, if it is performing significantly better than the opponent, it will jump to an earlier stage. The NPC continues to learn in real-time using reinforcement learning but its policy is adjusted, as required, by loading the most suitable milestones for the current circumstances.
Tasks	Decision Making
Published	2018-06-20
URL	http://arxiv.org/abs/1806.07637v1
PDF	http://arxiv.org/pdf/1806.07637v1.pdf
PWC	https://paperswithcode.com/paper/skilled-experience-catalogue-a-skill
Repo
Framework

Weighted Nonlocal Total Variation in Image Processing


Title	Weighted Nonlocal Total Variation in Image Processing
Authors	Haohan Li, Zuoqiang Shi, Xiaoping Wang
Abstract	In this paper, a novel weighted nonlocal total variation (WNTV) method is proposed. Compared to the classical nonlocal total variation methods, our method modifies the energy functional to introduce a weight to balance between the labeled sets and unlabeled sets. With extensive numerical examples in semi-supervised clustering, image inpainting and image colorization, we demonstrate that WNTV provides an effective and efficient method in many image processing and machine learning problems.
Tasks	Colorization, Image Inpainting
Published	2018-01-31
URL	http://arxiv.org/abs/1801.10441v1
PDF	http://arxiv.org/pdf/1801.10441v1.pdf
PWC	https://paperswithcode.com/paper/weighted-nonlocal-total-variation-in-image
Repo
Framework

Human-level Performance On Automatic Head Biometrics In Fetal Ultrasound Using Fully Convolutional Neural Networks


Title	Human-level Performance On Automatic Head Biometrics In Fetal Ultrasound Using Fully Convolutional Neural Networks
Authors	Matthew Sinclair, Christian F. Baumgartner, Jacqueline Matthew, Wenjia Bai, Juan Cerrolaza Martinez, Yuanwei Li, Sandra Smith, Caroline L. Knight, Bernhard Kainz, Jo Hajnal, Andrew P. King, Daniel Rueckert
Abstract	Measurement of head biometrics from fetal ultrasonography images is of key importance in monitoring the healthy development of fetuses. However, the accurate measurement of relevant anatomical structures is subject to large inter-observer variability in the clinic. To address this issue, an automated method utilizing Fully Convolutional Networks (FCN) is proposed to determine measurements of fetal head circumference (HC) and biparietal diameter (BPD). An FCN was trained on approximately 2000 2D ultrasound images of the head with annotations provided by 45 different sonographers during routine screening examinations to perform semantic segmentation of the head. An ellipse is fitted to the resulting segmentation contours to mimic the annotation typically produced by a sonographer. The model’s performance was compared with inter-observer variability, where two experts manually annotated 100 test images. Mean absolute model-expert error was slightly better than inter-observer error for HC (1.99mm vs 2.16mm), and comparable for BPD (0.61mm vs 0.59mm), as well as Dice coefficient (0.980 vs 0.980). Our results demonstrate that the model performs at a level similar to a human expert, and learns to produce accurate predictions from a large dataset annotated by many sonographers. Additionally, measurements are generated in near real-time at 15fps on a GPU, which could speed up clinical workflow for both skilled and trainee sonographers.
Tasks	Semantic Segmentation
Published	2018-04-24
URL	http://arxiv.org/abs/1804.09102v1
PDF	http://arxiv.org/pdf/1804.09102v1.pdf
PWC	https://paperswithcode.com/paper/human-level-performance-on-automatic-head
Repo
Framework

Confounding variables can degrade generalization performance of radiological deep learning models


Title	Confounding variables can degrade generalization performance of radiological deep learning models
Authors	John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, Eric K. Oermann
Abstract	Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospital systems. A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest x-rays from NIH (n=112,120 from 30,805 patients), Mount Sinai (42,396 from 12,904 patients), and Indiana (n=3,807 from 3,683 patients). In 3 / 5 natural comparisons, performance on chest x-rays from outside hospitals was significantly lower than on held-out x-rays from the original hospital systems. CNNs were able to detect where an x-ray was acquired (hospital system, hospital department) with extremely high accuracy and calibrate predictions accordingly. The performance of CNNs in diagnosing diseases on x-rays may reflect not only their ability to identify disease-specific imaging findings on x-rays, but also their ability to exploit confounding information. Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.
Tasks
Published	2018-07-02
URL	http://arxiv.org/abs/1807.00431v2
PDF	http://arxiv.org/pdf/1807.00431v2.pdf
PWC	https://paperswithcode.com/paper/confounding-variables-can-degrade
Repo
Framework

Privacy-preserving Prediction


Title	Privacy-preserving Prediction
Authors	Cynthia Dwork, Vitaly Feldman
Abstract	Ensuring differential privacy of models learned from sensitive user data is an important goal that has been studied extensively in recent years. It is now known that for some basic learning problems, especially those involving high-dimensional data, producing an accurate private model requires much more data than learning without privacy. At the same time, in many applications it is not necessary to expose the model itself. Instead users may be allowed to query the prediction model on their inputs only through an appropriate interface. Here we formulate the problem of ensuring privacy of individual predictions and investigate the overheads required to achieve it in several standard models of classification and regression. We first describe a simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict. We show that this approach has nearly optimal sample complexity for (realizable) PAC learning of any class of Boolean functions. At the same time, without strong assumptions on the data distribution, the aggregation step introduces a substantial overhead. We demonstrate that this overhead can be avoided for the well-studied class of thresholds on a line and for a number of standard settings of convex regression. The analysis of our algorithm for learning thresholds relies crucially on strong generalization guarantees that we establish for all differentially private prediction algorithms.
Tasks
Published	2018-03-27
URL	http://arxiv.org/abs/1803.10266v2
PDF	http://arxiv.org/pdf/1803.10266v2.pdf
PWC	https://paperswithcode.com/paper/privacy-preserving-prediction
Repo
Framework

Zero-shot keyword spotting for visual speech recognition in-the-wild


Title	Zero-shot keyword spotting for visual speech recognition in-the-wild
Authors	Themos Stafylakis, Georgios Tzimiropoulos
Abstract	Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.
Tasks	Keyword Spotting, Speech Recognition, Visual Speech Recognition
Published	2018-07-23
URL	http://arxiv.org/abs/1807.08469v2
PDF	http://arxiv.org/pdf/1807.08469v2.pdf
PWC	https://paperswithcode.com/paper/zero-shot-keyword-spotting-for-visual-speech
Repo
Framework

A Robust Background Initialization Algorithm with Superpixel Motion Detection


Title	A Robust Background Initialization Algorithm with Superpixel Motion Detection
Authors	Zhe Xu, Biao Min, Ray C. C. Cheung
Abstract	Scene background initialization allows the recovery of a clear image without foreground objects from a video sequence, which is generally the first step in many computer vision and video processing applications. The process may be strongly affected by some challenges such as illumination changes, foreground cluttering, intermittent movement, etc. In this paper, a robust background initialization approach based on superpixel motion detection is proposed. Both spatial and temporal characteristics of frames are adopted to effectively eliminate foreground objects. A subsequence with stable illumination condition is first selected for background estimation. Images are segmented into superpixels to preserve spatial texture information and foreground objects are eliminated by superpixel motion filtering process. A low-complexity density-based clustering is then performed to generate reliable background candidates for final background determination. The approach has been evaluated on SBMnet dataset and it achieves a performance superior or comparable to other state-of-the-art works with faster processing speed. Moreover, in those complex and dynamic categories, the algorithm produces the best results showing the robustness against very challenging scenarios.
Tasks	Motion Detection
Published	2018-05-17
URL	http://arxiv.org/abs/1805.06737v1
PDF	http://arxiv.org/pdf/1805.06737v1.pdf
PWC	https://paperswithcode.com/paper/a-robust-background-initialization-algorithm
Repo
Framework

Large-Scale Visual Speech Recognition


Title	Large-Scale Visual Speech Recognition
Authors	Brendan Shillingford, Yannis Assael, Matthew W. Hoffman, Thomas Paine, Cían Hughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett, Marie Mulville, Ben Coppin, Ben Laurie, Andrew Senior, Nando de Freitas
Abstract	This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.
Tasks	Lipreading, Speech Recognition, Visual Speech Recognition
Published	2018-07-13
URL	http://arxiv.org/abs/1807.05162v3
PDF	http://arxiv.org/pdf/1807.05162v3.pdf
PWC	https://paperswithcode.com/paper/large-scale-visual-speech-recognition
Repo
Framework

Paired 3D Model Generation with Conditional Generative Adversarial Networks


Title	Paired 3D Model Generation with Conditional Generative Adversarial Networks
Authors	Cihan Öngün, Alptekin Temizel
Abstract	Generative Adversarial Networks (GANs) are shown to be successful at generating new and realistic samples including 3D object models. Conditional GAN, a variant of GANs, allows generating samples in given conditions. However, objects generated for each condition are different and it does not allow generation of the same object in different conditions. In this paper, we first adapt conditional GAN, which is originally designed for 2D image generation, to the problem of generating 3D models in different rotations. We then propose a new approach to guide the network to generate the same 3D sample in different and controllable rotation angles (sample pairs). Unlike previous studies, the proposed method does not require modification of the standard conditional GAN architecture and it can be integrated into the training step of any conditional GAN. Experimental results and visual comparison of 3D models show that the proposed method is successful at generating model pairs in different conditions.
Tasks	Image Generation
Published	2018-08-09
URL	http://arxiv.org/abs/1808.03082v2
PDF	http://arxiv.org/pdf/1808.03082v2.pdf
PWC	https://paperswithcode.com/paper/paired-3d-model-generation-with-conditional
Repo
Framework

Variational Recurrent Neural Machine Translation


Title	Variational Recurrent Neural Machine Translation
Authors	Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, Biao Zhang
Abstract	Partially inspired by successful applications of variational recurrent neural networks, we propose a novel variational recurrent neural machine translation (VRNMT) model in this paper. Different from the variational NMT, VRNMT introduces a series of latent random variables to model the translation procedure of a sentence in a generative way, instead of a single latent variable. Specifically, the latent random variables are included into the hidden states of the NMT decoder with elements from the variational autoencoder. In this way, these variables are recurrently generated, which enables them to further capture strong and complex dependencies among the output translations at different timesteps. In order to deal with the challenges in performing efficient posterior inference and large-scale training during the incorporation of latent variables, we build a neural posterior approximator, and equip it with a reparameterization technique to estimate the variational lower bound. Experiments on Chinese-English and English-German translation tasks demonstrate that the proposed model achieves significant improvements over both the conventional and variational NMT models.
Tasks	Machine Translation
Published	2018-01-16
URL	http://arxiv.org/abs/1801.05119v1
PDF	http://arxiv.org/pdf/1801.05119v1.pdf
PWC	https://paperswithcode.com/paper/variational-recurrent-neural-machine
Repo
Framework