July 27, 2019

3220 words 16 mins read

Paper Group ANR 482

Decoding visemes: improving machine lipreading. Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization. A Convolution Tree with Deconvolution Branches: Exploiting Geometric Relationships for Single Shot Keypoint Detection. End-To-End Face Detection and Recognition. Make Your Bone Great Again : A s …

Decoding visemes: improving machine lipreading


Title	Decoding visemes: improving machine lipreading
Authors	Helen L Bear
Abstract	Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term “viseme” is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee’s is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee’s. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.
Tasks	Lipreading, Speech Recognition
Published	2017-10-03
URL	http://arxiv.org/abs/1710.01288v1
PDF	http://arxiv.org/pdf/1710.01288v1.pdf
PWC	https://paperswithcode.com/paper/decoding-visemes-improving-machine-lipreading
Repo
Framework

Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization


Title	Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization
Authors	Viacheslav Khomenko, Oleg Shyshkov, Olga Radyvonenko, Kostiantyn Bokhan
Abstract	An efficient algorithm for recurrent neural network training is presented. The approach increases the training speed for tasks where a length of the input sequence may vary significantly. The proposed approach is based on the optimal batch bucketing by input sequence length and data parallelization on multiple graphical processing units. The baseline training performance without sequence bucketing is compared with the proposed solution for a different number of buckets. An example is given for the online handwriting recognition task using an LSTM recurrent neural network. The evaluation is performed in terms of the wall clock time, number of epochs, and validation loss value.
Tasks
Published	2017-08-18
URL	http://arxiv.org/abs/1708.05604v1
PDF	http://arxiv.org/pdf/1708.05604v1.pdf
PWC	https://paperswithcode.com/paper/accelerating-recurrent-neural-network
Repo
Framework

A Convolution Tree with Deconvolution Branches: Exploiting Geometric Relationships for Single Shot Keypoint Detection


Title	A Convolution Tree with Deconvolution Branches: Exploiting Geometric Relationships for Single Shot Keypoint Detection
Authors	Amit Kumar, Rama Chellappa
Abstract	Recently, Deep Convolution Networks (DCNNs) have been applied to the task of face alignment and have shown potential for learning improved feature representations. Although deeper layers can capture abstract concepts like pose, it is difficult to capture the geometric relationships among the keypoints in DCNNs. In this paper, we propose a novel convolution-deconvolution network for facial keypoint detection. Our model predicts the 2D locations of the keypoints and their individual visibility along with 3D head pose, while exploiting the spatial relationships among different keypoints. Different from existing approaches of modeling these relationships, we propose learnable transform functions which captures the relationships between keypoints at feature level. However, due to extensive variations in pose, not all of these relationships act at once, and hence we propose, a pose-based routing function which implicitly models the active relationships. Both transform functions and the routing function are implemented through convolutions in a multi-task framework. Our approach presents a single-shot keypoint detection method, making it different from many existing cascade regression-based methods. We also show that learning these relationships significantly improve the accuracy of keypoint detections for in-the-wild face images from challenging datasets such as AFW and AFLW.
Tasks	Face Alignment, Keypoint Detection
Published	2017-04-06
URL	http://arxiv.org/abs/1704.01880v1
PDF	http://arxiv.org/pdf/1704.01880v1.pdf
PWC	https://paperswithcode.com/paper/a-convolution-tree-with-deconvolution
Repo
Framework

End-To-End Face Detection and Recognition


Title	End-To-End Face Detection and Recognition
Authors	Liying Chi, Hongxin Zhang, Mingxiu Chen
Abstract	Plenty of face detection and recognition methods have been proposed and got delightful results in decades. Common face recognition pipeline consists of: 1) face detection, 2) face alignment, 3) feature extraction, 4) similarity calculation, which are separated and independent from each other. The separated face analyzing stages lead the model redundant calculation and are hard for end-to-end training. In this paper, we proposed a novel end-to-end trainable convolutional network framework for face detection and recognition, in which a geometric transformation matrix was directly learned to align the faces, instead of predicting the facial landmarks. In training stage, our single CNN model is supervised only by face bounding boxes and personal identities, which are publicly available from WIDER FACE \cite{Yang2016} dataset and CASIA-WebFace \cite{Yi2014} dataset. Tested on Face Detection Dataset and Benchmark (FDDB) \cite{Jain2010} dataset and Labeled Face in the Wild (LFW) \cite{Huang2007} dataset, we have achieved 89.24% recall for face detection task and 98.63% verification accuracy for face recognition task simultaneously, which are comparable to state-of-the-art results.
Tasks	Face Alignment, Face Detection, Face Recognition
Published	2017-03-31
URL	http://arxiv.org/abs/1703.10818v1
PDF	http://arxiv.org/pdf/1703.10818v1.pdf
PWC	https://paperswithcode.com/paper/end-to-end-face-detection-and-recognition
Repo
Framework

Make Your Bone Great Again : A study on Osteoporosis Classification


Title	Make Your Bone Great Again : A study on Osteoporosis Classification
Authors	Rahul Paul, Saeed Alahamri, Sulav Malla, Ghulam Jilani Quadri
Abstract	Osteoporosis can be identified by looking at 2D x-ray images of the bone. The high degree of similarity between images of a healthy bone and a diseased one makes classification a challenge. A good bone texture characterization technique is essential for identifying osteoporosis cases. Standard texture feature extraction techniques like Local Binary Pattern (LBP), Gray Level Co-occurrence Matrix (GLCM) have been used for this purpose. In this paper, we draw a comparison between deep features extracted from convolution neural network against these traditional features. Our results show that deep features have more discriminative power as classifiers trained on them always outperform the ones trained on traditional features.
Tasks
Published	2017-07-17
URL	http://arxiv.org/abs/1707.05385v1
PDF	http://arxiv.org/pdf/1707.05385v1.pdf
PWC	https://paperswithcode.com/paper/make-your-bone-great-again-a-study-on
Repo
Framework

Gromov-Hausdorff limit of Wasserstein spaces on point clouds


Title	Gromov-Hausdorff limit of Wasserstein spaces on point clouds
Authors	Nicolas Garcia Trillos
Abstract	We consider a point cloud $X_n := { x_1, \dots, x_n }$ uniformly distributed on the flat torus $\mathbb{T}^d : = \mathbb{R}^d / \mathbb{Z}^d $, and construct a geometric graph on the cloud by connecting points that are within distance $\varepsilon$ of each other. We let $\mathcal{P}(X_n)$ be the space of probability measures on $X_n$ and endow it with a discrete Wasserstein distance $W_n$ as introduced independently by Chow et al, Maas, and Mielke for general finite Markov chains. We show that as long as $\varepsilon= \varepsilon_n$ decays towards zero slower than an explicit rate depending on the level of uniformity of $X_n$, then the space $(\mathcal{P}(X_n), W_n)$ converges in the Gromov-Hausdorff sense towards the space of probability measures on $\mathbb{T}^d$ endowed with the Wasserstein distance. The analysis presented in this paper is a first step in the study of stability of evolution equations defined over random point clouds as the number of points grows to infinity.
Tasks
Published	2017-02-11
URL	https://arxiv.org/abs/1702.03464v3
PDF	https://arxiv.org/pdf/1702.03464v3.pdf
PWC	https://paperswithcode.com/paper/gromov-hausdorff-limit-of-wasserstein-spaces
Repo
Framework

Travel time tomography with adaptive dictionaries


Title	Travel time tomography with adaptive dictionaries
Authors	Michael Bianco, Peter Gerstoft
Abstract	We develop a 2D travel time tomography method which regularizes the inversion by modeling groups of slowness pixels from discrete slowness maps, called patches, as sparse linear combinations of atoms from a dictionary. We propose to use dictionary learning during the inversion to adapt dictionaries to specific slowness maps. This patch regularization, called the local model, is integrated into the overall slowness map, called the global model. The local model considers small-scale variations using a sparsity constraint and the global model considers larger-scale features constrained using $\ell_2$ regularization. This strategy in a locally-sparse travel time tomography (LST) approach enables simultaneous modeling of smooth and discontinuous slowness features. This is in contrast to conventional tomography methods, which constrain models to be exclusively smooth or discontinuous. We develop a $\textit{maximum a posteriori}$ formulation for LST and exploit the sparsity of slowness patches using dictionary learning. The LST approach compares favorably with smoothness and total variation regularization methods on densely, but irregularly sampled synthetic slowness maps.
Tasks	Dictionary Learning
Published	2017-12-16
URL	http://arxiv.org/abs/1712.08655v3
PDF	http://arxiv.org/pdf/1712.08655v3.pdf
PWC	https://paperswithcode.com/paper/travel-time-tomography-with-adaptive
Repo
Framework

GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB


Title	GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB
Authors	Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, Christian Theobalt
Abstract	We address the highly challenging problem of real-time 3D hand tracking based on a monocular RGB-only sequence. Our tracking method combines a convolutional neural network with a kinematic 3D hand model, such that it generalizes well to unseen data, is robust to occlusions and varying camera viewpoints, and leads to anatomically plausible as well as temporally smooth hand motions. For training our CNN we propose a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network. To be more specific, we use a neural network that translates synthetic images to “real” images, such that the so-generated images follow the same statistical distribution as real-world hand images. For training this translation network we combine an adversarial loss and a cycle-consistency loss with a geometric consistency loss in order to preserve geometric properties (such as hand pose) during translation. We demonstrate that our hand tracking system outperforms the current state-of-the-art on challenging RGB-only footage.
Tasks	Image-to-Image Translation
Published	2017-12-04
URL	http://arxiv.org/abs/1712.01057v1
PDF	http://arxiv.org/pdf/1712.01057v1.pdf
PWC	https://paperswithcode.com/paper/ganerated-hands-for-real-time-3d-hand
Repo
Framework

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control


Title	Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control
Authors	Shangtong Zhang, Osmar R. Zaiane
Abstract	Reinforcement Learning and the Evolutionary Strategy are two major approaches in addressing complicated control problems. Both are strong contenders and have their own devotee communities. Both groups have been very active in developing new advances in their own domain and devising, in recent years, leading-edge techniques to address complex continuous control tasks. Here, in the context of Deep Reinforcement Learning, we formulate a parallelized version of the Proximal Policy Optimization method and a Deep Deterministic Policy Gradient method. Moreover, we conduct a thorough comparison between the state-of-the-art techniques in both camps fro continuous control; evolutionary methods and Deep Reinforcement Learning methods. The results show there is no consistent winner.
Tasks	Continuous Control
Published	2017-11-30
URL	http://arxiv.org/abs/1712.00006v2
PDF	http://arxiv.org/pdf/1712.00006v2.pdf
PWC	https://paperswithcode.com/paper/comparing-deep-reinforcement-learning-and
Repo
Framework

Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation


Title	Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation
Authors	Haoshu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, Song-Chun Zhu
Abstract	In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation. Our model directly takes 2D pose as input and learns a generalized 2D-3D mapping function. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a pose sample simulator to augment training samples in virtual camera views, which further improves our model generalizability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.
Tasks	3D Human Pose Estimation, 3D Pose Estimation, Pose Estimation
Published	2017-10-17
URL	http://arxiv.org/abs/1710.06513v6
PDF	http://arxiv.org/pdf/1710.06513v6.pdf
PWC	https://paperswithcode.com/paper/learning-pose-grammar-to-encode-human-body
Repo
Framework

End-to-end Global to Local CNN Learning for Hand Pose Recovery in Depth Data


Title	End-to-end Global to Local CNN Learning for Hand Pose Recovery in Depth Data
Authors	Meysam Madadi, Sergio Escalera, Xavier Baro, Jordi Gonzalez
Abstract	Despite recent advances in 3D pose estimation of human hands, especially thanks to the advent of CNNs and depth cameras, this task is still far from being solved. This is mainly due to the highly non-linear dynamics of fingers, which make hand model training a challenging task. In this paper, we exploit a novel hierarchical tree-like structured CNN, in which branches are trained to become specialized in predefined subsets of hand joints, called local poses. We further fuse local pose features, extracted from hierarchical CNN branches, to learn higher order dependencies among joints in the final pose by end-to-end training. Lastly, the loss function used is also defined to incorporate appearance and physical constraints about doable hand motion and deformation. Finally, we introduce a non-rigid data augmentation approach to increase the amount of training depth data. Experimental results suggest that feeding a tree-shaped CNN, specialized in local poses, into a fusion network for modeling joints correlations and dependencies, helps to increase the precision of final estimations, outperforming state-of-the-art results on NYU and SyntheticHand datasets.
Tasks	3D Pose Estimation, Data Augmentation, Pose Estimation
Published	2017-05-26
URL	http://arxiv.org/abs/1705.09606v2
PDF	http://arxiv.org/pdf/1705.09606v2.pdf
PWC	https://paperswithcode.com/paper/end-to-end-global-to-local-cnn-learning-for
Repo
Framework

Beyond opening up the black box: Investigating the role of algorithmic systems in Wikipedian organizational culture


Title	Beyond opening up the black box: Investigating the role of algorithmic systems in Wikipedian organizational culture
Authors	R. Stuart Geiger
Abstract	Scholars and practitioners across domains are increasingly concerned with algorithmic transparency and opacity, interrogating the values and assumptions embedded in automated, black-boxed systems, particularly in user-generated content platforms. I report from an ethnography of infrastructure in Wikipedia to discuss an often understudied aspect of this topic: the local, contextual, learned expertise involved in participating in a highly automated social-technical environment. Today, the organizational culture of Wikipedia is deeply intertwined with various data-driven algorithmic systems, which Wikipedians rely on to help manage and govern the “anyone can edit” encyclopedia at a massive scale. These bots, scripts, tools, plugins, and dashboards make Wikipedia more efficient for those who know how to work with them, but like all organizational culture, newcomers must learn them if they want to fully participate. I illustrate how cultural and organizational expertise is enacted around algorithmic agents by discussing two autoethnographic vignettes, which relate my personal experience as a veteran in Wikipedia. I present thick descriptions of how governance and gatekeeping practices are articulated through and in alignment with these automated infrastructures. Over the past 15 years, Wikipedian veterans and administrators have made specific decisions to support administrative and editorial workflows with automation in particular ways and not others. I use these cases of Wikipedia’s bot-supported bureaucracy to discuss several issues in the fields of critical algorithms studies, critical data studies, and fairness, accountability, and transparency in machine learning – most principally arguing that scholarship and practice must go beyond trying to “open up the black box” of such systems and also examine sociocultural processes like newcomer socialization.
Tasks
Published	2017-09-26
URL	http://arxiv.org/abs/1709.09093v2
PDF	http://arxiv.org/pdf/1709.09093v2.pdf
PWC	https://paperswithcode.com/paper/beyond-opening-up-the-black-box-investigating
Repo
Framework

Exploring the Combination Rules of D Numbers From a Perspective of Conflict Redistribution


Title	Exploring the Combination Rules of D Numbers From a Perspective of Conflict Redistribution
Authors	Xinyang Deng, Wen Jiang
Abstract	Dempster-Shafer theory of evidence is widely applied to uncertainty modelling and knowledge reasoning because of its advantages in dealing with uncertain information. But some conditions or requirements, such as exclusiveness hypothesis and completeness constraint, limit the development and application of that theory to a large extend. To overcome the shortcomings and enhance its capability of representing the uncertainty, a novel model, called D numbers, has been proposed recently. However, many key issues, for example how to implement the combination of D numbers, remain unsolved. In the paper, we have explored the combination of D Numbers from a perspective of conflict redistribution, and proposed two combination rules being suitable for different situations for the fusion of two D numbers. The proposed combination rules can reduce to the classical Dempster’s rule in Dempster-Shafer theory under a certain conditions. Numerical examples and discussion about the proposed rules are also given in the paper.
Tasks
Published	2017-03-15
URL	http://arxiv.org/abs/1703.04862v1
PDF	http://arxiv.org/pdf/1703.04862v1.pdf
PWC	https://paperswithcode.com/paper/exploring-the-combination-rules-of-d-numbers
Repo
Framework

Visual gesture variability between talkers in continuous visual speech


Title	Visual gesture variability between talkers in continuous visual speech
Authors	Helen L Bear
Abstract	Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop end-to-end systems holistically or, we experiment to further our understanding of the visual speech signal. The latter option is more difficult but this knowledge would enable researchers to both improve systems and apply the new knowledge to other domains such as speech therapy. One challenge in lipreading systems is the correct labeling of the classifiers. These labels map an estimated function between visemes on the lips and the phonemes uttered. Here we ask if such maps are speaker-dependent? Prior work investigated isolated word recognition from speaker-dependent (SD) visemes, we extend this to continuous speech. Benchmarked against SD results, and the isolated words performance, we test with RMAV dataset speakers and observe that with continuous speech, the trajectory between visemes has a greater negative effect on the speaker differentiation.
Tasks	Lipreading
Published	2017-10-03
URL	http://arxiv.org/abs/1710.01297v1
PDF	http://arxiv.org/pdf/1710.01297v1.pdf
PWC	https://paperswithcode.com/paper/visual-gesture-variability-between-talkers-in
Repo
Framework

Bootstrapped synthetic likelihood


Title	Bootstrapped synthetic likelihood
Authors	Richard G. Everitt
Abstract	Approximate Bayesian computation (ABC) and synthetic likelihood (SL) techniques have enabled the use of Bayesian inference for models that may be simulated, but for which the likelihood cannot be evaluated pointwise at values of an unknown parameter $\theta$. The main idea in ABC and SL is to, for different values of $\theta$ (usually chosen using a Monte Carlo algorithm), build estimates of the likelihood based on simulations from the model conditional on $\theta$. The quality of these estimates determines the efficiency of an ABC/SL algorithm. In standard ABC/SL, the only means to improve an estimated likelihood at $\theta$ is to simulate more times from the model conditional on $\theta$, which is infeasible in cases where the simulator is computationally expensive. In this paper we describe how to use bootstrapping as a means for improving SL estimates whilst using fewer simulations from the model, and also investigate its use in ABC. Further, we investigate the use of the bag of little bootstraps as a means for applying this approach to large datasets, yielding Monte Carlo algorithms that accurately approximate posterior distributions whilst only simulating subsamples of the full data. Examples of the approach applied to i.i.d., temporal and spatial data are given.
Tasks	Bayesian Inference
Published	2017-11-15
URL	http://arxiv.org/abs/1711.05825v2
PDF	http://arxiv.org/pdf/1711.05825v2.pdf
PWC	https://paperswithcode.com/paper/bootstrapped-synthetic-likelihood
Repo
Framework