January 31, 2020

3393 words 16 mins read

Paper Group ANR 63

NIESR: Nuisance Invariant End-to-end Speech Recognition. Accelerated CNN Training Through Gradient Approximation. SliderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. End-to-End Speech Recognition with High-Frame-Rate Features Extraction. Improving Performance of End-to-End ASR on Numeric Sequences. Linearized two-lay …

NIESR: Nuisance Invariant End-to-end Speech Recognition


Title	NIESR: Nuisance Invariant End-to-end Speech Recognition
Authors	I-Hung Hsu, Ayush Jaiswal, Premkumar Natarajan
Abstract	Deep neural network models for speech recognition have achieved great success recently, but they can learn incorrect associations between the target and nuisance factors of speech (e.g., speaker identities, background noise, etc.), which can lead to overfitting. While several methods have been proposed to tackle this problem, existing methods incorporate additional information about nuisance factors during training to develop invariant models. However, enumeration of all possible nuisance factors in speech data and the collection of their annotations is difficult and expensive. We present a robust training scheme for end-to-end speech recognition that adopts an unsupervised adversarial invariance induction framework to separate out essential factors for speech-recognition from nuisances without using any supplementary labels besides the transcriptions. Experiments show that the speech recognition model trained with the proposed training scheme achieves relative improvements of 5.48% on WSJ0, 6.16% on CHiME3, and 6.61% on TIMIT dataset over the base model. Additionally, the proposed method achieves a relative improvement of 14.44% on the combined WSJ0+CHiME3 dataset.
Tasks	End-To-End Speech Recognition, Speech Recognition
Published	2019-07-07
URL	https://arxiv.org/abs/1907.03233v1
PDF	https://arxiv.org/pdf/1907.03233v1.pdf
PWC	https://paperswithcode.com/paper/niesr-nuisance-invariant-end-to-end-speech
Repo
Framework

Accelerated CNN Training Through Gradient Approximation


Title	Accelerated CNN Training Through Gradient Approximation
Authors	Ziheng Wang, Sree Harsha Nelaturu
Abstract	Training deep convolutional neural networks such as VGG and ResNet by gradient descent is an expensive exercise requiring specialized hardware such as GPUs. Recent works have examined the possibility of approximating the gradient computation while maintaining the same convergence properties. While promising, the approximations only work on relatively small datasets such as MNIST. They also fail to achieve real wall-clock speedups due to lack of efficient GPU implementations of the proposed approximation methods. In this work, we explore three alternative methods to approximate gradients, with an efficient GPU kernel implementation for one of them. We achieve wall-clock speedup with ResNet-20 and VGG-19 on the CIFAR-10 dataset upwards of 7%, with a minimal loss in validation accuracy.
Tasks
Published	2019-08-15
URL	https://arxiv.org/abs/1908.05460v1
PDF	https://arxiv.org/pdf/1908.05460v1.pdf
PWC	https://paperswithcode.com/paper/accelerated-cnn-training-through-gradient
Repo
Framework

SliderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters


Title	SliderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters
Authors	Evangelos Ververas, Stefanos Zafeiriou
Abstract	Image-to-image (i2i) translation is the dense regression problem of learning how to transform an input image into an output using aligned image pairs. Remarkable progress has been made in i2i translation with the advent of Deep Convolutional Neural Networks (DCNNs) and particular using the learning paradigm of Generative Adversarial Networks (GANs). In the absence of paired images, i2i translation is tackled with one or multiple domain transformations (i.e., CycleGAN, StarGAN etc.). In this paper, we study a new problem, that of image-to-image translation, under a set of continuous parameters that correspond to a model describing a physical process. In particular, we propose the SliderGAN which transforms an input face image into a new one according to the continuous values of a statistical blendshape model of facial motion. We show that it is possible to edit a facial image according to expression and speech blendshapes, using sliders that control the continuous values of the blendshape model. This provides much more flexibility in various tasks, including but not limited to face editing, expression transfer and face neutralisation, comparing to models based on discrete expressions or action units.
Tasks	Image-to-Image Translation
Published	2019-08-26
URL	https://arxiv.org/abs/1908.09638v1
PDF	https://arxiv.org/pdf/1908.09638v1.pdf
PWC	https://paperswithcode.com/paper/slidergan-synthesizing-expressive-face-images
Repo
Framework

End-to-End Speech Recognition with High-Frame-Rate Features Extraction


Title	End-to-End Speech Recognition with High-Frame-Rate Features Extraction
Authors	Cong-Thanh Do
Abstract	State-of-the-art end-to-end automatic speech recognition (ASR) extracts acoustic features from input speech signal every 10 ms which corresponds to a frame rate of 100 frames/second. In this report, we investigate the use of high-frame-rate features extraction in end-to-end ASR. High frame rates of 200 and 400 frames/second are used in the features extraction and provide additional information for end-to-end ASR. The effectiveness of high-frame-rate features extraction is evaluated independently and in combination with speed perturbation based data augmentation. Experiments performed on two speech corpora, Wall Street Journal (WSJ) and CHiME-5, show that using high-frame-rate features extraction yields improved performance for end-to-end ASR, both independently and in combination with speed perturbation. On WSJ corpus, the relative reduction of word error rate (WER) yielded by high-frame-rate features extraction independently and in combination with speed perturbation are up to 21.3% and 24.1%, respectively. On CHiME-5 corpus, the corresponding relative WER reductions are up to 2.8% and 7.9%, respectively, on the test data recorded by microphone arrays and up to 11.8% and 21.2%, respectively, on the test data recorded by binaural microphones.
Tasks	Data Augmentation, End-To-End Speech Recognition, Speech Recognition
Published	2019-07-03
URL	https://arxiv.org/abs/1907.01957v2
PDF	https://arxiv.org/pdf/1907.01957v2.pdf
PWC	https://paperswithcode.com/paper/end-to-end-speech-recognition-with-high-frame
Repo
Framework

Improving Performance of End-to-End ASR on Numeric Sequences


Title	Improving Performance of End-to-End ASR on Numeric Sequences
Authors	Cal Peyser, Hao Zhang, Tara N. Sainath, Zelin Wu
Abstract	Recognizing written domain numeric utterances (e.g. I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-of-vocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g. I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low memory setting of on-device speech recognition. E2E models such as RNN-T are attractive for on-device ASR, as they fold the AM, PM and LM of a conventional model into one neural network. However, in the on-device setting the large memory footprint of an FST denormer makes spoken domain training more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that using a text-to-speech system to generate additional numeric training data, as well as using a small-footprint neural network to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest numeric sequences, we see reduction of WER by up to a factor of 8.
Tasks	End-To-End Speech Recognition, Speech Recognition
Published	2019-07-01
URL	https://arxiv.org/abs/1907.01372v1
PDF	https://arxiv.org/pdf/1907.01372v1.pdf
PWC	https://paperswithcode.com/paper/improving-performance-of-end-to-end-asr-on
Repo
Framework

Linearized two-layers neural networks in high dimension


Title	Linearized two-layers neural networks in high dimension
Authors	Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari
Abstract	We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples ${(y_i,{\boldsymbol x}_i)}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + \delta} \le N\le d^{\ell+1-\delta}$ for small $\delta > 0$, then \RF, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + \delta} \le n \le d^{\ell +1-\delta}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.
Tasks
Published	2019-04-27
URL	https://arxiv.org/abs/1904.12191v3
PDF	https://arxiv.org/pdf/1904.12191v3.pdf
PWC	https://paperswithcode.com/paper/linearized-two-layers-neural-networks-in-high
Repo
Framework

End-to-End ASR for Code-switched Hindi-English Speech


Title	End-to-End ASR for Code-switched Hindi-English Speech
Authors	Brij Mohan Lal Srivastava, Basil Abraham, Sunayana Sitaram, Rupesh Mehta, Preethi Jyothi
Abstract	End-to-end (E2E) models have been explored for large speech corpora and have been found to match or outperform traditional pipeline-based systems in some languages. However, most prior work on end-to-end models use speech corpora exceeding hundreds or thousands of hours. In this study, we explore end-to-end models for code-switched Hindi-English language with less than 50 hours of data. We utilize two specific measures to improve network performance in the low-resource setting, namely multi-task learning (MTL) and balancing the corpus to deal with the inherent class imbalance problem i.e. the skewed frequency distribution over graphemes. We compare the results of the proposed approaches with traditional, cascaded ASR systems. While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus.
Tasks	End-To-End Speech Recognition, Multi-Task Learning
Published	2019-06-22
URL	https://arxiv.org/abs/1906.09426v1
PDF	https://arxiv.org/pdf/1906.09426v1.pdf
PWC	https://paperswithcode.com/paper/end-to-end-asr-for-code-switched-hindi
Repo
Framework

Multi-Stream End-to-End Speech Recognition


Title	Multi-Stream End-to-End Speech Recognition
Authors	Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Shinji Watanabe, Takaaki Hori, Hynek Hermansky
Abstract	Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, the Hierarchical Attention Network (HAN) is introduced to steer the decoder toward the most informative encoders. A separate CTC network is assigned to each stream to force monotonic alignments. Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively. In MEM-Res framework, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary information from same acoustics. Experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0-32.1% and the best WER of 3.6% in the WSJ eval92 test set. The MEM-Array framework aims at improving the far-field ASR robustness using multiple microphone arrays which are activated by separate encoders. Compared with the best single-array results, the proposed framework has achieved relative WER reduction of 3.7% and 9.7% in AMI and DIRHA multi-array corpora, respectively, which also outperforms conventional fusion strategies.
Tasks	End-To-End Speech Recognition, Speech Recognition
Published	2019-06-17
URL	https://arxiv.org/abs/1906.08041v2
PDF	https://arxiv.org/pdf/1906.08041v2.pdf
PWC	https://paperswithcode.com/paper/multi-stream-end-to-end-speech-recognition
Repo
Framework

Binary Sine Cosine Algorithms for Feature Selection from Medical Data


Title	Binary Sine Cosine Algorithms for Feature Selection from Medical Data
Authors	Shokooh Taghian, Mohammad H. Nadimi-Shahraki
Abstract	A well-constructed classification model highly depends on input feature subsets from a dataset, which may contain redundant, irrelevant, or noisy features. This challenge can be worse while dealing with medical datasets. The main aim of feature selection as a pre-processing task is to eliminate these features and select the most effective ones. In the literature, metaheuristic algorithms show a successful performance to find optimal feature subsets. In this paper, two binary metaheuristic algorithms named S-shaped binary Sine Cosine Algorithm (SBSCA) and V-shaped binary Sine Cosine Algorithm (VBSCA) are proposed for feature selection from the medical data. In these algorithms, the search space remains continuous, while a binary position vector is generated by two transfer functions S-shaped and V-shaped for each solution. The proposed algorithms are compared with four latest binary optimization algorithms over five medical datasets from the UCI repository. The experimental results confirm that using both bSCA variants enhance the accuracy of classification on these medical datasets compared to four other algorithms.
Tasks	Feature Selection
Published	2019-11-15
URL	https://arxiv.org/abs/1911.07805v1
PDF	https://arxiv.org/pdf/1911.07805v1.pdf
PWC	https://paperswithcode.com/paper/binary-sine-cosine-algorithms-for-feature
Repo
Framework

Predicting Indian stock market using the psycho-linguistic features of financial news


Title	Predicting Indian stock market using the psycho-linguistic features of financial news
Authors	B. Shravan Kumar, Vadlamani Ravi, Rishabh Miglani
Abstract	Financial forecasting using news articles is an emerging field. In this paper, we proposed hybrid intelligent models for stock market prediction using the psycholinguistic variables (LIWC and TAALES) extracted from news articles as predictor variables. For prediction purpose, we employed various intelligent techniques such as Multilayer Perceptron (MLP), Group Method of Data Handling (GMDH), General Regression Neural Network (GRNN), Random Forest (RF), Quantile Regression Random Forest (QRRF), Classification and regression tree (CART) and Support Vector Regression (SVR). We experimented on the data of 12 companies stocks, which are listed in the Bombay Stock Exchange (BSE). We employed chi-squared and maximum relevance and minimum redundancy (MRMR) feature selection techniques on the psycho-linguistic features obtained from the new articles etc. After extensive experimentation, using the Diebold-Mariano test, we conclude that GMDH and GRNN are statistically the best techniques in that order with respect to the MAPE and NRMSE values.
Tasks	Feature Selection, Stock Market Prediction
Published	2019-11-07
URL	https://arxiv.org/abs/1911.06193v1
PDF	https://arxiv.org/pdf/1911.06193v1.pdf
PWC	https://paperswithcode.com/paper/predicting-indian-stock-market-using-the
Repo
Framework

Skin cancer detection based on deep learning and entropy to detect outlier samples


Title	Skin cancer detection based on deep learning and entropy to detect outlier samples
Authors	Andre G. C. Pacheco, Abder-Rahman Ali, Thomas Trappenberg
Abstract	We describe our methods that achieved the 3rd and 4th places in tasks 1 and 2, respectively, at ISIC challenge 2019. The goal of this challenge is to provide the diagnostic for skin cancer using images and meta-data. There are nine classes in the dataset, nonetheless, one of them is an outlier and is not present on it. To tackle the challenge, we apply an ensemble of classifiers, which has 13 convolutional neural networks (CNN), we develop two approaches to handle the outlier class and we propose a straightforward method to use the meta-data along with the images. Throughout this report, we detail each methodology and parameters to make it easy to replicate our work. The results obtained are in accordance with the previous challenges and the approaches to detect the outlier class and to address the meta-data seem to be work properly.
Tasks
Published	2019-09-10
URL	https://arxiv.org/abs/1909.04525v2
PDF	https://arxiv.org/pdf/1909.04525v2.pdf
PWC	https://paperswithcode.com/paper/skin-cancer-detection-based-on-deep-learning
Repo
Framework

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models


Title	Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
Authors	Wei Fang, Yu-An Chung, James Glass
Abstract	Modern text-to-speech (TTS) systems are able to generate audio that sounds almost as natural as human speech. However, the bar of developing high-quality TTS systems remains high since a sizable set of studio-quality <text, audio> pairs is usually required. Compared to commercial data used to develop state-of-the-art systems, publicly available data are usually worse in terms of both quality and size. Audio generated by TTS systems trained on publicly available data tends to not only sound less natural, but also exhibits more background noise. In this work, we aim to lower TTS systems’ reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training. In particular, we investigate the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder. BERT representations learned from large amounts of unlabeled text data are shown to contain very rich semantic and syntactic information about the input text, and have potential to be leveraged by a TTS system to compensate the lack of high-quality data. We incorporate BERT as a parallel branch to the Tacotron-2 encoder with its own attention head. For an input text, it is simultaneously passed into BERT and the Tacotron-2 encoder. The representations extracted by the two branches are concatenated and then fed to the decoder. As a preliminary study, although we have not found incorporating BERT into Tacotron-2 generates more natural or cleaner speech at a human-perceivable level, we observe improvements in other aspects such as the model is being significantly better at knowing when to stop decoding such that there is much less babbling at the end of the synthesized audio and faster convergence during training.
Tasks	Speech Synthesis, Transfer Learning
Published	2019-06-17
URL	https://arxiv.org/abs/1906.07307v1
PDF	https://arxiv.org/pdf/1906.07307v1.pdf
PWC	https://paperswithcode.com/paper/towards-transfer-learning-for-end-to-end
Repo
Framework

Fast Data-Driven Simulation of Cherenkov Detectors Using Generative Adversarial Networks


Title	Fast Data-Driven Simulation of Cherenkov Detectors Using Generative Adversarial Networks
Authors	Artem Maevskiy, Denis Derkach, Nikita Kazeev, Andrey Ustyuzhanin, Maksim Artemev, Lucio Anderlini
Abstract	The increasing luminosities of future Large Hadron Collider runs and next generation of collider experiments will require an unprecedented amount of simulated events to be produced. Such large scale productions are extremely demanding in terms of computing resources. Thus new approaches to event generation and simulation of detector responses are needed. In LHCb, the accurate simulation of Cherenkov detectors takes a sizeable fraction of CPU time. An alternative approach is described here, when one generates high-level reconstructed observables using a generative neural network to bypass low level details. This network is trained to reproduce the particle species likelihood function values based on the track kinematic parameters and detector occupancy. The fast simulation is trained using real data samples collected by LHCb during run 2. We demonstrate that this approach provides high-fidelity results.
Tasks
Published	2019-05-28
URL	https://arxiv.org/abs/1905.11825v2
PDF	https://arxiv.org/pdf/1905.11825v2.pdf
PWC	https://paperswithcode.com/paper/fast-data-driven-simulation-of-cherenkov
Repo
Framework

Mask-guided Style Transfer Network for Purifying Real Images


Title	Mask-guided Style Transfer Network for Purifying Real Images
Authors	Tongtong Zhao, Yuxiao Yan, Jinjia Peng, Huibing Wang, Xianping Fu
Abstract	Recently, the progress of learning-by-synthesis has proposed a training model for synthetic images, which can effectively reduce the cost of human and material resources. However, due to the different distribution of synthetic images compared with real images, the desired performance cannot be achieved. To solve this problem, the previous method learned a model to improve the realism of the synthetic images. Different from the previous methods, this paper try to purify real image by extracting discriminative and robust features to convert outdoor real images to indoor synthetic images. In this paper, we first introduce the segmentation masks to construct RGB-mask pairs as inputs, then we design a mask-guided style transfer network to learn style features separately from the attention and bkgd(background) regions and learn content features from full and attention region. Moreover, we propose a novel region-level task-guided loss to restrain the features learnt from style and content. Experiments were performed using mixed studies (qualitative and quantitative) methods to demonstrate the possibility of purifying real images in complex directions. We evaluate the proposed method on various public datasets, including LPW, COCO and MPIIGaze. Experimental results show that the proposed method is effective and achieves the state-of-the-art results.
Tasks	Style Transfer
Published	2019-03-19
URL	http://arxiv.org/abs/1903.08152v1
PDF	http://arxiv.org/pdf/1903.08152v1.pdf
PWC	https://paperswithcode.com/paper/mask-guided-style-transfer-network-for
Repo
Framework

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis


Title	Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis
Authors	Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby
Abstract	Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web.
Tasks	Latent Variable Models, Speech Synthesis, Style Transfer
Published	2019-06-08
URL	https://arxiv.org/abs/1906.03402v3
PDF	https://arxiv.org/pdf/1906.03402v3.pdf
PWC	https://paperswithcode.com/paper/effective-use-of-variational-embedding
Repo
Framework