Paper Group ANR 621
What does Attention in Neural Machine Translation Pay Attention to?. Matching neural paths: transfer from recognition to correspondence search. Linearly constrained Gaussian processes. Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition. Robust Speech Recognition Using Generative Adversarial Networks. A H …
What does Attention in Neural Machine Translation Pay Attention to?
Title | What does Attention in Neural Machine Translation Pay Attention to? |
Authors | Hamidreza Ghader, Christof Monz |
Abstract | Attention in neural machine translation provides the possibility to encode relevant parts of the source sentence at each translation step. As a result, attention is considered to be an alignment model as well. However, there is no work that specifically studies attention and provides analysis of what is being learned by attention models. Thus, the question still remains that how attention is similar or different from the traditional alignment. In this paper, we provide detailed analysis of attention and compare it to traditional alignment. We answer the question of whether attention is only capable of modelling translational equivalent or it captures more information. We show that attention is different from alignment in some cases and is capturing useful information other than alignments. |
Tasks | Machine Translation |
Published | 2017-10-09 |
URL | http://arxiv.org/abs/1710.03348v1 |
http://arxiv.org/pdf/1710.03348v1.pdf | |
PWC | https://paperswithcode.com/paper/what-does-attention-in-neural-machine |
Repo | |
Framework | |
Matching neural paths: transfer from recognition to correspondence search
Title | Matching neural paths: transfer from recognition to correspondence search |
Authors | Nikolay Savinov, Lubor Ladicky, Marc Pollefeys |
Abstract | Many machine learning tasks require finding per-part correspondences between objects. In this work we focus on low-level correspondences - a highly ambiguous matching problem. We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity. Training it for low-level correspondence prediction directly might not be an option in some domains where the ground-truth correspondences are hard to obtain. We show how transfer from recognition can be used to avoid such training. Our idea is to mark parts as “matching” if their features are close to each other at all the levels of convolutional feature hierarchy (neural paths). Although the overall number of such paths is exponential in the number of layers, we propose a polynomial algorithm for aggregating all of them in a single backward pass. The empirical validation is done on the task of stereo correspondence and demonstrates that we achieve competitive results among the methods which do not use labeled target domain data. |
Tasks | |
Published | 2017-05-19 |
URL | http://arxiv.org/abs/1705.08272v3 |
http://arxiv.org/pdf/1705.08272v3.pdf | |
PWC | https://paperswithcode.com/paper/matching-neural-paths-transfer-from |
Repo | |
Framework | |
Linearly constrained Gaussian processes
Title | Linearly constrained Gaussian processes |
Authors | Carl Jidling, Niklas Wahlström, Adrian Wills, Thomas B. Schön |
Abstract | We consider a modification of the covariance function in Gaussian processes to correctly account for known linear constraints. By modelling the target function as a transformation of an underlying function, the constraints are explicitly incorporated in the model such that they are guaranteed to be fulfilled by any sample drawn or prediction made. We also propose a constructive procedure for designing the transformation operator and illustrate the result on both simulated and real-data examples. |
Tasks | Gaussian Processes |
Published | 2017-03-02 |
URL | http://arxiv.org/abs/1703.00787v2 |
http://arxiv.org/pdf/1703.00787v2.pdf | |
PWC | https://paperswithcode.com/paper/linearly-constrained-gaussian-processes |
Repo | |
Framework | |
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition
Title | Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition |
Authors | Chris Donahue, Bo Li, Rohit Prabhavalkar |
Abstract | We investigate the effectiveness of generative adversarial networks (GANs) for speech enhancement, in the context of improving noise robustness of automatic speech recognition (ASR) systems. Prior work demonstrates that GANs can effectively suppress additive noise in raw waveform speech signals, improving perceptual quality metrics; however this technique was not justified in the context of ASR. In this work, we conduct a detailed study to measure the effectiveness of GANs in enhancing speech contaminated by both additive and reverberant noise. Motivated by recent advances in image processing, we propose operating GANs on log-Mel filterbank spectra instead of waveforms, which requires less computation and is more robust to reverberant noise. While GAN enhancement improves the performance of a clean-trained ASR system on noisy speech, it falls short of the performance achieved by conventional multi-style training (MTR). By appending the GAN-enhanced features to the noisy inputs and retraining, we achieve a 7% WER improvement relative to the MTR system. |
Tasks | Robust Speech Recognition, Speech Enhancement, Speech Recognition |
Published | 2017-11-15 |
URL | http://arxiv.org/abs/1711.05747v2 |
http://arxiv.org/pdf/1711.05747v2.pdf | |
PWC | https://paperswithcode.com/paper/exploring-speech-enhancement-with-generative |
Repo | |
Framework | |
Robust Speech Recognition Using Generative Adversarial Networks
Title | Robust Speech Recognition Using Generative Adversarial Networks |
Authors | Anuroop Sriram, Heewoo Jun, Yashesh Gaur, Sanjeev Satheesh |
Abstract | This paper describes a general, scalable, end-to-end framework that uses the generative adversarial network (GAN) objective to enable robust speech recognition. Encoders trained with the proposed approach enjoy improved invariance by learning to map noisy audio to the same embedding space as that of clean audio. Unlike previous methods, the new framework does not rely on domain expertise or simplifying assumptions as are often needed in signal processing, and directly encourages robustness in a data-driven way. We show the new approach improves simulated far-field speech recognition of vanilla sequence-to-sequence models without specialized front-ends or preprocessing. |
Tasks | Robust Speech Recognition, Speech Recognition |
Published | 2017-11-05 |
URL | http://arxiv.org/abs/1711.01567v1 |
http://arxiv.org/pdf/1711.01567v1.pdf | |
PWC | https://paperswithcode.com/paper/robust-speech-recognition-using-generative |
Repo | |
Framework | |
A Hybrid Approach with Multi-channel I-Vectors and Convolutional Neural Networks for Acoustic Scene Classification
Title | A Hybrid Approach with Multi-channel I-Vectors and Convolutional Neural Networks for Acoustic Scene Classification |
Authors | Hamid Eghbal-zadeh, Bernhard Lehner, Matthias Dorfer, Gerhard Widmer |
Abstract | In Acoustic Scene Classification (ASC) two major approaches have been followed . While one utilizes engineered features such as mel-frequency-cepstral-coefficients (MFCCs), the other uses learned features that are the outcome of an optimization algorithm. I-vectors are the result of a modeling technique that usually takes engineered features as input. It has been shown that standard MFCCs extracted from monaural audio signals lead to i-vectors that exhibit poor performance, especially on indoor acoustic scenes. At the same time, Convolutional Neural Networks (CNNs) are well known for their ability to learn features by optimizing their filters. They have been applied on ASC and have shown promising results. In this paper, we first propose a novel multi-channel i-vector extraction and scoring scheme for ASC, improving their performance on indoor and outdoor scenes. Second, we propose a CNN architecture that achieves promising ASC results. Further, we show that i-vectors and CNNs capture complementary information from acoustic scenes. Finally, we propose a hybrid system for ASC using multi-channel i-vectors and CNNs by utilizing a score fusion technique. Using our method, we participated in the ASC task of the DCASE-2016 challenge. Our hybrid approach achieved 1 st rank among 49 submissions, substantially improving the previous state of the art. |
Tasks | Acoustic Scene Classification, Scene Classification |
Published | 2017-06-20 |
URL | http://arxiv.org/abs/1706.06525v1 |
http://arxiv.org/pdf/1706.06525v1.pdf | |
PWC | https://paperswithcode.com/paper/a-hybrid-approach-with-multi-channel-i |
Repo | |
Framework | |
Near-optimal Sample Complexity Bounds for Robust Learning of Gaussians Mixtures via Compression Schemes
Title | Near-optimal Sample Complexity Bounds for Robust Learning of Gaussians Mixtures via Compression Schemes |
Authors | Hassan Ashtiani, Shai Ben-David, Nick Harvey, Christopher Liaw, Abbas Mehrabian, Yaniv Plan |
Abstract | We prove that $\tilde{\Theta}(k d^2 / \varepsilon^2)$ samples are necessary and sufficient for learning a mixture of $k$ Gaussians in $\mathbb{R}^d$, up to error $\varepsilon$ in total variation distance. This improves both the known upper bounds and lower bounds for this problem. For mixtures of axis-aligned Gaussians, we show that $\tilde{O}(k d / \varepsilon^2)$ samples suffice, matching a known lower bound. Moreover, these results hold in the agnostic-learning/robust-estimation setting as well, where the target distribution is only approximately a mixture of Gaussians. The upper bound is shown using a novel technique for distribution learning based on a notion of `compression.’ Any class of distributions that allows such a compression scheme can also be learned with few samples. Moreover, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions. The core of our main result is showing that the class of Gaussians in $\mathbb{R}^d$ admits a small-sized compression scheme. | |
Tasks | |
Published | 2017-10-14 |
URL | https://arxiv.org/abs/1710.05209v4 |
https://arxiv.org/pdf/1710.05209v4.pdf | |
PWC | https://paperswithcode.com/paper/near-optimal-sample-complexity-bounds-for |
Repo | |
Framework | |
DNN Filter Bank Cepstral Coefficients for Spoofing Detection
Title | DNN Filter Bank Cepstral Coefficients for Spoofing Detection |
Authors | Hong Yu, Zheng-Hua Tan, Zhanyu Ma, Jun Guo |
Abstract | With the development of speech synthesis techniques, automatic speaker verification systems face the serious challenge of spoofing attack. In order to improve the reliability of speaker verification systems, we develop a new filter bank based cepstral feature, deep neural network filter bank cepstral coefficients (DNN-FBCC), to distinguish between natural and spoofed speech. The deep neural network filter bank is automatically generated by training a filter bank neural network (FBNN) using natural and synthetic speech. By adding restrictions on the training rules, the learned weight matrix of FBNN is band-limited and sorted by frequency, similar to the normal filter bank. Unlike the manually designed filter bank, the learned filter bank has different filter shapes in different channels, which can capture the differences between natural and synthetic speech more effectively. The experimental results on the ASVspoof {2015} database show that the Gaussian mixture model maximum-likelihood (GMM-ML) classifier trained by the new feature performs better than the state-of-the-art linear frequency cepstral coefficients (LFCC) based classifier, especially on detecting unknown attacks. |
Tasks | Speaker Verification, Speech Synthesis |
Published | 2017-02-13 |
URL | http://arxiv.org/abs/1702.03791v1 |
http://arxiv.org/pdf/1702.03791v1.pdf | |
PWC | https://paperswithcode.com/paper/dnn-filter-bank-cepstral-coefficients-for |
Repo | |
Framework | |
Unsupervised Domain Adaptation for Robust Speech Recognition via Variational Autoencoder-Based Data Augmentation
Title | Unsupervised Domain Adaptation for Robust Speech Recognition via Variational Autoencoder-Based Data Augmentation |
Authors | Wei-Ning Hsu, Yu Zhang, James Glass |
Abstract | Domain mismatch between training and testing can lead to significant degradation in performance in many machine learning scenarios. Unfortunately, this is not a rare situation for automatic speech recognition deployments in real-world applications. Research on robust speech recognition can be regarded as trying to overcome this domain mismatch issue. In this paper, we address the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are presented, but word transcripts are only available for the source domain speech. We present novel augmentation-based methods that transform speech in a way that does not change the transcripts. Specifically, we first train a variational autoencoder on both source and target domain data (without supervision) to learn a latent representation of speech. We then transform nuisance attributes of speech that are irrelevant to recognition by modifying the latent representations, in order to augment labeled training data with additional data whose distribution is more similar to the target domain. The proposed method is evaluated on the CHiME-4 dataset and reduces the absolute word error rate (WER) by as much as 35% compared to the non-adapted baseline. |
Tasks | Data Augmentation, Domain Adaptation, Robust Speech Recognition, Speech Recognition, Unsupervised Domain Adaptation |
Published | 2017-07-19 |
URL | http://arxiv.org/abs/1707.06265v2 |
http://arxiv.org/pdf/1707.06265v2.pdf | |
PWC | https://paperswithcode.com/paper/unsupervised-domain-adaptation-for-robust |
Repo | |
Framework | |
Nonparametric Neural Networks
Title | Nonparametric Neural Networks |
Authors | George Philipp, Jaime G. Carbonell |
Abstract | Automatically determining the optimal size of a neural network for a given task without prior information currently requires an expensive global search and training many networks from scratch. In this paper, we address the problem of automatically finding a good network size during a single training cycle. We introduce nonparametric neural networks, a non-probabilistic framework for conducting optimization over all possible network sizes and prove its soundness when network growth is limited via an L_p penalty. We train networks under this framework by continuously adding new units while eliminating redundant units via an L_2 penalty. We employ a novel optimization algorithm, which we term adaptive radial-angular gradient descent or AdaRad, and obtain promising results. |
Tasks | |
Published | 2017-12-14 |
URL | http://arxiv.org/abs/1712.05440v1 |
http://arxiv.org/pdf/1712.05440v1.pdf | |
PWC | https://paperswithcode.com/paper/nonparametric-neural-networks |
Repo | |
Framework | |
An Empirical Study of Discriminative Sequence Labeling Models for Vietnamese Text Processing
Title | An Empirical Study of Discriminative Sequence Labeling Models for Vietnamese Text Processing |
Authors | Phuong Le-Hong, Minh Pham Quang Nhat, Thai-Hoang Pham, Tuan-Anh Tran, Dang-Minh Nguyen |
Abstract | This paper presents an empirical study of two widely-used sequence prediction models, Conditional Random Fields (CRFs) and Long Short-Term Memory Networks (LSTMs), on two fundamental tasks for Vietnamese text processing, including part-of-speech tagging and named entity recognition. We show that a strong lower bound for labeling accuracy can be obtained by relying only on simple word-based features with minimal hand-crafted feature engineering, of 90.65% and 86.03% performance scores on the standard test sets for the two tasks respectively. In particular, we demonstrate empirically the surprising efficiency of word embeddings in both of the two tasks, with both of the two models. We point out that the state-of-the-art LSTMs model does not always outperform significantly the traditional CRFs model, especially on moderate-sized data sets. Finally, we give some suggestions and discussions for efficient use of sequence labeling models in practical applications. |
Tasks | Feature Engineering, Named Entity Recognition, Part-Of-Speech Tagging, Word Embeddings |
Published | 2017-08-30 |
URL | http://arxiv.org/abs/1708.09163v1 |
http://arxiv.org/pdf/1708.09163v1.pdf | |
PWC | https://paperswithcode.com/paper/an-empirical-study-of-discriminative-sequence |
Repo | |
Framework | |
Channel-Recurrent Autoencoding for Image Modeling
Title | Channel-Recurrent Autoencoding for Image Modeling |
Authors | Wenling Shang, Kihyuk Sohn, Yuandong Tian |
Abstract | Despite recent successes in synthesizing faces and bedrooms, existing generative models struggle to capture more complex image types, potentially due to the oversimplification of their latent space constructions. To tackle this issue, building on Variational Autoencoders (VAEs), we integrate recurrent connections across channels to both inference and generation steps, allowing the high-level features to be captured in global-to-local, coarse-to-fine manners. Combined with adversarial loss, our channel-recurrent VAE-GAN (crVAE-GAN) outperforms VAE-GAN in generating a diverse spectrum of high resolution images while maintaining the same level of computational efficacy. Our model produces interpretable and expressive latent representations to benefit downstream tasks such as image completion. Moreover, we propose two novel regularizations, namely the KL objective weighting scheme over time steps and mutual information maximization between transformed latent variables and the outputs, to enhance the training. |
Tasks | |
Published | 2017-06-12 |
URL | http://arxiv.org/abs/1706.03729v2 |
http://arxiv.org/pdf/1706.03729v2.pdf | |
PWC | https://paperswithcode.com/paper/channel-recurrent-autoencoding-for-image |
Repo | |
Framework | |
Smarnet: Teaching Machines to Read and Comprehend Like Human
Title | Smarnet: Teaching Machines to Read and Comprehend Like Human |
Authors | Zheqian Chen, Rongqin Yang, Bin Cao, Zhou Zhao, Deng Cai, Xiaofei He |
Abstract | Machine Comprehension (MC) is a challenging task in Natural Language Processing field, which aims to guide the machine to comprehend a passage and answer the given question. Many existing approaches on MC task are suffering the inefficiency in some bottlenecks, such as insufficient lexical understanding, complex question-passage interaction, incorrect answer extraction and so on. In this paper, we address these problems from the viewpoint of how humans deal with reading tests in a scientific way. Specifically, we first propose a novel lexical gating mechanism to dynamically combine the words and characters representations. We then guide the machines to read in an interactive way with attention mechanism and memory network. Finally we add a checking layer to refine the answer for insurance. The extensive experiments on two popular datasets SQuAD and TriviaQA show that our method exceeds considerable performance than most state-of-the-art solutions at the time of submission. |
Tasks | Question Answering, Reading Comprehension |
Published | 2017-10-08 |
URL | http://arxiv.org/abs/1710.02772v1 |
http://arxiv.org/pdf/1710.02772v1.pdf | |
PWC | https://paperswithcode.com/paper/smarnet-teaching-machines-to-read-and |
Repo | |
Framework | |
Two-view 3D Reconstruction for Food Volume Estimation
Title | Two-view 3D Reconstruction for Food Volume Estimation |
Authors | Joachim Dehais, Marios Anthimopoulos, Sergey Shevchik, Stavroula Mougiakakou |
Abstract | The increasing prevalence of diet-related chronic diseases coupled with the ineffectiveness of traditional diet management methods have resulted in a need for novel tools to accurately and automatically assess meals. Recently, computer vision based systems that use meal images to assess their content have been proposed. Food portion estimation is the most difficult task for individuals assessing their meals and it is also the least studied area. The present paper proposes a three-stage system to calculate portion sizes using two images of a dish acquired by mobile devices. The first stage consists in understanding the configuration of the different views, after which a dense 3D model is built from the two images; finally, this 3D model serves to extract the volume of the different items. The system was extensively tested on 77 real dishes of known volume, and achieved an average error of less than 10% in 5.5 seconds per dish. The proposed pipeline is computationally tractable and requires no user input, making it a viable option for fully automated dietary assessment. |
Tasks | 3D Reconstruction |
Published | 2017-01-12 |
URL | http://arxiv.org/abs/1701.03330v1 |
http://arxiv.org/pdf/1701.03330v1.pdf | |
PWC | https://paperswithcode.com/paper/two-view-3d-reconstruction-for-food-volume |
Repo | |
Framework | |
Collaborative Descriptors: Convolutional Maps for Preprocessing
Title | Collaborative Descriptors: Convolutional Maps for Preprocessing |
Authors | Hirokatsu Kataoka, Kaori Abe, Akio Nakamura, Yutaka Satoh |
Abstract | The paper presents a novel concept for collaborative descriptors between deeply learned and hand-crafted features. To achieve this concept, we apply convolutional maps for pre-processing, namely the convovlutional maps are used as input of hand-crafted features. We recorded an increase in the performance rate of +17.06 % (multi-class object recognition) and +24.71 % (car detection) from grayscale input to convolutional maps. Although the framework is straight-forward, the concept should be inherited for an improved representation. |
Tasks | Object Recognition |
Published | 2017-05-10 |
URL | http://arxiv.org/abs/1705.03595v1 |
http://arxiv.org/pdf/1705.03595v1.pdf | |
PWC | https://paperswithcode.com/paper/collaborative-descriptors-convolutional-maps |
Repo | |
Framework | |