October 19, 2019

2832 words 14 mins read

Paper Group ANR 405

Paper Group ANR 405

Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts. Adversarial Training for Probabilistic Spiking Neural Networks. DeepSPINE: Automated Lumbar Vertebral Segmentation, Disc-level Designation, and Spinal Stenosis Grading Using Deep Learning. Speaker-independent raw waveform model for glottal excitat …

Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts

Title Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts
Authors Pankaj Gupta, Bernt Andrassy, Hinrich Schütze
Abstract The goal of our industrial ticketing system is to retrieve a relevant solution for an input query, by matching with historical tickets stored in knowledge base. A query is comprised of subject and description, while a historical ticket consists of subject, description and solution. To retrieve a relevant solution, we use textual similarity paradigm to learn similarity in the query and historical tickets. The task is challenging due to significant term mismatch in the query and ticket pairs of asymmetric lengths, where subject is a short text but description and solution are multi-sentence texts. We present a novel Replicated Siamese LSTM model to learn similarity in asymmetric text pairs, that gives 22% and 7% gain (Accuracy@10) for retrieval task, respectively over unsupervised and supervised baselines. We also show that the topic and distributed semantic features for short and long texts improved both similarity learning and retrieval.
Published 2018-07-08
URL http://arxiv.org/abs/1807.02854v1
PDF http://arxiv.org/pdf/1807.02854v1.pdf
PWC https://paperswithcode.com/paper/replicated-siamese-lstm-in-ticketing-system

Adversarial Training for Probabilistic Spiking Neural Networks

Title Adversarial Training for Probabilistic Spiking Neural Networks
Authors Alireza Bagheri, Osvaldo Simeone, Bipin Rajendran
Abstract Classifiers trained using conventional empirical risk minimization or maximum likelihood methods are known to suffer dramatic performance degradations when tested over examples adversarially selected based on knowledge of the classifier’s decision rule. Due to the prominence of Artificial Neural Networks (ANNs) as classifiers, their sensitivity to adversarial examples, as well as robust training schemes, have been recently the subject of intense investigation. In this paper, for the first time, the sensitivity of spiking neural networks (SNNs), or third-generation neural networks, to adversarial examples is studied. The study considers rate and time encoding, as well as rate and first-to-spike decoding. Furthermore, a robust training mechanism is proposed that is demonstrated to enhance the performance of SNNs under white-box attacks.
Published 2018-02-22
URL http://arxiv.org/abs/1802.08567v2
PDF http://arxiv.org/pdf/1802.08567v2.pdf
PWC https://paperswithcode.com/paper/adversarial-training-for-probabilistic

DeepSPINE: Automated Lumbar Vertebral Segmentation, Disc-level Designation, and Spinal Stenosis Grading Using Deep Learning

Title DeepSPINE: Automated Lumbar Vertebral Segmentation, Disc-level Designation, and Spinal Stenosis Grading Using Deep Learning
Authors Jen-Tang Lu, Stefano Pedemonte, Bernardo Bizzo, Sean Doyle, Katherine P. Andriole, Mark H. Michalski, R. Gilberto Gonzalez, Stuart R. Pomerantz
Abstract The high prevalence of spinal stenosis results in a large volume of MRI imaging, yet interpretation can be time-consuming with high inter-reader variability even among the most specialized radiologists. In this paper, we develop an efficient methodology to leverage the subject-matter-expertise stored in large-scale archival reporting and image data for a deep-learning approach to fully-automated lumbar spinal stenosis grading. Specifically, we introduce three major contributions: (1) a natural-language-processing scheme to extract level-by-level ground-truth labels from free-text radiology reports for the various types and grades of spinal stenosis (2) accurate vertebral segmentation and disc-level localization using a U-Net architecture combined with a spine-curve fitting method, and (3) a multi-input, multi-task, and multi-class convolutional neural network to perform central canal and foraminal stenosis grading on both axial and sagittal imaging series inputs with the extracted report-derived labels applied to corresponding imaging level segments. This study uses a large dataset of 22796 disc-levels extracted from 4075 patients. We achieve state-of-the-art performance on lumbar spinal stenosis classification and expect the technique will increase both radiology workflow efficiency and the perceived value of radiology reports for referring clinicians and patients.
Published 2018-07-26
URL http://arxiv.org/abs/1807.10215v1
PDF http://arxiv.org/pdf/1807.10215v1.pdf
PWC https://paperswithcode.com/paper/deepspine-automated-lumbar-vertebral

Speaker-independent raw waveform model for glottal excitation

Title Speaker-independent raw waveform model for glottal excitation
Authors Lauri Juvela, Vassilis Tsiaras, Bajibabu Bollepalli, Manu Airaksinen, Junichi Yamagishi, Paavo Alku
Abstract Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker ‘GlotNet’ vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.
Tasks Speech Synthesis, Text-To-Speech Synthesis, Voice Conversion
Published 2018-04-25
URL http://arxiv.org/abs/1804.09593v1
PDF http://arxiv.org/pdf/1804.09593v1.pdf
PWC https://paperswithcode.com/paper/speaker-independent-raw-waveform-model-for

Applying Cooperative Machine Learning to Speed Up the Annotation of Social Signals in Large Multi-modal Corpora

Title Applying Cooperative Machine Learning to Speed Up the Annotation of Social Signals in Large Multi-modal Corpora
Authors Johannes Wagner, Tobias Baur, Yue Zhang, Michel F. Valstar, Björn Schuller, Elisabeth André
Abstract Scientific disciplines, such as Behavioural Psychology, Anthropology and recently Social Signal Processing are concerned with the systematic exploration of human behaviour. A typical work-flow includes the manual annotation (also called coding) of social signals in multi-modal corpora of considerable size. For the involved annotators this defines an exhausting and time-consuming task. In the article at hand we present a novel method and also provide the tools to speed up the coding procedure. To this end, we suggest and evaluate the use of Cooperative Machine Learning (CML) techniques to reduce manual labelling efforts by combining the power of computational capabilities and human intelligence. The proposed CML strategy starts with a small number of labelled instances and concentrates on predicting local parts first. Afterwards, a session-independent classification model is created to finish the remaining parts of the database. Confidence values are computed to guide the manual inspection and correction of the predictions. To bring the proposed approach into application we introduce NOVA - an open-source tool for collaborative and machine-aided annotations. In particular, it gives labellers immediate access to CML strategies and directly provides visual feedback on the results. Our experiments show that the proposed method has the potential to significantly reduce human labelling efforts.
Published 2018-02-07
URL http://arxiv.org/abs/1802.02565v1
PDF http://arxiv.org/pdf/1802.02565v1.pdf
PWC https://paperswithcode.com/paper/applying-cooperative-machine-learning-to

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Title Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
Authors Kei Akuzawa, Yusuke Iwasawa, Yutaka Matsuo
Abstract Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.
Tasks Speech Synthesis
Published 2018-04-06
URL http://arxiv.org/abs/1804.02135v3
PDF http://arxiv.org/pdf/1804.02135v3.pdf
PWC https://paperswithcode.com/paper/expressive-speech-synthesis-via-modeling

Machine Speech Chain with One-shot Speaker Adaptation

Title Machine Speech Chain with One-shot Speaker Adaptation
Authors Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Abstract In previous work, we developed a closed-loop speech chain model based on deep learning, in which the architecture enabled the automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components to mutually improve their performance. This was accomplished by the two parts teaching each other using both labeled and unlabeled data. This approach could significantly improve model performance within a single-speaker speech dataset, but only a slight increase could be gained in multi-speaker tasks. Furthermore, the model is still unable to handle unseen speakers. In this paper, we present a new speech chain mechanism by integrating a speaker recognition model inside the loop. We also propose extending the capability of TTS to handle unseen speakers by implementing one-shot speaker adaptation. This enables TTS to mimic voice characteristics from one speaker to another with only a one-shot speaker sample, even from a text without any speaker information. In the speech chain loop mechanism, ASR also benefits from the ability to further learn an arbitrary speaker’s characteristics from the generated speech waveform, resulting in a significant improvement in the recognition rate.
Tasks Speaker Recognition, Speech Recognition, Speech Synthesis, Text-To-Speech Synthesis
Published 2018-03-28
URL http://arxiv.org/abs/1803.10525v1
PDF http://arxiv.org/pdf/1803.10525v1.pdf
PWC https://paperswithcode.com/paper/machine-speech-chain-with-one-shot-speaker

Sharp Analysis of a Simple Model for Random Forests

Title Sharp Analysis of a Simple Model for Random Forests
Authors Jason M. Klusowski
Abstract Random forests have become an important tool for improving accuracy in regression problems since their popularization by [Breiman, 2001] and others. In this paper, we revisit a random forest model originally proposed by [Breiman, 2004] and later studied by [Biau, 2012], where a feature is selected at random and the split occurs at the midpoint of the box containing the chosen feature. If the Lipschitz regression function is sparse and only depends on a small, unknown subset of $S$ out of $d$ features, we show that, given access to $n$ observations, this random forest model outputs a predictor that has a mean-squared prediction error $O((n(\sqrt{\log n})^{S-1})^{-\frac{1}{S\log2+1}})$. This positively answers an outstanding question of [Biau, 2012] about whether the rate of convergence therein could be improved. The second part of this article shows that the aforementioned prediction error cannot generally be improved, which we accomplish by characterizing the variance and by showing that the bias is tight for any linear model with nonzero parameter vector. As a striking consequence of our analysis, we show the variance of this forest is similar in form to the best-case variance lower bound of [Lin and Jeon, 2006] among all random forest models with nonadaptive splitting schemes (i.e., where the split protocol is independent of the training data).
Published 2018-05-07
URL https://arxiv.org/abs/1805.02587v6
PDF https://arxiv.org/pdf/1805.02587v6.pdf
PWC https://paperswithcode.com/paper/complete-analysis-of-a-random-forest-model

Deep Feed-forward Sequential Memory Networks for Speech Synthesis

Title Deep Feed-forward Sequential Memory Networks for Speech Synthesis
Authors Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei, Zhijie Yan
Abstract The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody. However, the model complexity and inference cost of BLSTM prevents its usage in many runtime applications. Meanwhile, Deep Feed-forward Sequential Memory Networks (DFSMN) has shown its consistent out-performance over BLSTM in both word error rate (WER) and the runtime computation cost in speech recognition tasks. Since speech synthesis also requires to model long-term dependencies compared to speech recognition, in this paper, we investigate the Deep-FSMN (DFSMN) in speech synthesis. Both objective and subjective experiments show that, compared with BLSTM TTS method, the DFSMN system can generate synthesized speech with comparable speech quality while drastically reduce model complexity and speech generation time.
Tasks Speech Recognition, Speech Synthesis
Published 2018-02-26
URL http://arxiv.org/abs/1802.09194v1
PDF http://arxiv.org/pdf/1802.09194v1.pdf
PWC https://paperswithcode.com/paper/deep-feed-forward-sequential-memory-networks

Regret vs. Bandwidth Trade-off for Recommendation Systems

Title Regret vs. Bandwidth Trade-off for Recommendation Systems
Authors Linqi Song, Christina Fragouli, Devavrat Shah
Abstract We consider recommendation systems that need to operate under wireless bandwidth constraints, measured as number of broadcast transmissions, and demonstrate a (tight for some instances) tradeoff between regret and bandwidth for two scenarios: the case of multi-armed bandit with context, and the case where there is a latent structure in the message space that we can exploit to reduce the learning phase.
Tasks Recommendation Systems
Published 2018-10-15
URL http://arxiv.org/abs/1810.06313v1
PDF http://arxiv.org/pdf/1810.06313v1.pdf
PWC https://paperswithcode.com/paper/regret-vs-bandwidth-trade-off-for

Localizing Moments in Video with Temporal Language

Title Localizing Moments in Video with Temporal Language
Authors Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
Abstract Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text. We propose a new model that explicitly reasons about different temporal segments in a video, and shows that temporal context is important for localizing phrases which include temporal language. To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO - Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO - Human Language).
Tasks Video Understanding
Published 2018-09-05
URL http://arxiv.org/abs/1809.01337v1
PDF http://arxiv.org/pdf/1809.01337v1.pdf
PWC https://paperswithcode.com/paper/localizing-moments-in-video-with-temporal

Equalizing Financial Impact in Supervised Learning

Title Equalizing Financial Impact in Supervised Learning
Authors Govind Ramnarayan
Abstract Notions of “fair classification” that have arisen in computer science generally revolve around equalizing certain statistics across protected groups. This approach has been criticized as ignoring societal issues, including how errors can hurt certain groups disproportionately. We pose a modification of one of the fairness criteria from Hardt, Price, and Srebro [NIPS, 2016] that makes a small step towards addressing this issue in the case of financial decisions like giving loans. We call this new notion “equalized financial impact.”
Published 2018-06-24
URL http://arxiv.org/abs/1806.09211v1
PDF http://arxiv.org/pdf/1806.09211v1.pdf
PWC https://paperswithcode.com/paper/equalizing-financial-impact-in-supervised

Stylistic Variation in Social Media Part-of-Speech Tagging

Title Stylistic Variation in Social Media Part-of-Speech Tagging
Authors Murali Raghu Babu Balusu, Taha Merghani, Jacob Eisenstein
Abstract Social media features substantial stylistic variation, raising new challenges for syntactic analysis of online writing. However, this variation is often aligned with author attributes such as age, gender, and geography, as well as more readily-available social network metadata. In this paper, we report new evidence on the link between language and social networks in the task of part-of-speech tagging. We find that tagger error rates are correlated with network structure, with high accuracy in some parts of the network, and lower accuracy elsewhere. As a result, tagger accuracy depends on training from a balanced sample of the network, rather than training on texts from a narrow subcommunity. We also describe our attempts to add robustness to stylistic variation, by building a mixture-of-experts model in which each expert is associated with a region of the social network. While prior work found that similar approaches yield performance improvements in sentiment analysis and entity linking, we were unable to obtain performance improvements in part-of-speech tagging, despite strong evidence for the link between part-of-speech error rates and social network structure.
Tasks Entity Linking, Part-Of-Speech Tagging, Sentiment Analysis
Published 2018-04-19
URL http://arxiv.org/abs/1804.07331v1
PDF http://arxiv.org/pdf/1804.07331v1.pdf
PWC https://paperswithcode.com/paper/stylistic-variation-in-social-media-part-of

Tools and resources for Romanian text-to-speech and speech-to-text applications

Title Tools and resources for Romanian text-to-speech and speech-to-text applications
Authors Tiberiu Boros, Stefan Daniel Dumitrescu, Vasile Pais
Abstract In this paper we introduce a set of resources and tools aimed at providing support for natural language processing, text-to-speech synthesis and speech recognition for Romanian. While the tools are general purpose and can be used for any language (we successfully trained our system for more than 50 languages and participated in the Universal Dependencies Shared Task), the resources are only relevant for Romanian language processing.
Tasks Speech Recognition, Speech Synthesis, Text-To-Speech Synthesis
Published 2018-02-15
URL http://arxiv.org/abs/1802.05583v1
PDF http://arxiv.org/pdf/1802.05583v1.pdf
PWC https://paperswithcode.com/paper/tools-and-resources-for-romanian-text-to

Built-in Vulnerabilities to Imperceptible Adversarial Perturbations

Title Built-in Vulnerabilities to Imperceptible Adversarial Perturbations
Authors Thomas Tanay, Jerone T. A. Andrews, Lewis D. Griffin
Abstract Designing models that are robust to small adversarial perturbations of their inputs has proven remarkably difficult. In this work we show that the reverse problem—making models more vulnerable—is surprisingly easy. After presenting some proofs of concept on MNIST, we introduce a generic tilting attack that injects vulnerabilities into the linear layers of pre-trained networks by increasing their sensitivity to components of low variance in the training data without affecting their performance on test data. We illustrate this attack on a multilayer perceptron trained on SVHN and use it to design a stand-alone adversarial module which we call a steganogram decoder. Finally, we show on CIFAR-10 that a poisoning attack with a poisoning rate as low as 0.1% can induce vulnerabilities to chosen imperceptible backdoor signals in state-of-the-art networks. Beyond their practical implications, these different results shed new light on the nature of the adversarial example phenomenon.
Published 2018-06-19
URL https://arxiv.org/abs/1806.07409v2
PDF https://arxiv.org/pdf/1806.07409v2.pdf
PWC https://paperswithcode.com/paper/built-in-vulnerabilities-to-imperceptible
comments powered by Disqus