October 19, 2019

2832 words 14 mins read

Paper Group ANR 405

Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts. Adversarial Training for Probabilistic Spiking Neural Networks. DeepSPINE: Automated Lumbar Vertebral Segmentation, Disc-level Designation, and Spinal Stenosis Grading Using Deep Learning. Speaker-independent raw waveform model for glottal excitat …

Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts


Title	Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts
Authors	Pankaj Gupta, Bernt Andrassy, Hinrich Schütze
Abstract	The goal of our industrial ticketing system is to retrieve a relevant solution for an input query, by matching with historical tickets stored in knowledge base. A query is comprised of subject and description, while a historical ticket consists of subject, description and solution. To retrieve a relevant solution, we use textual similarity paradigm to learn similarity in the query and historical tickets. The task is challenging due to significant term mismatch in the query and ticket pairs of asymmetric lengths, where subject is a short text but description and solution are multi-sentence texts. We present a novel Replicated Siamese LSTM model to learn similarity in asymmetric text pairs, that gives 22% and 7% gain (Accuracy@10) for retrieval task, respectively over unsupervised and supervised baselines. We also show that the topic and distributed semantic features for short and long texts improved both similarity learning and retrieval.
Tasks
Published	2018-07-08
URL	http://arxiv.org/abs/1807.02854v1
PDF	http://arxiv.org/pdf/1807.02854v1.pdf
PWC	https://paperswithcode.com/paper/replicated-siamese-lstm-in-ticketing-system
Repo
Framework

Adversarial Training for Probabilistic Spiking Neural Networks


Title	Adversarial Training for Probabilistic Spiking Neural Networks
Authors	Alireza Bagheri, Osvaldo Simeone, Bipin Rajendran
Abstract	Classifiers trained using conventional empirical risk minimization or maximum likelihood methods are known to suffer dramatic performance degradations when tested over examples adversarially selected based on knowledge of the classifier’s decision rule. Due to the prominence of Artificial Neural Networks (ANNs) as classifiers, their sensitivity to adversarial examples, as well as robust training schemes, have been recently the subject of intense investigation. In this paper, for the first time, the sensitivity of spiking neural networks (SNNs), or third-generation neural networks, to adversarial examples is studied. The study considers rate and time encoding, as well as rate and first-to-spike decoding. Furthermore, a robust training mechanism is proposed that is demonstrated to enhance the performance of SNNs under white-box attacks.
Tasks
Published	2018-02-22
URL	http://arxiv.org/abs/1802.08567v2
PDF	http://arxiv.org/pdf/1802.08567v2.pdf
PWC	https://paperswithcode.com/paper/adversarial-training-for-probabilistic
Repo
Framework

DeepSPINE: Automated Lumbar Vertebral Segmentation, Disc-level Designation, and Spinal Stenosis Grading Using Deep Learning


Title	DeepSPINE: Automated Lumbar Vertebral Segmentation, Disc-level Designation, and Spinal Stenosis Grading Using Deep Learning
Authors	Jen-Tang Lu, Stefano Pedemonte, Bernardo Bizzo, Sean Doyle, Katherine P. Andriole, Mark H. Michalski, R. Gilberto Gonzalez, Stuart R. Pomerantz
Abstract	The high prevalence of spinal stenosis results in a large volume of MRI imaging, yet interpretation can be time-consuming with high inter-reader variability even among the most specialized radiologists. In this paper, we develop an efficient methodology to leverage the subject-matter-expertise stored in large-scale archival reporting and image data for a deep-learning approach to fully-automated lumbar spinal stenosis grading. Specifically, we introduce three major contributions: (1) a natural-language-processing scheme to extract level-by-level ground-truth labels from free-text radiology reports for the various types and grades of spinal stenosis (2) accurate vertebral segmentation and disc-level localization using a U-Net architecture combined with a spine-curve fitting method, and (3) a multi-input, multi-task, and multi-class convolutional neural network to perform central canal and foraminal stenosis grading on both axial and sagittal imaging series inputs with the extracted report-derived labels applied to corresponding imaging level segments. This study uses a large dataset of 22796 disc-levels extracted from 4075 patients. We achieve state-of-the-art performance on lumbar spinal stenosis classification and expect the technique will increase both radiology workflow efficiency and the perceived value of radiology reports for referring clinicians and patients.
Tasks
Published	2018-07-26
URL	http://arxiv.org/abs/1807.10215v1
PDF	http://arxiv.org/pdf/1807.10215v1.pdf
PWC	https://paperswithcode.com/paper/deepspine-automated-lumbar-vertebral
Repo
Framework

Speaker-independent raw waveform model for glottal excitation


Title	Speaker-independent raw waveform model for glottal excitation
Authors	Lauri Juvela, Vassilis Tsiaras, Bajibabu Bollepalli, Manu Airaksinen, Junichi Yamagishi, Paavo Alku
Abstract	Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker ‘GlotNet’ vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.
Tasks	Speech Synthesis, Text-To-Speech Synthesis, Voice Conversion
Published	2018-04-25
URL	http://arxiv.org/abs/1804.09593v1
PDF	http://arxiv.org/pdf/1804.09593v1.pdf
PWC	https://paperswithcode.com/paper/speaker-independent-raw-waveform-model-for
Repo
Framework


Title	Applying Cooperative Machine Learning to Speed Up the Annotation of Social Signals in Large Multi-modal Corpora
Authors	Johannes Wagner, Tobias Baur, Yue Zhang, Michel F. Valstar, Björn Schuller, Elisabeth André
Abstract	Scientific disciplines, such as Behavioural Psychology, Anthropology and recently Social Signal Processing are concerned with the systematic exploration of human behaviour. A typical work-flow includes the manual annotation (also called coding) of social signals in multi-modal corpora of considerable size. For the involved annotators this defines an exhausting and time-consuming task. In the article at hand we present a novel method and also provide the tools to speed up the coding procedure. To this end, we suggest and evaluate the use of Cooperative Machine Learning (CML) techniques to reduce manual labelling efforts by combining the power of computational capabilities and human intelligence. The proposed CML strategy starts with a small number of labelled instances and concentrates on predicting local parts first. Afterwards, a session-independent classification model is created to finish the remaining parts of the database. Confidence values are computed to guide the manual inspection and correction of the predictions. To bring the proposed approach into application we introduce NOVA - an open-source tool for collaborative and machine-aided annotations. In particular, it gives labellers immediate access to CML strategies and directly provides visual feedback on the results. Our experiments show that the proposed method has the potential to significantly reduce human labelling efforts.
Tasks
Published	2018-02-07
URL	http://arxiv.org/abs/1802.02565v1
PDF	http://arxiv.org/pdf/1802.02565v1.pdf
PWC	https://paperswithcode.com/paper/applying-cooperative-machine-learning-to
Repo
Framework

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder


Title	Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
Authors	Kei Akuzawa, Yusuke Iwasawa, Yutaka Matsuo
Abstract	Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.
Tasks	Speech Synthesis
Published	2018-04-06
URL	http://arxiv.org/abs/1804.02135v3
PDF	http://arxiv.org/pdf/1804.02135v3.pdf
PWC	https://paperswithcode.com/paper/expressive-speech-synthesis-via-modeling
Repo
Framework

Machine Speech Chain with One-shot Speaker Adaptation


Title	Machine Speech Chain with One-shot Speaker Adaptation
Authors	Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Abstract	In previous work, we developed a closed-loop speech chain model based on deep learning, in which the architecture enabled the automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components to mutually improve their performance. This was accomplished by the two parts teaching each other using both labeled and unlabeled data. This approach could significantly improve model performance within a single-speaker speech dataset, but only a slight increase could be gained in multi-speaker tasks. Furthermore, the model is still unable to handle unseen speakers. In this paper, we present a new speech chain mechanism by integrating a speaker recognition model inside the loop. We also propose extending the capability of TTS to handle unseen speakers by implementing one-shot speaker adaptation. This enables TTS to mimic voice characteristics from one speaker to another with only a one-shot speaker sample, even from a text without any speaker information. In the speech chain loop mechanism, ASR also benefits from the ability to further learn an arbitrary speaker’s characteristics from the generated speech waveform, resulting in a significant improvement in the recognition rate.
Tasks	Speaker Recognition, Speech Recognition, Speech Synthesis, Text-To-Speech Synthesis
Published	2018-03-28
URL	http://arxiv.org/abs/1803.10525v1
PDF	http://arxiv.org/pdf/1803.10525v1.pdf
PWC	https://paperswithcode.com/paper/machine-speech-chain-with-one-shot-speaker
Repo
Framework

Sharp Analysis of a Simple Model for Random Forests


Title	Sharp Analysis of a Simple Model for Random Forests
Authors	Jason M. Klusowski
Abstract	Random forests have become an important tool for improving accuracy in regression problems since their popularization by [Breiman, 2001] and others. In this paper, we revisit a random forest model originally proposed by [Breiman, 2004] and later studied by [Biau, 2012], where a feature is selected at random and the split occurs at the midpoint of the box containing the chosen feature. If the Lipschitz regression function is sparse and only depends on a small, unknown subset of $S$ out of $d$ features, we show that, given access to $n$ observations, this random forest model outputs a predictor that has a mean-squared prediction error $O((n(\sqrt{\log n})^{S-1})^{-\frac{1}{S\log2+1}})$. This positively answers an outstanding question of [Biau, 2012] about whether the rate of convergence therein could be improved. The second part of this article shows that the aforementioned prediction error cannot generally be improved, which we accomplish by characterizing the variance and by showing that the bias is tight for any linear model with nonzero parameter vector. As a striking consequence of our analysis, we show the variance of this forest is similar in form to the best-case variance lower bound of [Lin and Jeon, 2006] among all random forest models with nonadaptive splitting schemes (i.e., where the split protocol is independent of the training data).
Tasks
Published	2018-05-07
URL	https://arxiv.org/abs/1805.02587v6
PDF	https://arxiv.org/pdf/1805.02587v6.pdf
PWC	https://paperswithcode.com/paper/complete-analysis-of-a-random-forest-model
Repo
Framework

Deep Feed-forward Sequential Memory Networks for Speech Synthesis


Title	Deep Feed-forward Sequential Memory Networks for Speech Synthesis
Authors	Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei, Zhijie Yan
Abstract	The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody. However, the model complexity and inference cost of BLSTM prevents its usage in many runtime applications. Meanwhile, Deep Feed-forward Sequential Memory Networks (DFSMN) has shown its consistent out-performance over BLSTM in both word error rate (WER) and the runtime computation cost in speech recognition tasks. Since speech synthesis also requires to model long-term dependencies compared to speech recognition, in this paper, we investigate the Deep-FSMN (DFSMN) in speech synthesis. Both objective and subjective experiments show that, compared with BLSTM TTS method, the DFSMN system can generate synthesized speech with comparable speech quality while drastically reduce model complexity and speech generation time.
Tasks	Speech Recognition, Speech Synthesis
Published	2018-02-26
URL	http://arxiv.org/abs/1802.09194v1
PDF	http://arxiv.org/pdf/1802.09194v1.pdf
PWC	https://paperswithcode.com/paper/deep-feed-forward-sequential-memory-networks
Repo
Framework

Regret vs. Bandwidth Trade-off for Recommendation Systems


Title	Regret vs. Bandwidth Trade-off for Recommendation Systems
Authors	Linqi Song, Christina Fragouli, Devavrat Shah
Abstract	We consider recommendation systems that need to operate under wireless bandwidth constraints, measured as number of broadcast transmissions, and demonstrate a (tight for some instances) tradeoff between regret and bandwidth for two scenarios: the case of multi-armed bandit with context, and the case where there is a latent structure in the message space that we can exploit to reduce the learning phase.
Tasks	Recommendation Systems
Published	2018-10-15
URL	http://arxiv.org/abs/1810.06313v1
PDF	http://arxiv.org/pdf/1810.06313v1.pdf
PWC	https://paperswithcode.com/paper/regret-vs-bandwidth-trade-off-for
Repo
Framework

Localizing Moments in Video with Temporal Language


Title	Localizing Moments in Video with Temporal Language
Authors	Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
Abstract	Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text. We propose a new model that explicitly reasons about different temporal segments in a video, and shows that temporal context is important for localizing phrases which include temporal language. To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO - Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO - Human Language).
Tasks	Video Understanding
Published	2018-09-05
URL	http://arxiv.org/abs/1809.01337v1
PDF	http://arxiv.org/pdf/1809.01337v1.pdf
PWC	https://paperswithcode.com/paper/localizing-moments-in-video-with-temporal
Repo
Framework

Equalizing Financial Impact in Supervised Learning


Title	Equalizing Financial Impact in Supervised Learning
Authors	Govind Ramnarayan
Abstract	Notions of “fair classification” that have arisen in computer science generally revolve around equalizing certain statistics across protected groups. This approach has been criticized as ignoring societal issues, including how errors can hurt certain groups disproportionately. We pose a modification of one of the fairness criteria from Hardt, Price, and Srebro [NIPS, 2016] that makes a small step towards addressing this issue in the case of financial decisions like giving loans. We call this new notion “equalized financial impact.”
Tasks
Published	2018-06-24
URL	http://arxiv.org/abs/1806.09211v1
PDF	http://arxiv.org/pdf/1806.09211v1.pdf
PWC	https://paperswithcode.com/paper/equalizing-financial-impact-in-supervised
Repo
Framework


Title	Stylistic Variation in Social Media Part-of-Speech Tagging
Authors	Murali Raghu Babu Balusu, Taha Merghani, Jacob Eisenstein
Abstract	Social media features substantial stylistic variation, raising new challenges for syntactic analysis of online writing. However, this variation is often aligned with author attributes such as age, gender, and geography, as well as more readily-available social network metadata. In this paper, we report new evidence on the link between language and social networks in the task of part-of-speech tagging. We find that tagger error rates are correlated with network structure, with high accuracy in some parts of the network, and lower accuracy elsewhere. As a result, tagger accuracy depends on training from a balanced sample of the network, rather than training on texts from a narrow subcommunity. We also describe our attempts to add robustness to stylistic variation, by building a mixture-of-experts model in which each expert is associated with a region of the social network. While prior work found that similar approaches yield performance improvements in sentiment analysis and entity linking, we were unable to obtain performance improvements in part-of-speech tagging, despite strong evidence for the link between part-of-speech error rates and social network structure.
Tasks	Entity Linking, Part-Of-Speech Tagging, Sentiment Analysis
Published	2018-04-19
URL	http://arxiv.org/abs/1804.07331v1
PDF	http://arxiv.org/pdf/1804.07331v1.pdf
PWC	https://paperswithcode.com/paper/stylistic-variation-in-social-media-part-of
Repo
Framework

Tools and resources for Romanian text-to-speech and speech-to-text applications


Title	Tools and resources for Romanian text-to-speech and speech-to-text applications
Authors	Tiberiu Boros, Stefan Daniel Dumitrescu, Vasile Pais
Abstract	In this paper we introduce a set of resources and tools aimed at providing support for natural language processing, text-to-speech synthesis and speech recognition for Romanian. While the tools are general purpose and can be used for any language (we successfully trained our system for more than 50 languages and participated in the Universal Dependencies Shared Task), the resources are only relevant for Romanian language processing.
Tasks	Speech Recognition, Speech Synthesis, Text-To-Speech Synthesis
Published	2018-02-15
URL	http://arxiv.org/abs/1802.05583v1
PDF	http://arxiv.org/pdf/1802.05583v1.pdf
PWC	https://paperswithcode.com/paper/tools-and-resources-for-romanian-text-to
Repo
Framework

Built-in Vulnerabilities to Imperceptible Adversarial Perturbations


Title	Built-in Vulnerabilities to Imperceptible Adversarial Perturbations
Authors	Thomas Tanay, Jerone T. A. Andrews, Lewis D. Griffin
Abstract	Designing models that are robust to small adversarial perturbations of their inputs has proven remarkably difficult. In this work we show that the reverse problem—making models more vulnerable—is surprisingly easy. After presenting some proofs of concept on MNIST, we introduce a generic tilting attack that injects vulnerabilities into the linear layers of pre-trained networks by increasing their sensitivity to components of low variance in the training data without affecting their performance on test data. We illustrate this attack on a multilayer perceptron trained on SVHN and use it to design a stand-alone adversarial module which we call a steganogram decoder. Finally, we show on CIFAR-10 that a poisoning attack with a poisoning rate as low as 0.1% can induce vulnerabilities to chosen imperceptible backdoor signals in state-of-the-art networks. Beyond their practical implications, these different results shed new light on the nature of the adversarial example phenomenon.
Tasks
Published	2018-06-19
URL	https://arxiv.org/abs/1806.07409v2
PDF	https://arxiv.org/pdf/1806.07409v2.pdf
PWC	https://paperswithcode.com/paper/built-in-vulnerabilities-to-imperceptible
Repo
Framework