April 2, 2020

3302 words 16 mins read

Paper Group ANR 343

Paper Group ANR 343

Defense against adversarial attacks on spoofing countermeasures of ASV. Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances. Comparison of user models based on GMM-UBM and i-vectors for speech, handwriting, and gait assessment of Parkinson’s disease patients. A Speaker Verification Backend for Improved Calibration Performa …

Defense against adversarial attacks on spoofing countermeasures of ASV

Title Defense against adversarial attacks on spoofing countermeasures of ASV
Authors Haibin Wu, Songxiang Liu, Helen Meng, Hung-yi Lee
Abstract Various forefront countermeasure methods for automatic speaker verification (ASV) with considerable performance in anti-spoofing are proposed in the ASVspoof 2019 challenge. However, previous work has shown that countermeasure models are vulnerable to adversarial examples indistinguishable from natural data. A good countermeasure model should not only be robust against spoofing audio, including synthetic, converted, and replayed audios; but counteract deliberately generated examples by malicious adversaries. In this work, we introduce a passive defense method, spatial smoothing, and a proactive defense method, adversarial training, to mitigate the vulnerability of ASV spoofing countermeasure models against adversarial examples. This paper is among the first to use defense methods to improve the robustness of ASV spoofing countermeasure models under adversarial attacks. The experimental results show that these two defense methods positively help spoofing countermeasure models counter adversarial examples.
Tasks Speaker Verification
Published 2020-03-06
URL https://arxiv.org/abs/2003.03065v1
PDF https://arxiv.org/pdf/2003.03065v1.pdf
PWC https://paperswithcode.com/paper/defense-against-adversarial-attacks-on
Repo
Framework

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Title Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances
Authors Aleksei Gusev, Vladimir Volokhov, Tseren Andzhukaev, Sergey Novoselov, Galina Lavrentyeva, Marina Volkova, Alice Gazizullina, Andrey Shulipa, Artem Gorlanov, Anastasia Avdeeva, Artem Ivanov, Alexander Kozlov, Timur Pekhovsky, Yuri Matveev
Abstract Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both long-duration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.
Tasks Speaker Recognition, Speaker Verification
Published 2020-02-14
URL https://arxiv.org/abs/2002.06033v1
PDF https://arxiv.org/pdf/2002.06033v1.pdf
PWC https://paperswithcode.com/paper/deep-speaker-embeddings-for-far-field-speaker
Repo
Framework

Comparison of user models based on GMM-UBM and i-vectors for speech, handwriting, and gait assessment of Parkinson’s disease patients

Title Comparison of user models based on GMM-UBM and i-vectors for speech, handwriting, and gait assessment of Parkinson’s disease patients
Authors J. C. Vasquez-Correa, T. Bocklet, J. R. Orozco-Arroyave, E. Nöth
Abstract Parkinson’s disease is a neurodegenerative disorder characterized by the presence of different motor impairments. Information from speech, handwriting, and gait signals have been considered to evaluate the neurological state of the patients. On the other hand, user models based on Gaussian mixture models - universal background models (GMM-UBM) and i-vectors are considered the state-of-the-art in biometric applications like speaker verification because they are able to model specific speaker traits. This study introduces the use of GMM-UBM and i-vectors to evaluate the neurological state of Parkinson’s patients using information from speech, handwriting, and gait. The results show the importance of different feature sets from each type of signal in the assessment of the neurological state of the patients.
Tasks Speaker Verification
Published 2020-02-13
URL https://arxiv.org/abs/2002.05412v1
PDF https://arxiv.org/pdf/2002.05412v1.pdf
PWC https://paperswithcode.com/paper/comparison-of-user-models-based-on-gmm-ubm
Repo
Framework

A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

Title A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions
Authors Luciana Ferrer, Mitchell McLaren
Abstract In a recent work, we presented a discriminative backend for speaker verification that achieved good out-of-the-box calibration performance on most tested conditions containing varying levels of mismatch to the training conditions. This backend mimics the standard PLDA-based backend process used in most current speaker verification systems, including the calibration stage. All parameters of the backend are jointly trained to optimize the binary cross-entropy for the speaker verification task. Calibration robustness is achieved by making the parameters of the calibration stage a function of vectors representing the conditions of the signal, which are extracted using a model trained to predict condition labels. In this work, we propose a simplified version of this backend where the vectors used to compute the calibration parameters are estimated within the backend, without the need for a condition prediction model. We show that this simplified method provides similar performance to the previously proposed method while being simpler to implement, and having less requirements on the training data. Further, we provide an analysis of different aspects of the method including the effect of initialization, the nature of the vectors used to compute the calibration parameters, and the effect that the random seed and the number of training epochs has on performance. We also compare the proposed method with the trial-based calibration (TBC) method that, to our knowledge, was the state-of-the-art for achieving good calibration across varying conditions. We show that the proposed method outperforms TBC while also being several orders of magnitude faster to run, comparable to the standard PLDA baseline.
Tasks Calibration, Speaker Verification
Published 2020-02-05
URL https://arxiv.org/abs/2002.03802v1
PDF https://arxiv.org/pdf/2002.03802v1.pdf
PWC https://paperswithcode.com/paper/a-speaker-verification-backend-for-improved
Repo
Framework

Masked cross self-attention encoding for deep speaker embedding

Title Masked cross self-attention encoding for deep speaker embedding
Authors Soonshin Seo, Daniel Jun Rim, Junseok Oh, Ji-Hwan Kim
Abstract In general, speaker verification tasks require the extraction of speaker embedding from a deep neural network. As speaker embedding may contain additional information such as noise besides speaker information, its variability controlling is needed. Our previous model have used multiple pooling based on shortcut connections to amplify speaker information by deepening the dimension; however, the problem of variability remains. In this paper, we propose a masked cross self-attention encoding (MCSAE) for deep speaker embedding. This method controls the variability of speaker embedding by focusing on each masked output of multiple pooling on each other. The output of the MCSAE is used to construct the deep speaker embedding. Experimental results on VoxCeleb data set demonstrate that the proposed approach improves performance as compared with previous state-of-the-art models.
Tasks Speaker Verification
Published 2020-01-28
URL https://arxiv.org/abs/2001.10817v1
PDF https://arxiv.org/pdf/2001.10817v1.pdf
PWC https://paperswithcode.com/paper/masked-cross-self-attention-encoding-for-deep
Repo
Framework

Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

Title Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting
Authors Pongpisit Thanasutives, Ken-ichi Fukui, Masayuki Numao, Boonserm Kijsirikul
Abstract In this paper, we proposed two modified neural network architectures based on SFANet and SegNet respectively for accurate and efficient crowd counting. Inspired by SFANet, the first model is attached with two novel multi-scale-aware modules, called ASSP and CAN. This model is called M-SFANet. The encoder of M-SFANet is enhanced with ASSP containing parallel atrous convolution with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage contextual module called CAN which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on the SFANet’s decoder structure, M-SFANet’s decoder has dual paths, for density map generation and attention map generation. The second model is called M-SegNet. For M-SegNet, we simply change bilinear upsampling used in SFANet to max unpooling originally from SegNet and propose the faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and end-to-end trainable. We also conduct extensive experiments on four crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could outperform some state-of-the-art crowd counting methods.
Tasks Crowd Counting
Published 2020-03-12
URL https://arxiv.org/abs/2003.05586v3
PDF https://arxiv.org/pdf/2003.05586v3.pdf
PWC https://paperswithcode.com/paper/encoder-decoder-based-convolutional-neural
Repo
Framework

Hydrological time series forecasting using simple combinations: Big data testing and investigations on one-year ahead river flow predictability

Title Hydrological time series forecasting using simple combinations: Big data testing and investigations on one-year ahead river flow predictability
Authors Georgia Papacharalampous, Hristos Tyralis
Abstract Delivering useful hydrological forecasts is critical for urban and agricultural water management, hydropower generation, flood protection and management, drought mitigation and alleviation, and river basin planning and management, among others. In this work, we present and appraise a new methodology for hydrological time series forecasting. This methodology is based on simple combinations. The appraisal is made by using a big dataset consisted of 90-year-long mean annual river flow time series from approximately 600 stations. Covering large parts of North America and Europe, these stations represent various climate and catchment characteristics, and thus can collectively support benchmarking. Five individual forecasting methods and 26 variants of the introduced methodology are applied to each time series. The application is made in one-step ahead forecasting mode. The individual methods are the last-observation benchmark, simple exponential smoothing, complex exponential smoothing, automatic autoregressive fractionally integrated moving average (ARFIMA) and Facebook’s Prophet, while the 26 variants are defined by all the possible combinations (per two, three, four or five) of the five afore-mentioned methods. The findings have both practical and theoretical implications. The simple methodology of the study is identified as well-performing in the long run. Our large-scale results are additionally exploited for finding an interpretable relationship between predictive performance and temporal dependence in the river flow time series, and for examining one-year ahead river flow predictability.
Tasks Time Series, Time Series Forecasting
Published 2020-01-02
URL https://arxiv.org/abs/2001.00811v1
PDF https://arxiv.org/pdf/2001.00811v1.pdf
PWC https://paperswithcode.com/paper/hydrological-time-series-forecasting-using
Repo
Framework

A Review of Computational Approaches for Evaluation of Rehabilitation Exercises

Title A Review of Computational Approaches for Evaluation of Rehabilitation Exercises
Authors Yalin Liao, Aleksandar Vakanski, Min Xian, David Paul, Russell Baker
Abstract Recent advances in data analytics and computer-aided diagnostics stimulate the vision of patient-centric precision healthcare, where treatment plans are customized based on the health records and needs of every patient. In physical rehabilitation, the progress in machine learning and the advent of affordable and reliable motion capture sensors have been conducive to the development of approaches for automated assessment of patient performance and progress toward functional recovery. The presented study reviews computational approaches for evaluating patient performance in rehabilitation programs using motion capture systems. Such approaches will play an important role in supplementing traditional rehabilitation assessment performed by trained clinicians, and in assisting patients participating in home-based rehabilitation. The reviewed computational methods for exercise evaluation are grouped into three main categories: discrete movement score, rule-based, and template-based approaches. The review places an emphasis on the application of machine learning methods for movement evaluation in rehabilitation. Related work in the literature on data representation, feature engineering, movement segmentation, and scoring functions is presented. The study also reviews existing sensors for capturing rehabilitation movements and provides an informative listing of pertinent benchmark datasets. The significance of this paper is in being the first to provide a comprehensive review of computational methods for evaluation of patient performance in rehabilitation programs.
Tasks Feature Engineering, Motion Capture
Published 2020-02-29
URL https://arxiv.org/abs/2003.08767v2
PDF https://arxiv.org/pdf/2003.08767v2.pdf
PWC https://paperswithcode.com/paper/a-review-of-computational-approaches-for
Repo
Framework

A Single RGB Camera Based Gait Analysis with a Mobile Tele-Robot for Healthcare

Title A Single RGB Camera Based Gait Analysis with a Mobile Tele-Robot for Healthcare
Authors Ziyang Wang
Abstract With the increasing awareness of high-quality life, there is a growing need for health monitoring devices running robust algorithms in home environment. Health monitoring technologies enable real-time analysis of users’ health status, offering long-term healthcare support and reducing hospitalization time. The purpose of this work is twofold, the software focuses on the analysis of gait, which is widely adopted for joint correction and assessing any lower limb or spinal problem. On the hardware side, we design a novel marker-less gait analysis device using a low-cost RGB camera mounted on a mobile tele-robot. As gait analysis with a single camera is much more challenging compared to previous works utilizing multi-cameras, a RGB-D camera or wearable sensors, we propose using vision-based human pose estimation approaches. More specifically, based on the output of two state-of-the-art human pose estimation models (Openpose and VNect), we devise measurements for four bespoke gait parameters: inversion/eversion, dorsiflexion/plantarflexion, ankle and foot progression angles. We thereby classify walking patterns into normal, supination, pronation and limp. We also illustrate how to run the purposed machine learning models in low-resource environments such as a single entry-level CPU. Experiments show that our single RGB camera method achieves competitive performance compared to state-of-the-art methods based on depth cameras or multi-camera motion capture system, at smaller hardware costs.
Tasks Motion Capture, Pose Estimation
Published 2020-02-11
URL https://arxiv.org/abs/2002.04700v4
PDF https://arxiv.org/pdf/2002.04700v4.pdf
PWC https://paperswithcode.com/paper/a-single-rgb-camera-based-gait-analysis-with
Repo
Framework

Depth-Based Selective Blurring in Stereo Images Using Accelerated Framework

Title Depth-Based Selective Blurring in Stereo Images Using Accelerated Framework
Authors Subhayan Mukherjee, Ram Mohana Reddy Guddeti
Abstract We propose a hybrid method for stereo disparity estimation by combining block and region-based stereo matching approaches. It generates dense depth maps from disparity measurements of only 18 % image pixels (left or right). The methodology involves segmenting pixel lightness values using fast K-Means implementation, refining segment boundaries using morphological filtering and connected components analysis; then determining boundaries’ disparities using sum of absolute differences (SAD) cost function. Complete disparity maps are reconstructed from boundaries’ disparities. We consider an application of our method for depth-based selective blurring of non-interest regions of stereo images, using Gaussian blur to de-focus users’ non-interest regions. Experiments on Middlebury dataset demonstrate that our method outperforms traditional disparity estimation approaches using SAD and normalized cross correlation by up to 33.6 % and some recent methods by up to 6.1 %. Further, our method is highly parallelizable using CPU and GPU framework based on Java Thread Pool and APARAPI with speed-up of 5.8 for 250 stereo video frames (4,096 x 2,304).
Tasks Disparity Estimation, Stereo Matching
Published 2020-01-21
URL https://arxiv.org/abs/2001.07809v1
PDF https://arxiv.org/pdf/2001.07809v1.pdf
PWC https://paperswithcode.com/paper/depth-based-selective-blurring-in-stereo
Repo
Framework

Two Applications of Deep Learning in the Physical Layer of Communication Systems

Title Two Applications of Deep Learning in the Physical Layer of Communication Systems
Authors Emil Björnson, Pontus Giselsson
Abstract Deep learning has proved itself to be a powerful tool to develop data-driven signal processing algorithms for challenging engineering problems. By learning the key features and characteristics of the input signals, instead of requiring a human to first identify and model them, learned algorithms can beat many man-made algorithms. In particular, deep neural networks are capable of learning the complicated features in nature-made signals, such as photos and audio recordings, and use them for classification and decision making. The situation is rather different in communication systems, where the information signals are man-made, the propagation channels are relatively easy to model, and we know how to operate close to the Shannon capacity limits. Does this mean that there is no role for deep learning in the development of future communication systems?
Tasks Decision Making
Published 2020-01-10
URL https://arxiv.org/abs/2001.03350v1
PDF https://arxiv.org/pdf/2001.03350v1.pdf
PWC https://paperswithcode.com/paper/two-applications-of-deep-learning-in-the
Repo
Framework

A Principled Approach to Learning Stochastic Representations for Privacy in Deep Neural Inference

Title A Principled Approach to Learning Stochastic Representations for Privacy in Deep Neural Inference
Authors Fatemehsadat Mireshghallah, Mohammadkazem Taram, Ali Jalali, Ahmed Taha Elthakeb, Dean Tullsen, Hadi Esmaeilzadeh
Abstract INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Networks (DNNs) in home automation, targeted advertising, machine vision, etc. The cloud receives the inference request as a raw input, containing a rich set of private information, that can be misused or leaked, possibly inadvertently. This prevalent setting can compromise the privacy of users during the inference phase. This paper sets out to provide a principled approach, dubbed Cloak, that finds optimal stochastic perturbations to obfuscate the private data before it is sent to the cloud. To this end, Cloak reduces the information content of the transmitted data while conserving the essential pieces that enable the request to be serviced accurately. The key idea is formulating the discovery of this stochasticity as an offline gradient-based optimization problem that reformulates a pre-trained DNN (with optimized known weights) as an analytical function of the stochastic perturbations. Using Laplace distribution as a parametric model for the stochastic perturbations, Cloak learns the optimal parameters using gradient descent and Monte Carlo sampling. This set of optimized Laplace distributions further guarantee that the injected stochasticity satisfies the -differential privacy criterion. Experimental evaluations with real-world datasets show that, on average, the injected stochasticity can reduce the information content in the input data by 80.07%, while incurring 7.12% accuracy loss.
Tasks
Published 2020-03-26
URL https://arxiv.org/abs/2003.12154v1
PDF https://arxiv.org/pdf/2003.12154v1.pdf
PWC https://paperswithcode.com/paper/a-principled-approach-to-learning-stochastic
Repo
Framework

SirenLess: reveal the intention behind news

Title SirenLess: reveal the intention behind news
Authors Xumeng Chen, Leo Yu-Ho Lo, Huamin Qu
Abstract News articles tend to be increasingly misleading nowadays, preventing readers from making subjective judgments towards certain events. While some machine learning approaches have been proposed to detect misleading news, most of them are black boxes that provide limited help for humans in decision making. In this paper, we present SirenLess, a visual analytical system for misleading news detection by linguistic features. The system features article explorer, a novel interactive tool that integrates news metadata and linguistic features to reveal semantic structures of news articles and facilitate textual analysis. We use SirenLess to analyze 18 news articles from different sources and summarize some helpful patterns for misleading news detection. A user study with journalism professionals and university students is conducted to confirm the usefulness and effectiveness of our system.
Tasks Decision Making
Published 2020-01-08
URL https://arxiv.org/abs/2001.02731v1
PDF https://arxiv.org/pdf/2001.02731v1.pdf
PWC https://paperswithcode.com/paper/sirenless-reveal-the-intention-behind-news
Repo
Framework

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Title ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Authors Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti
Abstract In this paper, we introduce a new vision-language pre-trained model – ImageBERT – for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks, and achieve new state-of-the-art results on both MSCOCO and Flickr30k datasets.
Tasks Image Retrieval, Language Modelling, Object Classification, Text Matching
Published 2020-01-22
URL https://arxiv.org/abs/2001.07966v2
PDF https://arxiv.org/pdf/2001.07966v2.pdf
PWC https://paperswithcode.com/paper/imagebert-cross-modal-pre-training-with-large
Repo
Framework

Efficient Programmable Random Variate Generation Accelerator from Sensor Noise

Title Efficient Programmable Random Variate Generation Accelerator from Sensor Noise
Authors James Timothy Meech, Phillip Stanley-Marbell
Abstract We introduce a method for non-uniform random number generation based on sampling a physical process in a controlled environment. We demonstrate one proof-of-concept implementation of the method that reduces the error of Monte Carlo integration of a univariate Gaussian by 1068 times while doubling the speed of the Monte Carlo simulation. We show that the supply voltage and temperature of the physical process must be controlled to prevent the mean and standard deviation of the random number generator from drifting.
Tasks
Published 2020-01-10
URL https://arxiv.org/abs/2001.05400v1
PDF https://arxiv.org/pdf/2001.05400v1.pdf
PWC https://paperswithcode.com/paper/efficient-programmable-random-variate
Repo
Framework
comments powered by Disqus