April 2, 2020

# Paper Group ANR 305

Intrinsic Dimension Estimation via Nearest Constrained Subspace Classifier. Federated Over-the-Air Subspace Learning from Incomplete Data. Computationally Efficient NER Taggers with Combined Embeddings and Constrained Decoding. Approximate Cross-validation: Guarantees for Model Assessment and Selection. Impact of ImageNet Model Selection on Domain …

#### Intrinsic Dimension Estimation via Nearest Constrained Subspace Classifier

Title Intrinsic Dimension Estimation via Nearest Constrained Subspace Classifier
Authors Liang Liao, Stephen John Maybank
Abstract We consider the problems of classification and intrinsic dimension estimation on image data. A new subspace based classifier is proposed for supervised classification or intrinsic dimension estimation. The distribution of the data in each class is modeled by a union of of a finite number ofaffine subspaces of the feature space. The affine subspaces have a common dimension, which is assumed to be much less than the dimension of the feature space. The subspaces are found using regression based on the L0-norm. The proposed method is a generalisation of classical NN (Nearest Neighbor), NFL (Nearest Feature Line) classifiers and has a close relationship to NS (Nearest Subspace) classifier. The proposed classifier with an accurately estimated dimension parameter generally outperforms its competitors in terms of classification accuracy. We also propose a fast version of the classifier using a neighborhood representation to reduce its computational complexity. Experiments on publicly available datasets corroborate these claims.
Published 2020-02-08
URL https://arxiv.org/abs/2002.03228v1
PDF https://arxiv.org/pdf/2002.03228v1.pdf
PWC https://paperswithcode.com/paper/intrinsic-dimension-estimation-via-nearest
Repo
Framework

#### Federated Over-the-Air Subspace Learning from Incomplete Data

Title Federated Over-the-Air Subspace Learning from Incomplete Data
Authors Praneeth Narayanamurthy, Namrata Vaswani, Aditya Ramamoorthy
Abstract Federated learning refers to a distributed learning scenario in which users/nodes keep their data private but only share intermediate locally computed iterates with the master node. The master, in turn, shares a global aggregate of these iterates with all the nodes at each iteration. In this work, we consider a wireless federated learning scenario where the nodes communicate to and from the master node via a wireless channel. Current and upcoming technologies such as 5G (and beyond) will operate mostly in a non-orthogonal multiple access (NOMA) mode where transmissions from the users occupy the same bandwidth and interfere at the access point. These technologies naturally lend themselves to an “over-the-air” superposition whereby information received from the user nodes can be directly summed at the master node. However, over-the-air aggregation also means that the channel noise can corrupt the algorithm iterates at the time of aggregation at the master. This iteration noise introduces a novel set of challenges that have not been previously studied in the literature. It needs to be treated differently from the well-studied setting of noise or corruption in the dataset itself. In this work, we first study the subspace learning problem in a federated over-the-air setting. Subspace learning involves computing the subspace spanned by the top $r$ singular vectors of a given matrix. We develop a federated over-the-air version of the power method (FedPM) and show that its iterates converge as long as (i) the channel noise is very small compared to the $r$-th singular value of the matrix; and (ii) the ratio between its $(r+1)$-th and $r$-th singular value is smaller than a constant less than one. The second important contribution of this work is to show how over-the-air FedPM can be used to obtain a provably accurate federated solution for subspace tracking in the presence of missing data.
Published 2020-02-28
URL https://arxiv.org/abs/2002.12873v1
PDF https://arxiv.org/pdf/2002.12873v1.pdf
PWC https://paperswithcode.com/paper/federated-over-the-air-subspace-learning-from
Repo
Framework

#### Computationally Efficient NER Taggers with Combined Embeddings and Constrained Decoding

Title Computationally Efficient NER Taggers with Combined Embeddings and Constrained Decoding
Authors Brian Lester, Daniel Pressel, Amy Hemmeter, Sagnik Ray Choudhury
Abstract Current State-of-the-Art models in Named Entity Recognition (NER) are neural models with a Conditional Random Field (CRF) as the final network layer, and pre-trained “contextual embeddings”. The CRF layer is used to facilitate global coherence between labels, and the contextual embeddings provide a better representation of words in context. However, both of these improvements come at a high computational cost. In this work, we explore two simple techniques that substantially improve NER performance over a strong baseline with negligible cost. First, we use multiple pre-trained embeddings as word representations via concatenation. Second, we constrain the tagger, trained using a cross-entropy loss, during decoding to eliminate illegal transitions. While training a tagger on CoNLL 2003 we find a $786$% speed-up over a contextual embeddings-based tagger without sacrificing strong performance. We also show that the concatenation technique works across multiple tasks and datasets. We analyze aspects of similarity and coverage between pre-trained embeddings and the dynamics of tag co-occurrence to explain why these techniques work. We provide an open source implementation of our tagger using these techniques in three popular deep learning frameworks — TensorFlow, Pytorch, and DyNet.
Published 2020-01-05
URL https://arxiv.org/abs/2001.01167v1
PDF https://arxiv.org/pdf/2001.01167v1.pdf
PWC https://paperswithcode.com/paper/computationally-efficient-ner-taggers-with
Repo
Framework

#### Approximate Cross-validation: Guarantees for Model Assessment and Selection

Title Approximate Cross-validation: Guarantees for Model Assessment and Selection
Authors Ashia Wilson, Maximilian Kasy, Lester Mackey
Abstract Cross-validation (CV) is a popular approach for assessing and selecting predictive models. However, when the number of folds is large, CV suffers from a need to repeatedly refit a learning procedure on a large number of training datasets. Recent work in empirical risk minimization (ERM) approximates the expensive refitting with a single Newton step warm-started from the full training set optimizer. While this can greatly reduce runtime, several open questions remain including whether these approximations lead to faithful model selection and whether they are suitable for non-smooth objectives. We address these questions with three main contributions: (i) we provide uniform non-asymptotic, deterministic model assessment guarantees for approximate CV; (ii) we show that (roughly) the same conditions also guarantee model selection performance comparable to CV; (iii) we provide a proximal Newton extension of the approximate CV framework for non-smooth prediction problems and develop improved assessment guarantees for problems such as l1-regularized ERM.
Published 2020-03-02
URL https://arxiv.org/abs/2003.00617v1
PDF https://arxiv.org/pdf/2003.00617v1.pdf
PWC https://paperswithcode.com/paper/approximate-cross-validation-guarantees-for
Repo
Framework

#### Impact of ImageNet Model Selection on Domain Adaptation

Title Impact of ImageNet Model Selection on Domain Adaptation
Authors Youshan Zhang, Brian D. Davison
Abstract Deep neural networks are widely used in image classification problems. However, little work addresses how features from different deep neural networks affect the domain adaptation problem. Existing methods often extract deep features from one ImageNet model, without exploring other neural networks. In this paper, we investigate how different ImageNet models affect transfer accuracy on domain adaptation problems. We extract features from sixteen distinct pre-trained ImageNet models and examine the performance of twelve benchmarking methods when using the features. Extensive experimental results show that a higher accuracy ImageNet model produces better features, and leads to higher accuracy on domain adaptation problems (with a correlation coefficient of up to 0.95). We also examine the architecture of each neural network to find the best layer for feature extraction. Together, performance from our features exceeds that of the state-of-the-art in three benchmark datasets.
Published 2020-02-06
URL https://arxiv.org/abs/2002.02559v1
PDF https://arxiv.org/pdf/2002.02559v1.pdf
PWC https://paperswithcode.com/paper/impact-of-imagenet-model-selection-on-domain
Repo
Framework

#### Gaussian Process Policy Optimization

Title Gaussian Process Policy Optimization
Authors Ashish Rao, Bidipta Sarkar, Tejas Narayanan
Abstract We propose a novel actor-critic, model-free reinforcement learning algorithm which employs a Bayesian method of parameter space exploration to solve environments. A Gaussian process is used to learn the expected return of a policy given the policy’s parameters. The system is trained by updating the parameters using gradient descent on a new surrogate loss function consisting of the Proximal Policy Optimization ‘Clipped’ loss function and a bonus term representing the expected improvement acquisition function given by the Gaussian process. This new method is shown to be comparable to and at times empirically outperform current algorithms on environments that simulate robotic locomotion using the MuJoCo physics engine.
Published 2020-03-02
URL https://arxiv.org/abs/2003.01074v1
PDF https://arxiv.org/pdf/2003.01074v1.pdf
PWC https://paperswithcode.com/paper/gaussian-process-policy-optimization
Repo
Framework

#### Stagewise Enlargement of Batch Size for SGD-based Learning

Title Stagewise Enlargement of Batch Size for SGD-based Learning
Authors Shen-Yi Zhao, Yin-Peng Xie, Wu-Jun Li
Abstract Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent~(SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called \underline{s}tagewise \underline{e}nlargement of \underline{b}atch \underline{s}ize~(\mbox{SEBS}), to set proper batch size for SGD. More specifically, \mbox{SEBS} adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, \mbox{SEBS} can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for \mbox{SGD}, momentum \mbox{SGD} and AdaGrad. Empirical results on real data successfully verify the theories of \mbox{SEBS}. Furthermore, empirical results also show that SEBS can outperform other baselines.
Published 2020-02-26
URL https://arxiv.org/abs/2002.11601v2
PDF https://arxiv.org/pdf/2002.11601v2.pdf
PWC https://paperswithcode.com/paper/stagewise-enlargement-of-batch-size-for-sgd
Repo
Framework

#### Mitigating Query-Flooding Parameter Duplication Attack on Regression Models with High-Dimensional Gaussian Mechanism

Title Mitigating Query-Flooding Parameter Duplication Attack on Regression Models with High-Dimensional Gaussian Mechanism
Authors Xiaoguang Li, Hui Li, Haonan Yan, Zelei Cheng, Wenhai Sun, Hui Zhu
Abstract Public intelligent services enabled by machine learning algorithms are vulnerable to model extraction attacks that can steal confidential information of the learning models through public queries. Differential privacy (DP) has been considered a promising technique to mitigate this attack. However, we find that the vulnerability persists when regression models are being protected by current DP solutions. We show that the adversary can launch a query-flooding parameter duplication (QPD) attack to infer the model information by repeated queries. To defend against the QPD attack on logistic and linear regression models, we propose a novel High-Dimensional Gaussian (HDG) mechanism to prevent unauthorized information disclosure without interrupting the intended services. In contrast to prior work, the proposed HDG mechanism will dynamically generate the privacy budget and random noise for different queries and their results to enhance the obfuscation. Besides, for the first time, HDG enables an optimal privacy budget allocation that automatically determines the minimum amount of noise to be added per user-desired privacy level on each dimension. We comprehensively evaluate the performance of HDG using real-world datasets and shows that HDG effectively mitigates the QPD attack while satisfying the privacy requirements. We also prepare to open-source the relevant codes to the community for further research.
Published 2020-02-06
URL https://arxiv.org/abs/2002.02061v1
PDF https://arxiv.org/pdf/2002.02061v1.pdf
PWC https://paperswithcode.com/paper/mitigating-query-flooding-parameter
Repo
Framework

#### Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

Title Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning
Authors Sanna Wager, Aparna Khare, Minhua Wu, Kenichi Kumatani, Shiva Sundaram
Abstract In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.
Published 2020-02-01
URL https://arxiv.org/abs/2002.00125v1
PDF https://arxiv.org/pdf/2002.00125v1.pdf
PWC https://paperswithcode.com/paper/fully-learnable-front-end-for-multi-channel
Repo
Framework

#### Dialogue-based simulation for cultural awareness training

Title Dialogue-based simulation for cultural awareness training
Authors Sodiq Adewole, Erfaneh Gharavi, Benjamin Shpringer, Martin Bolger, Vaibhav Sharma, Sung Ming Yang, Donald E. Brown
Abstract Existing simulations designed for cultural and interpersonal skill training rely on pre-defined responses with a menu option selection interface. Using a multiple-choice interface and restricting trainees’ responses may limit the trainees’ ability to apply the lessons in real life situations. This systems also uses a simplistic evaluation model, where trainees’ selected options are marked as either correct or incorrect. This model may not capture sufficient information that could drive an adaptive feedback mechanism to improve trainees’ cultural awareness. This paper describes the design of a dialogue-based simulation for cultural awareness training. The simulation, built around a disaster management scenario involving a joint coalition between the US and the Chinese armies. Trainees were able to engage in realistic dialogue with the Chinese agent. Their responses, at different points, get evaluated by different multi-label classification models. Based on training on our dataset, the models score the trainees’ responses for cultural awareness in the Chinese culture. Trainees also get feedback that informs the cultural appropriateness of their responses. The result of this work showed the following; i) A feature-based evaluation model improves the design, modeling and computation of dialogue-based training simulation systems; ii) Output from current automatic speech recognition (ASR) systems gave comparable end results compared with the output from manual transcription; iii) A multi-label classification model trained as a cultural expert gave results which were comparable with scores assigned by human annotators.
Published 2020-02-01
URL https://arxiv.org/abs/2002.00223v1
PDF https://arxiv.org/pdf/2002.00223v1.pdf
PWC https://paperswithcode.com/paper/dialogue-based-simulation-for-cultural
Repo
Framework

#### Detecting Emotion Primitives from Speech and their use in discerning Categorical Emotions

Title Detecting Emotion Primitives from Speech and their use in discerning Categorical Emotions
Authors Vasudha Kowtha, Vikramjit Mitra, Chris Bartels, Erik Marchi, Sue Booker, William Caruso, Sachin Kajarekar, Devang Naik
Abstract Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity. While modern speech technologies rely heavily on speech recognition and natural language understanding for speech content understanding, the investigation of vocal expression is increasingly gaining attention. Key considerations for building robust emotion models include characterizing and improving the extent to which a model, given its training data distribution, is able to generalize to unseen data conditions. This work investigated a long-shot-term memory (LSTM) network and a time convolution - LSTM (TC-LSTM) to detect primitive emotion attributes such as valence, arousal, and dominance, from speech. It was observed that training with multiple datasets and using robust features improved the concordance correlation coefficient (CCC) for valence, by 30% with respect to the baseline system. Additionally, this work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech, and results indicated that arousal, followed by dominance was a better detector of such emotions.
Published 2020-01-31
URL https://arxiv.org/abs/2002.01323v1
PDF https://arxiv.org/pdf/2002.01323v1.pdf
PWC https://paperswithcode.com/paper/detecting-emotion-primitives-from-speech-and
Repo
Framework

#### Continuous speech separation: dataset and analysis

Title Continuous speech separation: dataset and analysis
Authors Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Jinyu Li
Abstract This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies on speech separation use pre-segmented signals of artificially mixed speech utterances which are mostly \emph{fully} overlapped, and the algorithms are evaluated based on signal-to-distortion ratio or similar performance metrics. However, in natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. In addition, the signal-based metrics have very weak correlations with automatic speech recognition (ASR) accuracy. We think that not only does this make it hard to assess the practical relevance of the tested algorithms, it also hinders researchers from developing systems that can be readily applied to real scenarios. In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree. A new real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones. A Kaldi-based ASR evaluation protocol is also established by using a well-trained multi-conditional acoustic model. By using this dataset, several aspects of a recently proposed speaker-independent CSS algorithm are investigated. The dataset and evaluation scripts are available to facilitate the research in this direction.
Published 2020-01-30
URL https://arxiv.org/abs/2001.11482v1
PDF https://arxiv.org/pdf/2001.11482v1.pdf
PWC https://paperswithcode.com/paper/continuous-speech-separation-dataset-and
Repo
Framework

#### Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Title Audio-Visual Decision Fusion for WFST-based and seq2seq Models
Authors Rohith Aralikatti, Sharad Roy, Abhinav Thanda, Dilip Kumar Margam, Pujitha Appan Kandala, Tanay Sharma, Shankar M Venkatesan
Abstract Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER). In such cases, information from the visual modality comprising the speaker lip movements can help improve the performance. In this work, we propose novel methods to fuse information from audio and visual modalities at inference time. This enables us to train the acoustic and visual models independently. First, we train separate RNN-HMM based acoustic and visual models. A common WFST generated by taking a special union of the HMM components is used for decoding using a modified Viterbi algorithm. Second, we train separate seq2seq acoustic and visual models. The decoding step is performed simultaneously for both modalities using shallow fusion while maintaining a common hypothesis beam. We also present results for a novel seq2seq fusion without the weighing parameter. We present results at varying SNR and show that our methods give significant improvements over acoustic-only WER.
Published 2020-01-29
URL https://arxiv.org/abs/2001.10832v1
PDF https://arxiv.org/pdf/2001.10832v1.pdf
PWC https://paperswithcode.com/paper/audio-visual-decision-fusion-for-wfst-based
Repo
Framework

#### Joint Contextual Modeling for ASR Correction and Language Understanding

Title Joint Contextual Modeling for ASR Correction and Language Understanding
Authors Yue Weng, Sai Sumanth Miryala, Chandra Khatri, Runze Wang, Huaixiu Zheng, Piero Molino, Mahdi Namazifar, Alexandros Papangelis, Hugh Williams, Franziska Bell, Gokhan Tur
Abstract The quality of automatic speech recognition (ASR) is critical to Dialogue Systems as ASR errors propagate to and directly impact downstream tasks such as language understanding (LU). In this paper, we propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with LU to improve the performance of both tasks simultaneously. To measure the effectiveness of this approach we used a public benchmark, the 2nd Dialogue State Tracking (DSTC2) corpus. As a baseline approach, we trained task-specific Statistical Language Models (SLM) and fine-tuned state-of-the-art Generalized Pre-training (GPT) Language Model to re-rank the n-best ASR hypotheses, followed by a model to identify the dialog act and slots. i) We further trained ranker models using GPT and Hierarchical CNN-RNN models with discriminatory losses to detect the best output given n-best hypotheses. We extended these ranker models to first select the best ASR output and then identify the dialogue act and slots in an end to end fashion. ii) We also proposed a novel joint ASR error correction and LU model, a word confusion pointer network (WCN-Ptr) with multi-head self-attention on top, which consumes the word confusions populated from the n-best. We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
Tasks Dialogue State Tracking, Language Modelling, Speech Recognition
Published 2020-01-28
URL https://arxiv.org/abs/2002.00750v1
PDF https://arxiv.org/pdf/2002.00750v1.pdf
PWC https://paperswithcode.com/paper/joint-contextual-modeling-for-asr-correction
Repo
Framework

#### Deep Line Art Video Colorization with a Few References

Title Deep Line Art Video Colorization with a Few References
Authors Min Shi, Jia-Qi Zhang, Shu-Yu Chen, Lin Gao, Yu-Kun Lai, Fang-Lue Zhang
Abstract Coloring line art images based on the colors of reference images is an important stage in animation production, which is time-consuming and tedious. In this paper, we propose a deep architecture to automatically color line art videos with the same color style as the given reference images. Our framework consists of a color transform network and a temporal constraint network. The color transform network takes the target line art images as well as the line art and color images of one or more reference images as input, and generates corresponding target color images. To cope with larger differences between the target line art image and reference color images, our architecture utilizes non-local similarity matching to determine the region correspondences between the target image and the reference images, which are used to transform the local color information from the references to the target. To ensure global color style consistency, we further incorporate Adaptive Instance Normalization (AdaIN) with the transformation parameters obtained from a style embedding vector that describes the global color style of the references, extracted by an embedder. The temporal constraint network takes the reference images and the target image together in chronological order, and learns the spatiotemporal features through 3D convolution to ensure the temporal consistency of the target image and the reference image. Our model can achieve even better coloring results by fine-tuning the parameters with only a small amount of samples when dealing with an animation of a new style. To evaluate our method, we build a line art coloring dataset. Experiments show that our method achieves the best performance on line art video coloring compared to the state-of-the-art methods and other baselines.