April 2, 2020

3273 words 16 mins read

Paper Group ANR 335

Paper Group ANR 335

Improving Uyghur ASR systems with decoders using morpheme-based language models. A Comparative Study of Machine Learning Models for Predicting the State of Reactive Mixing. XGPT: Cross-modal Generative Pre-Training for Image Captioning. Meta-Embeddings Based On Self-Attention. Fine-Grained Instance-Level Sketch-Based Video Retrieval. Video Cloze Pr …

Improving Uyghur ASR systems with decoders using morpheme-based language models

Title Improving Uyghur ASR systems with decoders using morpheme-based language models
Authors Zicheng Qiu, Wei Jiang, Turghunjan Mamut
Abstract Uyghur is a minority language, and its resources for Automatic Speech Recognition (ASR) research are always insufficient. THUYG-20 is currently the only open-sourced dataset of Uyghur speeches. State-of-the-art results of its clean and noiseless speech test task haven’t been updated since the first release, which shows a big gap in the development of ASR between mainstream languages and Uyghur. In this paper, we try to bridge the gap by ultimately optimizing the ASR systems, and by developing a morpheme-based decoder, MLDG-Decoder (Morpheme Lattice Dynamically Generating Decoder for Uyghur DNN-HMM systems), which has long been missing. We have open-sourced the decoder. The MLDG-Decoder employs an algorithm, named as “on-the-fly composition with FEBABOS”, to allow the back-off states and transitions to play the role of a relay station in on-the-fly composition. The algorithm empowers the dynamically generated graph to constrain the morpheme sequences in the lattices as effectively as the static and fully composed graph does when a 4-Gram morpheme-based Language Model (LM) is used. We have trained deeper and wider neural network acoustic models, and experimented with three kinds of decoding schemes. The experimental results show that the decoding based on the static and fully composed graph reduces state-of-the-art Word Error Rate (WER) on the clean and noiseless speech test task in THUYG-20 to 14.24%. The MLDG-Decoder reduces the WER to 14.54% while keeping the memory consumption reasonable. Based on the open-sourced MLDG-Decoder, readers can easily reproduce the experimental results in this paper.
Tasks Language Modelling, Speech Recognition
Published 2020-03-03
URL https://arxiv.org/abs/2003.01509v2
PDF https://arxiv.org/pdf/2003.01509v2.pdf
PWC https://paperswithcode.com/paper/improving-uyghur-asr-systems-with-decoders

A Comparative Study of Machine Learning Models for Predicting the State of Reactive Mixing

Title A Comparative Study of Machine Learning Models for Predicting the State of Reactive Mixing
Authors B. Ahmmed, M. K. Mudunuru, S. Karra, S. C. James, V. V. Vesselinov
Abstract Accurate predictions of reactive mixing are critical for many Earth and environmental science problems. To investigate mixing dynamics over time under different scenarios, a high-fidelity, finite-element-based numerical model is built to solve the fast, irreversible bimolecular reaction-diffusion equations to simulate a range of reactive-mixing scenarios. A total of 2,315 simulations are performed using different sets of model input parameters comprising various spatial scales of vortex structures in the velocity field, time-scales associated with velocity oscillations, the perturbation parameter for the vortex-based velocity, anisotropic dispersion contrast, and molecular diffusion. Outputs comprise concentration profiles of the reactants and products. The inputs and outputs of these simulations are concatenated into feature and label matrices, respectively, to train 20 different machine learning (ML) emulators to approximate system behavior. The 20 ML emulators based on linear methods, Bayesian methods, ensemble learning methods, and multilayer perceptron (MLP), are compared to assess these models. The ML emulators are specifically trained to classify the state of mixing and predict three quantities of interest (QoIs) characterizing species production, decay, and degree of mixing. Linear classifiers and regressors fail to reproduce the QoIs; however, ensemble methods (classifiers and regressors) and the MLP accurately classify the state of reactive mixing and the QoIs. Among ensemble methods, random forest and decision-tree-based AdaBoost faithfully predict the QoIs. At run time, trained ML emulators are $\approx10^5$ times faster than the high-fidelity numerical simulations. Speed and accuracy of the ensemble and MLP models facilitate uncertainty quantification, which usually requires 1,000s of model run, to estimate the uncertainty bounds on the QoIs.
Published 2020-02-24
URL https://arxiv.org/abs/2002.11511v1
PDF https://arxiv.org/pdf/2002.11511v1.pdf
PWC https://paperswithcode.com/paper/a-comparative-study-of-machine-learning-1

XGPT: Cross-modal Generative Pre-Training for Image Captioning

Title XGPT: Cross-modal Generative Pre-Training for Image Captioning
Authors Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou
Abstract While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.
Tasks Data Augmentation, Denoising, Image Captioning, Image Retrieval, Language Modelling, Visual Question Answering
Published 2020-03-03
URL https://arxiv.org/abs/2003.01473v2
PDF https://arxiv.org/pdf/2003.01473v2.pdf
PWC https://paperswithcode.com/paper/xgpt-cross-modal-generative-pre-training-for

Meta-Embeddings Based On Self-Attention

Title Meta-Embeddings Based On Self-Attention
Authors Qichen Li, Xiaoke Jiang, Jun Xia, Jian Li
Abstract Creating meta-embeddings for better performance in language modelling has received attention lately, and methods based on concatenation or merely calculating the arithmetic mean of more than one separately trained embeddings to perform meta-embeddings have shown to be beneficial. In this paper, we devise a new meta-embedding model based on the self-attention mechanism, namely the Duo. With less than 0.4M parameters, the Duo mechanism achieves state-of-the-art accuracy in text classification tasks such as 20NG. Additionally, we propose a new meta-embedding sequece-to-sequence model for machine translation, which to the best of our knowledge, is the first machine translation model based on more than one word-embedding. Furthermore, it has turned out that our model outperform the Transformer not only in terms of achieving a better result, but also a faster convergence on recognized benchmarks, such as the WMT 2014 English-to-French translation task.
Tasks Language Modelling, Machine Translation, Text Classification
Published 2020-03-03
URL https://arxiv.org/abs/2003.01371v1
PDF https://arxiv.org/pdf/2003.01371v1.pdf
PWC https://paperswithcode.com/paper/meta-embeddings-based-on-self-attention

Fine-Grained Instance-Level Sketch-Based Video Retrieval

Title Fine-Grained Instance-Level Sketch-Based Video Retrieval
Authors Peng Xu, Kun Liu, Tao Xiang, Timothy M. Hospedales, Zhanyu Ma, Jun Guo, Yi-Zhe Song
Abstract Existing sketch-analysis work studies sketches depicting static objects or scenes. In this work, we propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR), where a sketch sequence is used as a query to retrieve a specific target video instance. Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level. We contribute the first FG-SBVR dataset with rich annotations. We then introduce a novel multi-stream multi-modality deep network to perform FG-SBVR under both strong and weakly supervised settings. The key component of the network is a relation module, designed to prevent model over-fitting given scarce training data. We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
Tasks Cross-Modal Retrieval, Image Retrieval, Video Retrieval
Published 2020-02-21
URL https://arxiv.org/abs/2002.09461v1
PDF https://arxiv.org/pdf/2002.09461v1.pdf
PWC https://paperswithcode.com/paper/fine-grained-instance-level-sketch-based

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Title Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
Authors Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, Weiping Wang
Abstract We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates “blanks” by withholding video clips and then creates “options” by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with “options” and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in a uniform and interpretable manner. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks. Experiments on commonly used benchmarks show that the trained models outperform the state-of-the-art self-supervised models with significant margins.
Tasks Representation Learning, Video Retrieval
Published 2020-01-02
URL https://arxiv.org/abs/2001.00294v1
PDF https://arxiv.org/pdf/2001.00294v1.pdf
PWC https://paperswithcode.com/paper/video-cloze-procedure-for-self-supervised

Automating Discovery of Dominance in Synchronous Computer-Mediated Communication

Title Automating Discovery of Dominance in Synchronous Computer-Mediated Communication
Authors Jim Samuel, Richard Holowczak, Raquel Benbunan-Fich, Ilan Levine
Abstract With the advent of electronic interaction, dominance (or the assertion of control over others) has acquired new dimensions. This study investigates the dynamics and characteristics of dominance in virtual interaction by analyzing electronic chat transcripts of groups solving a hidden profile task. We investigate computer-mediated communication behavior patterns that demonstrate dominance and identify a number of relevant variables. These indicators are calculated with automatic and manual coding of text transcripts. A comparison of both sets of variables indicates that automatic text analysis methods yield similar conclusions than manual coding. These findings are encouraging to advance research in text analysis methods in general, and in the study of virtual team dominance in particular.
Published 2020-02-24
URL https://arxiv.org/abs/2002.10582v1
PDF https://arxiv.org/pdf/2002.10582v1.pdf
PWC https://paperswithcode.com/paper/automating-discovery-of-dominance-in

Channel Attention with Embedding Gaussian Process: A Probabilistic Methodology

Title Channel Attention with Embedding Gaussian Process: A Probabilistic Methodology
Authors Jiyang Xie, Dongliang Chang, Zhanyu Ma, Guoqiang Zhang, Jun Guo
Abstract Channel attention mechanisms, as the key components of some modern convolutional neural networks (CNNs) architectures, have been commonly used in many visual tasks for effective performance improvement. It is able to reinforce the informative channels and to suppress useless channels of feature maps obtained by CNNs. Recently, different attention modules have been proposed, which are implemented in various ways. However, they are mainly based on convolution and pooling operations, which are lack of intuitive and reasonable insights about the principles that they are based on. Moreover, the ways that they improve the performance of the CNNs is not clear either. In this paper, we propose a Gaussian process embedded channel attention (GPCA) module and interpret the channel attention intuitively and reasonably in a probabilistic way. The GPCA module is able to model the correlations from channels which are assumed as beta distributed variables with Gaussian process prior. As the beta distribution is intractably integrated into the end-to-end training of the CNNs, we utilize an appropriate approximation of the beta distribution to make the distribution assumption implemented easily. In this case, the proposed GPCA module can be integrated into the end-to-end training of the CNNs. Experimental results demonstrate that the proposed GPCA module can improve the accuracies of image classification on four widely used datasets.
Tasks Image Classification
Published 2020-03-10
URL https://arxiv.org/abs/2003.04575v1
PDF https://arxiv.org/pdf/2003.04575v1.pdf
PWC https://paperswithcode.com/paper/channel-attention-with-embedding-gaussian

A Deep Learning Framework for Simulation and Defect Prediction Applied in Microelectronics

Title A Deep Learning Framework for Simulation and Defect Prediction Applied in Microelectronics
Authors Nikolaos Dimitriou, Lampros Leontaris, Thanasis Vafeiadis, Dimosthenis Ioannidis, Tracy Wotherspoon, Gregory Tinker, Dimitrios Tzovaras
Abstract The prediction of upcoming events in industrial processes has been a long-standing research goal since it enables optimization of manufacturing parameters, planning of equipment maintenance and more importantly prediction and eventually prevention of defects. While existing approaches have accomplished substantial progress, they are mostly limited to processing of one dimensional signals or require parameter tuning to model environmental parameters. In this paper, we propose an alternative approach based on deep neural networks that simulates changes in the 3D structure of a monitored object in a batch based on previous 3D measurements. In particular, we propose an architecture based on 3D Convolutional Neural Networks (3DCNN) in order to model the geometric variations in manufacturing parameters and predict upcoming events related to sub-optimal performance. We validate our framework on a microelectronics use-case using the recently published PCB scans dataset where we simulate changes on the shape and volume of glue deposited on an Liquid Crystal Polymer (LCP) substrate before the attachment of integrated circuits (IC). Experimental evaluation examines the impact of different choices in the cost function during training and shows that the proposed method can be efficiently used for defect prediction.
Published 2020-02-25
URL https://arxiv.org/abs/2002.10986v1
PDF https://arxiv.org/pdf/2002.10986v1.pdf
PWC https://paperswithcode.com/paper/a-deep-learning-framework-for-simulation-and

Multimodal Shape Completion via Conditional Generative Adversarial Networks

Title Multimodal Shape Completion via Conditional Generative Adversarial Networks
Authors Rundi Wu, Xuelin Chen, Yixin Zhuang, Baoquan Chen
Abstract Several deep learning methods have been proposed for completing partial data from shape acquisition setups, i.e., filling the regions that were missing in the shape. These methods, however, only complete the partial shape with a single output, ignoring the ambiguity when reasoning the missing geometry. Hence, we pose a multi-modal shape completion problem, in which we seek to complete the partial shape with multiple outputs by learning a one-to-many mapping. We develop the first multimodal shape completion method that completes the partial shape via conditional generative modeling, without requiring paired training data. Our approach distills the ambiguity by conditioning the completion on a learned multimodal distribution of possible results. We extensively evaluate the approach on several datasets that contain varying forms of shape incompleteness, and compare among several baseline methods and variants of our methods qualitatively and quantitatively, demonstrating the merit of our method in completing partial shapes with both diversity and quality.
Published 2020-03-17
URL https://arxiv.org/abs/2003.07717v2
PDF https://arxiv.org/pdf/2003.07717v2.pdf
PWC https://paperswithcode.com/paper/multimodal-shape-completion-via-conditional

Resource-Aware Network Topology Management Framework

Title Resource-Aware Network Topology Management Framework
Authors Aaqif Afzaal Abbasi, Shahab Shamshirband, Mohammed A. A. Al-qaness, Almas Abbasi, Nashat T. AL-Jallad, Amir Mosavi
Abstract Cloud infrastructure provides computing services where computing resources can be adjusted on-demand. However, the adoption of cloud infrastructures brings concerns like reliance on the service provider network, reliability, compliance for service level agreements. Software-defined networking (SDN) is a networking concept that suggests the segregation of a network data plane from the control plane. This concept improves networking behavior. In this paper, we present an SDN-enabled resource-aware topology framework. The proposed framework employs SLA compliance, Path Computation Element (PCE) and shares fair loading to achieve better topology features. We also present an evaluation, showcasing the potential of our framework.
Published 2020-02-26
URL https://arxiv.org/abs/2003.00860v1
PDF https://arxiv.org/pdf/2003.00860v1.pdf
PWC https://paperswithcode.com/paper/resource-aware-network-topology-management

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Title Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Authors Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu
Abstract Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video representations. The HGR model aggregates matchings from different video-text levels to capture both global and local details. Experimental results on three video-text datasets demonstrate the advantages of our model. Such hierarchical decomposition also enables better generalization across datasets and improves the ability to distinguish fine-grained semantic differences.
Tasks Cross-Modal Retrieval, Text Matching
Published 2020-03-01
URL https://arxiv.org/abs/2003.00392v1
PDF https://arxiv.org/pdf/2003.00392v1.pdf
PWC https://paperswithcode.com/paper/fine-grained-video-text-retrieval-with

Unsupervised Domain Adaptation via Discriminative Manifold Embedding and Alignment

Title Unsupervised Domain Adaptation via Discriminative Manifold Embedding and Alignment
Authors You-Wei Luo, Chuan-Xian Ren, Pengfei Ge, Ke-Kun Huang, Yu-Feng Yu
Abstract Unsupervised domain adaptation is effective in leveraging the rich information from the source domain to the unsupervised target domain. Though deep learning and adversarial strategy make an important breakthrough in the adaptability of features, there are two issues to be further explored. First, the hard-assigned pseudo labels on the target domain are risky to the intrinsic data structure. Second, the batch-wise training manner in deep learning limits the description of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability consistently. As to the first problem, this method establishes a probabilistic discriminant criterion on the target domain via soft labels. Further, this criterion is extended to a global approximation scheme for the second issue; such approximation is also memory-saving. The manifold metric alignment is exploited to be compatible with the embedding space. A theoretical error bound is derived to facilitate the alignment. Extensive experiments have been conducted to investigate the proposal and results of the comparison study manifest the superiority of consistent manifold learning framework.
Tasks Domain Adaptation, Unsupervised Domain Adaptation
Published 2020-02-20
URL https://arxiv.org/abs/2002.08675v2
PDF https://arxiv.org/pdf/2002.08675v2.pdf
PWC https://paperswithcode.com/paper/unsupervised-domain-adaptation-via-3

Kernel Quantization for Efficient Network Compression

Title Kernel Quantization for Efficient Network Compression
Authors Zhongzhi Yu, Yemin Shi, Tiejun Huang, Yizhou Yu
Abstract This paper presents a novel network compression framework Kernel Quantization (KQ), targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Unlike existing methods struggling with weight bit-length, KQ has the potential in improving the compression ratio by considering the convolution kernel as the quantization unit. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Instead of representing each weight parameter with a low-bit index, we learn a kernel codebook and replace all kernels in the convolution layer with corresponding low-bit indexes. Thus, KQ can represent the weight tensor in the convolution layer with low-bit indexes and a kernel codebook with limited size, which enables KQ to achieve significant compression ratio. Then, we conduct a 6-bit parameter quantization on the kernel codebook to further reduce redundancy. Extensive experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer and achieves the state-of-the-art compression ratio with little accuracy loss.
Tasks Quantization
Published 2020-03-11
URL https://arxiv.org/abs/2003.05148v1
PDF https://arxiv.org/pdf/2003.05148v1.pdf
PWC https://paperswithcode.com/paper/kernel-quantization-for-efficient-network

Ternary Compression for Communication-Efficient Federated Learning

Title Ternary Compression for Communication-Efficient Federated Learning
Authors Jinjin Xu, Wenli Du, Ran Cheng, Wangli He, Yaochu Jin
Abstract Learning over massive data stored in different locations is essential in many real-world applications. However, sharing data is full of challenges due to the increasing demands of privacy and security with the growing use of smart mobile devices and IoT devices. Federated learning provides a potential solution to privacy-preserving and secure machine learning, by means of jointly training a global model without uploading data distributed on multiple devices to a central server. However, most existing work on federated learning adopts machine learning models with full-precision weights, and almost all these models contain a large number of redundant parameters that do not need to be transmitted to the server, consuming an excessive amount of communication costs. To address this issue, we propose a federated trained ternary quantization (FTTQ) algorithm, which optimizes the quantized networks on the clients through a self-learning quantization factor. A convergence proof of the quantization factor and the unbiasedness of FTTQ is given. In addition, we propose a ternary federated averaging protocol (T-FedAvg) to reduce the upstream and downstream communication of federated learning systems. Empirical experiments are conducted to train widely used deep learning models on publicly available datasets, and our results demonstrate the effectiveness of FTTQ and T-FedAvg compared with the canonical federated learning algorithms in reducing communication costs and maintaining the learning performance.
Tasks Quantization
Published 2020-03-07
URL https://arxiv.org/abs/2003.03564v1
PDF https://arxiv.org/pdf/2003.03564v1.pdf
PWC https://paperswithcode.com/paper/ternary-compression-for-communication
comments powered by Disqus