January 29, 2020

3750 words 18 mins read

Paper Group ANR 610

Paper Group ANR 610

Effective and Efficient Indexing in Cross-Modal Hashing-Based Datasets. Multi-Object Portion Tracking in 4D Fluorescence Microscopy Imagery with Deep Feature Maps. Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients. The Second DIHARD Diarization Challenge: Dataset, task, and baselines. Some Considerations and a Be …

Effective and Efficient Indexing in Cross-Modal Hashing-Based Datasets

Title Effective and Efficient Indexing in Cross-Modal Hashing-Based Datasets
Authors Sarawut Markchit, Chih-Yi Chiu
Abstract To overcome the barrier of storage and computation, the hashing technique has been widely used for nearest neighbor search in multimedia retrieval applications recently. Particularly, cross-modal retrieval that searches across different modalities becomes an active but challenging problem. Although dozens of cross-modal hashing algorithms are proposed to yield compact binary codes, the exhaustive search is impractical for the real-time purpose, and Hamming distance computation suffers inaccurate results. In this paper, we propose a novel search method that utilizes a probability-based index scheme over binary hash codes in cross-modal retrieval. The proposed hash code indexing scheme exploits a few binary bits of the hash code as the index code. We construct an inverted index table based on index codes and train a neural network to improve the indexing accuracy and efficiency. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated by three cross-modal hashing methods. Results show the proposed method effectively boost the performance on these hash methods.
Tasks Cross-Modal Retrieval
Published 2019-04-30
URL https://arxiv.org/abs/1904.13325v2
PDF https://arxiv.org/pdf/1904.13325v2.pdf
PWC https://paperswithcode.com/paper/effective-and-efficient-indexing-in-cross
Repo
Framework

Multi-Object Portion Tracking in 4D Fluorescence Microscopy Imagery with Deep Feature Maps

Title Multi-Object Portion Tracking in 4D Fluorescence Microscopy Imagery with Deep Feature Maps
Authors Yang Jiao, Mo Weng, Mei Yang
Abstract 3D fluorescence microscopy of living organisms has increasingly become an essential and powerful tool in biomedical research and diagnosis. An exploding amount of imaging data has been collected, whereas efficient and effective computational tools to extract information from them are still lagging behind. This is largely due to the challenges in analyzing biological data. Interesting biological structures are not only small, but are often morphologically irregular and highly dynamic. Although tracking cells in live organisms has been studied for years, existing tracking methods for cells are not effective in tracking subcellular structures, such as protein complexes, which feature in continuous morphological changes including split and merge, in addition to fast migration and complex motion. In this paper, we first define the problem of multi-object portion tracking to model the protein object tracking process. A multi-object tracking method with portion matching is proposed based on 3D segmentation results. The proposed method distills deep feature maps from deep networks, then recognizes and matches object portions using an extended search. Experimental results confirm that the proposed method achieves 2.96% higher on consistent tracking accuracy and 35.48% higher on event identification accuracy than the state-of-art methods.
Tasks Multi-Object Tracking, Object Tracking
Published 2019-11-26
URL https://arxiv.org/abs/1911.11808v1
PDF https://arxiv.org/pdf/1911.11808v1.pdf
PWC https://paperswithcode.com/paper/multi-object-portion-tracking-in-4d
Repo
Framework

Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients

Title Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients
Authors Nursadul Mamun, Soheil Khorram, John H. L. Hansen
Abstract Attempts to develop speech enhancement algorithms with improved speech intelligibility for cochlear implant (CI) users have met with limited success. To improve speech enhancement methods for CI users, we propose to perform speech enhancement in a cochlear filter-bank feature space, a feature-set specifically designed for CI users based on CI auditory stimuli. We leverage a convolutional neural network (CNN) to extract both stationary and non-stationary components of environmental acoustics and speech. We propose three CNN architectures: (1) vanilla CNN that directly generates the enhanced signal; (2) spectral-subtraction-style CNN (SS-CNN) that first predicts noise and then generates the enhanced signal by subtracting noise from the noisy signal; (3) Wiener-style CNN (Wiener-CNN) that generates an optimal mask for suppressing noise. An important problem of the proposed networks is that they introduce considerable delays, which limits their real-time application for CI users. To address this, this study also considers causal variations of these networks. Our experiments show that the proposed networks (both causal and non-causal forms) achieve significant improvement over existing baseline systems. We also found that causal Wiener-CNN outperforms other networks, and leads to the best overall envelope coefficient measure (ECM). The proposed algorithms represent a viable option for implementation on the CCi-MOBILE research platform as a pre-processor for CI users in naturalistic environments.
Tasks Speech Enhancement
Published 2019-07-03
URL https://arxiv.org/abs/1907.02526v1
PDF https://arxiv.org/pdf/1907.02526v1.pdf
PWC https://paperswithcode.com/paper/convolutional-neural-network-based-speech
Repo
Framework

The Second DIHARD Diarization Challenge: Dataset, task, and baselines

Title The Second DIHARD Diarization Challenge: Dataset, task, and baselines
Authors Neville Ryant, Kenneth Church, Christopher Cieri, Alejandrina Cristia, Jun Du, Sriram Ganapathy, Mark Liberman
Abstract This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization.
Tasks Action Detection, Activity Detection, Language Acquisition, Speaker Diarization, Speech Enhancement
Published 2019-06-18
URL https://arxiv.org/abs/1906.07839v1
PDF https://arxiv.org/pdf/1906.07839v1.pdf
PWC https://paperswithcode.com/paper/the-second-dihard-diarization-challenge
Repo
Framework
Title Some Considerations and a Benchmark Related to the CNF Property of the Koczy-Hirota Fuzzy Rule Interpolation
Authors Maen Alzubi, Szilveszter Kovacs
Abstract The goal of this paper is twofold. Once to highlight some basic problematic properties of the KH Fuzzy Rule Interpolation through examples, secondly to set up a brief Benchmark set of Examples, which is suitable for testing other Fuzzy Rule Interpolation (FRI) methods against these ill conditions. Fuzzy Rule Interpolation methods were originally proposed to handle the situation of missing fuzzy rules (sparse rule-bases) and to reduce the decision complexity. Fuzzy Rule Interpolation is an important technique for implementing inference with sparse fuzzy rule-bases. Even if a given observation has no overlap with the antecedent of any rule from the rule-base, FRI may still conclude a conclusion. The first FRI method was the Koczy and Hirota proposed “Linear Interpolation”, which was later renamed to “KH Fuzzy Interpolation” by the followers. There are several conditions and criteria have been suggested for unifying the common requirements an FRI methods have to satisfy. One of the most common one is the demand for a convex and normal fuzzy (CNF) conclusion, if all the rule antecedents and consequents are CNF sets. The KH FRI is the one, which cannot fulfill this condition. This paper is focusing on the conditions, where the KH FRI fails the demand for the CNF conclusion. By setting up some CNF rule examples, the paper also defines a Benchmark, in which other FRI methods can be tested if they can produce CNF conclusion where the KH FRI fails.
Tasks
Published 2019-11-12
URL https://arxiv.org/abs/1911.05041v1
PDF https://arxiv.org/pdf/1911.05041v1.pdf
PWC https://paperswithcode.com/paper/some-considerations-and-a-benchmark-related
Repo
Framework

Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization Techniques

Title Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization Techniques
Authors Jyun-Yi Wu, Cheng Yu, Szu-Wei Fu, Chih-Ting Liu, Shao-Yi Chien, Yu Tsao
Abstract Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in real scenarios. In this study, we propose a novel parameter pruning (PP) technique, which removes redundant channels in a neural network. In addition, a parameter quantization (PQ) technique was applied to reduce the size of a neural network by representing weights with fewer cluster centroids. Because the techniques are derived based on different concepts, the PP and PQ can be integrated to provide even more compact SE models. The experimental results show that the PP and PQ techniques produce a compacted SE model with a size of only 10.03% compared to that of the original model, resulting in minor performance losses of 1.43% (from 0.70 to 0.69) for STOI and 3.24% (from 1.85 to 1.79) for PESQ. The promising results suggest that the PP and PQ techniques can be used in a SE system in devices with limited storage and computation resources.
Tasks Denoising, Quantization, Speech Enhancement
Published 2019-05-31
URL https://arxiv.org/abs/1906.01078v2
PDF https://arxiv.org/pdf/1906.01078v2.pdf
PWC https://paperswithcode.com/paper/increasing-compactness-of-deep-learning-based
Repo
Framework

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

Title Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect
Authors Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Abstract When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field. We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of -5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.
Tasks Speech Enhancement
Published 2019-05-29
URL https://arxiv.org/abs/1905.12605v1
PDF https://arxiv.org/pdf/1905.12605v1.pdf
PWC https://paperswithcode.com/paper/deep-learning-based-audio-visual-speech
Repo
Framework

Dimension reduction as an optimization problem over a set of generalized functions

Title Dimension reduction as an optimization problem over a set of generalized functions
Authors Rustem Takhanov
Abstract Classical dimension reduction problem can be loosely formulated as a problem of finding a $k$-dimensional affine subspace of ${\mathbb R}^n$ onto which data points ${\mathbf x}1,\cdots, {\mathbf x}N$ can be projected without loss of valuable information. We reformulate this problem in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function $p{\rm{emp}}({\mathbf x}) = \frac{1}{N} \sum{i=1}^N \delta^n (\bold{x} - \bold{x}i)$, where $\delta^n$ is an $n$-dimensional Dirac delta function, by another tempered distribution $q({\mathbf x})$ whose density is supported in some $k$-dimensional subspace. Thus, our problem is reduced to the minimization of a certain loss function $I(q)$ measuring the distance from $q$ to $p{\rm{emp}}$ over a pertinent set of generalized functions, denoted $\mathcal{G}k$. Another classical problem of data analysis is the sufficient dimension reduction problem. We show that it can be reduced to the following problem: given a function $f: {\mathbb R}^n\rightarrow {\mathbb R}$ and a probability density function $p({\mathbf x})$, find a function of the form $g({\mathbf w}^T_1{\mathbf x}, \cdots, {\mathbf w}^T_k{\mathbf x})$ that minimizes the loss ${\mathbb E}{{\mathbf x}\sim p} f({\mathbf x})-g({\mathbf w}^T_1{\mathbf x}, \cdots, {\mathbf w}^T_k{\mathbf x})^2$. We first show that search spaces of the latter two problems are in one-to-one correspondence which is defined by the Fourier transform. We introduce a nonnegative penalty function $R(f)$ and a set of ordinary functions $\Omega_\epsilon = {f R(f)\leq \epsilon}$ in such a way that $\Omega_\epsilon$ `approximates’ the space $\mathcal{G}_k$ when $\epsilon \rightarrow 0$. Then we present an algorithm for minimization of $I(f)+\lambda R(f)$, based on the idea of two-step iterative computation. |
Tasks Dimensionality Reduction
Published 2019-03-12
URL http://arxiv.org/abs/1903.05083v1
PDF http://arxiv.org/pdf/1903.05083v1.pdf
PWC https://paperswithcode.com/paper/dimension-reduction-as-an-optimization
Repo
Framework

Alignment-Free Cross-Sensor Fingerprint Matching based on the Co-Occurrence of Ridge Orientations and Gabor-HoG Descriptor

Title Alignment-Free Cross-Sensor Fingerprint Matching based on the Co-Occurrence of Ridge Orientations and Gabor-HoG Descriptor
Authors Helala AlShehri, Muhammad Hussain, Hatim AboAlSamh, Qazi Emad-ul-Haq, Aqil M. Azmi
Abstract The existing automatic fingerprint verification methods are designed to work under the assumption that the same sensor is installed for enrollment and authentication (regular matching). There is a remarkable decrease in efficiency when one type of contact-based sensor is employed for enrolment and another type of contact-based sensor is used for authentication (cross-matching or fingerprint sensor interoperability problem,). The ridge orientation patterns in a fingerprint are invariant to sensor type. Based on this observation, we propose a robust fingerprint descriptor called the co-occurrence of ridge orientations (Co-Ror), which encodes the spatial distribution of ridge orientations. Employing this descriptor, we introduce an efficient automatic fingerprint verification method for cross-matching problem. Further, to enhance the robustness of the method, we incorporate scale based ridge orientation information through Gabor-HoG descriptor. The two descriptors are fused with canonical correlation analysis (CCA), and the matching score between two fingerprints is calculated using city-block distance. The proposed method is alignment-free and can handle the matching process without the need for a registration step. The intensive experiments on two benchmark databases (FingerPass and MOLF) show the effectiveness of the method and reveal its significant enhancement over the state-of-the-art methods such as VeriFinger (a commercial SDK), minutia cylinder-code (MCC), MCC with scale, and the thin-plate spline (TPS) model. The proposed research will help security agencies, service providers and law-enforcement departments to overcome the interoperability problem of contact sensors of different technology and interaction types.
Tasks
Published 2019-04-30
URL http://arxiv.org/abs/1905.03699v1
PDF http://arxiv.org/pdf/1905.03699v1.pdf
PWC https://paperswithcode.com/paper/190503699
Repo
Framework

Going Beneath the Surface: Evaluating Image Captioning for Grammaticality, Truthfulness and Diversity

Title Going Beneath the Surface: Evaluating Image Captioning for Grammaticality, Truthfulness and Diversity
Authors Huiyuan Xie, Tom Sherborne, Alexander Kuhnle, Ann Copestake
Abstract Image captioning as a multimodal task has drawn much interest in recent years. However, evaluation for this task remains a challenging problem. Existing evaluation metrics focus on surface similarity between a candidate caption and a set of reference captions, and do not check the actual relation between a caption and the underlying visual content. We introduce a new diagnostic evaluation framework for the task of image captioning, with the goal of directly assessing models for grammaticality, truthfulness and diversity (GTD) of generated captions. We demonstrate the potential of our evaluation framework by evaluating existing image captioning models on a wide ranging set of synthetic datasets that we construct for diagnostic evaluation. We empirically show how the GTD evaluation framework, in combination with diagnostic datasets, can provide insights into model capabilities and limitations to supplement standard evaluations.
Tasks Image Captioning
Published 2019-12-19
URL https://arxiv.org/abs/1912.08960v1
PDF https://arxiv.org/pdf/1912.08960v1.pdf
PWC https://paperswithcode.com/paper/going-beneath-the-surface-evaluating-image
Repo
Framework

Predicting Motion of Vulnerable Road Users using High-Definition Maps and Efficient ConvNets

Title Predicting Motion of Vulnerable Road Users using High-Definition Maps and Efficient ConvNets
Authors Fang-Chieh Chou, Tsung-Han Lin, Henggang Cui, Vladan Radosavljevic, Thi Nguyen, Tzu-Kuo Huang, Matthew Niedoba, Jeff Schneider, Nemanja Djuric
Abstract Following detection and tracking of traffic actors, prediction of their future motion is the next critical component of a self-driving vehicle (SDV) technology, allowing the SDV to operate safely and efficiently in its environment. This is particularly important when it comes to vulnerable road users (VRUs), such as pedestrians and bicyclists. These actors need to be handled with special care due to an increased risk of injury, as well as the fact that their behavior is less predictable than that of motorized actors. To address this issue, in this paper we present a deep learning-based method for predicting VRU movement, where we rasterize high-definition maps and actor’s surroundings into bird’s-eye view image used as an input to deep convolutional networks. In addition, we propose a fast architecture suitable for real-time inference, and present a detailed ablation study of various rasterization choices. The results strongly indicate benefits of using the proposed approach for motion prediction of VRUs, both in terms of accuracy and latency.
Tasks motion prediction
Published 2019-06-20
URL https://arxiv.org/abs/1906.08469v1
PDF https://arxiv.org/pdf/1906.08469v1.pdf
PWC https://paperswithcode.com/paper/predicting-motion-of-vulnerable-road-users
Repo
Framework

Universal Sound Separation

Title Universal Sound Separation
Authors Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey
Abstract Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Tasks Speech Enhancement, Speech Separation
Published 2019-05-08
URL https://arxiv.org/abs/1905.03330v2
PDF https://arxiv.org/pdf/1905.03330v2.pdf
PWC https://paperswithcode.com/paper/190503330
Repo
Framework
Title Sequence to Sequence with Attention for Influenza Prevalence Prediction using Google Trends
Authors Kenjiro Kondo, Akihiko Ishikawa, Masashi Kimura
Abstract Early prediction of the prevalence of influenza reduces its impact. Various studies have been conducted to predict the number of influenza-infected people. However, these studies are not highly accurate especially in the distant future such as over one month. To deal with this problem, we investigate the sequence to sequence (Seq2Seq) with attention model using Google Trends data to assess and predict the number of influenza-infected people over the course of multiple weeks. Google Trends data help to compensate the dark figures including the statistics and improve the prediction accuracy. We demonstrate that the attention mechanism is highly effective to improve prediction accuracy and achieves state-of-the art results, with a Pearson correlation and root-mean-square error of 0.996 and 0.67, respectively. However, the prediction accuracy of the peak of influenza epidemic is not sufficient, and further investigation is needed to overcome this problem.
Tasks
Published 2019-07-03
URL https://arxiv.org/abs/1907.02786v1
PDF https://arxiv.org/pdf/1907.02786v1.pdf
PWC https://paperswithcode.com/paper/sequence-to-sequence-with-attention-for
Repo
Framework

Resonator Circuits for factoring high-dimensional vectors

Title Resonator Circuits for factoring high-dimensional vectors
Authors Spencer J. Kent, E. Paxon Frady, Friedrich T. Sommer, Bruno A. Olshausen
Abstract We describe a type of neural network, called a Resonator Circuit, that factors high-dimensional vectors. Given a composite vector formed by the Hadamard product of several other vectors drawn from a discrete set, a Resonator Circuit can efficiently decompose the composite into these factors. This paper focuses on the case of “bipolar” vectors whose elements are $\pm1$ and characterizes the solution quality, stability properties, and speed of Resonator Circuits in comparison to several benchmark optimization methods including Alternating Least Squares, Iterative Soft Thresholding, and Multiplicative Weights. We find that Resonator Circuits substantially outperform these alternative methods by leveraging a combination of powerful nonlinear dynamics and “searching in superposition”, by which we mean that estimates of the correct solution are, at any given time, formed from a weighted superposition of all possible solutions. The considered alternative methods also search in superposition, but the dynamics of Resonator Circuits allow them to strike a more natural balance between exploring the solution space and exploiting local information to drive the network toward probable solutions. Resonator Circuits can be conceptualized as a set of interconnected Hopfield Networks, and this leads to some interesting analysis. In particular, while a Hopfield Network descends an energy function and is guaranteed to converge, a Resonator Circuit is not. However, there exists a high-fidelity regime where Resonator Circuits almost always do converge, and they can solve the factorization problem extremely well. As factorization is central to many aspects of perception and cognition, we believe that Resonator Circuits may bring us a step closer to understanding how this computationally difficult problem is efficiently solved by neural circuits in brains.
Tasks
Published 2019-06-19
URL https://arxiv.org/abs/1906.11684v2
PDF https://arxiv.org/pdf/1906.11684v2.pdf
PWC https://paperswithcode.com/paper/resonator-circuits-for-factoring-high
Repo
Framework

PISEP^2: Pseudo Image Sequence Evolution based 3D Pose Prediction

Title PISEP^2: Pseudo Image Sequence Evolution based 3D Pose Prediction
Authors Xiaoli Liu, Jianqin Yin, Huaping Liu, Yilong Yin
Abstract Pose prediction is to predict future poses given a window of previous poses. In this paper, we propose a new problem that predicts poses using 3D joint coordinate sequences. Different from the traditional pose prediction based on Mocap frames, this problem is convenient to use in real applications due to its simple sensors to capture data. We also present a new framework, PISEP^2 (Pseudo Image Sequence Evolution based 3D Pose Prediction), to address this new problem. Specifically, a skeletal representation is proposed by transforming the joint coordinate sequence into an image sequence, which can model the different correlations of different joints. With this image based skeletal representation, we model the pose prediction as the evolution of image sequence. Moreover, a novel inference network is proposed to predict all future poses in one step by decoupling the decoders in a non-recursive manner. Compared with the recursive sequence to sequence model, we can improve the computational efficiency and avoid error accumulation significantly. Extensive experiments are carried out on two benchmark datasets (e.g. G3D and FNTU). The proposed method achieves the state-of-the-art performance on both datasets, which demonstrates the effectiveness of our proposed method.
Tasks Pose Prediction
Published 2019-09-04
URL https://arxiv.org/abs/1909.01818v1
PDF https://arxiv.org/pdf/1909.01818v1.pdf
PWC https://paperswithcode.com/paper/pisep2-pseudo-image-sequence-evolution-based
Repo
Framework
comments powered by Disqus