January 25, 2020

3276 words 16 mins read

Paper Group ANR 1745

Positively Scale-Invariant Flatness of ReLU Neural Networks. Neural Embedding Allocation: Distributed Representations of Topic Models. Spacetime Graph Optimization for Video Object Segmentation. Key Instance Selection for Unsupervised Video Object Segmentation. On Modeling ASR Word Confidence. Transformer-Transducer: End-to-End Speech Recognition w …

Positively Scale-Invariant Flatness of ReLU Neural Networks


Title	Positively Scale-Invariant Flatness of ReLU Neural Networks
Authors	Mingyang Yi, Qi Meng, Wei Chen, Zhi-ming Ma, Tie-Yan Liu
Abstract	It was empirically confirmed by Keskar et al.\cite{SharpMinima} that flatter minima generalize better. However, for the popular ReLU network, sharp minimum can also generalize well \cite{SharpMinimacan}. The conclusion demonstrates that the existing definitions of flatness fail to account for the complex geometry of ReLU neural networks because they can’t cover the Positively Scale-Invariant (PSI) property of ReLU network. In this paper, we formalize the PSI causes problem of existing definitions of flatness and propose a new description of flatness - \emph{PSI-flatness}. PSI-flatness is defined on the values of basis paths \cite{GSGD} instead of weights. Values of basis paths have been shown to be the PSI-variables and can sufficiently represent the ReLU neural networks which ensure the PSI property of PSI-flatness. Then we study the relation between PSI-flatness and generalization theoretically and empirically. First, we formulate a generalization bound based on PSI-flatness which shows generalization error decreasing with the ratio between the largest basis path value and the smallest basis path value. That is to say, the minimum with balanced values of basis paths will more likely to be flatter and generalize better. Finally. we visualize the PSI-flatness of loss surface around two learned models which indicates the minimum with smaller PSI-flatness can indeed generalize better.
Tasks
Published	2019-03-06
URL	http://arxiv.org/abs/1903.02237v1
PDF	http://arxiv.org/pdf/1903.02237v1.pdf
PWC	https://paperswithcode.com/paper/positively-scale-invariant-flatness-of-relu
Repo
Framework

Neural Embedding Allocation: Distributed Representations of Topic Models


Title	Neural Embedding Allocation: Distributed Representations of Topic Models
Authors	Kamrun Naher Keya, Yannis Papanikolaou, James R. Foulds
Abstract	Word embedding models such as the skip-gram learn vector representations of words’ semantic relationships, and document embedding models learn similar representations for documents. On the other hand, topic models provide latent representations of the documents’ topical themes. To get the benefits of these representations simultaneously, we propose a unifying algorithm, called neural embedding allocation (NEA), which deconstructs topic models into interpretable vector-space embeddings of words, topics, documents, authors, and so on, by learning neural embeddings to mimic the topic models. We showcase NEA’s effectiveness and generality on LDA, author-topic models and the recently proposed mixed membership skip gram topic model and achieve better performance with the embeddings compared to several state-of-the-art models. Furthermore, we demonstrate that using NEA to smooth out the topics improves coherence scores over the original topic models when the number of topics is large.
Tasks	Document Embedding, Topic Models
Published	2019-09-10
URL	https://arxiv.org/abs/1909.04702v1
PDF	https://arxiv.org/pdf/1909.04702v1.pdf
PWC	https://paperswithcode.com/paper/neural-embedding-allocation-distributed
Repo
Framework

Spacetime Graph Optimization for Video Object Segmentation


Title	Spacetime Graph Optimization for Video Object Segmentation
Authors	Emanuela Haller, Adina Magda Florea, Marius Leordeanu
Abstract	We address the challenging task of foreground object discovery and segmentation in video. We introduce an efficient solution, suitable for both unsupervised and supervised scenarios, based on a spacetime graph representation of the video sequence. We ensure a fine grained representation with one-to-one correspondences between graph nodes and video pixels. We formulate the task as a spectral clustering problem by exploiting the spatio-temporal consistency between the scene elements in terms of motion and appearance. Graph nodes that belong to the main object of interest should form a strong cluster, as they are linked through long range optical flow chains and have similar motion and appearance features along those chains. On one hand, the optimization problem aims to maximize the segmentation clustering score based on the motion structure through space and time. On the other hand, the segmentation should be consistent with respect to node features. Our approach leads to a graph formulation in which the segmentation solution becomes the principal eigenvector of a novel Feature-Motion matrix. While the actual matrix is not computed explicitly, the proposed algorithm efficiently computes, in a few iteration steps, the principal eigenvector that captures the segmentation of the main object in the video. The proposed algorithm, GO-VOS, produces a global optimum solution and, consequently, it does not depend on initialization. In practice, GO-VOS achieves state of the art results on three challenging datasets used in current literature: DAVIS, SegTrack and YouTube-Objects.
Tasks	Optical Flow Estimation, Semantic Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published	2019-07-07
URL	https://arxiv.org/abs/1907.03326v2
PDF	https://arxiv.org/pdf/1907.03326v2.pdf
PWC	https://paperswithcode.com/paper/spacetime-graph-optimization-for-video-object
Repo
Framework

Key Instance Selection for Unsupervised Video Object Segmentation


Title	Key Instance Selection for Unsupervised Video Object Segmentation
Authors	Donghyeon Cho, Sungeun Hong, Sungil Kang, Jiwon Kim
Abstract	This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for each frame. We link objects according to their similarity until the M-th frame and then assign them unique IDs (i.e., instances). Similarity measure takes into account multiple properties such as ReID descriptor, expected trajectory, and semantic co-segmentation result. After M-th frame, we select K IDs based on video saliency and frequency of appearance; then only these key IDs are tracked through the remaining frames. Thanks to these technical contributions, our results are ranked third on the leaderboard of UVOS DAVIS challenge.
Tasks	Semantic Segmentation, Unsupervised Video Object Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published	2019-06-18
URL	https://arxiv.org/abs/1906.07851v2
PDF	https://arxiv.org/pdf/1906.07851v2.pdf
PWC	https://paperswithcode.com/paper/key-instance-selection-for-unsupervised-video
Repo
Framework

On Modeling ASR Word Confidence


Title	On Modeling ASR Word Confidence
Authors	Woojay Jeon, Maxwell Jordan, Mahesh Krishnamoorthy
Abstract	We present a new method for computing ASR word confidences that effectively mitigates the effect of ASR errors for diverse downstream applications, improves the word error rate of the 1-best result, and allows better comparison of scores across different models. We propose 1) a new method for modeling word confidence using a Heterogeneous Word Confusion Network (HWCN) that addresses some key flaws in conventional Word Confusion Networks, and 2) a new score calibration method for facilitating direct comparison of scores from different models. Using a bidirectional lattice recurrent neural network to compute the confidence scores of each word in the HWCN, we show that the word sequence with the best overall confidence is more accurate than the default 1-best result of the recognizer, and that the calibration method can substantially improve the reliability of recognizer combination.
Tasks	Calibration
Published	2019-07-22
URL	https://arxiv.org/abs/1907.09636v3
PDF	https://arxiv.org/pdf/1907.09636v3.pdf
PWC	https://paperswithcode.com/paper/on-modeling-asr-word-confidence
Repo
Framework

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention


Title	Transformer-Transducer: End-to-End Speech Recognition with Self-Attention
Authors	Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer
Abstract	We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.
Tasks	End-To-End Speech Recognition, Speech Recognition
Published	2019-10-28
URL	https://arxiv.org/abs/1910.12977v1
PDF	https://arxiv.org/pdf/1910.12977v1.pdf
PWC	https://paperswithcode.com/paper/transformer-transducer-end-to-end-speech
Repo
Framework

Detecting GAN generated errors


Title	Detecting GAN generated errors
Authors	Xiru Zhu, Fengdi Che, Tianzi Yang, Tzuyang Yu, David Meger, Gregory Dudek
Abstract	Despite an impressive performance from the latest GAN for generating hyper-realistic images, GAN discriminators have difficulty evaluating the quality of an individual generated sample. This is because the task of evaluating the quality of a generated image differs from deciding if an image is real or fake. A generated image could be perfect except in a single area but still be detected as fake. Instead, we propose a novel approach for detecting where errors occur within a generated image. By collaging real images with generated images, we compute for each pixel, whether it belongs to the real distribution or generated distribution. Furthermore, we leverage attention to model long-range dependency; this allows detection of errors which are reasonable locally but not holistically. For evaluation, we show that our error detection can act as a quality metric for an individual image, unlike FID and IS. We leverage Improved Wasserstein, BigGAN, and StyleGAN to show a ranking based on our metric correlates impressively with FID scores. Our work opens the door for better understanding of GAN and the ability to select the best samples from a GAN model.
Tasks
Published	2019-12-02
URL	https://arxiv.org/abs/1912.00527v1
PDF	https://arxiv.org/pdf/1912.00527v1.pdf
PWC	https://paperswithcode.com/paper/detecting-gan-generated-errors
Repo
Framework

Emotion Recognition in Low-Resource Settings: An Evaluation of Automatic Feature Selection Methods


Title	Emotion Recognition in Low-Resource Settings: An Evaluation of Automatic Feature Selection Methods
Authors	Fasih Haider, Senja Pollak, Pierre Albert, Saturnino Luz
Abstract	Research in automatic emotion recognition has seldom addressed the issue of computational resource utilization. With the advent of ambient technology, which employs a variety of low-power, resource constrained devices, this issue is increasingly gaining interest. This is especially the case in the context of health and elderly care technologies, where interventions aim at maintaining the user’s independence as unobtrusively as possible. In this context, efforts are being made to model human social signals such as emotions, which can aid health monitoring. This paper focuses on emotion recognition from speech data. In order to minimize the system’s memory and computational needs, a minimum number of features should be extracted for use in machine learning models. A number of feature set reduction methods exist which seek to find minimal sets of relevant features. We evaluate three different state of the art feature selection methods: Infinite Latent Feature Selection (ILFS), ReliefF and Fisher (generalized Fisher score), and compare them to our recently proposed feature selection method named ‘Active Feature Selection’ (AFS). The evaluation is performed on three emotion recognition data sets (EmoDB, SAVEE and EMOVO) using two standard speech feature sets (i.e. eGeMAPs and emobase). The results show that similar or better accuracy can be achieved using subsets of features substantially smaller than entire feature set. A machine learning model trained on a smaller feature set will reduce the memory and computational resources of an emotion recognition system which can result in lowering the barriers for use of health monitoring technology.
Tasks	Emotion Recognition, Feature Selection
Published	2019-08-28
URL	https://arxiv.org/abs/1908.10623v1
PDF	https://arxiv.org/pdf/1908.10623v1.pdf
PWC	https://paperswithcode.com/paper/emotion-recognition-in-low-resource-settings
Repo
Framework

End-to-End Speech Recognition: A review for the French Language


Title	End-to-End Speech Recognition: A review for the French Language
Authors	Florian Boyer, Jean-Luc Rouas
Abstract	Recently, end-to-end ASR based either on sequence-to-sequence networks or on the CTC objective function gained a lot of interest from the community, achieving competitive results over traditional systems using robust but complex pipelines. One of the main features of end-to-end systems, in addition to the ability to free themselves from extra linguistic resources such as dictionaries or language models, is the capacity to model acoustic units such as characters, subwords or directly words; opening up the capacity to directly translate speech with different representations or levels of knowledge depending on the target language. In this paper we propose a review of the existing end-to-end ASR approaches for the French language. We compare results to conventional state-of-the-art ASR systems and discuss which units are more suited to model the French language.
Tasks	End-To-End Speech Recognition, Speech Recognition
Published	2019-10-18
URL	https://arxiv.org/abs/1910.08502v2
PDF	https://arxiv.org/pdf/1910.08502v2.pdf
PWC	https://paperswithcode.com/paper/end-to-end-speech-recognition-a-review-for
Repo
Framework

The Bayesian Prophet: A Low-Regret Framework for Online Decision Making


Title	The Bayesian Prophet: A Low-Regret Framework for Online Decision Making
Authors	Alberto Vera, Siddhartha Banerjee
Abstract	We develop a new framework for designing online policies given access to an oracle providing statistical information about an offline benchmark. Having access to such prediction oracles enables simple and natural Bayesian selection policies, and raises the question as to how these policies perform in different settings. Our work makes two important contributions towards this question: First, we develop a general technique we call compensated coupling which can be used to derive bounds on the expected regret (i.e., additive loss with respect to a benchmark) for any online policy and offline benchmark. Second, using this technique, we show that a natural greedy policy, which we call the Bayes Selector, has constant expected regret (i.e., independent of the number of arrivals and resource levels) for a large class of problems we refer to as Online Allocation with finite types, which includes widely-studied Online Packing and Online Matching problems. Our results generalize and simplify several existing results for Online Packing and Online Matching, and suggest a promising pathway for obtaining oracle-driven policies for other online decision-making settings.
Tasks	Decision Making
Published	2019-01-15
URL	https://arxiv.org/abs/1901.05028v2
PDF	https://arxiv.org/pdf/1901.05028v2.pdf
PWC	https://paperswithcode.com/paper/the-bayesian-prophet-a-low-regret-framework
Repo
Framework

UPC: Learning Universal Physical Camouflage Attacks on Object Detectors


Title	UPC: Learning Universal Physical Camouflage Attacks on Object Detectors
Authors	Lifeng Huang, Chengying Gao, Yuyin Zhou, Changqing Zou, Cihang Xie, Alan Yuille, Ning Liu
Abstract	In this paper, we study physical adversarial attacks on object detectors in the wild. Prior arts on this matter mostly craft instance-dependent perturbations only for rigid and planar objects. To this end, we propose to learn an adversarial pattern to effectively attack all instances belonging to the same object category (e.g., person, car), referred to as Universal Physical Camouflage Attack (UPC). Concretely, UPC crafts camouflage by jointly fooling the region proposal network, as well as misleading the classifier and the regressor to output errors. In order to make UPC effective for articulated non-rigid or non-planar objects, we introduce a set of transformations for the generated camouflage patterns to mimic their deformable properties. We additionally impose optimization constraint to make generated patterns look natural for human observers. To fairly evaluate the effectiveness of different physical-world attacks on object detectors, we present the first standardized virtual database, AttackScenes, which simulates the real 3D world in a controllable and reproducible environment. Extensive experiments suggest the superiority of our proposed UPC compared with existing physical adversarial attackers not only in virtual environments (AttackScenes), but also in real-world physical environments. Codes, models, and demos are publicly available at https://mesunhlf.github.io/index_physical.html.
Tasks
Published	2019-09-10
URL	https://arxiv.org/abs/1909.04326v1
PDF	https://arxiv.org/pdf/1909.04326v1.pdf
PWC	https://paperswithcode.com/paper/upc-learning-universal-physical-camouflage
Repo
Framework

Cryptocurrency Price Prediction and Trading Strategies Using Support Vector Machines


Title	Cryptocurrency Price Prediction and Trading Strategies Using Support Vector Machines
Authors	David Zhao, Alessandro Rinaldo, Christopher Brookins
Abstract	Few assets in financial history have been as notoriously volatile as cryptocurrencies. While the long term outlook for this asset class remains unclear, we are successful in making short term price predictions for several major crypto assets. Using historical data from July 2015 to November 2019, we develop a large number of technical indicators to capture patterns in the cryptocurrency market. We then test various classification methods to forecast short-term future price movements based on these indicators. On both PPV and NPV metrics, our classifiers do well in identifying up and down market moves over the next 1 hour. Beyond evaluating classification accuracy, we also develop a strategy for translating 1-hour-ahead class predictions into trading decisions, along with a backtester that simulates trading in a realistic environment. We find that support vector machines yield the most profitable trading strategies, which outperform the market on average for Bitcoin, Ethereum and Litecoin over the past 22 months, since January 2018.
Tasks
Published	2019-11-26
URL	https://arxiv.org/abs/1911.11819v2
PDF	https://arxiv.org/pdf/1911.11819v2.pdf
PWC	https://paperswithcode.com/paper/cryptocurrency-price-prediction-and-trading
Repo
Framework

Harmonized Multimodal Learning with Gaussian Process Latent Variable Models


Title	Harmonized Multimodal Learning with Gaussian Process Latent Variable Models
Authors	Guoli Song, Shuhui Wang, Qingming Huang, Qi Tian
Abstract	Multimodal learning aims to discover the relationship between multiple modalities. It has become an important research topic due to extensive multimodal applications such as cross-modal retrieval. This paper attempts to address the modality heterogeneity problem based on Gaussian process latent variable models (GPLVMs) to represent multimodal data in a common space. Previous multimodal GPLVM extensions generally adopt individual learning schemes on latent representations and kernel hyperparameters, which ignore their intrinsic relationship. To exploit strong complementarity among different modalities and GPLVM components, we develop a novel learning scheme called Harmonization, where latent model parameters are jointly learned from each other. Beyond the correlation fitting or intra-modal structure preservation paradigms widely used in existing studies, the harmonization is derived in a model-driven manner to encourage the agreement between modality-specific GP kernels and the similarity of latent representations. We present a range of multimodal learning models by incorporating the harmonization mechanism into several representative GPLVM-based approaches. Experimental results on four benchmark datasets show that the proposed models outperform the strong baselines for cross-modal retrieval tasks, and that the harmonized multimodal learning method is superior in discovering semantically consistent latent representation.
Tasks	Cross-Modal Retrieval, Latent Variable Models
Published	2019-08-14
URL	https://arxiv.org/abs/1908.04979v1
PDF	https://arxiv.org/pdf/1908.04979v1.pdf
PWC	https://paperswithcode.com/paper/harmonized-multimodal-learning-with-gaussian
Repo
Framework

End-to-End Speech Translation with Knowledge Distillation


Title	End-to-End Speech Translation with Knowledge Distillation
Authors	Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Hua Wu, Haifeng Wang, Chengqing Zong
Abstract	End-to-end speech translation (ST), which directly translates from source language speech into target language text, has attracted intensive attentions in recent years. Compared to conventional pipeline systems, end-to-end ST models have advantages of lower latency, smaller model size and less error propagation. However, the combination of speech recognition and text translation in one model is more difficult than each of these two tasks. In this paper, we propose a knowledge distillation approach to improve ST model by transferring the knowledge from text translation model. Specifically, we first train a text translation model, regarded as a teacher model, and then ST model is trained to learn output probabilities from teacher model through knowledge distillation. Experiments on English- French Augmented LibriSpeech and English-Chinese TED corpus show that end-to-end ST is possible to implement on both similar and dissimilar language pairs. In addition, with the instruction of teacher model, end-to-end ST model can gain significant improvements by over 3.5 BLEU points.
Tasks	Speech Recognition
Published	2019-04-17
URL	http://arxiv.org/abs/1904.08075v1
PDF	http://arxiv.org/pdf/1904.08075v1.pdf
PWC	https://paperswithcode.com/paper/end-to-end-speech-translation-with-knowledge
Repo
Framework

Path-planning microswimmers can swim efficiently in turbulent flows


Title	Path-planning microswimmers can swim efficiently in turbulent flows
Authors	Jaya Kumar Alageshan, Akhilesh Kumar Verma, Jérémie Bec, Rahul Pandit
Abstract	We develop an adversarial-reinforcement learning scheme for microswimmers in statistically homogeneous and isotropic turbulent fluid flows, in both two (2D) and three dimensions (3D). We show that this scheme allows microswimmers to find non-trivial paths, which enable them to reach a target on average in less time than a na"ive microswimmer, which tries, at any instant of time and at a given position in space, to swim in the direction of the target. We use pseudospectral direct numerical simulations (DNSs) of the 2D and 3D (incompressible) Navier-Stokes equations to obtain the turbulent flows. We then introduce passive microswimmers that try to swim along a given direction in these flows; the microswimmwers do not affect the flow, but they are advected by it. Two, non-dimensional, control parameters play important roles in our learning scheme: (a) the ratio $\tilde{V}s$ of the microswimmer’s bare velocity $V_s$ and the root-mean-square (rms) velocity $u{rms}$ of the turbulent fluid; and (b) the product $\tilde{B}$ of the microswimmer-response time $B$ and the rms vorticity $\omega_{rms}$ of the fluid. We show that, in a substantial part of the $\tilde{V}_s-\tilde{B}$ plane, the average time required for the microswimmers to reach the target, by using our adversarial-learning scheme, eventually reduces below the average time taken by microswimmers that follow the na"ive strategy.
Tasks
Published	2019-10-03
URL	https://arxiv.org/abs/1910.01728v1
PDF	https://arxiv.org/pdf/1910.01728v1.pdf
PWC	https://paperswithcode.com/paper/path-planning-microswimmers-can-swim
Repo
Framework