January 29, 2020

3228 words 16 mins read

Paper Group ANR 622

Two-Stream FCNs to Balance Content and Style for Style Transfer. Data Augmentation for Deep Transfer Learning. Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers. Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emoti …

Two-Stream FCNs to Balance Content and Style for Style Transfer


Title	Two-Stream FCNs to Balance Content and Style for Style Transfer
Authors	Duc Minh Vo, Akihiro Sugimoto
Abstract	Style transfer is to render given image contents in given styles, and it has an important role in both computer vision fundamental research and industrial applications. Following the success of deep learning based approaches, this problem has been re-launched very recently, but still remains a difficult task because of trade-off between preserving contents and faithful rendering of styles. In this paper, we propose an end-to-end two-stream Fully Convolutional Networks (FCNs) aiming at balancing the contributions of the content and the style in rendered images. Our proposed network consists of the encoder and decoder parts. The encoder part utilizes a FCN for content and a FCN for style where the two FCNs have feature injections and are independently trained to preserve the semantic content and to learn the faithful style representation in each. The semantic content feature and the style representation feature are then concatenated adaptively and fed into the decoder to generate style-transferred (stylized) images. In order to train our proposed network, we employ a loss network, the pre-trained VGG-16, to compute content loss and style loss, both of which are efficiently used for the feature injection as well as the feature concatenation. Our intensive experiments show that our proposed model generates more balanced stylized images in content and style than state-of-the-art methods. Moreover, our proposed network achieves efficiency in speed.
Tasks	Style Transfer
Published	2019-11-19
URL	https://arxiv.org/abs/1911.08079v1
PDF	https://arxiv.org/pdf/1911.08079v1.pdf
PWC	https://paperswithcode.com/paper/two-stream-fcns-to-balance-content-and-style
Repo
Framework

Data Augmentation for Deep Transfer Learning


Title	Data Augmentation for Deep Transfer Learning
Authors	Cameron R. Wolfe, Keld T. Lundgaard
Abstract	Current approaches to deep learning are beginning to rely heavily on transfer learning as an effective method for reducing overfitting, improving model performance, and quickly learning new tasks. Similarly, such pre-trained models are often used to create embedding representations for various types of data, such as text and images, which can then be fed as input into separate, downstream models. However, in cases where such transfer learning models perform poorly (i.e., for data outside of the training distribution), one must resort to fine-tuning such models, or even retraining them completely. Currently, no form of data augmentation has been proposed that can be applied directly to embedding inputs to improve downstream model performance. In this work, we introduce four new types of data augmentation that are generally applicable to embedding inputs, thus making them useful in both Natural Language Processing (NLP) and Computer Vision (CV) applications. For models trained on downstream tasks with such embedding inputs, these augmentation methods are shown to improve the AUC score of the models from a score of 0.9582 to 0.9812 and significantly increase the model’s ability to identify classes of data that are not seen during training.
Tasks	Data Augmentation, Transfer Learning
Published	2019-11-28
URL	https://arxiv.org/abs/1912.00772v1
PDF	https://arxiv.org/pdf/1912.00772v1.pdf
PWC	https://paperswithcode.com/paper/data-augmentation-for-deep-transfer-learning
Repo
Framework

Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers


Title	Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers
Authors	Andreas Gerken, Michael Spranger
Abstract	This paper presents a novel model-free Reinforcement Learning algorithm for learning behavior in continuous action, state, and goal spaces. The algorithm approximates optimal value functions using non-parametric estimators. It is able to efficiently learn to reach multiple arbitrary goals in deterministic and nondeterministic environments. To improve generalization in the goal space, we propose a novel sample augmentation technique. Using these methods, robots learn faster and overall better controllers. We benchmark the proposed algorithms using simulation and a real-world voltage controlled robot that learns to maneuver in a non-observable Cartesian task space.
Tasks
Published	2019-08-27
URL	https://arxiv.org/abs/1908.10255v1
PDF	https://arxiv.org/pdf/1908.10255v1.pdf
PWC	https://paperswithcode.com/paper/continuous-value-iteration-cvi-reinforcement
Repo
Framework

Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition


Title	Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition
Authors	Shruti Gupta, Md. Shah Fahad, Akshay Deepak
Abstract	Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, the uncertainty principles of the short-time Fourier transform prevent it from capturing time and frequency resolutions simultaneously. On the other hand, the recently proposed single frequency filtering (SFF) spectrogram promises to be a better alternative because it captures both time and frequency resolutions simultaneously. In this work, we explore the SFF spectrogram as an alternative representation of speech for SER. We have modified the SFF spectrogram by taking the average of the amplitudes of all the samples between two successive glottal closure instants (GCI) locations. The duration between two successive GCI locations gives the pitch, motivating us to name the modified SFF spectrogram as pitch-synchronous SFF spectrogram. The GCI locations were detected using zero frequency filtering approach. The proposed pitch-synchronous SFF spectrogram produced accuracy values of 63.95% (unweighted) and 70.4% (weighted) on the IEMOCAP dataset. These correspond to an improvement of +7.35% (unweighted) and +4.3% (weighted) over state-of-the-art result on the STFT sepctrogram using CNN. Specially, the proposed method recognized 22.7% of the happy emotion samples correctly, whereas this number was 0% for state-of-the-art results. These results also promise a much wider use of the proposed pitch-synchronous SFF spectrogram for other speech-based applications.
Tasks	Emotion Recognition, Speech Emotion Recognition
Published	2019-08-07
URL	https://arxiv.org/abs/1908.03054v1
PDF	https://arxiv.org/pdf/1908.03054v1.pdf
PWC	https://paperswithcode.com/paper/pitch-synchronous-single-frequency-filtering
Repo
Framework

Learning GANs and Ensembles Using Discrepancy


Title	Learning GANs and Ensembles Using Discrepancy
Authors	Ben Adlam, Corinna Cortes, Mehryar Mohri, Ningshan Zhang
Abstract	Generative adversarial networks (GANs) generate data based on minimizing a divergence between two distributions. The choice of that divergence is therefore critical. We argue that the divergence must take into account the hypothesis set and the loss function used in a subsequent learning task, where the data generated by a GAN serves for training. Taking that structural information into account is also important to derive generalization guarantees. Thus, we propose to use the discrepancy measure, which was originally introduced for the closely related problem of domain adaptation and which precisely takes into account the hypothesis set and the loss function. We show that discrepancy admits favorable properties for training GANs and prove explicit generalization guarantees. We present efficient algorithms using discrepancy for two tasks: training a GAN directly, namely DGAN, and mixing previously trained generative models, namely EDGAN. Our experiments on toy examples and several benchmark datasets show that DGAN is competitive with other GANs and that EDGAN outperforms existing GAN ensembles, such as AdaGAN.
Tasks	Domain Adaptation
Published	2019-10-20
URL	https://arxiv.org/abs/1910.08965v2
PDF	https://arxiv.org/pdf/1910.08965v2.pdf
PWC	https://paperswithcode.com/paper/learning-gans-and-ensembles-using-discrepancy
Repo
Framework

Complex-valued neural networks for machine learning on non-stationary physical data


Title	Complex-valued neural networks for machine learning on non-stationary physical data
Authors	Jesper Sören Dramsch, Mikael Lüthje, Anders Nymark Christensen
Abstract	Deep learning has become an area of interest in most scientific areas, including physical sciences. Modern networks apply real-valued transformations on the data. Particularly, convolutions in convolutional neural networks discard phase information entirely. Many deterministic signals, such as seismic data or electrical signals, contain significant information in the phase of the signal. We explore complex-valued deep convolutional networks to leverage non-linear feature maps. Seismic data commonly has a lowcut filter applied, to attenuate noise from ocean waves and similar long wavelength contributions. Discarding the phase information leads to low-frequency aliasing analogous to the Nyquist-Shannon theorem for high frequencies. In non-stationary data, the phase content can stabilize training and improve the generalizability of neural networks. While it has been shown that phase content can be restored in deep neural networks, we show how including phase information in feature maps improves both training and inference from deterministic physical data. Furthermore, we show that the reduction of parameters in a complex network outperforms larger real-valued networks.
Tasks
Published	2019-05-29
URL	https://arxiv.org/abs/1905.12321v2
PDF	https://arxiv.org/pdf/1905.12321v2.pdf
PWC	https://paperswithcode.com/paper/complex-valued-neural-networks-for-machine
Repo
Framework

Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions


Title	Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions
Authors	Suraj Tripathi, Abhay Kumar, Abhiram Ramesh, Chirag Singh, Promod Yenigalla
Abstract	This paper proposes a speech emotion recognition method based on speech features and speech transcriptions (text). Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCC) help retain emotion-related low-level characteristics in speech whereas text helps capture semantic meaning, both of which help in different aspects of emotion detection. We experimented with several Deep Neural Network (DNN) architectures, which take in different combinations of speech features and text as inputs. The proposed network architectures achieve higher accuracies when compared to state-of-the-art methods on a benchmark dataset. The combined MFCC-Text Convolutional Neural Network (CNN) model proved to be the most accurate in recognizing emotions in IEMOCAP data.
Tasks	Emotion Recognition, Speech Emotion Recognition
Published	2019-06-11
URL	https://arxiv.org/abs/1906.05681v1
PDF	https://arxiv.org/pdf/1906.05681v1.pdf
PWC	https://paperswithcode.com/paper/deep-learning-based-emotion-recognition
Repo
Framework

Detection of Collision-Prone Vehicle Behavior at Intersections using Siamese Interaction LSTM


Title	Detection of Collision-Prone Vehicle Behavior at Intersections using Siamese Interaction LSTM
Authors	Debaditya Roy, Tetsuhiro Ishizaka, Krishna Mohan C., Atsushi Fukuda
Abstract	As a large proportion of road accidents occur at intersections, monitoring traffic safety of intersections is important. Existing approaches are designed to investigate accidents in lane-based traffic. However, such approaches are not suitable in a lane-less mixed-traffic environment where vehicles often ply very close to each other. Hence, we propose an approach called Siamese Interaction Long Short-Term Memory network (SILSTM) to detect collision prone vehicle behavior. The SILSTM network learns the interaction trajectory of a vehicle that describes the interactions of a vehicle with its neighbors at an intersection. Among the hundreds of interactions for every vehicle, there maybe only some interactions which may be unsafe and hence, a temporal attention layer is used in the SILSTM network. Furthermore, the comparison of interaction trajectories requires labeling the trajectories as either unsafe or safe, but such a distinction is highly subjective, especially in lane-less traffic. Hence, in this work, we compute the characteristics of interaction trajectories involved in accidents using the collision energy model. The interaction trajectories that match accident characteristics are labeled as unsafe while the rest are considered safe. Finally, there is no existing dataset that allows us to monitor a particular intersection for a long duration. Therefore, we introduce the SkyEye dataset that contains 1 hour of continuous aerial footage from each of the 4 chosen intersections in the city of Ahmedabad in India. A detailed evaluation of SILSTM on the SkyEye dataset shows that unsafe (collision-prone) interaction trajectories can be effectively detected at different intersections.
Tasks
Published	2019-12-10
URL	https://arxiv.org/abs/1912.04801v1
PDF	https://arxiv.org/pdf/1912.04801v1.pdf
PWC	https://paperswithcode.com/paper/detection-of-collision-prone-vehicle-behavior
Repo
Framework

Training Neural Machine Translation To Apply Terminology Constraints


Title	Training Neural Machine Translation To Apply Terminology Constraints
Authors	Georgiana Dinu, Prashant Mathur, Marcello Federico, Yaser Al-Onaizan
Abstract	This paper proposes a novel method to inject custom terminology into neural machine translation at run time. Previous works have mainly proposed modifications to the decoding algorithm in order to constrain the output to include run-time-provided target terms. While being effective, these constrained decoding methods add, however, significant computational overhead to the inference step, and, as we show in this paper, can be brittle when tested in realistic conditions. In this paper we approach the problem by training a neural MT system to learn how to use custom terminology when provided with the input. Comparative experiments show that our method is not only more effective than a state-of-the-art implementation of constrained decoding, but is also as fast as constraint-free decoding.
Tasks	Machine Translation
Published	2019-06-03
URL	https://arxiv.org/abs/1906.01105v2
PDF	https://arxiv.org/pdf/1906.01105v2.pdf
PWC	https://paperswithcode.com/paper/training-neural-machine-translation-to-apply
Repo
Framework

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)


Title	Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)
Authors	John Gideon, Melvin G McInnis, Emily Mower Provost
Abstract	Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.
Tasks	Domain Generalization, Emotion Recognition, Speech Emotion Recognition
Published	2019-03-28
URL	https://arxiv.org/abs/1903.12094v2
PDF	https://arxiv.org/pdf/1903.12094v2.pdf
PWC	https://paperswithcode.com/paper/barking-up-the-right-tree-improving-cross
Repo
Framework

ORSIm Detector: A Novel Object Detection Framework in Optical Remote Sensing Imagery Using Spatial-Frequency Channel Features


Title	ORSIm Detector: A Novel Object Detection Framework in Optical Remote Sensing Imagery Using Spatial-Frequency Channel Features
Authors	Xin Wu, Danfeng Hong, Jiaojiao Tian, Jocelyn Chanussot, Wei Li, Ran Tao
Abstract	With the rapid development of spaceborne imaging techniques, object detection in optical remote sensing imagery has drawn much attention in recent decades. While many advanced works have been developed with powerful learning algorithms, the incomplete feature representation still cannot meet the demand for effectively and efficiently handling image deformations, particularly objective scaling and rotation. To this end, we propose a novel object detection framework, called optical remote sensing imagery detector (ORSIm detector), integrating diverse channel features extraction, feature learning, fast image pyramid matching, and boosting strategy. ORSIm detector adopts a novel spatial-frequency channel feature (SFCF) by jointly considering the rotation-invariant channel features constructed in frequency domain and the original spatial channel features (e.g., color channel, gradient magnitude). Subsequently, we refine SFCF using learning-based strategy in order to obtain the high-level or semantically meaningful features. In the test phase, we achieve a fast and coarsely-scaled channel computation by mathematically estimating a scaling factor in the image domain. Extensive experimental results conducted on the two different airborne datasets are performed to demonstrate the superiority and effectiveness in comparison with previous state-of-the-art methods.
Tasks	Object Detection
Published	2019-01-23
URL	http://arxiv.org/abs/1901.07925v2
PDF	http://arxiv.org/pdf/1901.07925v2.pdf
PWC	https://paperswithcode.com/paper/orsim-detector-a-novel-object-detection
Repo
Framework

Image Recognition of Tea Leaf Diseases Based on Convolutional Neural Network


Title	Image Recognition of Tea Leaf Diseases Based on Convolutional Neural Network
Authors	Xiaoxiao Sun, Shaomin Mu, Yongyu Xu, Zhihao Cao, Tingting Su
Abstract	In order to identify and prevent tea leaf diseases effectively, convolution neural network (CNN) was used to realize the image recognition of tea disease leaves. Firstly, image segmentation and data enhancement are used to preprocess the images, and then these images were input into the network for training. Secondly, to reach a higher recognition accuracy of CNN, the learning rate and iteration numbers were adjusted frequently and the dropout was added properly in the case of over-fitting. Finally, the experimental results show that the recognition accuracy of CNN is 93.75%, while the accuracy of SVM and BP neural network is 89.36% and 87.69% respectively. Therefore, the recognition algorithm based on CNN is better in classification and can improve the recognition efficiency of tea leaf diseases effectively.
Tasks	Semantic Segmentation
Published	2019-01-09
URL	http://arxiv.org/abs/1901.02694v1
PDF	http://arxiv.org/pdf/1901.02694v1.pdf
PWC	https://paperswithcode.com/paper/image-recognition-of-tea-leaf-diseases-based
Repo
Framework

Sparse Bayesian Learning Approach for Discrete Signal Reconstruction


Title	Sparse Bayesian Learning Approach for Discrete Signal Reconstruction
Authors	Jisheng Dai, An Liu, Hing Cheung So
Abstract	This study addresses the problem of discrete signal reconstruction from the perspective of sparse Bayesian learning (SBL). Generally, it is intractable to perform the Bayesian inference with the ideal discretization prior under the SBL framework. To overcome this challenge, we introduce a novel discretization enforcing prior to exploit the knowledge of the discrete nature of the signal-of-interest. By integrating the discretization enforcing prior into the SBL framework and applying the variational Bayesian inference (VBI) methodology, we devise an alternating update algorithm to jointly characterize the finite alphabet feature and reconstruct the unknown signal. When the measurement matrix is i.i.d. Gaussian per component, we further embed the generalized approximate message passing (GAMP) into the VBI-based method, so as to directly adopt the ideal prior and significantly reduce the computational burden. Simulation results demonstrate substantial performance improvement of the two proposed methods over existing schemes. Moreover, the GAMP-based variant outperforms the VBI-based method with an i.i.d. Gaussian measurement matrix but it fails to work for non i.i.d. Gaussian matrices.
Tasks	Bayesian Inference
Published	2019-06-01
URL	https://arxiv.org/abs/1906.00309v1
PDF	https://arxiv.org/pdf/1906.00309v1.pdf
PWC	https://paperswithcode.com/paper/190600309
Repo
Framework

A spelling correction model for end-to-end speech recognition


Title	A spelling correction model for end-to-end speech recognition
Authors	Jinxi Guo, Tara N. Sainath, Ron J. Weiss
Abstract	Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0% relative improvement when further rescoring an expanded n-best list using an external LM.
Tasks	End-To-End Speech Recognition, Language Modelling, Speech Recognition, Spelling Correction
Published	2019-02-19
URL	http://arxiv.org/abs/1902.07178v1
PDF	http://arxiv.org/pdf/1902.07178v1.pdf
PWC	https://paperswithcode.com/paper/a-spelling-correction-model-for-end-to-end
Repo
Framework

The University of Sydney’s Machine Translation System for WMT19


Title	The University of Sydney’s Machine Translation System for WMT19
Authors	Liang Ding, Dacheng Tao
Abstract	This paper describes the University of Sydney’s submission of the WMT 2019 shared news translation task. We participated in the Finnish$\rightarrow$English direction and got the best BLEU(33.0) score among all the participants. Our system is based on the self-attentional Transformer networks, into which we integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, ConMBR system combination, and post-processing). Furthermore, we propose a novel augmentation method $Cycle Translation$ and a data mixture strategy $Big$/$Small$ parallel construction to entirely exploit the synthetic corpus. Extensive experiments show that adding the above techniques can make continuous improvements of the BLEU scores, and the best result outperforms the baseline (Transformer ensemble model trained with the original parallel corpus) by approximately 5.3 BLEU score, achieving the state-of-the-art performance.
Tasks	Data Augmentation, Machine Translation
Published	2019-06-30
URL	https://arxiv.org/abs/1907.00494v1
PDF	https://arxiv.org/pdf/1907.00494v1.pdf
PWC	https://paperswithcode.com/paper/the-university-of-sydneys-machine-translation
Repo
Framework