Paper Group ANR 622
Two-Stream FCNs to Balance Content and Style for Style Transfer. Data Augmentation for Deep Transfer Learning. Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers. Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emoti …
Two-Stream FCNs to Balance Content and Style for Style Transfer
Title | Two-Stream FCNs to Balance Content and Style for Style Transfer |
Authors | Duc Minh Vo, Akihiro Sugimoto |
Abstract | Style transfer is to render given image contents in given styles, and it has an important role in both computer vision fundamental research and industrial applications. Following the success of deep learning based approaches, this problem has been re-launched very recently, but still remains a difficult task because of trade-off between preserving contents and faithful rendering of styles. In this paper, we propose an end-to-end two-stream Fully Convolutional Networks (FCNs) aiming at balancing the contributions of the content and the style in rendered images. Our proposed network consists of the encoder and decoder parts. The encoder part utilizes a FCN for content and a FCN for style where the two FCNs have feature injections and are independently trained to preserve the semantic content and to learn the faithful style representation in each. The semantic content feature and the style representation feature are then concatenated adaptively and fed into the decoder to generate style-transferred (stylized) images. In order to train our proposed network, we employ a loss network, the pre-trained VGG-16, to compute content loss and style loss, both of which are efficiently used for the feature injection as well as the feature concatenation. Our intensive experiments show that our proposed model generates more balanced stylized images in content and style than state-of-the-art methods. Moreover, our proposed network achieves efficiency in speed. |
Tasks | Style Transfer |
Published | 2019-11-19 |
URL | https://arxiv.org/abs/1911.08079v1 |
https://arxiv.org/pdf/1911.08079v1.pdf | |
PWC | https://paperswithcode.com/paper/two-stream-fcns-to-balance-content-and-style |
Repo | |
Framework | |
Data Augmentation for Deep Transfer Learning
Title | Data Augmentation for Deep Transfer Learning |
Authors | Cameron R. Wolfe, Keld T. Lundgaard |
Abstract | Current approaches to deep learning are beginning to rely heavily on transfer learning as an effective method for reducing overfitting, improving model performance, and quickly learning new tasks. Similarly, such pre-trained models are often used to create embedding representations for various types of data, such as text and images, which can then be fed as input into separate, downstream models. However, in cases where such transfer learning models perform poorly (i.e., for data outside of the training distribution), one must resort to fine-tuning such models, or even retraining them completely. Currently, no form of data augmentation has been proposed that can be applied directly to embedding inputs to improve downstream model performance. In this work, we introduce four new types of data augmentation that are generally applicable to embedding inputs, thus making them useful in both Natural Language Processing (NLP) and Computer Vision (CV) applications. For models trained on downstream tasks with such embedding inputs, these augmentation methods are shown to improve the AUC score of the models from a score of 0.9582 to 0.9812 and significantly increase the model’s ability to identify classes of data that are not seen during training. |
Tasks | Data Augmentation, Transfer Learning |
Published | 2019-11-28 |
URL | https://arxiv.org/abs/1912.00772v1 |
https://arxiv.org/pdf/1912.00772v1.pdf | |
PWC | https://paperswithcode.com/paper/data-augmentation-for-deep-transfer-learning |
Repo | |
Framework | |
Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers
Title | Continuous Value Iteration (CVI) Reinforcement Learning and Imaginary Experience Replay (IER) for learning multi-goal, continuous action and state space controllers |
Authors | Andreas Gerken, Michael Spranger |
Abstract | This paper presents a novel model-free Reinforcement Learning algorithm for learning behavior in continuous action, state, and goal spaces. The algorithm approximates optimal value functions using non-parametric estimators. It is able to efficiently learn to reach multiple arbitrary goals in deterministic and nondeterministic environments. To improve generalization in the goal space, we propose a novel sample augmentation technique. Using these methods, robots learn faster and overall better controllers. We benchmark the proposed algorithms using simulation and a real-world voltage controlled robot that learns to maneuver in a non-observable Cartesian task space. |
Tasks | |
Published | 2019-08-27 |
URL | https://arxiv.org/abs/1908.10255v1 |
https://arxiv.org/pdf/1908.10255v1.pdf | |
PWC | https://paperswithcode.com/paper/continuous-value-iteration-cvi-reinforcement |
Repo | |
Framework | |
Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition
Title | Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition |
Authors | Shruti Gupta, Md. Shah Fahad, Akshay Deepak |
Abstract | Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, the uncertainty principles of the short-time Fourier transform prevent it from capturing time and frequency resolutions simultaneously. On the other hand, the recently proposed single frequency filtering (SFF) spectrogram promises to be a better alternative because it captures both time and frequency resolutions simultaneously. In this work, we explore the SFF spectrogram as an alternative representation of speech for SER. We have modified the SFF spectrogram by taking the average of the amplitudes of all the samples between two successive glottal closure instants (GCI) locations. The duration between two successive GCI locations gives the pitch, motivating us to name the modified SFF spectrogram as pitch-synchronous SFF spectrogram. The GCI locations were detected using zero frequency filtering approach. The proposed pitch-synchronous SFF spectrogram produced accuracy values of 63.95% (unweighted) and 70.4% (weighted) on the IEMOCAP dataset. These correspond to an improvement of +7.35% (unweighted) and +4.3% (weighted) over state-of-the-art result on the STFT sepctrogram using CNN. Specially, the proposed method recognized 22.7% of the happy emotion samples correctly, whereas this number was 0% for state-of-the-art results. These results also promise a much wider use of the proposed pitch-synchronous SFF spectrogram for other speech-based applications. |
Tasks | Emotion Recognition, Speech Emotion Recognition |
Published | 2019-08-07 |
URL | https://arxiv.org/abs/1908.03054v1 |
https://arxiv.org/pdf/1908.03054v1.pdf | |
PWC | https://paperswithcode.com/paper/pitch-synchronous-single-frequency-filtering |
Repo | |
Framework | |
Learning GANs and Ensembles Using Discrepancy
Title | Learning GANs and Ensembles Using Discrepancy |
Authors | Ben Adlam, Corinna Cortes, Mehryar Mohri, Ningshan Zhang |
Abstract | Generative adversarial networks (GANs) generate data based on minimizing a divergence between two distributions. The choice of that divergence is therefore critical. We argue that the divergence must take into account the hypothesis set and the loss function used in a subsequent learning task, where the data generated by a GAN serves for training. Taking that structural information into account is also important to derive generalization guarantees. Thus, we propose to use the discrepancy measure, which was originally introduced for the closely related problem of domain adaptation and which precisely takes into account the hypothesis set and the loss function. We show that discrepancy admits favorable properties for training GANs and prove explicit generalization guarantees. We present efficient algorithms using discrepancy for two tasks: training a GAN directly, namely DGAN, and mixing previously trained generative models, namely EDGAN. Our experiments on toy examples and several benchmark datasets show that DGAN is competitive with other GANs and that EDGAN outperforms existing GAN ensembles, such as AdaGAN. |
Tasks | Domain Adaptation |
Published | 2019-10-20 |
URL | https://arxiv.org/abs/1910.08965v2 |
https://arxiv.org/pdf/1910.08965v2.pdf | |
PWC | https://paperswithcode.com/paper/learning-gans-and-ensembles-using-discrepancy |
Repo | |
Framework | |
Complex-valued neural networks for machine learning on non-stationary physical data
Title | Complex-valued neural networks for machine learning on non-stationary physical data |
Authors | Jesper Sören Dramsch, Mikael Lüthje, Anders Nymark Christensen |
Abstract | Deep learning has become an area of interest in most scientific areas, including physical sciences. Modern networks apply real-valued transformations on the data. Particularly, convolutions in convolutional neural networks discard phase information entirely. Many deterministic signals, such as seismic data or electrical signals, contain significant information in the phase of the signal. We explore complex-valued deep convolutional networks to leverage non-linear feature maps. Seismic data commonly has a lowcut filter applied, to attenuate noise from ocean waves and similar long wavelength contributions. Discarding the phase information leads to low-frequency aliasing analogous to the Nyquist-Shannon theorem for high frequencies. In non-stationary data, the phase content can stabilize training and improve the generalizability of neural networks. While it has been shown that phase content can be restored in deep neural networks, we show how including phase information in feature maps improves both training and inference from deterministic physical data. Furthermore, we show that the reduction of parameters in a complex network outperforms larger real-valued networks. |
Tasks | |
Published | 2019-05-29 |
URL | https://arxiv.org/abs/1905.12321v2 |
https://arxiv.org/pdf/1905.12321v2.pdf | |
PWC | https://paperswithcode.com/paper/complex-valued-neural-networks-for-machine |
Repo | |
Framework | |
Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions
Title | Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions |
Authors | Suraj Tripathi, Abhay Kumar, Abhiram Ramesh, Chirag Singh, Promod Yenigalla |
Abstract | This paper proposes a speech emotion recognition method based on speech features and speech transcriptions (text). Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCC) help retain emotion-related low-level characteristics in speech whereas text helps capture semantic meaning, both of which help in different aspects of emotion detection. We experimented with several Deep Neural Network (DNN) architectures, which take in different combinations of speech features and text as inputs. The proposed network architectures achieve higher accuracies when compared to state-of-the-art methods on a benchmark dataset. The combined MFCC-Text Convolutional Neural Network (CNN) model proved to be the most accurate in recognizing emotions in IEMOCAP data. |
Tasks | Emotion Recognition, Speech Emotion Recognition |
Published | 2019-06-11 |
URL | https://arxiv.org/abs/1906.05681v1 |
https://arxiv.org/pdf/1906.05681v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-learning-based-emotion-recognition |
Repo | |
Framework | |
Detection of Collision-Prone Vehicle Behavior at Intersections using Siamese Interaction LSTM
Title | Detection of Collision-Prone Vehicle Behavior at Intersections using Siamese Interaction LSTM |
Authors | Debaditya Roy, Tetsuhiro Ishizaka, Krishna Mohan C., Atsushi Fukuda |
Abstract | As a large proportion of road accidents occur at intersections, monitoring traffic safety of intersections is important. Existing approaches are designed to investigate accidents in lane-based traffic. However, such approaches are not suitable in a lane-less mixed-traffic environment where vehicles often ply very close to each other. Hence, we propose an approach called Siamese Interaction Long Short-Term Memory network (SILSTM) to detect collision prone vehicle behavior. The SILSTM network learns the interaction trajectory of a vehicle that describes the interactions of a vehicle with its neighbors at an intersection. Among the hundreds of interactions for every vehicle, there maybe only some interactions which may be unsafe and hence, a temporal attention layer is used in the SILSTM network. Furthermore, the comparison of interaction trajectories requires labeling the trajectories as either unsafe or safe, but such a distinction is highly subjective, especially in lane-less traffic. Hence, in this work, we compute the characteristics of interaction trajectories involved in accidents using the collision energy model. The interaction trajectories that match accident characteristics are labeled as unsafe while the rest are considered safe. Finally, there is no existing dataset that allows us to monitor a particular intersection for a long duration. Therefore, we introduce the SkyEye dataset that contains 1 hour of continuous aerial footage from each of the 4 chosen intersections in the city of Ahmedabad in India. A detailed evaluation of SILSTM on the SkyEye dataset shows that unsafe (collision-prone) interaction trajectories can be effectively detected at different intersections. |
Tasks | |
Published | 2019-12-10 |
URL | https://arxiv.org/abs/1912.04801v1 |
https://arxiv.org/pdf/1912.04801v1.pdf | |
PWC | https://paperswithcode.com/paper/detection-of-collision-prone-vehicle-behavior |
Repo | |
Framework | |
Training Neural Machine Translation To Apply Terminology Constraints
Title | Training Neural Machine Translation To Apply Terminology Constraints |
Authors | Georgiana Dinu, Prashant Mathur, Marcello Federico, Yaser Al-Onaizan |
Abstract | This paper proposes a novel method to inject custom terminology into neural machine translation at run time. Previous works have mainly proposed modifications to the decoding algorithm in order to constrain the output to include run-time-provided target terms. While being effective, these constrained decoding methods add, however, significant computational overhead to the inference step, and, as we show in this paper, can be brittle when tested in realistic conditions. In this paper we approach the problem by training a neural MT system to learn how to use custom terminology when provided with the input. Comparative experiments show that our method is not only more effective than a state-of-the-art implementation of constrained decoding, but is also as fast as constraint-free decoding. |
Tasks | Machine Translation |
Published | 2019-06-03 |
URL | https://arxiv.org/abs/1906.01105v2 |
https://arxiv.org/pdf/1906.01105v2.pdf | |
PWC | https://paperswithcode.com/paper/training-neural-machine-translation-to-apply |
Repo | |
Framework | |
Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)
Title | Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG) |
Authors | John Gideon, Melvin G McInnis, Emily Mower Provost |
Abstract | Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle” approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings. |
Tasks | Domain Generalization, Emotion Recognition, Speech Emotion Recognition |
Published | 2019-03-28 |
URL | https://arxiv.org/abs/1903.12094v2 |
https://arxiv.org/pdf/1903.12094v2.pdf | |
PWC | https://paperswithcode.com/paper/barking-up-the-right-tree-improving-cross |
Repo | |
Framework | |
ORSIm Detector: A Novel Object Detection Framework in Optical Remote Sensing Imagery Using Spatial-Frequency Channel Features
Title | ORSIm Detector: A Novel Object Detection Framework in Optical Remote Sensing Imagery Using Spatial-Frequency Channel Features |
Authors | Xin Wu, Danfeng Hong, Jiaojiao Tian, Jocelyn Chanussot, Wei Li, Ran Tao |
Abstract | With the rapid development of spaceborne imaging techniques, object detection in optical remote sensing imagery has drawn much attention in recent decades. While many advanced works have been developed with powerful learning algorithms, the incomplete feature representation still cannot meet the demand for effectively and efficiently handling image deformations, particularly objective scaling and rotation. To this end, we propose a novel object detection framework, called optical remote sensing imagery detector (ORSIm detector), integrating diverse channel features extraction, feature learning, fast image pyramid matching, and boosting strategy. ORSIm detector adopts a novel spatial-frequency channel feature (SFCF) by jointly considering the rotation-invariant channel features constructed in frequency domain and the original spatial channel features (e.g., color channel, gradient magnitude). Subsequently, we refine SFCF using learning-based strategy in order to obtain the high-level or semantically meaningful features. In the test phase, we achieve a fast and coarsely-scaled channel computation by mathematically estimating a scaling factor in the image domain. Extensive experimental results conducted on the two different airborne datasets are performed to demonstrate the superiority and effectiveness in comparison with previous state-of-the-art methods. |
Tasks | Object Detection |
Published | 2019-01-23 |
URL | http://arxiv.org/abs/1901.07925v2 |
http://arxiv.org/pdf/1901.07925v2.pdf | |
PWC | https://paperswithcode.com/paper/orsim-detector-a-novel-object-detection |
Repo | |
Framework | |
Image Recognition of Tea Leaf Diseases Based on Convolutional Neural Network
Title | Image Recognition of Tea Leaf Diseases Based on Convolutional Neural Network |
Authors | Xiaoxiao Sun, Shaomin Mu, Yongyu Xu, Zhihao Cao, Tingting Su |
Abstract | In order to identify and prevent tea leaf diseases effectively, convolution neural network (CNN) was used to realize the image recognition of tea disease leaves. Firstly, image segmentation and data enhancement are used to preprocess the images, and then these images were input into the network for training. Secondly, to reach a higher recognition accuracy of CNN, the learning rate and iteration numbers were adjusted frequently and the dropout was added properly in the case of over-fitting. Finally, the experimental results show that the recognition accuracy of CNN is 93.75%, while the accuracy of SVM and BP neural network is 89.36% and 87.69% respectively. Therefore, the recognition algorithm based on CNN is better in classification and can improve the recognition efficiency of tea leaf diseases effectively. |
Tasks | Semantic Segmentation |
Published | 2019-01-09 |
URL | http://arxiv.org/abs/1901.02694v1 |
http://arxiv.org/pdf/1901.02694v1.pdf | |
PWC | https://paperswithcode.com/paper/image-recognition-of-tea-leaf-diseases-based |
Repo | |
Framework | |
Sparse Bayesian Learning Approach for Discrete Signal Reconstruction
Title | Sparse Bayesian Learning Approach for Discrete Signal Reconstruction |
Authors | Jisheng Dai, An Liu, Hing Cheung So |
Abstract | This study addresses the problem of discrete signal reconstruction from the perspective of sparse Bayesian learning (SBL). Generally, it is intractable to perform the Bayesian inference with the ideal discretization prior under the SBL framework. To overcome this challenge, we introduce a novel discretization enforcing prior to exploit the knowledge of the discrete nature of the signal-of-interest. By integrating the discretization enforcing prior into the SBL framework and applying the variational Bayesian inference (VBI) methodology, we devise an alternating update algorithm to jointly characterize the finite alphabet feature and reconstruct the unknown signal. When the measurement matrix is i.i.d. Gaussian per component, we further embed the generalized approximate message passing (GAMP) into the VBI-based method, so as to directly adopt the ideal prior and significantly reduce the computational burden. Simulation results demonstrate substantial performance improvement of the two proposed methods over existing schemes. Moreover, the GAMP-based variant outperforms the VBI-based method with an i.i.d. Gaussian measurement matrix but it fails to work for non i.i.d. Gaussian matrices. |
Tasks | Bayesian Inference |
Published | 2019-06-01 |
URL | https://arxiv.org/abs/1906.00309v1 |
https://arxiv.org/pdf/1906.00309v1.pdf | |
PWC | https://paperswithcode.com/paper/190600309 |
Repo | |
Framework | |
A spelling correction model for end-to-end speech recognition
Title | A spelling correction model for end-to-end speech recognition |
Authors | Jinxi Guo, Tara N. Sainath, Ron J. Weiss |
Abstract | Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0% relative improvement when further rescoring an expanded n-best list using an external LM. |
Tasks | End-To-End Speech Recognition, Language Modelling, Speech Recognition, Spelling Correction |
Published | 2019-02-19 |
URL | http://arxiv.org/abs/1902.07178v1 |
http://arxiv.org/pdf/1902.07178v1.pdf | |
PWC | https://paperswithcode.com/paper/a-spelling-correction-model-for-end-to-end |
Repo | |
Framework | |
The University of Sydney’s Machine Translation System for WMT19
Title | The University of Sydney’s Machine Translation System for WMT19 |
Authors | Liang Ding, Dacheng Tao |
Abstract | This paper describes the University of Sydney’s submission of the WMT 2019 shared news translation task. We participated in the Finnish$\rightarrow$English direction and got the best BLEU(33.0) score among all the participants. Our system is based on the self-attentional Transformer networks, into which we integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, ConMBR system combination, and post-processing). Furthermore, we propose a novel augmentation method $Cycle Translation$ and a data mixture strategy $Big$/$Small$ parallel construction to entirely exploit the synthetic corpus. Extensive experiments show that adding the above techniques can make continuous improvements of the BLEU scores, and the best result outperforms the baseline (Transformer ensemble model trained with the original parallel corpus) by approximately 5.3 BLEU score, achieving the state-of-the-art performance. |
Tasks | Data Augmentation, Machine Translation |
Published | 2019-06-30 |
URL | https://arxiv.org/abs/1907.00494v1 |
https://arxiv.org/pdf/1907.00494v1.pdf | |
PWC | https://paperswithcode.com/paper/the-university-of-sydneys-machine-translation |
Repo | |
Framework | |