Paper Group AWR 41
Progressive Learning and Disentanglement of Hierarchical Representations. Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos. Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction. Learning Robust Representations via Multi-View Information Bottleneck. An End-to-End Visual-Audio At …
Progressive Learning and Disentanglement of Hierarchical Representations
Title | Progressive Learning and Disentanglement of Hierarchical Representations |
Authors | Zhiyuan Li, Jaideep Vitthal Murkute, Prashnna Kumar Gyawali, Linwei Wang |
Abstract | Learning rich representation from data is an important task for deep generative models such as variational auto-encoder (VAE). However, by extracting high-level abstractions in the bottom-up inference process, the goal of preserving all factors of variations for top-down generation is compromised. Motivated by the concept of “starting small”, we present a strategy to progressively learn independent hierarchical representations from high- to low-levels of abstractions. The model starts with learning the most abstract representation, and then progressively grow the network architecture to introduce new representations at different levels of abstraction. We quantitatively demonstrate the ability of the presented model to improve disentanglement in comparison to existing works on two benchmark data sets using three disentanglement metrics, including a new metric we proposed to complement the previously-presented metric of mutual information gap. We further present both qualitative and quantitative evidence on how the progression of learning improves disentangling of hierarchical representations. By drawing on the respective advantage of hierarchical representation learning and progressive learning, this is to our knowledge the first attempt to improve disentanglement by progressively growing the capacity of VAE to learn hierarchical representations. |
Tasks | Representation Learning |
Published | 2020-02-24 |
URL | https://arxiv.org/abs/2002.10549v1 |
https://arxiv.org/pdf/2002.10549v1.pdf | |
PWC | https://paperswithcode.com/paper/progressive-learning-and-disentanglement-of-1 |
Repo | https://github.com/Zhiyuan1991/proVLAE |
Framework | pytorch |
Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos
Title | Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos |
Authors | Aysegul Dundar, Kevin J. Shih, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro |
Abstract | Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foreground object of interest. However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background. This work explores the effects of factorizing the reconstruction task into separate foreground and background reconstructions, conditioning only the foreground reconstruction on the unsupervised landmarks. Our experiments demonstrate that the proposed factorization results in landmarks that are focused on the foreground object of interest. Furthermore, the rendered background quality is also improved, as the background rendering pipeline no longer requires the ill-suited landmarks to model its pose and appearance. We demonstrate this improvement in the context of the video-prediction task. |
Tasks | Video Prediction |
Published | 2020-01-26 |
URL | https://arxiv.org/abs/2001.09518v1 |
https://arxiv.org/pdf/2001.09518v1.pdf | |
PWC | https://paperswithcode.com/paper/unsupervised-disentanglement-of-pose-1 |
Repo | https://github.com/NVIDIA/UnsupervisedLandmarkLearning |
Framework | pytorch |
Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction
Title | Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction |
Authors | Vincent Le Guen, Nicolas Thome |
Abstract | Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video prediction methods. Since physics is too restrictive for describing the full visual content of generic videos, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution is to propose a new recurrent physical cell (PhyCell), inspired from data assimilation techniques, for performing PDE-constrained prediction in latent space. Extensive experiments conducted on four various datasets show the ability of PhyDNet to outperform state-of-the-art methods. Ablation studies also highlight the important gain brought out by both disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet presents interesting features for dealing with missing data and long-term forecasting. |
Tasks | Video Prediction |
Published | 2020-03-03 |
URL | https://arxiv.org/abs/2003.01460v2 |
https://arxiv.org/pdf/2003.01460v2.pdf | |
PWC | https://paperswithcode.com/paper/disentangling-physical-dynamics-from-unknown |
Repo | https://github.com/vincent-leguen/PhyDNet |
Framework | none |
Learning Robust Representations via Multi-View Information Bottleneck
Title | Learning Robust Representations via Multi-View Information Bottleneck |
Authors | Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, Zeynep Akata |
Abstract | The information bottleneck principle provides an information-theoretic method for representation learning, by training an encoder to retain all information which is relevant for predicting the label while minimizing the amount of other, excess information in the representation. The original formulation, however, requires labeled data to identify the superfluous information. In this work, we extend this ability to the multi-view unsupervised setting, where two views of the same underlying entity are provided but the label is unknown. This enables us to identify superfluous information as that not shared by both views. A theoretical analysis leads to the definition of a new multi-view model that produces state-of-the-art results on the Sketchy dataset and label-limited versions of the MIR-Flickr dataset. We also extend our theory to the single-view setting by taking advantage of standard data augmentation techniques, empirically showing better generalization capabilities when compared to common unsupervised approaches for representation learning. |
Tasks | Data Augmentation, Representation Learning |
Published | 2020-02-17 |
URL | https://arxiv.org/abs/2002.07017v2 |
https://arxiv.org/pdf/2002.07017v2.pdf | |
PWC | https://paperswithcode.com/paper/learning-robust-representations-via-multi-1 |
Repo | https://github.com/mfederici/Multi-View-Information-Bottleneck |
Framework | pytorch |
An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos
Title | An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos |
Authors | Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei Xu, Runbo Hu, Hua Chai, Kurt Keutzer |
Abstract | Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet. |
Tasks | Emotion Recognition, Video Emotion Recognition |
Published | 2020-02-12 |
URL | https://arxiv.org/abs/2003.00832v1 |
https://arxiv.org/pdf/2003.00832v1.pdf | |
PWC | https://paperswithcode.com/paper/an-end-to-end-visual-audio-attention-network |
Repo | https://github.com/maysonma/VAANet |
Framework | none |
A Survey of Deep Learning Techniques for Neural Machine Translation
Title | A Survey of Deep Learning Techniques for Neural Machine Translation |
Authors | Shuoheng Yang, Yuxin Wang, Xiaowen Chu |
Abstract | In recent years, natural language processing (NLP) has got great development with deep learning techniques. In the sub-field of machine translation, a new approach named Neural Machine Translation (NMT) has emerged and got massive attention from both academia and industry. However, with a significant number of researches proposed in the past several years, there is little work in investigating the development process of this new technology trend. This literature survey traces back the origin and principal development timeline of NMT, investigates the important branches, categorizes different research orientations, and discusses some future research trends in this field. |
Tasks | Machine Translation |
Published | 2020-02-18 |
URL | https://arxiv.org/abs/2002.07526v1 |
https://arxiv.org/pdf/2002.07526v1.pdf | |
PWC | https://paperswithcode.com/paper/a-survey-of-deep-learning-techniques-for-2 |
Repo | https://github.com/SFFAI-AIKT/AIKT-Natural_Language_Processing |
Framework | none |
Overfitting in adversarially robust deep learning
Title | Overfitting in adversarially robust deep learning |
Authors | Leslie Rice, Eric Wong, J. Zico Kolter |
Abstract | It is common practice in deep learning to use overparameterized networks and train for as long as possible; there are numerous studies that show, both theoretically and empirically, that such practices surprisingly do not unduly harm the generalization performance of the classifier. In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worst-case adversarial perturbations. We find that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and ImageNet) and perturbation models ($\ell_\infty$ and $\ell_2$). Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial training can be matched by simply using early stopping. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting. Finally, we study several classical and modern deep learning remedies for overfitting, including regularization and data augmentation, and find that no approach in isolation improves significantly upon the gains achieved by early stopping. All code for reproducing the experiments as well as pretrained model weights and training logs can be found at https://github.com/locuslab/robust_overfitting. |
Tasks | Data Augmentation |
Published | 2020-02-26 |
URL | https://arxiv.org/abs/2002.11569v2 |
https://arxiv.org/pdf/2002.11569v2.pdf | |
PWC | https://paperswithcode.com/paper/overfitting-in-adversarially-robust-deep |
Repo | https://github.com/locuslab/robust_overfitting |
Framework | pytorch |
Improving Generalization by Controlling Label-Noise Information in Neural Network Weights
Title | Improving Generalization by Controlling Label-Noise Information in Neural Network Weights |
Authors | Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, Aram Galstyan |
Abstract | In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. Standard regularization techniques such as dropout, weight decay or data augmentation sometimes help, but do not prevent this behavior. If one considers neural network weights as random variables that depend on the data and stochasticity of training, the amount of memorized information can be quantified with the Shannon mutual information between weights and the vector of all training labels given inputs, $I(w : \mathbf{y} \mid \mathbf{x})$. We show that for any training algorithm, low values of this term correspond to reduction in memorization of label-noise and better generalization bounds. To obtain these low values, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. We illustrate the effectiveness of our approach on versions of MNIST, CIFAR-10, and CIFAR-100 corrupted with various noise models, and on a large-scale dataset Clothing1M that has noisy labels. |
Tasks | Data Augmentation |
Published | 2020-02-19 |
URL | https://arxiv.org/abs/2002.07933v1 |
https://arxiv.org/pdf/2002.07933v1.pdf | |
PWC | https://paperswithcode.com/paper/improving-generalization-by-controlling-label |
Repo | https://github.com/hrayrhar/limit-label-memorization |
Framework | pytorch |
Addressing the confounds of accompaniments in singer identification
Title | Addressing the confounds of accompaniments in singer identification |
Authors | Tsung-Han Hsieh, Kai-Hsiang Cheng, Zhe-Cheng Fan, Yu-Ching Yang, Yi-Hsuan Yang |
Abstract | Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we “shuffle-and-remix” the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification. |
Tasks | Data Augmentation |
Published | 2020-02-17 |
URL | https://arxiv.org/abs/2002.06817v1 |
https://arxiv.org/pdf/2002.06817v1.pdf | |
PWC | https://paperswithcode.com/paper/addressing-the-confounds-of-accompaniments-in |
Repo | https://github.com/bill317996/Singer-identification-in-artist20 |
Framework | pytorch |
Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs
Title | Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs |
Authors | Sangwoo Mo, Minsu Cho, Jinwoo Shin |
Abstract | Generative adversarial networks (GANs) have shown outstanding performance on a wide range of problems in computer vision, graphics, and machine learning, but often require numerous training data and heavy computational resources. To tackle this issue, several methods introduce a transfer learning technique in GAN training. They, however, are either prone to overfitting or limited to learning small distribution shifts. In this paper, we show that simple fine-tuning of GANs with frozen lower layers of the discriminator performs surprisingly well. This simple baseline, FreezeD, significantly outperforms previous techniques used in both unconditional and conditional GANs. We demonstrate the consistent effect using StyleGAN and SNGAN-projection architectures on several datasets of Animal Face, Anime Face, Oxford Flower, CUB-200-2011, and Caltech-256 datasets. The code and results are available at https://github.com/sangwoomo/FreezeD. |
Tasks | Image Generation, Transfer Learning |
Published | 2020-02-25 |
URL | https://arxiv.org/abs/2002.10964v2 |
https://arxiv.org/pdf/2002.10964v2.pdf | |
PWC | https://paperswithcode.com/paper/freeze-discriminator-a-simple-baseline-for |
Repo | https://github.com/sangwoomo/freezeD |
Framework | pytorch |
QTIP: Quick simulation-based adaptation of Traffic model per Incident Parameters
Title | QTIP: Quick simulation-based adaptation of Traffic model per Incident Parameters |
Authors | Inon Peled, Raghuveer Kamalakar, Carlos Lima Azevedo, Francisco C. Pereira |
Abstract | Current data-driven traffic prediction models are usually trained with large datasets, e.g. several months of speeds and flows. Such models provide very good fit for ordinary road conditions, but often fail just when they are most needed: when traffic suffers a sudden and significant disruption, such as a road incident. In this work, we describe QTIP: a simulation-based framework for quasi-instantaneous adaptation of prediction models upon traffic disruption. In a nutshell, QTIP performs real-time simulations of the affected road for multiple scenarios, analyzes the results, and suggests a change to an ordinary prediction model accordingly. QTIP constructs the simulated scenarios per properties of the incident, as conveyed by immediate distress signals from affected vehicles. Such real-time signals are provided by In-Vehicle Monitor Systems, which are becoming increasingly prevalent world-wide. We experiment QTIP in a case study of a Danish motorway, and the results show that QTIP can improve traffic prediction in the first critical minutes of road incidents. |
Tasks | Traffic Prediction |
Published | 2020-03-09 |
URL | https://arxiv.org/abs/2003.04109v1 |
https://arxiv.org/pdf/2003.04109v1.pdf | |
PWC | https://paperswithcode.com/paper/qtip-quick-simulation-based-adaptation-of |
Repo | https://github.com/inon-peled/qtip_code_pub |
Framework | none |
KFNet: Learning Temporal Camera Relocalization using Kalman Filtering
Title | KFNet: Learning Temporal Camera Relocalization using Kalman Filtering |
Authors | Lei Zhou, Zixin Luo, Tianwei Shen, Jiahui Zhang, Mingmin Zhen, Yao Yao, Tian Fang, Long Quan |
Abstract | Temporal camera relocalization estimates the pose with respect to each video frame in sequence, as opposed to one-shot relocalization which focuses on a still image. Even though the time dependency has been taken into account, current temporal relocalization methods still generally underperform the state-of-the-art one-shot approaches in terms of accuracy. In this work, we improve the temporal relocalization method by using a network architecture that incorporates Kalman filtering (KFNet) for online camera relocalization. In particular, KFNet extends the scene coordinate regression problem to the time domain in order to recursively establish 2D and 3D correspondences for the pose determination. The network architecture design and the loss formulation are based on Kalman filtering in the context of Bayesian learning. Extensive experiments on multiple relocalization benchmarks demonstrate the high accuracy of KFNet at the top of both one-shot and temporal relocalization approaches. Our codes are released at https://github.com/zlthinker/KFNet. |
Tasks | Camera Relocalization |
Published | 2020-03-24 |
URL | https://arxiv.org/abs/2003.10629v1 |
https://arxiv.org/pdf/2003.10629v1.pdf | |
PWC | https://paperswithcode.com/paper/kfnet-learning-temporal-camera-relocalization |
Repo | https://github.com/zlthinker/KFNet |
Framework | tf |
Point-Based Methods for Model Checking in Partially Observable Markov Decision Processes
Title | Point-Based Methods for Model Checking in Partially Observable Markov Decision Processes |
Authors | Maxime Bouton, Jana Tumova, Mykel J. Kochenderfer |
Abstract | Autonomous systems are often required to operate in partially observable environments. They must reliably execute a specified objective even with incomplete information about the state of the environment. We propose a methodology to synthesize policies that satisfy a linear temporal logic formula in a partially observable Markov decision process (POMDP). By formulating a planning problem, we show how to use point-based value iteration methods to efficiently approximate the maximum probability of satisfying a desired logical formula and compute the associated belief state policy. We demonstrate that our method scales to large POMDP domains and provides strong bounds on the performance of the resulting policy. |
Tasks | |
Published | 2020-01-11 |
URL | https://arxiv.org/abs/2001.03809v1 |
https://arxiv.org/pdf/2001.03809v1.pdf | |
PWC | https://paperswithcode.com/paper/point-based-methods-for-model-checking-in |
Repo | https://github.com/sisl/POMDPModelChecking.jl |
Framework | none |
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework
Title | The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework |
Authors | Chandan K. A. Reddy, Ebrahim Beyrami, Harishchandra Dubey, Vishak Gopal, Roger Cheng, Ross Cutler, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke |
Abstract | The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. Many publications report reasonable performance on the synthetic test set drawn from the same distribution as that of the training set. However, often the model performance degrades significantly on real recordings. Also, most of the conventional objective metrics do not correlate well with subjective tests and lab subjective tests are not scalable for a large test set. In this challenge, we open-source a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings. We also open source an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments. The winners of this challenge will be selected based on subjective evaluation on a representative test set using P.808 framework. |
Tasks | Speech Enhancement |
Published | 2020-01-23 |
URL | https://arxiv.org/abs/2001.08662v1 |
https://arxiv.org/pdf/2001.08662v1.pdf | |
PWC | https://paperswithcode.com/paper/the-interspeech-2020-deep-noise-suppression |
Repo | https://github.com/microsoft/DNS-Challenge |
Framework | none |
Discontinuous Constituent Parsing with Pointer Networks
Title | Discontinuous Constituent Parsing with Pointer Networks |
Authors | Daniel Fernández-González, Carlos Gómez-Rodríguez |
Abstract | One of the most complex syntactic representations used in computational linguistics and NLP are discontinuous constituent trees, crucial for representing all grammatical phenomena of languages such as German. Recent advances in dependency parsing have shown that Pointer Networks excel in efficiently parsing syntactic relations between words in a sentence. This kind of sequence-to-sequence models achieve outstanding accuracies in building non-projective dependency trees, but its potential has not been proved yet on a more difficult task. We propose a novel neural network architecture that, by means of Pointer Networks, is able to generate the most accurate discontinuous constituent representations to date, even without the need of Part-of-Speech tagging information. To do so, we internally model discontinuous constituent structures as augmented non-projective dependency structures. The proposed approach achieves state-of-the-art results on the two widely-used NEGRA and TIGER benchmarks, outperforming previous work by a wide margin. |
Tasks | Dependency Parsing, Part-Of-Speech Tagging |
Published | 2020-02-05 |
URL | https://arxiv.org/abs/2002.01824v1 |
https://arxiv.org/pdf/2002.01824v1.pdf | |
PWC | https://paperswithcode.com/paper/discontinuous-constituent-parsing-with |
Repo | https://github.com/danifg/DiscoPointer |
Framework | pytorch |