April 3, 2020

3046 words 15 mins read

Paper Group AWR 41

Progressive Learning and Disentanglement of Hierarchical Representations. Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos. Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction. Learning Robust Representations via Multi-View Information Bottleneck. An End-to-End Visual-Audio At …

Progressive Learning and Disentanglement of Hierarchical Representations


Title	Progressive Learning and Disentanglement of Hierarchical Representations
Authors	Zhiyuan Li, Jaideep Vitthal Murkute, Prashnna Kumar Gyawali, Linwei Wang
Abstract	Learning rich representation from data is an important task for deep generative models such as variational auto-encoder (VAE). However, by extracting high-level abstractions in the bottom-up inference process, the goal of preserving all factors of variations for top-down generation is compromised. Motivated by the concept of “starting small”, we present a strategy to progressively learn independent hierarchical representations from high- to low-levels of abstractions. The model starts with learning the most abstract representation, and then progressively grow the network architecture to introduce new representations at different levels of abstraction. We quantitatively demonstrate the ability of the presented model to improve disentanglement in comparison to existing works on two benchmark data sets using three disentanglement metrics, including a new metric we proposed to complement the previously-presented metric of mutual information gap. We further present both qualitative and quantitative evidence on how the progression of learning improves disentangling of hierarchical representations. By drawing on the respective advantage of hierarchical representation learning and progressive learning, this is to our knowledge the first attempt to improve disentanglement by progressively growing the capacity of VAE to learn hierarchical representations.
Tasks	Representation Learning
Published	2020-02-24
URL	https://arxiv.org/abs/2002.10549v1
PDF	https://arxiv.org/pdf/2002.10549v1.pdf
PWC	https://paperswithcode.com/paper/progressive-learning-and-disentanglement-of-1
Repo	https://github.com/Zhiyuan1991/proVLAE
Framework	pytorch

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos


Title	Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos
Authors	Aysegul Dundar, Kevin J. Shih, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro
Abstract	Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foreground object of interest. However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background. This work explores the effects of factorizing the reconstruction task into separate foreground and background reconstructions, conditioning only the foreground reconstruction on the unsupervised landmarks. Our experiments demonstrate that the proposed factorization results in landmarks that are focused on the foreground object of interest. Furthermore, the rendered background quality is also improved, as the background rendering pipeline no longer requires the ill-suited landmarks to model its pose and appearance. We demonstrate this improvement in the context of the video-prediction task.
Tasks	Video Prediction
Published	2020-01-26
URL	https://arxiv.org/abs/2001.09518v1
PDF	https://arxiv.org/pdf/2001.09518v1.pdf
PWC	https://paperswithcode.com/paper/unsupervised-disentanglement-of-pose-1
Repo	https://github.com/NVIDIA/UnsupervisedLandmarkLearning
Framework	pytorch

Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction


Title	Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction
Authors	Vincent Le Guen, Nicolas Thome
Abstract	Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video prediction methods. Since physics is too restrictive for describing the full visual content of generic videos, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution is to propose a new recurrent physical cell (PhyCell), inspired from data assimilation techniques, for performing PDE-constrained prediction in latent space. Extensive experiments conducted on four various datasets show the ability of PhyDNet to outperform state-of-the-art methods. Ablation studies also highlight the important gain brought out by both disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet presents interesting features for dealing with missing data and long-term forecasting.
Tasks	Video Prediction
Published	2020-03-03
URL	https://arxiv.org/abs/2003.01460v2
PDF	https://arxiv.org/pdf/2003.01460v2.pdf
PWC	https://paperswithcode.com/paper/disentangling-physical-dynamics-from-unknown
Repo	https://github.com/vincent-leguen/PhyDNet
Framework	none

Learning Robust Representations via Multi-View Information Bottleneck


Title	Learning Robust Representations via Multi-View Information Bottleneck
Authors	Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, Zeynep Akata
Abstract	The information bottleneck principle provides an information-theoretic method for representation learning, by training an encoder to retain all information which is relevant for predicting the label while minimizing the amount of other, excess information in the representation. The original formulation, however, requires labeled data to identify the superfluous information. In this work, we extend this ability to the multi-view unsupervised setting, where two views of the same underlying entity are provided but the label is unknown. This enables us to identify superfluous information as that not shared by both views. A theoretical analysis leads to the definition of a new multi-view model that produces state-of-the-art results on the Sketchy dataset and label-limited versions of the MIR-Flickr dataset. We also extend our theory to the single-view setting by taking advantage of standard data augmentation techniques, empirically showing better generalization capabilities when compared to common unsupervised approaches for representation learning.
Tasks	Data Augmentation, Representation Learning
Published	2020-02-17
URL	https://arxiv.org/abs/2002.07017v2
PDF	https://arxiv.org/pdf/2002.07017v2.pdf
PWC	https://paperswithcode.com/paper/learning-robust-representations-via-multi-1
Repo	https://github.com/mfederici/Multi-View-Information-Bottleneck
Framework	pytorch

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos


Title	An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos
Authors	Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei Xu, Runbo Hu, Hua Chai, Kurt Keutzer
Abstract	Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.
Tasks	Emotion Recognition, Video Emotion Recognition
Published	2020-02-12
URL	https://arxiv.org/abs/2003.00832v1
PDF	https://arxiv.org/pdf/2003.00832v1.pdf
PWC	https://paperswithcode.com/paper/an-end-to-end-visual-audio-attention-network
Repo	https://github.com/maysonma/VAANet
Framework	none

A Survey of Deep Learning Techniques for Neural Machine Translation


Title	A Survey of Deep Learning Techniques for Neural Machine Translation
Authors	Shuoheng Yang, Yuxin Wang, Xiaowen Chu
Abstract	In recent years, natural language processing (NLP) has got great development with deep learning techniques. In the sub-field of machine translation, a new approach named Neural Machine Translation (NMT) has emerged and got massive attention from both academia and industry. However, with a significant number of researches proposed in the past several years, there is little work in investigating the development process of this new technology trend. This literature survey traces back the origin and principal development timeline of NMT, investigates the important branches, categorizes different research orientations, and discusses some future research trends in this field.
Tasks	Machine Translation
Published	2020-02-18
URL	https://arxiv.org/abs/2002.07526v1
PDF	https://arxiv.org/pdf/2002.07526v1.pdf
PWC	https://paperswithcode.com/paper/a-survey-of-deep-learning-techniques-for-2
Repo	https://github.com/SFFAI-AIKT/AIKT-Natural_Language_Processing
Framework	none

Overfitting in adversarially robust deep learning


Title	Overfitting in adversarially robust deep learning
Authors	Leslie Rice, Eric Wong, J. Zico Kolter
Abstract	It is common practice in deep learning to use overparameterized networks and train for as long as possible; there are numerous studies that show, both theoretically and empirically, that such practices surprisingly do not unduly harm the generalization performance of the classifier. In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worst-case adversarial perturbations. We find that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and ImageNet) and perturbation models ($\ell_\infty$ and $\ell_2$). Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial training can be matched by simply using early stopping. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting. Finally, we study several classical and modern deep learning remedies for overfitting, including regularization and data augmentation, and find that no approach in isolation improves significantly upon the gains achieved by early stopping. All code for reproducing the experiments as well as pretrained model weights and training logs can be found at https://github.com/locuslab/robust_overfitting.
Tasks	Data Augmentation
Published	2020-02-26
URL	https://arxiv.org/abs/2002.11569v2
PDF	https://arxiv.org/pdf/2002.11569v2.pdf
PWC	https://paperswithcode.com/paper/overfitting-in-adversarially-robust-deep
Repo	https://github.com/locuslab/robust_overfitting
Framework	pytorch

Improving Generalization by Controlling Label-Noise Information in Neural Network Weights


Title	Improving Generalization by Controlling Label-Noise Information in Neural Network Weights
Authors	Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, Aram Galstyan
Abstract	In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. Standard regularization techniques such as dropout, weight decay or data augmentation sometimes help, but do not prevent this behavior. If one considers neural network weights as random variables that depend on the data and stochasticity of training, the amount of memorized information can be quantified with the Shannon mutual information between weights and the vector of all training labels given inputs, $I(w : \mathbf{y} \mid \mathbf{x})$. We show that for any training algorithm, low values of this term correspond to reduction in memorization of label-noise and better generalization bounds. To obtain these low values, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. We illustrate the effectiveness of our approach on versions of MNIST, CIFAR-10, and CIFAR-100 corrupted with various noise models, and on a large-scale dataset Clothing1M that has noisy labels.
Tasks	Data Augmentation
Published	2020-02-19
URL	https://arxiv.org/abs/2002.07933v1
PDF	https://arxiv.org/pdf/2002.07933v1.pdf
PWC	https://paperswithcode.com/paper/improving-generalization-by-controlling-label
Repo	https://github.com/hrayrhar/limit-label-memorization
Framework	pytorch

Addressing the confounds of accompaniments in singer identification


Title	Addressing the confounds of accompaniments in singer identification
Authors	Tsung-Han Hsieh, Kai-Hsiang Cheng, Zhe-Cheng Fan, Yu-Ching Yang, Yi-Hsuan Yang
Abstract	Identifying singers is an important task with many applications. However, the task remains challenging due to many issues. One major issue is related to the confounding factors from the background instrumental music that is mixed with the vocals in music production. A singer identification model may learn to extract non-vocal related features from the instrumental part of the songs, if a singer only sings in certain musical contexts (e.g., genres). The model cannot therefore generalize well when the singer sings in unseen contexts. In this paper, we attempt to address this issue. Specifically, we employ open-unmix, an open source tool with state-of-the-art performance in source separation, to separate the vocal and instrumental tracks of music. We then investigate two means to train a singer identification model: by learning from the separated vocal only, or from an augmented set of data where we “shuffle-and-remix” the separated vocal tracks and instrumental tracks of different songs to artificially make the singers sing in different contexts. We also incorporate melodic features learned from the vocal melody contour for better performance. Evaluation results on a benchmark dataset called the artist20 shows that this data augmentation method greatly improves the accuracy of singer identification.
Tasks	Data Augmentation
Published	2020-02-17
URL	https://arxiv.org/abs/2002.06817v1
PDF	https://arxiv.org/pdf/2002.06817v1.pdf
PWC	https://paperswithcode.com/paper/addressing-the-confounds-of-accompaniments-in
Repo	https://github.com/bill317996/Singer-identification-in-artist20
Framework	pytorch

Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs


Title	Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs
Authors	Sangwoo Mo, Minsu Cho, Jinwoo Shin
Abstract	Generative adversarial networks (GANs) have shown outstanding performance on a wide range of problems in computer vision, graphics, and machine learning, but often require numerous training data and heavy computational resources. To tackle this issue, several methods introduce a transfer learning technique in GAN training. They, however, are either prone to overfitting or limited to learning small distribution shifts. In this paper, we show that simple fine-tuning of GANs with frozen lower layers of the discriminator performs surprisingly well. This simple baseline, FreezeD, significantly outperforms previous techniques used in both unconditional and conditional GANs. We demonstrate the consistent effect using StyleGAN and SNGAN-projection architectures on several datasets of Animal Face, Anime Face, Oxford Flower, CUB-200-2011, and Caltech-256 datasets. The code and results are available at https://github.com/sangwoomo/FreezeD.
Tasks	Image Generation, Transfer Learning
Published	2020-02-25
URL	https://arxiv.org/abs/2002.10964v2
PDF	https://arxiv.org/pdf/2002.10964v2.pdf
PWC	https://paperswithcode.com/paper/freeze-discriminator-a-simple-baseline-for
Repo	https://github.com/sangwoomo/freezeD
Framework	pytorch

QTIP: Quick simulation-based adaptation of Traffic model per Incident Parameters


Title	QTIP: Quick simulation-based adaptation of Traffic model per Incident Parameters
Authors	Inon Peled, Raghuveer Kamalakar, Carlos Lima Azevedo, Francisco C. Pereira
Abstract	Current data-driven traffic prediction models are usually trained with large datasets, e.g. several months of speeds and flows. Such models provide very good fit for ordinary road conditions, but often fail just when they are most needed: when traffic suffers a sudden and significant disruption, such as a road incident. In this work, we describe QTIP: a simulation-based framework for quasi-instantaneous adaptation of prediction models upon traffic disruption. In a nutshell, QTIP performs real-time simulations of the affected road for multiple scenarios, analyzes the results, and suggests a change to an ordinary prediction model accordingly. QTIP constructs the simulated scenarios per properties of the incident, as conveyed by immediate distress signals from affected vehicles. Such real-time signals are provided by In-Vehicle Monitor Systems, which are becoming increasingly prevalent world-wide. We experiment QTIP in a case study of a Danish motorway, and the results show that QTIP can improve traffic prediction in the first critical minutes of road incidents.
Tasks	Traffic Prediction
Published	2020-03-09
URL	https://arxiv.org/abs/2003.04109v1
PDF	https://arxiv.org/pdf/2003.04109v1.pdf
PWC	https://paperswithcode.com/paper/qtip-quick-simulation-based-adaptation-of
Repo	https://github.com/inon-peled/qtip_code_pub
Framework	none

KFNet: Learning Temporal Camera Relocalization using Kalman Filtering


Title	KFNet: Learning Temporal Camera Relocalization using Kalman Filtering
Authors	Lei Zhou, Zixin Luo, Tianwei Shen, Jiahui Zhang, Mingmin Zhen, Yao Yao, Tian Fang, Long Quan
Abstract	Temporal camera relocalization estimates the pose with respect to each video frame in sequence, as opposed to one-shot relocalization which focuses on a still image. Even though the time dependency has been taken into account, current temporal relocalization methods still generally underperform the state-of-the-art one-shot approaches in terms of accuracy. In this work, we improve the temporal relocalization method by using a network architecture that incorporates Kalman filtering (KFNet) for online camera relocalization. In particular, KFNet extends the scene coordinate regression problem to the time domain in order to recursively establish 2D and 3D correspondences for the pose determination. The network architecture design and the loss formulation are based on Kalman filtering in the context of Bayesian learning. Extensive experiments on multiple relocalization benchmarks demonstrate the high accuracy of KFNet at the top of both one-shot and temporal relocalization approaches. Our codes are released at https://github.com/zlthinker/KFNet.
Tasks	Camera Relocalization
Published	2020-03-24
URL	https://arxiv.org/abs/2003.10629v1
PDF	https://arxiv.org/pdf/2003.10629v1.pdf
PWC	https://paperswithcode.com/paper/kfnet-learning-temporal-camera-relocalization
Repo	https://github.com/zlthinker/KFNet
Framework	tf

Point-Based Methods for Model Checking in Partially Observable Markov Decision Processes


Title	Point-Based Methods for Model Checking in Partially Observable Markov Decision Processes
Authors	Maxime Bouton, Jana Tumova, Mykel J. Kochenderfer
Abstract	Autonomous systems are often required to operate in partially observable environments. They must reliably execute a specified objective even with incomplete information about the state of the environment. We propose a methodology to synthesize policies that satisfy a linear temporal logic formula in a partially observable Markov decision process (POMDP). By formulating a planning problem, we show how to use point-based value iteration methods to efficiently approximate the maximum probability of satisfying a desired logical formula and compute the associated belief state policy. We demonstrate that our method scales to large POMDP domains and provides strong bounds on the performance of the resulting policy.
Tasks
Published	2020-01-11
URL	https://arxiv.org/abs/2001.03809v1
PDF	https://arxiv.org/pdf/2001.03809v1.pdf
PWC	https://paperswithcode.com/paper/point-based-methods-for-model-checking-in
Repo	https://github.com/sisl/POMDPModelChecking.jl
Framework	none

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework


Title	The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework
Authors	Chandan K. A. Reddy, Ebrahim Beyrami, Harishchandra Dubey, Vishak Gopal, Roger Cheng, Ross Cutler, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke
Abstract	The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. Many publications report reasonable performance on the synthetic test set drawn from the same distribution as that of the training set. However, often the model performance degrades significantly on real recordings. Also, most of the conventional objective metrics do not correlate well with subjective tests and lab subjective tests are not scalable for a large test set. In this challenge, we open-source a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings. We also open source an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments. The winners of this challenge will be selected based on subjective evaluation on a representative test set using P.808 framework.
Tasks	Speech Enhancement
Published	2020-01-23
URL	https://arxiv.org/abs/2001.08662v1
PDF	https://arxiv.org/pdf/2001.08662v1.pdf
PWC	https://paperswithcode.com/paper/the-interspeech-2020-deep-noise-suppression
Repo	https://github.com/microsoft/DNS-Challenge
Framework	none

Discontinuous Constituent Parsing with Pointer Networks


Title	Discontinuous Constituent Parsing with Pointer Networks
Authors	Daniel Fernández-González, Carlos Gómez-Rodríguez
Abstract	One of the most complex syntactic representations used in computational linguistics and NLP are discontinuous constituent trees, crucial for representing all grammatical phenomena of languages such as German. Recent advances in dependency parsing have shown that Pointer Networks excel in efficiently parsing syntactic relations between words in a sentence. This kind of sequence-to-sequence models achieve outstanding accuracies in building non-projective dependency trees, but its potential has not been proved yet on a more difficult task. We propose a novel neural network architecture that, by means of Pointer Networks, is able to generate the most accurate discontinuous constituent representations to date, even without the need of Part-of-Speech tagging information. To do so, we internally model discontinuous constituent structures as augmented non-projective dependency structures. The proposed approach achieves state-of-the-art results on the two widely-used NEGRA and TIGER benchmarks, outperforming previous work by a wide margin.
Tasks	Dependency Parsing, Part-Of-Speech Tagging
Published	2020-02-05
URL	https://arxiv.org/abs/2002.01824v1
PDF	https://arxiv.org/pdf/2002.01824v1.pdf
PWC	https://paperswithcode.com/paper/discontinuous-constituent-parsing-with
Repo	https://github.com/danifg/DiscoPointer
Framework	pytorch