October 20, 2019

3145 words 15 mins read

Paper Group AWR 336

Stochastic Video Generation with a Learned Prior. Using link and content over time for embedding generation in Dynamic Attributed Networks. A Large-Scale Study on Regularization and Normalization in GANs. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. SPLATNet: Sparse Lattice Networks for Point Cl …

Stochastic Video Generation with a Learned Prior


Title	Stochastic Video Generation with a Learned Prior
Authors	Emily Denton, Rob Fergus
Abstract	Generating video frames that accurately predict future world states is challenging. Existing approaches either fail to capture the full distribution of outcomes, or yield blurry generations, or both. In this paper we introduce an unsupervised video generation model that learns a prior model of uncertainty in a given environment. Video frames are generated by drawing samples from this prior and combining them with a deterministic estimate of the future frame. The approach is simple and easily trained end-to-end on a variety of datasets. Sample generations are both varied and sharp, even many frames into the future, and compare favorably to those from existing approaches.
Tasks	Video Generation
Published	2018-02-21
URL	http://arxiv.org/abs/1802.07687v2
PDF	http://arxiv.org/pdf/1802.07687v2.pdf
PWC	https://paperswithcode.com/paper/stochastic-video-generation-with-a-learned
Repo	https://github.com/edenton/svg
Framework	pytorch

Using link and content over time for embedding generation in Dynamic Attributed Networks


Title	Using link and content over time for embedding generation in Dynamic Attributed Networks
Authors	Ana Paula Appel, Renato L. F. Cunha, Charu C. Aggarwal, Marcela Megumi Terakado
Abstract	In this work, we consider the problem of combining link, content and temporal analysis for community detection and prediction in evolving networks. Such temporal and content-rich networks occur in many real-life settings, such as bibliographic networks and question answering forums. Most of the work in the literature (that uses both content and structure) deals with static snapshots of networks, and they do not reflect the dynamic changes occurring over multiple snapshots. Incorporating dynamic changes in the communities into the analysis can also provide useful insights about the changes in the network such as the migration of authors across communities. In this work, we propose Chimera, a shared factorization model that can simultaneously account for graph links, content, and temporal analysis. This approach works by extracting the latent semantic structure of the network in multidimensional form, but in a way that takes into account the temporal continuity of these embeddings. Such an approach simplifies temporal analysis of the underlying network by using the embedding as a surrogate. A consequence of this simplification is that it is also possible to use this temporal sequence of embeddings to predict future communities. We present experimental results illustrating the effectiveness of the approach.
Tasks	Community Detection, Question Answering
Published	2018-07-17
URL	https://arxiv.org/abs/1807.06560v2
PDF	https://arxiv.org/pdf/1807.06560v2.pdf
PWC	https://paperswithcode.com/paper/temporally-evolving-community-detection-and
Repo	https://github.com/renatolfc/chimera-stf
Framework	tf

A Large-Scale Study on Regularization and Normalization in GANs


Title	A Large-Scale Study on Regularization and Normalization in GANs
Authors	Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, Sylvain Gelly
Abstract	Generative adversarial networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion. While they were successfully applied to many problems, training a GAN is a notoriously challenging task and requires a significant number of hyperparameter tuning, neural architecture engineering, and a non-trivial amount of “tricks”. The success in many practical applications coupled with the lack of a measure to quantify the failure modes of GANs resulted in a plethora of proposed losses, regularization and normalization schemes, as well as neural architectures. In this work we take a sober view of the current state of GANs from a practical perspective. We discuss and evaluate common pitfalls and reproducibility issues, open-source our code on Github, and provide pre-trained models on TensorFlow Hub.
Tasks
Published	2018-07-12
URL	https://arxiv.org/abs/1807.04720v3
PDF	https://arxiv.org/pdf/1807.04720v3.pdf
PWC	https://paperswithcode.com/paper/the-gan-landscape-losses-architectures
Repo	https://github.com/w510056105/DeepLearning
Framework	tf

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation


Title	Pre-training on high-resource speech recognition improves low-resource speech-to-text translation
Authors	Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon Goldwater
Abstract	We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve Spanish-English ST from 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training data are available. Through an ablation study, we find that the pre-trained encoder (acoustic model) accounts for most of the improvement, despite the fact that the shared language in these tasks is the target language text, not the source language audio. Applying this insight, we show that pre-training on ASR helps ST even when the ASR language differs from both source and target ST languages: pre-training on French ASR also improves Spanish-English ST. Finally, we show that the approach improves performance on a true low-resource task: pre-training on a combination of English ASR and French ASR improves Mboshi-French ST, where only 4 hours of data are available, from 3.5 to 7.1 BLEU.
Tasks	Speech Recognition
Published	2018-09-05
URL	http://arxiv.org/abs/1809.01431v2
PDF	http://arxiv.org/pdf/1809.01431v2.pdf
PWC	https://paperswithcode.com/paper/pre-training-on-high-resource-speech
Repo	https://github.com/0xSameer/ast
Framework	none

SPLATNet: Sparse Lattice Networks for Point Cloud Processing


Title	SPLATNet: Sparse Lattice Networks for Point Cloud Processing
Authors	Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, Jan Kautz
Abstract	We present a network architecture for processing point clouds that directly operates on a collection of points represented as a sparse set of samples in a high-dimensional lattice. Naively applying convolutions on this lattice scales poorly, both in terms of memory and computational cost, as the size of the lattice increases. Instead, our network uses sparse bilateral convolutional layers as building blocks. These layers maintain efficiency by using indexing structures to apply convolutions only on occupied parts of the lattice, and allow flexible specifications of the lattice structure enabling hierarchical and spatially-aware feature learning, as well as joint 2D-3D reasoning. Both point-based and image-based representations can be easily incorporated in a network with such layers and the resulting model can be trained in an end-to-end manner. We present results on 3D segmentation tasks where our approach outperforms existing state-of-the-art techniques.
Tasks	3D Part Segmentation, Semantic Segmentation
Published	2018-02-22
URL	http://arxiv.org/abs/1802.08275v4
PDF	http://arxiv.org/pdf/1802.08275v4.pdf
PWC	https://paperswithcode.com/paper/splatnet-sparse-lattice-networks-for-point
Repo	https://github.com/IsaacRe/splatnet
Framework	caffe2

MAC: Mining Activity Concepts for Language-based Temporal Localization


Title	MAC: Mining Activity Concepts for Language-based Temporal Localization
Authors	Runzhou Ge, Jiyang Gao, Kan Chen, Ram Nevatia
Abstract	We address the problem of language-based temporal localization in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries not only have no pre-defined activity list but also may contain complex descriptions. Previous methods address the problem by considering features from video sliding windows and language queries and learning a subspace to encode their correlation, which ignore rich semantic cues about activities in videos and queries. We propose to mine activity concepts from both video and language modalities by applying the actionness score enhanced Activity Concepts based Localizer (ACL). Specifically, the novel ACL encodes the semantic concepts from verb-obj pairs in language queries and leverages activity classifiers’ prediction scores to encode visual concepts. Besides, ACL also has the capability to regress sliding windows as localization results. Experiments show that ACL significantly outperforms state-of-the-arts under the widely used metric, with more than 5% increase on both Charades-STA and TACoS datasets.
Tasks	Language-Based Temporal Localization, Temporal Localization
Published	2018-11-21
URL	http://arxiv.org/abs/1811.08925v1
PDF	http://arxiv.org/pdf/1811.08925v1.pdf
PWC	https://paperswithcode.com/paper/mac-mining-activity-concepts-for-language
Repo	https://github.com/runzhouge/MAC
Framework	tf

Deep convolutional recurrent autoencoders for learning low-dimensional feature dynamics of fluid systems


Title	Deep convolutional recurrent autoencoders for learning low-dimensional feature dynamics of fluid systems
Authors	Francisco J. Gonzalez, Maciej Balajewicz
Abstract	Model reduction of high-dimensional dynamical systems alleviates computational burdens faced in various tasks from design optimization to model predictive control. One popular model reduction approach is based on projecting the governing equations onto a subspace spanned by basis functions obtained from the compression of a dataset of solution snapshots. However, this method is intrusive since the projection requires access to the system operators. Further, some systems may require special treatment of nonlinearities to ensure computational efficiency or additional modeling to preserve stability. In this work we propose a deep learning-based strategy for nonlinear model reduction that is inspired by projection-based model reduction where the idea is to identify some optimal low-dimensional representation and evolve it in time. Our approach constructs a modular model consisting of a deep convolutional autoencoder and a modified LSTM network. The deep convolutional autoencoder returns a low-dimensional representation in terms of coordinates on some expressive nonlinear data-supporting manifold. The dynamics on this manifold are then modeled by the modified LSTM network in a computationally efficient manner. An offline unsupervised training strategy that exploits the model modularity is also developed. We demonstrate our model on three illustrative examples each highlighting the model’s performance in prediction tasks for fluid systems with large parameter-variations and its stability in long-term prediction.
Tasks
Published	2018-08-03
URL	http://arxiv.org/abs/1808.01346v2
PDF	http://arxiv.org/pdf/1808.01346v2.pdf
PWC	https://paperswithcode.com/paper/deep-convolutional-recurrent-autoencoders-for
Repo	https://github.com/panchgonzalez/nmor
Framework	tf

Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection


Title	Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection
Authors	Andrea Paudice, Luis Muñoz-González, Andras Gyorgy, Emil C. Lupu
Abstract	Machine learning has become an important component for many systems and applications including computer vision, spam filtering, malware and network intrusion detection, among others. Despite the capabilities of machine learning algorithms to extract valuable information from data and produce accurate predictions, it has been shown that these algorithms are vulnerable to attacks. Data poisoning is one of the most relevant security threats against machine learning systems, where attackers can subvert the learning process by injecting malicious samples in the training data. Recent work in adversarial machine learning has shown that the so-called optimal attack strategies can successfully poison linear classifiers, degrading the performance of the system dramatically after compromising a small fraction of the training dataset. In this paper we propose a defence mechanism to mitigate the effect of these optimal poisoning attacks based on outlier detection. We show empirically that the adversarial examples generated by these attack strategies are quite different from genuine points, as no detectability constrains are considered to craft the attack. Hence, they can be detected with an appropriate pre-filtering of the training dataset.
Tasks	Anomaly Detection, data poisoning, Intrusion Detection, Network Intrusion Detection, Outlier Detection
Published	2018-02-08
URL	http://arxiv.org/abs/1802.03041v1
PDF	http://arxiv.org/pdf/1802.03041v1.pdf
PWC	https://paperswithcode.com/paper/detection-of-adversarial-training-examples-in
Repo	https://github.com/lmunoz-gonzalez/Poisoning-Attacks-with-Back-gradient-Optimization
Framework	none

Learnable Pooling Methods for Video Classification


Title	Learnable Pooling Methods for Video Classification
Authors	Sebastian Kmiec, Juhan Bae, Ruijian An
Abstract	We introduce modifications to state-of-the-art approaches to aggregating local video descriptors by using attention mechanisms and function approximations. Rather than using ensembles of existing architectures, we provide an insight on creating new architectures. We demonstrate our solutions in the “The 2nd YouTube-8M Video Understanding Challenge”, by using frame-level video and audio descriptors. We obtain testing accuracy similar to the state of the art, while meeting budget constraints, and touch upon strategies to improve the state of the art. Model implementations are available in https://github.com/pomonam/LearnablePoolingMethods.
Tasks	Video Classification, Video Understanding
Published	2018-10-01
URL	http://arxiv.org/abs/1810.00530v1
PDF	http://arxiv.org/pdf/1810.00530v1.pdf
PWC	https://paperswithcode.com/paper/learnable-pooling-methods-for-video
Repo	https://github.com/pomonam/LearnablePoolingMethods
Framework	tf

Gaussian Process Prior Variational Autoencoders


Title	Gaussian Process Prior Variational Autoencoders
Authors	Francesco Paolo Casale, Adrian V Dalca, Luca Saglietti, Jennifer Listgarten, Nicolo Fusi
Abstract	Variational autoencoders (VAE) are a powerful and widely-used class of models to learn complex data distributions in an unsupervised fashion. One important limitation of VAEs is the prior assumption that latent sample representations are independent and identically distributed. However, for many important datasets, such as time-series of images, this assumption is too strong: accounting for covariances between samples, such as those in time, can yield to a more appropriate model specification and improve performance in downstream tasks. In this work, we introduce a new model, the Gaussian Process (GP) Prior Variational Autoencoder (GPPVAE), to specifically address this issue. The GPPVAE aims to combine the power of VAEs with the ability to model correlations afforded by GP priors. To achieve efficient inference in this new class of models, we leverage structure in the covariance matrix, and introduce a new stochastic backpropagation strategy that allows for computing stochastic gradients in a distributed and low-memory fashion. We show that our method outperforms conditional VAEs (CVAEs) and an adaptation of standard VAEs in two image data applications.
Tasks	Time Series
Published	2018-10-28
URL	http://arxiv.org/abs/1810.11738v2
PDF	http://arxiv.org/pdf/1810.11738v2.pdf
PWC	https://paperswithcode.com/paper/gaussian-process-prior-variational
Repo	https://github.com/fpcasale/GPPVAE
Framework	pytorch

On the insufficiency of existing momentum schemes for Stochastic Optimization


Title	On the insufficiency of existing momentum schemes for Stochastic Optimization
Authors	Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, Sham M. Kakade
Abstract	Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov’s accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, “fast gradient” methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for their wide applicability is that when these fast gradient methods are applied in the stochastic case, they partially mimic their exact gradient counterparts, resulting in some practical gain. This work provides a counterpoint to this belief by proving that there exist simple problem instances where these methods cannot outperform SGD despite the best setting of its parameters. These negative problem instances are, in an informal sense, generic; they do not look like carefully constructed pathological instances. These results suggest (along with empirical evidence) that HB or NAG’s practical performance gains are a by-product of mini-batching. Furthermore, this work provides a viable (and provable) alternative, which, on the same set of problem instances, significantly improves over HB, NAG, and SGD’s performance. This algorithm, referred to as Accelerated Stochastic Gradient Descent (ASGD), is a simple to implement stochastic algorithm, based on a relatively less popular variant of Nesterov’s Acceleration. Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.
Tasks	Stochastic Optimization
Published	2018-03-15
URL	http://arxiv.org/abs/1803.05591v2
PDF	http://arxiv.org/pdf/1803.05591v2.pdf
PWC	https://paperswithcode.com/paper/on-the-insufficiency-of-existing-momentum
Repo	https://github.com/COMP6248-Reproducability-Challenge/Insufficiency-momentum-schemes-for-Stochastic-Optimization
Framework	none

Joint Event Detection and Description in Continuous Video Streams


Title	Joint Event Detection and Description in Continuous Video Streams
Authors	Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, Kate Saenko
Abstract	Dense video captioning is a fine-grained video understanding task that involves two sub-problems: localizing distinct events in a long video stream, and generating captions for the localized events. We propose the Joint Event Detection and Description Network (JEDDi-Net), which solves the dense video captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features, and generates their captions. Proposal features are extracted within each proposal segment through 3D Segment-of-Interest pooling from shared video feature encoding. In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context. On the large-scale ActivityNet Captions dataset, JEDDi-Net demonstrates improved results as measured by standard metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.
Tasks	Dense Video Captioning, Video Captioning, Video Understanding
Published	2018-02-28
URL	http://arxiv.org/abs/1802.10250v3
PDF	http://arxiv.org/pdf/1802.10250v3.pdf
PWC	https://paperswithcode.com/paper/joint-event-detection-and-description-in
Repo	https://github.com/VisionLearningGroup/JEDDi-Net
Framework	none

Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge


Title	Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge
Authors	Tianqi Liu, Bo Liu
Abstract	This paper presents our 7th place solution to the second YouTube-8M video understanding competition which challenges participates to build a constrained-size model to classify millions of YouTube videos into thousands of classes. Our final model consists of four single models aggregated into one tensorflow graph. For each single model, we use the same network architecture as in the winning solution of the first YouTube-8M video understanding competition, namely Gated NetVLAD. We train the single models separately in tensorflow’s default float32 precision, then replace weights with float16 precision and ensemble them in the evaluation and inference stages., achieving 48.5% compression rate without loss of precision. Our best model achieved 88.324% GAP on private leaderboard. The code is publicly available at https://github.com/boliu61/youtube-8m
Tasks	Video Understanding
Published	2018-08-21
URL	http://arxiv.org/abs/1808.06739v3
PDF	http://arxiv.org/pdf/1808.06739v3.pdf
PWC	https://paperswithcode.com/paper/constrained-size-tensorflow-models-for
Repo	https://github.com/boliu61/youtube-8m
Framework	tf

Discretely Relaxing Continuous Variables for tractable Variational Inference


Title	Discretely Relaxing Continuous Variables for tractable Variational Inference
Authors	Trefor W. Evans, Prasanth B. Nair
Abstract	We explore a new research direction in Bayesian variational inference with discrete latent variable priors where we exploit Kronecker matrix algebra for efficient and exact computations of the evidence lower bound (ELBO). The proposed “DIRECT” approach has several advantages over its predecessors; (i) it can exactly compute ELBO gradients (i.e. unbiased, zero-variance gradient estimates), eliminating the need for high-variance stochastic gradient estimators and enabling the use of quasi-Newton optimization methods; (ii) its training complexity is independent of the number of training points, permitting inference on large datasets; and (iii) its posterior samples consist of sparse and low-precision quantized integers which permit fast inference on hardware limited devices. In addition, our DIRECT models can exactly compute statistical moments of the parameterized predictive posterior without relying on Monte Carlo sampling. The DIRECT approach is not practical for all likelihoods, however, we identify a popular model structure which is practical, and demonstrate accurate inference using latent variables discretized as extremely low-precision 4-bit quantized integers. While the ELBO computations considered in the numerical studies require over $10^{2352}$ log-likelihood evaluations, we train on datasets with over two-million points in just seconds.
Tasks
Published	2018-09-12
URL	http://arxiv.org/abs/1809.04279v3
PDF	http://arxiv.org/pdf/1809.04279v3.pdf
PWC	https://paperswithcode.com/paper/discretely-relaxing-continuous-variables-for
Repo	https://github.com/treforevans/direct
Framework	tf

Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages


Title	Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages
Authors	Yuxi Li, Jiuwei Li, Weiyao Lin, Jianguo Li
Abstract	Object detection has made great progress in the past few years along with the development of deep learning. However, most current object detection methods are resource hungry, which hinders their wide deployment to many resource restricted usages such as usages on always-on devices, battery-powered low-end devices, etc. This paper considers the resource and accuracy trade-off for resource-restricted usages during designing the whole object detection framework. Based on the deeply supervised object detection (DSOD) framework, we propose Tiny-DSOD dedicating to resource-restricted usages. Tiny-DSOD introduces two innovative and ultra-efficient architecture blocks: depthwise dense block (DDB) based backbone and depthwise feature-pyramid-network (D-FPN) based front-end. We conduct extensive experiments on three famous benchmarks (PASCAL VOC 2007, KITTI, and COCO), and compare Tiny-DSOD to the state-of-the-art ultra-efficient object detection solutions such as Tiny-YOLO, MobileNet-SSD (v1 & v2), SqueezeDet, Pelee, etc. Results show that Tiny-DSOD outperforms these solutions in all the three metrics (parameter-size, FLOPs, accuracy) in each comparison. For instance, Tiny-DSOD achieves 72.1% mAP with only 0.95M parameters and 1.06B FLOPs, which is by far the state-of-the-arts result with such a low resource requirement.
Tasks	Object Detection
Published	2018-07-29
URL	http://arxiv.org/abs/1807.11013v1
PDF	http://arxiv.org/pdf/1807.11013v1.pdf
PWC	https://paperswithcode.com/paper/tiny-dsod-lightweight-object-detection-for
Repo	https://github.com/lyxok1/Tiny-DSOD
Framework	none