July 29, 2019

3392 words 16 mins read

Paper Group AWR 111

Paper Group AWR 111

Natural Langevin Dynamics for Neural Networks. Residual Features and Unified Prediction Network for Single Stage Detection. DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents. Improving Video Generation for Multi-functional Applications. DeepCache: Principled Cache for Mobile Deep Vision. Where to put the Image in an Image …

Natural Langevin Dynamics for Neural Networks

Title Natural Langevin Dynamics for Neural Networks
Authors Gaétan Marceau-Caron, Yann Ollivier
Abstract One way to avoid overfitting in machine learning is to use model parameters distributed according to a Bayesian posterior given the data, rather than the maximum likelihood estimator. Stochastic gradient Langevin dynamics (SGLD) is one algorithm to approximate such Bayesian posteriors for large models and datasets. SGLD is a standard stochastic gradient descent to which is added a controlled amount of noise, specifically scaled so that the parameter converges in law to the posterior distribution [WT11, TTV16]. The posterior predictive distribution can be approximated by an ensemble of samples from the trajectory. Choice of the variance of the noise is known to impact the practical behavior of SGLD: for instance, noise should be smaller for sensitive parameter directions. Theoretically, it has been suggested to use the inverse Fisher information matrix of the model as the variance of the noise, since it is also the variance of the Bayesian posterior [PT13, AKW12, GC11]. But the Fisher matrix is costly to compute for large- dimensional models. Here we use the easily computed Fisher matrix approximations for deep neural networks from [MO16, Oll15]. The resulting natural Langevin dynamics combines the advantages of Amari’s natural gradient descent and Fisher-preconditioned Langevin dynamics for large neural networks. Small-scale experiments on MNIST show that Fisher matrix preconditioning brings SGLD close to dropout as a regularizing technique.
Tasks
Published 2017-12-04
URL http://arxiv.org/abs/1712.01076v1
PDF http://arxiv.org/pdf/1712.01076v1.pdf
PWC https://paperswithcode.com/paper/natural-langevin-dynamics-for-neural-networks
Repo https://github.com/gmarceaucaron/natural-langevin-dynamics-for-neural-networks
Framework none

Residual Features and Unified Prediction Network for Single Stage Detection

Title Residual Features and Unified Prediction Network for Single Stage Detection
Authors Kyoungmin Lee, Jaeseok Choi, Jisoo Jeong, Nojun Kwak
Abstract Recently, a lot of single stage detectors using multi-scale features have been actively proposed. They are much faster than two stage detectors that use region proposal networks (RPN) without much degradation in the detection performances. However, the feature maps in the lower layers close to the input which are responsible for detecting small objects in a single stage detector have a problem of insufficient representation power because they are too shallow. There is also a structural contradiction that the feature maps have to deliver low-level information to next layers as well as contain high-level abstraction for prediction. In this paper, we propose a method to enrich the representation power of feature maps using Resblock and deconvolution layers. In addition, a unified prediction module is applied to generalize output results and boost earlier layers’ representation power for prediction. The proposed method enables more precise prediction, which achieved higher score than SSD on PASCAL VOC and MS COCO. In addition, it maintains the advantage of fast computation of a single stage detector, which requires much less computation than other detectors with similar performance. Code is available at https://github.com/kmlee-snu/run
Tasks
Published 2017-07-17
URL http://arxiv.org/abs/1707.05031v4
PDF http://arxiv.org/pdf/1707.05031v4.pdf
PWC https://paperswithcode.com/paper/residual-features-and-unified-prediction
Repo https://github.com/kmlee-snu/run
Framework caffe2

DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents

Title DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents
Authors Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B. Choy, Philip H. S. Torr, Manmohan Chandraker
Abstract We introduce a Deep Stochastic IOC RNN Encoderdecoder framework, DESIRE, for the task of future predictions of multiple interacting agents in dynamic scenes. DESIRE effectively predicts future locations of objects in multiple scenes by 1) accounting for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) foreseeing the potential future outcomes and make a strategic prediction based on that, and 3) reasoning not only from the past motion history, but also from the scene context as well as the interactions among the agents. DESIRE achieves these in a single end-to-end trainable neural network model, while being computationally efficient. The model first obtains a diverse set of hypothetical future prediction samples employing a conditional variational autoencoder, which are ranked and refined by the following RNN scoring-regression module. Samples are scored by accounting for accumulated future rewards, which enables better long-term strategic decisions similar to IOC frameworks. An RNN scene context fusion module jointly captures past motion histories, the semantic scene context and interactions among multiple agents. A feedback mechanism iterates over the ranking and refinement to further boost the prediction accuracy. We evaluate our model on two publicly available datasets: KITTI and Stanford Drone Dataset. Our experiments show that the proposed model significantly improves the prediction accuracy compared to other baseline methods.
Tasks Future prediction, Trajectory Prediction
Published 2017-04-14
URL http://arxiv.org/abs/1704.04394v1
PDF http://arxiv.org/pdf/1704.04394v1.pdf
PWC https://paperswithcode.com/paper/desire-distant-future-prediction-in-dynamic
Repo https://github.com/yadrimz/DESIRE
Framework tf

Improving Video Generation for Multi-functional Applications

Title Improving Video Generation for Multi-functional Applications
Authors Bernhard Kratzwald, Zhiwu Huang, Danda Pani Paudel, Acharya Dinesh, Luc Van Gool
Abstract In this paper, we aim to improve the state-of-the-art video generative adversarial networks (GANs) with a view towards multi-functional applications. Our improved video GAN model does not separate foreground from background nor dynamic from static patterns, but learns to generate the entire video clip conjointly. Our model can thus be trained to generate - and learn from - a broad set of videos with no restriction. This is achieved by designing a robust one-stream video generation architecture with an extension of the state-of-the-art Wasserstein GAN framework that allows for better convergence. The experimental results show that our improved video GAN model outperforms state-of-theart video generative models on multiple challenging datasets. Furthermore, we demonstrate the superiority of our model by successfully extending it to three challenging problems: video colorization, video inpainting, and future prediction. To the best of our knowledge, this is the first work using GANs to colorize and inpaint video clips.
Tasks Colorization, Future prediction, Video Generation, Video Inpainting
Published 2017-11-30
URL http://arxiv.org/abs/1711.11453v2
PDF http://arxiv.org/pdf/1711.11453v2.pdf
PWC https://paperswithcode.com/paper/improving-video-generation-for-multi
Repo https://github.com/ishandutta2007/Video-Generation-Landscape
Framework none

DeepCache: Principled Cache for Mobile Deep Vision

Title DeepCache: Principled Cache for Mobile Deep Vision
Authors Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, Xuanzhe Liu
Abstract We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision. DeepCache benefits model execution efficiency by exploiting temporal locality in input video streams. It addresses a key challenge raised by mobile vision: the cache must operate under video scene variation, while trading off among cacheability, overhead, and loss in model accuracy. At the input of a model, DeepCache discovers video temporal locality by exploiting the video’s internal structure, for which it borrows proven heuristics from video compression; into the model, DeepCache propagates regions of reusable results by exploiting the model’s internal structure. Notably, DeepCache eschews applying video heuristics to model internals which are not pixels but high-dimensional, difficult-to-interpret data. Our implementation of DeepCache works with unmodified deep learning models, requires zero developer’s manual effort, and is therefore immediately deployable on off-the-shelf mobile devices. Our experiments show that DeepCache saves inference execution time by 18% on average and up to 47%. DeepCache reduces system energy consumption by 20% on average.
Tasks Video Compression
Published 2017-12-01
URL https://arxiv.org/abs/1712.01670v5
PDF https://arxiv.org/pdf/1712.01670v5.pdf
PWC https://paperswithcode.com/paper/deepcache-principled-cache-for-mobile-deep
Repo https://github.com/xumengwei/DeepCache
Framework none

Where to put the Image in an Image Caption Generator

Title Where to put the Image in an Image Caption Generator
Authors Marc Tanti, Albert Gatt, Kenneth P. Camilleri
Abstract When a recurrent neural network language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by injecting' image features -- or in a layer following the RNN -- conditioning the language model by merging’ image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper we empirically show that it is not especially detrimental to performance whether one architecture is used or another. The merge architecture does have practical advantages, as conditioning by merging allows the RNN’s hidden state vector to shrink in size by up to four times. Our results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN as that yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.
Tasks Language Modelling
Published 2017-03-27
URL http://arxiv.org/abs/1703.09137v2
PDF http://arxiv.org/pdf/1703.09137v2.pdf
PWC https://paperswithcode.com/paper/where-to-put-the-image-in-an-image-caption
Repo https://github.com/shan1322/Neural-Style-Captioning
Framework none

A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Title A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Authors P. Godard, G. Adda, M. Adda-Decker, J. Benjumea, L. Besacier, J. Cooper-Leavitt, G-N. Kouarata, L. Lamel, H. Maynard, M. Mueller, A. Rialland, S. Stueker, F. Yvon, M. Zanon-Boito
Abstract Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.
Tasks
Published 2017-10-10
URL http://arxiv.org/abs/1710.03501v3
PDF http://arxiv.org/pdf/1710.03501v3.pdf
PWC https://paperswithcode.com/paper/a-very-low-resource-language-speech-corpus
Repo https://github.com/mzboito/mmboshi
Framework none

Data-driven Advice for Applying Machine Learning to Bioinformatics Problems

Title Data-driven Advice for Applying Machine Learning to Bioinformatics Problems
Authors Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, Jason H. Moore
Abstract As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.
Tasks Model Selection
Published 2017-08-08
URL http://arxiv.org/abs/1708.05070v2
PDF http://arxiv.org/pdf/1708.05070v2.pdf
PWC https://paperswithcode.com/paper/data-driven-advice-for-applying-machine
Repo https://github.com/rhiever/sklearn-benchmarks
Framework none

Single Shot Temporal Action Detection

Title Single Shot Temporal Action Detection
Authors Tianwei Lin, Xu Zhao, Zheng Shou
Abstract Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the “detection by classification” framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we empirically search for the best network architecture of SSAD due to lacking existing models that can be directly adopted. Moreover, we investigate into input feature types and fusion strategies to further improve detection accuracy. We conduct extensive experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from 19.0% to 24.6% on THUMOS 2014 and from 7.4% to 11.0% on MEXaction2.
Tasks Action Detection
Published 2017-10-17
URL http://arxiv.org/abs/1710.06236v1
PDF http://arxiv.org/pdf/1710.06236v1.pdf
PWC https://paperswithcode.com/paper/single-shot-temporal-action-detection
Repo https://github.com/hypjudy/Decouple-SSAD
Framework tf

Diversifying Support Vector Machines for Boosting using Kernel Perturbation: Applications to Class Imbalance and Small Disjuncts

Title Diversifying Support Vector Machines for Boosting using Kernel Perturbation: Applications to Class Imbalance and Small Disjuncts
Authors Shounak Datta, Sayak Nag, Sankha Subhra Mullick, Swagatam Das
Abstract The diversification (generating slightly varying separating discriminators) of Support Vector Machines (SVMs) for boosting has proven to be a challenge due to the strong learning nature of SVMs. Based on the insight that perturbing the SVM kernel may help in diversifying SVMs, we propose two kernel perturbation based boosting schemes where the kernel is modified in each round so as to increase the resolution of the kernel-induced Reimannian metric in the vicinity of the datapoints misclassified in the previous round. We propose a method for identifying the disjuncts in a dataset, dispelling the dependence on rule-based learning methods for identifying the disjuncts. We also present a new performance measure called Geometric Small Disjunct Index (GSDI) to quantify the performance on small disjuncts for balanced as well as class imbalanced datasets. Experimental comparison with a variety of state-of-the-art algorithms is carried out using the best classifiers of each type selected by a new approach inspired by multi-criteria decision making. The proposed method is found to outperform the contending state-of-the-art methods on different datasets (ranging from mildly imbalanced to highly imbalanced and characterized by varying number of disjuncts) in terms of three different performance indices (including the proposed GSDI).
Tasks Decision Making
Published 2017-12-22
URL http://arxiv.org/abs/1712.08493v1
PDF http://arxiv.org/pdf/1712.08493v1.pdf
PWC https://paperswithcode.com/paper/diversifying-support-vector-machines-for
Repo https://github.com/Shounak-D/Kernel-Perturbation-based-Boosting-of-SVMs
Framework none

Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints

Title Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints
Authors Alexander Richard, Hilde Kuehne, Juergen Gall
Abstract Action detection and temporal segmentation of actions in videos are topics of increasing interest. While fully supervised systems have gained much attention lately, full annotation of each action within the video is costly and impractical for large amounts of video data. Thus, weakly supervised action detection and temporal segmentation methods are of great importance. While most works in this area assume an ordered sequence of occurring actions to be given, our approach only uses a set of actions. Such action sets provide much less supervision since neither action ordering nor the number of action occurrences are known. In exchange, they can be easily obtained, for instance, from meta-tags, while ordered sequences still require human annotation. We introduce a system that automatically learns to temporally segment and label actions in a video, where the only supervision that is used are action sets. An evaluation on three datasets shows that our method still achieves good results although the amount of supervision is significantly smaller than for other related methods.
Tasks Action Detection, action segmentation
Published 2017-06-02
URL http://arxiv.org/abs/1706.00699v2
PDF http://arxiv.org/pdf/1706.00699v2.pdf
PWC https://paperswithcode.com/paper/action-sets-weakly-supervised-action
Repo https://github.com/alexanderrichard/action-sets
Framework pytorch

Incremental Tube Construction for Human Action Detection

Title Incremental Tube Construction for Human Action Detection
Authors Harkirat Singh Behl, Michael Sapienza, Gurkirt Singh, Suman Saha, Fabio Cuzzolin, Philip H. S. Torr
Abstract Current state-of-the-art action detection systems are tailored for offline batch-processing applications. However, for online applications like human-robot interaction, current systems fall short, either because they only detect one action per video, or because they assume that the entire video is available ahead of time. In this work, we introduce a real-time and online joint-labelling and association algorithm for action detection that can incrementally construct space-time action tubes on the most challenging action videos in which different action categories occur concurrently. In contrast to previous methods, we solve the detection-window association and action labelling problems jointly in a single pass. We demonstrate superior online association accuracy and speed (2.2ms per frame) as compared to the current state-of-the-art offline systems. We further demonstrate that the entire action detection pipeline can easily be made to work effectively in real-time using our action tube construction algorithm.
Tasks Action Detection
Published 2017-04-05
URL http://arxiv.org/abs/1704.01358v2
PDF http://arxiv.org/pdf/1704.01358v2.pdf
PWC https://paperswithcode.com/paper/incremental-tube-construction-for-human
Repo https://github.com/harkiratbehl/OJLA
Framework none

Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network

Title Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network
Authors Yuxin Peng, Jinwei Qi, Yuxin Yuan
Abstract Nowadays, cross-modal retrieval plays an indispensable role to flexibly find information across different modalities of data. Effectively measuring the similarity between different modalities of data is the key of cross-modal retrieval. Different modalities such as image and text have imbalanced and complementary relationships, which contain unequal amount of information when describing the same semantics. For example, images often contain more details that cannot be demonstrated by textual descriptions and vice versa. Existing works based on Deep Neural Network (DNN) mostly construct one common space for different modalities to find the latent alignments between them, which lose their exclusive modality-specific characteristics. Different from the existing works, we propose modality-specific cross-modal similarity measurement (MCSM) approach by constructing independent semantic space for each modality, which adopts end-to-end framework to directly generate modality-specific cross-modal similarity without explicit common representation. For each semantic space, modality-specific characteristics within one modality are fully exploited by recurrent attention network, while the data of another modality is projected into this space with attention based joint embedding to utilize the learned attention weights for guiding the fine-grained cross-modal correlation learning, which can capture the imbalanced and complementary relationships between different modalities. Finally, the complementarity between the semantic spaces for different modalities is explored by adaptive fusion of the modality-specific cross-modal similarities to perform cross-modal retrieval. Experiments on the widely-used Wikipedia and Pascal Sentence datasets as well as our constructed large-scale XMediaNet dataset verify the effectiveness of our proposed approach, outperforming 9 state-of-the-art methods.
Tasks Cross-Modal Retrieval
Published 2017-08-16
URL http://arxiv.org/abs/1708.04776v1
PDF http://arxiv.org/pdf/1708.04776v1.pdf
PWC https://paperswithcode.com/paper/modality-specific-cross-modal-similarity
Repo https://github.com/PKU-ICST-MIPL/MCSM_TIP2018
Framework torch

Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning

Title Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
Authors Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Shin Ishii
Abstract We propose a new regularization method based on virtual adversarial loss: a new measure of local smoothness of the conditional label distribution given input. Virtual adversarial loss is defined as the robustness of the conditional label distribution around each input data point against local perturbation. Unlike adversarial training, our method defines the adversarial direction without label information and is hence applicable to semi-supervised learning. Because the directions in which we smooth the model are only “virtually” adversarial, we call our method virtual adversarial training (VAT). The computational cost of VAT is relatively low. For neural networks, the approximated gradient of virtual adversarial loss can be computed with no more than two pairs of forward- and back-propagations. In our experiments, we applied VAT to supervised and semi-supervised learning tasks on multiple benchmark datasets. With a simple enhancement of the algorithm based on the entropy minimization principle, our VAT achieves state-of-the-art performance for semi-supervised learning tasks on SVHN and CIFAR-10.
Tasks Semi-Supervised Image Classification
Published 2017-04-13
URL http://arxiv.org/abs/1704.03976v2
PDF http://arxiv.org/pdf/1704.03976v2.pdf
PWC https://paperswithcode.com/paper/virtual-adversarial-training-a-regularization
Repo https://github.com/takerum/vat_tf
Framework tf

Design and Analysis of the NIPS 2016 Review Process

Title Design and Analysis of the NIPS 2016 Review Process
Authors Nihar B. Shah, Behzad Tabibian, Krikamol Muandet, Isabelle Guyon, Ulrike von Luxburg
Abstract Neural Information Processing Systems (NIPS) is a top-tier annual conference in machine learning. The 2016 edition of the conference comprised more than 2,400 paper submissions, 3,000 reviewers, and 8,000 attendees. This represents a growth of nearly 40% in terms of submissions, 96% in terms of reviewers, and over 100% in terms of attendees as compared to the previous year. The massive scale as well as rapid growth of the conference calls for a thorough quality assessment of the peer-review process and novel means of improvement. In this paper, we analyze several aspects of the data collected during the review process, including an experiment investigating the efficacy of collecting ordinal rankings from reviewers. Our goal is to check the soundness of the review process, and provide insights that may be useful in the design of the review process of subsequent conferences.
Tasks
Published 2017-08-31
URL http://arxiv.org/abs/1708.09794v2
PDF http://arxiv.org/pdf/1708.09794v2.pdf
PWC https://paperswithcode.com/paper/design-and-analysis-of-the-nips-2016-review
Repo https://github.com/btabibian/conference-analysis
Framework none
comments powered by Disqus