October 21, 2019

3019 words 15 mins read

Paper Group AWR 125

Let’s Dance: Learning From Online Dance Videos. Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding. Movie Recommendation System using Sentiment Analysis from Microblogging Data. An Introductory Survey on Attention Mechanisms in NLP Problems. Near-lossless Binarization of Word Embeddings. Learning to Encode T …

Let’s Dance: Learning From Online Dance Videos


Title	Let’s Dance: Learning From Online Dance Videos
Authors	Daniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, Irfan Essa
Abstract	In recent years, deep neural network approaches have naturally extended to the video domain, in their simplest case by aggregating per-frame classifications as a baseline for action recognition. A majority of the work in this area extends from the imaging domain, leading to visual-feature heavy approaches on temporal data. To address this issue we introduce “Let’s Dance”, a 1000 video dataset (and growing) comprised of 10 visually overlapping dance categories that require motion for their classification. We stress the important of human motion as a key distinguisher in our work given that, as we show in this work, visual information is not sufficient to classify motion-heavy categories. We compare our datasets’ performance using imaging techniques with UCF-101 and demonstrate this inherent difficulty. We present a comparison of numerous state-of-the-art techniques on our dataset using three different representations (video, optical flow and multi-person pose data) in order to analyze these approaches. We discuss the motion parameterization of each of them and their value in learning to categorize online dance videos. Lastly, we release this dataset (and its three representations) for the research community to use.
Tasks	Optical Flow Estimation, Temporal Action Localization
Published	2018-01-23
URL	http://arxiv.org/abs/1801.07388v1
PDF	http://arxiv.org/pdf/1801.07388v1.pdf
PWC	https://paperswithcode.com/paper/lets-dance-learning-from-online-dance-videos
Repo	https://github.com/xrenaa/Human-Motion-Analysis-with-Deep-Metric-Learning
Framework	pytorch

Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding


Title	Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Authors	Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, Alan Yuille
Abstract	Learning to estimate 3D geometry in a single frame and optical flow from consecutive frames by watching unlabeled videos via deep convolutional network has made significant progress recently. Current state-of-the-art (SoTA) methods treat the two tasks independently. One typical assumption of the existing depth estimation methods is that the scenes contain no independent moving objects. while object moving could be easily modeled using optical flow. In this paper, we propose to address the two tasks as a whole, i.e. to jointly understand per-pixel 3D geometry and motion. This eliminates the need of static scene assumption and enforces the inherent geometrical consistency during the learning process, yielding significantly improved results for both tasks. We call our method as “Every Pixel Counts++” or “EPC++”. Specifically, during training, given two consecutive frames from a video, we adopt three parallel networks to predict the camera motion (MotionNet), dense depth map (DepthNet), and per-pixel optical flow between two frames (OptFlowNet) respectively. The three types of information are fed into a holistic 3D motion parser (HMP), and per-pixel 3D motion of both rigid background and moving objects are disentangled and recovered. Comprehensive experiments were conducted on datasets with different scenes, including driving scenario (KITTI 2012 and KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic animation (MPI Sintel dataset). Performance on the five tasks of depth estimation, optical flow estimation, odometry, moving object segmentation and scene flow estimation shows that our approach outperforms other SoTA methods. Code will be available at: https://github.com/chenxuluo/EPC.
Tasks	Depth Estimation, Optical Flow Estimation, Scene Flow Estimation, Semantic Segmentation
Published	2018-10-14
URL	https://arxiv.org/abs/1810.06125v2
PDF	https://arxiv.org/pdf/1810.06125v2.pdf
PWC	https://paperswithcode.com/paper/every-pixel-counts-joint-learning-of-geometry
Repo	https://github.com/chenxuluo/EPC
Framework	tf

Movie Recommendation System using Sentiment Analysis from Microblogging Data


Title	Movie Recommendation System using Sentiment Analysis from Microblogging Data
Authors	Sudhanshu Kumar, Shirsendu Sukanta Halder, Kanjar De, Partha Pratim Roy
Abstract	Recommendation systems are important intelligent systems that play a vital role in providing selective information to users. Traditional approaches in recommendation systems include collaborative filtering and content-based filtering. However, these approaches have certain limitations like the necessity of prior user history and habits for performing the task of recommendation. In order to reduce the effect of such dependencies, this paper proposes a hybrid recommendation system which combines the collaborative filtering, content-based filtering with sentiment analysis of movie tweets. The movie tweets have been collected from microblogging websites to understand the current trends and user response of the movie. Experiments conducted on public database produce promising results.
Tasks	Recommendation Systems, Sentiment Analysis
Published	2018-11-27
URL	https://arxiv.org/abs/1811.10804v1
PDF	https://arxiv.org/pdf/1811.10804v1.pdf
PWC	https://paperswithcode.com/paper/movie-recommendation-system-using-sentiment
Repo	https://github.com/3ZadeSSG/ContentBased-Movie-Recommendation-using-Sentiment-Analysis
Framework	none

An Introductory Survey on Attention Mechanisms in NLP Problems


Title	An Introductory Survey on Attention Mechanisms in NLP Problems
Authors	Dichao Hu
Abstract	First derived from human intuition, later adapted to machine translation for automatic token alignment, attention mechanism, a simple method that can be used for encoding sequence data based on the importance score each element is assigned, has been widely applied to and attained significant improvement in various tasks in natural language processing, including sentiment classification, text summarization, question answering, dependency parsing, etc. In this paper, we survey through recent works and conduct an introductory summary of the attention mechanism in different NLP problems, aiming to provide our readers with basic knowledge on this widely used method, discuss its different variants for different tasks, explore its association with other techniques in machine learning, and examine methods for evaluating its performance.
Tasks	Dependency Parsing, Machine Translation, Question Answering, Sentiment Analysis, Text Summarization
Published	2018-11-12
URL	http://arxiv.org/abs/1811.05544v1
PDF	http://arxiv.org/pdf/1811.05544v1.pdf
PWC	https://paperswithcode.com/paper/an-introductory-survey-on-attention
Repo	https://github.com/wangyiyan3318/Attention-Mechanism
Framework	tf

Near-lossless Binarization of Word Embeddings


Title	Near-lossless Binarization of Word Embeddings
Authors	Julien Tissier, Christophe Gravier, Amaury Habrard
Abstract	Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ~2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.
Tasks	Semantic Similarity, Semantic Textual Similarity, Sentiment Analysis, Text Classification, Word Embeddings
Published	2018-03-24
URL	http://arxiv.org/abs/1803.09065v3
PDF	http://arxiv.org/pdf/1803.09065v3.pdf
PWC	https://paperswithcode.com/paper/near-lossless-binarization-of-word-embeddings
Repo	https://github.com/tca19/near-lossless-binarization
Framework	none

Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks


Title	Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks
Authors	Yau-Shian Wang, Hung-Yi Lee
Abstract	Auto-encoders compress input data into a latent-space representation and reconstruct the original data from the representation. This latent representation is not easily interpreted by humans. In this paper, we propose training an auto-encoder that encodes input text into human-readable sentences, and unpaired abstractive summarization is thereby achieved. The auto-encoder is composed of a generator and a reconstructor. The generator encodes the input text into a shorter word sequence, and the reconstructor recovers the generator input from the generator output. To make the generator output human-readable, a discriminator restricts the output of the generator to resemble human-written sentences. By taking the generator output as the summary of the input text, abstractive summarization is achieved without document-summary pairs as training data. Promising results are shown on both English and Chinese corpora.
Tasks	Abstractive Text Summarization
Published	2018-10-05
URL	http://arxiv.org/abs/1810.02851v1
PDF	http://arxiv.org/pdf/1810.02851v1.pdf
PWC	https://paperswithcode.com/paper/learning-to-encode-text-as-human-readable
Repo	https://github.com/yaushian/Unparalleled-Text-Summarization-using-GAN
Framework	tf

Correlation Flow: Robust Optical Flow Using Kernel Cross-Correlators


Title	Correlation Flow: Robust Optical Flow Using Kernel Cross-Correlators
Authors	Chen Wang, Tete Ji, Thien-Minh Nguyen, Lihua Xie
Abstract	Robust velocity and position estimation is crucial for autonomous robot navigation. The optical flow based methods for autonomous navigation have been receiving increasing attentions in tandem with the development of micro unmanned aerial vehicles. This paper proposes a kernel cross-correlator (KCC) based algorithm to determine optical flow using a monocular camera, which is named as correlation flow (CF). Correlation flow is able to provide reliable and accurate velocity estimation and is robust to motion blur. In addition, it can also estimate the altitude velocity and yaw rate, which are not available by traditional methods. Autonomous flight tests on a quadcopter show that correlation flow can provide robust trajectory estimation with very low processing power. The source codes are released based on the ROS framework.
Tasks	Autonomous Navigation, Optical Flow Estimation, Robot Navigation
Published	2018-02-20
URL	http://arxiv.org/abs/1802.07078v2
PDF	http://arxiv.org/pdf/1802.07078v2.pdf
PWC	https://paperswithcode.com/paper/correlation-flow-robust-optical-flow-using
Repo	https://github.com/wang-chen/KCC
Framework	none

SurfConv: Bridging 3D and 2D Convolution for RGBD Images


Title	SurfConv: Bridging 3D and 2D Convolution for RGBD Images
Authors	Hang Chu, Wei-Chiu Ma, Kaustav Kundu, Raquel Urtasun, Sanja Fidler
Abstract	We tackle the problem of using 3D information in convolutional neural networks for down-stream recognition tasks. Using depth as an additional channel alongside the RGB input has the scale variance problem present in image convolution based approaches. On the other hand, 3D convolution wastes a large amount of memory on mostly unoccupied 3D space, which consists of only the surface visible to the sensor. Instead, we propose SurfConv, which “slides” compact 2D filters along the visible 3D surface. SurfConv is formulated as a simple depth-aware multi-scale 2D convolution, through a new Data-Driven Depth Discretization (D4) scheme. We demonstrate the effectiveness of our method on indoor and outdoor 3D semantic segmentation datasets. Our method achieves state-of-the-art performance with less than 30% parameters used by the 3D convolution-based approaches.
Tasks	3D Semantic Segmentation, Semantic Segmentation
Published	2018-12-04
URL	http://arxiv.org/abs/1812.01519v1
PDF	http://arxiv.org/pdf/1812.01519v1.pdf
PWC	https://paperswithcode.com/paper/surfconv-bridging-3d-and-2d-convolution-for
Repo	https://github.com/chuhang/SurfConv
Framework	pytorch

Learning to Guide Decoding for Image Captioning


Title	Learning to Guide Decoding for Image Captioning
Authors	Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, Wei Liu
Abstract	Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at each time step. The guiding network can be plugged into the current encoder-decoder framework and trained in an end-to-end manner. Hence, the guiding vector can be adaptively learned according to the signal from the decoder, making itself to embed information from both image and language. Additionally, discriminative supervision can be employed to further improve the quality of guidance. The advantages of our proposed approach are verified by experiments carried out on the MS COCO dataset.
Tasks	Image Captioning
Published	2018-04-03
URL	http://arxiv.org/abs/1804.00887v1
PDF	http://arxiv.org/pdf/1804.00887v1.pdf
PWC	https://paperswithcode.com/paper/learning-to-guide-decoding-for-image
Repo	https://github.com/HaleyPei/Learning-to-Guide-Decoding-for-Image-Captioning
Framework	pytorch

Fine-grained visual recognition with salient feature detection


Title	Fine-grained visual recognition with salient feature detection
Authors	Hui Feng, Shanshan Wang, Shuzhi Sam Ge
Abstract	Computer vision based fine-grained recognition has received great attention in recent years. Existing works focus on discriminative part localization and feature learning. In this paper, to improve the performance of fine-grained recognition, we try to precisely locate as many salient parts of object as possible at first. Then, we figure out the classification probability that can be obtained by using separate parts for object classification. Finally, through extracting efficient features from each part and combining them, then feeding to a classifier for recognition, an improved accuracy over state-of-art algorithms has been obtained on CUB200-2011 bird dataset.
Tasks	Fine-Grained Visual Recognition, Object Classification
Published	2018-08-12
URL	http://arxiv.org/abs/1808.03935v2
PDF	http://arxiv.org/pdf/1808.03935v2.pdf
PWC	https://paperswithcode.com/paper/fine-grained-visual-recognition-with-salient
Repo	https://github.com/wuyun8210/partdetection
Framework	none

Evolutionary Generative Adversarial Networks


Title	Evolutionary Generative Adversarial Networks
Authors	Chaoyue Wang, Chang Xu, Xin Yao, Dacheng Tao
Abstract	Generative adversarial networks (GAN) have been effective for learning generative models for real-world data. However, existing GANs (GAN and its variants) tend to suffer from training problems such as instability and mode collapse. In this paper, we propose a novel GAN framework called evolutionary generative adversarial networks (E-GAN) for stable GAN training and improved generative performance. Unlike existing GANs, which employ a pre-defined adversarial objective function alternately training a generator and a discriminator, we utilize different adversarial training objectives as mutation operations and evolve a population of generators to adapt to the environment (i.e., the discriminator). We also utilize an evaluation mechanism to measure the quality and diversity of generated samples, such that only well-performing generator(s) are preserved and used for further training. In this way, E-GAN overcomes the limitations of an individual adversarial training objective and always preserves the best offspring, contributing to progress in and the success of GANs. Experiments on several datasets demonstrate that E-GAN achieves convincing generative performance and reduces the training problems inherent in existing GANs.
Tasks
Published	2018-03-01
URL	http://arxiv.org/abs/1803.00657v1
PDF	http://arxiv.org/pdf/1803.00657v1.pdf
PWC	https://paperswithcode.com/paper/evolutionary-generative-adversarial-networks
Repo	https://github.com/WANG-Chaoyue/EvolutionaryGAN
Framework	pytorch

A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification


Title	A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification
Authors	Eduardo Fonseca, Rong Gong, Xavier Serra
Abstract	In the past, Acoustic Scene Classification systems have been based on hand crafting audio features that are input to a classifier. Nowadays, the common trend is to adopt data driven techniques, e.g., deep learning, where audio representations are learned from data. In this paper, we propose a system that consists of a simple fusion of two methods of the aforementioned types: a deep learning approach where log-scaled mel-spectrograms are input to a convolutional neural network, and a feature engineering approach, where a collection of hand-crafted features is input to a gradient boosting machine. We first show that both methods provide complementary information to some extent. Then, we use a simple late fusion strategy to combine both methods. We report classification accuracy of each method individually and the combined system on the TUT Acoustic Scenes 2017 dataset. The proposed fused system outperforms each of the individual methods and attains a classification accuracy of 72.8% on the evaluation set, improving the baseline system by 11.8%.
Tasks	Acoustic Scene Classification, Feature Engineering, Scene Classification
Published	2018-06-19
URL	http://arxiv.org/abs/1806.07506v2
PDF	http://arxiv.org/pdf/1806.07506v2.pdf
PWC	https://paperswithcode.com/paper/a-simple-fusion-of-deep-and-shallow-learning
Repo	https://github.com/edufonseca/icassp19
Framework	tf

Spherical Latent Spaces for Stable Variational Autoencoders


Title	Spherical Latent Spaces for Stable Variational Autoencoders
Authors	Jiacheng Xu, Greg Durrett
Abstract	A hallmark of variational autoencoders (VAEs) for text processing is their combination of powerful encoder-decoder models, such as LSTMs, with simple latent distributions, typically multivariate Gaussians. These models pose a difficult optimization problem: there is an especially bad local optimum where the variational posterior always equals the prior and the model does not use the latent variable at all, a kind of “collapse” which is encouraged by the KL divergence term of the objective. In this work, we experiment with another choice of latent distribution, namely the von Mises-Fisher (vMF) distribution, which places mass on the surface of the unit hypersphere. With this choice of prior and posterior, the KL divergence term now only depends on the variance of the vMF distribution, giving us the ability to treat it as a fixed hyperparameter. We show that doing so not only averts the KL collapse, but consistently gives better likelihoods than Gaussians across a range of modeling conditions, including recurrent language modeling and bag-of-words document modeling. An analysis of the properties of our vMF representations shows that they learn richer and more nuanced structures in their latent representations than their Gaussian counterparts.
Tasks	Language Modelling
Published	2018-08-31
URL	http://arxiv.org/abs/1808.10805v2
PDF	http://arxiv.org/pdf/1808.10805v2.pdf
PWC	https://paperswithcode.com/paper/spherical-latent-spaces-for-stable
Repo	https://github.com/jiacheng-xu/vmf_vae_nlp
Framework	pytorch

DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder


Title	DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder
Authors	Xiaodong Gu, Kyunghyun Cho, Jung-Woo Ha, Sunghun Kim
Abstract	Variational autoencoders~(VAEs) have shown a promise in data-driven conversation modeling. However, most VAE conversation models match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., unimodal) scope. In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder~(WAE) specially designed for dialogue modeling. Unlike VAEs that impose a simple distribution over the latent variables, DialogWAE models the distribution of data by training a GAN within the latent variable space. Specifically, our model samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise using neural networks and minimizes the Wasserstein distance between the two distributions. We further develop a Gaussian mixture prior network to enrich the latent space. Experiments on two popular datasets show that DialogWAE outperforms the state-of-the-art approaches in generating more coherent, informative and diverse responses.
Tasks
Published	2018-05-31
URL	http://arxiv.org/abs/1805.12352v2
PDF	http://arxiv.org/pdf/1805.12352v2.pdf
PWC	https://paperswithcode.com/paper/dialogwae-multimodal-response-generation-with
Repo	https://github.com/guxd/DialogWAE
Framework	pytorch

Grounded Video Description


Title	Grounded Video Description
Authors	Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach
Abstract	Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or “true” such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.
Tasks	Video Description
Published	2018-12-17
URL	https://arxiv.org/abs/1812.06587v2
PDF	https://arxiv.org/pdf/1812.06587v2.pdf
PWC	https://paperswithcode.com/paper/grounded-video-description
Repo	https://github.com/facebookresearch/grounded-video-description
Framework	pytorch