Paper Group AWR 125
Let’s Dance: Learning From Online Dance Videos. Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding. Movie Recommendation System using Sentiment Analysis from Microblogging Data. An Introductory Survey on Attention Mechanisms in NLP Problems. Near-lossless Binarization of Word Embeddings. Learning to Encode T …
Let’s Dance: Learning From Online Dance Videos
Title | Let’s Dance: Learning From Online Dance Videos |
Authors | Daniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, Irfan Essa |
Abstract | In recent years, deep neural network approaches have naturally extended to the video domain, in their simplest case by aggregating per-frame classifications as a baseline for action recognition. A majority of the work in this area extends from the imaging domain, leading to visual-feature heavy approaches on temporal data. To address this issue we introduce “Let’s Dance”, a 1000 video dataset (and growing) comprised of 10 visually overlapping dance categories that require motion for their classification. We stress the important of human motion as a key distinguisher in our work given that, as we show in this work, visual information is not sufficient to classify motion-heavy categories. We compare our datasets’ performance using imaging techniques with UCF-101 and demonstrate this inherent difficulty. We present a comparison of numerous state-of-the-art techniques on our dataset using three different representations (video, optical flow and multi-person pose data) in order to analyze these approaches. We discuss the motion parameterization of each of them and their value in learning to categorize online dance videos. Lastly, we release this dataset (and its three representations) for the research community to use. |
Tasks | Optical Flow Estimation, Temporal Action Localization |
Published | 2018-01-23 |
URL | http://arxiv.org/abs/1801.07388v1 |
http://arxiv.org/pdf/1801.07388v1.pdf | |
PWC | https://paperswithcode.com/paper/lets-dance-learning-from-online-dance-videos |
Repo | https://github.com/xrenaa/Human-Motion-Analysis-with-Deep-Metric-Learning |
Framework | pytorch |
Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Title | Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding |
Authors | Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, Alan Yuille |
Abstract | Learning to estimate 3D geometry in a single frame and optical flow from consecutive frames by watching unlabeled videos via deep convolutional network has made significant progress recently. Current state-of-the-art (SoTA) methods treat the two tasks independently. One typical assumption of the existing depth estimation methods is that the scenes contain no independent moving objects. while object moving could be easily modeled using optical flow. In this paper, we propose to address the two tasks as a whole, i.e. to jointly understand per-pixel 3D geometry and motion. This eliminates the need of static scene assumption and enforces the inherent geometrical consistency during the learning process, yielding significantly improved results for both tasks. We call our method as “Every Pixel Counts++” or “EPC++”. Specifically, during training, given two consecutive frames from a video, we adopt three parallel networks to predict the camera motion (MotionNet), dense depth map (DepthNet), and per-pixel optical flow between two frames (OptFlowNet) respectively. The three types of information are fed into a holistic 3D motion parser (HMP), and per-pixel 3D motion of both rigid background and moving objects are disentangled and recovered. Comprehensive experiments were conducted on datasets with different scenes, including driving scenario (KITTI 2012 and KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic animation (MPI Sintel dataset). Performance on the five tasks of depth estimation, optical flow estimation, odometry, moving object segmentation and scene flow estimation shows that our approach outperforms other SoTA methods. Code will be available at: https://github.com/chenxuluo/EPC. |
Tasks | Depth Estimation, Optical Flow Estimation, Scene Flow Estimation, Semantic Segmentation |
Published | 2018-10-14 |
URL | https://arxiv.org/abs/1810.06125v2 |
https://arxiv.org/pdf/1810.06125v2.pdf | |
PWC | https://paperswithcode.com/paper/every-pixel-counts-joint-learning-of-geometry |
Repo | https://github.com/chenxuluo/EPC |
Framework | tf |
Movie Recommendation System using Sentiment Analysis from Microblogging Data
Title | Movie Recommendation System using Sentiment Analysis from Microblogging Data |
Authors | Sudhanshu Kumar, Shirsendu Sukanta Halder, Kanjar De, Partha Pratim Roy |
Abstract | Recommendation systems are important intelligent systems that play a vital role in providing selective information to users. Traditional approaches in recommendation systems include collaborative filtering and content-based filtering. However, these approaches have certain limitations like the necessity of prior user history and habits for performing the task of recommendation. In order to reduce the effect of such dependencies, this paper proposes a hybrid recommendation system which combines the collaborative filtering, content-based filtering with sentiment analysis of movie tweets. The movie tweets have been collected from microblogging websites to understand the current trends and user response of the movie. Experiments conducted on public database produce promising results. |
Tasks | Recommendation Systems, Sentiment Analysis |
Published | 2018-11-27 |
URL | https://arxiv.org/abs/1811.10804v1 |
https://arxiv.org/pdf/1811.10804v1.pdf | |
PWC | https://paperswithcode.com/paper/movie-recommendation-system-using-sentiment |
Repo | https://github.com/3ZadeSSG/ContentBased-Movie-Recommendation-using-Sentiment-Analysis |
Framework | none |
An Introductory Survey on Attention Mechanisms in NLP Problems
Title | An Introductory Survey on Attention Mechanisms in NLP Problems |
Authors | Dichao Hu |
Abstract | First derived from human intuition, later adapted to machine translation for automatic token alignment, attention mechanism, a simple method that can be used for encoding sequence data based on the importance score each element is assigned, has been widely applied to and attained significant improvement in various tasks in natural language processing, including sentiment classification, text summarization, question answering, dependency parsing, etc. In this paper, we survey through recent works and conduct an introductory summary of the attention mechanism in different NLP problems, aiming to provide our readers with basic knowledge on this widely used method, discuss its different variants for different tasks, explore its association with other techniques in machine learning, and examine methods for evaluating its performance. |
Tasks | Dependency Parsing, Machine Translation, Question Answering, Sentiment Analysis, Text Summarization |
Published | 2018-11-12 |
URL | http://arxiv.org/abs/1811.05544v1 |
http://arxiv.org/pdf/1811.05544v1.pdf | |
PWC | https://paperswithcode.com/paper/an-introductory-survey-on-attention |
Repo | https://github.com/wangyiyan3318/Attention-Mechanism |
Framework | tf |
Near-lossless Binarization of Word Embeddings
Title | Near-lossless Binarization of Word Embeddings |
Authors | Julien Tissier, Christophe Gravier, Amaury Habrard |
Abstract | Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ~2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors. |
Tasks | Semantic Similarity, Semantic Textual Similarity, Sentiment Analysis, Text Classification, Word Embeddings |
Published | 2018-03-24 |
URL | http://arxiv.org/abs/1803.09065v3 |
http://arxiv.org/pdf/1803.09065v3.pdf | |
PWC | https://paperswithcode.com/paper/near-lossless-binarization-of-word-embeddings |
Repo | https://github.com/tca19/near-lossless-binarization |
Framework | none |
Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks
Title | Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks |
Authors | Yau-Shian Wang, Hung-Yi Lee |
Abstract | Auto-encoders compress input data into a latent-space representation and reconstruct the original data from the representation. This latent representation is not easily interpreted by humans. In this paper, we propose training an auto-encoder that encodes input text into human-readable sentences, and unpaired abstractive summarization is thereby achieved. The auto-encoder is composed of a generator and a reconstructor. The generator encodes the input text into a shorter word sequence, and the reconstructor recovers the generator input from the generator output. To make the generator output human-readable, a discriminator restricts the output of the generator to resemble human-written sentences. By taking the generator output as the summary of the input text, abstractive summarization is achieved without document-summary pairs as training data. Promising results are shown on both English and Chinese corpora. |
Tasks | Abstractive Text Summarization |
Published | 2018-10-05 |
URL | http://arxiv.org/abs/1810.02851v1 |
http://arxiv.org/pdf/1810.02851v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-to-encode-text-as-human-readable |
Repo | https://github.com/yaushian/Unparalleled-Text-Summarization-using-GAN |
Framework | tf |
Correlation Flow: Robust Optical Flow Using Kernel Cross-Correlators
Title | Correlation Flow: Robust Optical Flow Using Kernel Cross-Correlators |
Authors | Chen Wang, Tete Ji, Thien-Minh Nguyen, Lihua Xie |
Abstract | Robust velocity and position estimation is crucial for autonomous robot navigation. The optical flow based methods for autonomous navigation have been receiving increasing attentions in tandem with the development of micro unmanned aerial vehicles. This paper proposes a kernel cross-correlator (KCC) based algorithm to determine optical flow using a monocular camera, which is named as correlation flow (CF). Correlation flow is able to provide reliable and accurate velocity estimation and is robust to motion blur. In addition, it can also estimate the altitude velocity and yaw rate, which are not available by traditional methods. Autonomous flight tests on a quadcopter show that correlation flow can provide robust trajectory estimation with very low processing power. The source codes are released based on the ROS framework. |
Tasks | Autonomous Navigation, Optical Flow Estimation, Robot Navigation |
Published | 2018-02-20 |
URL | http://arxiv.org/abs/1802.07078v2 |
http://arxiv.org/pdf/1802.07078v2.pdf | |
PWC | https://paperswithcode.com/paper/correlation-flow-robust-optical-flow-using |
Repo | https://github.com/wang-chen/KCC |
Framework | none |
SurfConv: Bridging 3D and 2D Convolution for RGBD Images
Title | SurfConv: Bridging 3D and 2D Convolution for RGBD Images |
Authors | Hang Chu, Wei-Chiu Ma, Kaustav Kundu, Raquel Urtasun, Sanja Fidler |
Abstract | We tackle the problem of using 3D information in convolutional neural networks for down-stream recognition tasks. Using depth as an additional channel alongside the RGB input has the scale variance problem present in image convolution based approaches. On the other hand, 3D convolution wastes a large amount of memory on mostly unoccupied 3D space, which consists of only the surface visible to the sensor. Instead, we propose SurfConv, which “slides” compact 2D filters along the visible 3D surface. SurfConv is formulated as a simple depth-aware multi-scale 2D convolution, through a new Data-Driven Depth Discretization (D4) scheme. We demonstrate the effectiveness of our method on indoor and outdoor 3D semantic segmentation datasets. Our method achieves state-of-the-art performance with less than 30% parameters used by the 3D convolution-based approaches. |
Tasks | 3D Semantic Segmentation, Semantic Segmentation |
Published | 2018-12-04 |
URL | http://arxiv.org/abs/1812.01519v1 |
http://arxiv.org/pdf/1812.01519v1.pdf | |
PWC | https://paperswithcode.com/paper/surfconv-bridging-3d-and-2d-convolution-for |
Repo | https://github.com/chuhang/SurfConv |
Framework | pytorch |
Learning to Guide Decoding for Image Captioning
Title | Learning to Guide Decoding for Image Captioning |
Authors | Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, Wei Liu |
Abstract | Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at each time step. The guiding network can be plugged into the current encoder-decoder framework and trained in an end-to-end manner. Hence, the guiding vector can be adaptively learned according to the signal from the decoder, making itself to embed information from both image and language. Additionally, discriminative supervision can be employed to further improve the quality of guidance. The advantages of our proposed approach are verified by experiments carried out on the MS COCO dataset. |
Tasks | Image Captioning |
Published | 2018-04-03 |
URL | http://arxiv.org/abs/1804.00887v1 |
http://arxiv.org/pdf/1804.00887v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-to-guide-decoding-for-image |
Repo | https://github.com/HaleyPei/Learning-to-Guide-Decoding-for-Image-Captioning |
Framework | pytorch |
Fine-grained visual recognition with salient feature detection
Title | Fine-grained visual recognition with salient feature detection |
Authors | Hui Feng, Shanshan Wang, Shuzhi Sam Ge |
Abstract | Computer vision based fine-grained recognition has received great attention in recent years. Existing works focus on discriminative part localization and feature learning. In this paper, to improve the performance of fine-grained recognition, we try to precisely locate as many salient parts of object as possible at first. Then, we figure out the classification probability that can be obtained by using separate parts for object classification. Finally, through extracting efficient features from each part and combining them, then feeding to a classifier for recognition, an improved accuracy over state-of-art algorithms has been obtained on CUB200-2011 bird dataset. |
Tasks | Fine-Grained Visual Recognition, Object Classification |
Published | 2018-08-12 |
URL | http://arxiv.org/abs/1808.03935v2 |
http://arxiv.org/pdf/1808.03935v2.pdf | |
PWC | https://paperswithcode.com/paper/fine-grained-visual-recognition-with-salient |
Repo | https://github.com/wuyun8210/partdetection |
Framework | none |
Evolutionary Generative Adversarial Networks
Title | Evolutionary Generative Adversarial Networks |
Authors | Chaoyue Wang, Chang Xu, Xin Yao, Dacheng Tao |
Abstract | Generative adversarial networks (GAN) have been effective for learning generative models for real-world data. However, existing GANs (GAN and its variants) tend to suffer from training problems such as instability and mode collapse. In this paper, we propose a novel GAN framework called evolutionary generative adversarial networks (E-GAN) for stable GAN training and improved generative performance. Unlike existing GANs, which employ a pre-defined adversarial objective function alternately training a generator and a discriminator, we utilize different adversarial training objectives as mutation operations and evolve a population of generators to adapt to the environment (i.e., the discriminator). We also utilize an evaluation mechanism to measure the quality and diversity of generated samples, such that only well-performing generator(s) are preserved and used for further training. In this way, E-GAN overcomes the limitations of an individual adversarial training objective and always preserves the best offspring, contributing to progress in and the success of GANs. Experiments on several datasets demonstrate that E-GAN achieves convincing generative performance and reduces the training problems inherent in existing GANs. |
Tasks | |
Published | 2018-03-01 |
URL | http://arxiv.org/abs/1803.00657v1 |
http://arxiv.org/pdf/1803.00657v1.pdf | |
PWC | https://paperswithcode.com/paper/evolutionary-generative-adversarial-networks |
Repo | https://github.com/WANG-Chaoyue/EvolutionaryGAN |
Framework | pytorch |
A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification
Title | A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification |
Authors | Eduardo Fonseca, Rong Gong, Xavier Serra |
Abstract | In the past, Acoustic Scene Classification systems have been based on hand crafting audio features that are input to a classifier. Nowadays, the common trend is to adopt data driven techniques, e.g., deep learning, where audio representations are learned from data. In this paper, we propose a system that consists of a simple fusion of two methods of the aforementioned types: a deep learning approach where log-scaled mel-spectrograms are input to a convolutional neural network, and a feature engineering approach, where a collection of hand-crafted features is input to a gradient boosting machine. We first show that both methods provide complementary information to some extent. Then, we use a simple late fusion strategy to combine both methods. We report classification accuracy of each method individually and the combined system on the TUT Acoustic Scenes 2017 dataset. The proposed fused system outperforms each of the individual methods and attains a classification accuracy of 72.8% on the evaluation set, improving the baseline system by 11.8%. |
Tasks | Acoustic Scene Classification, Feature Engineering, Scene Classification |
Published | 2018-06-19 |
URL | http://arxiv.org/abs/1806.07506v2 |
http://arxiv.org/pdf/1806.07506v2.pdf | |
PWC | https://paperswithcode.com/paper/a-simple-fusion-of-deep-and-shallow-learning |
Repo | https://github.com/edufonseca/icassp19 |
Framework | tf |
Spherical Latent Spaces for Stable Variational Autoencoders
Title | Spherical Latent Spaces for Stable Variational Autoencoders |
Authors | Jiacheng Xu, Greg Durrett |
Abstract | A hallmark of variational autoencoders (VAEs) for text processing is their combination of powerful encoder-decoder models, such as LSTMs, with simple latent distributions, typically multivariate Gaussians. These models pose a difficult optimization problem: there is an especially bad local optimum where the variational posterior always equals the prior and the model does not use the latent variable at all, a kind of “collapse” which is encouraged by the KL divergence term of the objective. In this work, we experiment with another choice of latent distribution, namely the von Mises-Fisher (vMF) distribution, which places mass on the surface of the unit hypersphere. With this choice of prior and posterior, the KL divergence term now only depends on the variance of the vMF distribution, giving us the ability to treat it as a fixed hyperparameter. We show that doing so not only averts the KL collapse, but consistently gives better likelihoods than Gaussians across a range of modeling conditions, including recurrent language modeling and bag-of-words document modeling. An analysis of the properties of our vMF representations shows that they learn richer and more nuanced structures in their latent representations than their Gaussian counterparts. |
Tasks | Language Modelling |
Published | 2018-08-31 |
URL | http://arxiv.org/abs/1808.10805v2 |
http://arxiv.org/pdf/1808.10805v2.pdf | |
PWC | https://paperswithcode.com/paper/spherical-latent-spaces-for-stable |
Repo | https://github.com/jiacheng-xu/vmf_vae_nlp |
Framework | pytorch |
DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder
Title | DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder |
Authors | Xiaodong Gu, Kyunghyun Cho, Jung-Woo Ha, Sunghun Kim |
Abstract | Variational autoencoders~(VAEs) have shown a promise in data-driven conversation modeling. However, most VAE conversation models match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., unimodal) scope. In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder~(WAE) specially designed for dialogue modeling. Unlike VAEs that impose a simple distribution over the latent variables, DialogWAE models the distribution of data by training a GAN within the latent variable space. Specifically, our model samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise using neural networks and minimizes the Wasserstein distance between the two distributions. We further develop a Gaussian mixture prior network to enrich the latent space. Experiments on two popular datasets show that DialogWAE outperforms the state-of-the-art approaches in generating more coherent, informative and diverse responses. |
Tasks | |
Published | 2018-05-31 |
URL | http://arxiv.org/abs/1805.12352v2 |
http://arxiv.org/pdf/1805.12352v2.pdf | |
PWC | https://paperswithcode.com/paper/dialogwae-multimodal-response-generation-with |
Repo | https://github.com/guxd/DialogWAE |
Framework | pytorch |
Grounded Video Description
Title | Grounded Video Description |
Authors | Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach |
Abstract | Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or “true” such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video. |
Tasks | Video Description |
Published | 2018-12-17 |
URL | https://arxiv.org/abs/1812.06587v2 |
https://arxiv.org/pdf/1812.06587v2.pdf | |
PWC | https://paperswithcode.com/paper/grounded-video-description |
Repo | https://github.com/facebookresearch/grounded-video-description |
Framework | pytorch |