July 29, 2019

3060 words 15 mins read

Paper Group AWR 144

The “something something” video database for learning and evaluating visual common sense. IQA: Visual Question Answering in Interactive Environments. Estimating the Success of Unsupervised Image to Image Translation. Near-optimal sample complexity for convex tensor completion. Machine Learning of Linear Differential Equations using Gaussian Process …

The “something something” video database for learning and evaluating visual common sense


Title	The “something something” video database for learning and evaluating visual common sense
Authors	Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, Roland Memisevic
Abstract	Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.
Tasks	Action Recognition In Videos, Common Sense Reasoning, Object Classification, Video Prediction
Published	2017-06-13
URL	http://arxiv.org/abs/1706.04261v2
PDF	http://arxiv.org/pdf/1706.04261v2.pdf
PWC	https://paperswithcode.com/paper/the-something-something-video-database-for
Repo	https://github.com/caspillaga/Conv3DSelfAttention
Framework	pytorch

IQA: Visual Question Answering in Interactive Environments


Title	IQA: Visual Question Answering in Interactive Environments
Authors	Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi
Abstract	We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: “Are there any apples in the fridge?” The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects (code and dataset available at https://github.com/danielgordon10/thor-iqa-cvpr-2018). IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98
Tasks	Visual Question Answering
Published	2017-12-09
URL	http://arxiv.org/abs/1712.03316v3
PDF	http://arxiv.org/pdf/1712.03316v3.pdf
PWC	https://paperswithcode.com/paper/iqa-visual-question-answering-in-interactive
Repo	https://github.com/danielgordon10/thor-iqa-cvpr-2018
Framework	tf

Estimating the Success of Unsupervised Image to Image Translation


Title	Estimating the Success of Unsupervised Image to Image Translation
Authors	Sagie Benaim, Tomer Galanti, Lior Wolf
Abstract	While in supervised learning, the validation error is an unbiased estimator of the generalization (test) error and complexity-based generalization bounds are abundant, no such bounds exist for learning a mapping in an unsupervised way. As a result, when training GANs and specifically when using GANs for learning to map between domains in a completely unsupervised way, one is forced to select the hyperparameters and the stopping epoch by subjectively examining multiple options. We propose a novel bound for predicting the success of unsupervised cross domain mapping methods, which is motivated by the recently proposed Simplicity Principle. The bound can be applied both in expectation, for comparing hyperparameters and for selecting a stopping criterion, or per sample, in order to predict the success of a specific cross-domain translation. The utility of the bound is demonstrated in an extensive set of experiments employing multiple recent algorithms. Our code is available at https://github.com/sagiebenaim/gan_bound .
Tasks	Image-to-Image Translation, Unsupervised Image-To-Image Translation
Published	2017-12-21
URL	http://arxiv.org/abs/1712.07886v2
PDF	http://arxiv.org/pdf/1712.07886v2.pdf
PWC	https://paperswithcode.com/paper/estimating-the-success-of-unsupervised-image
Repo	https://github.com/sagiebenaim/gan_bound
Framework	pytorch

Near-optimal sample complexity for convex tensor completion


Title	Near-optimal sample complexity for convex tensor completion
Authors	Navid Ghadermarzy, Yaniv Plan, Özgür Yılmaz
Abstract	We analyze low rank tensor completion (TC) using noisy measurements of a subset of the tensor. Assuming a rank-$r$, order-$d$, $N \times N \times \cdots \times N$ tensor where $r=O(1)$, the best sampling complexity that was achieved is $O(N^{\frac{d}{2}})$, which is obtained by solving a tensor nuclear-norm minimization problem. However, this bound is significantly larger than the number of free variables in a low rank tensor which is $O(dN)$. In this paper, we show that by using an atomic-norm whose atoms are rank-$1$ sign tensors, one can obtain a sample complexity of $O(dN)$. Moreover, we generalize the matrix max-norm definition to tensors, which results in a max-quasi-norm (max-qnorm) whose unit ball has small Rademacher complexity. We prove that solving a constrained least squares estimation using either the convex atomic-norm or the nonconvex max-qnorm results in optimal sample complexity for the problem of low-rank tensor completion. Furthermore, we show that these bounds are nearly minimax rate-optimal. We also provide promising numerical results for max-qnorm constrained tensor completion, showing improved recovery results compared to matricization and alternating least squares.
Tasks
Published	2017-11-14
URL	http://arxiv.org/abs/1711.04965v1
PDF	http://arxiv.org/pdf/1711.04965v1.pdf
PWC	https://paperswithcode.com/paper/near-optimal-sample-complexity-for-convex
Repo	https://github.com/navidghadermarzy/TensorCompletion_1bit_noisy
Framework	none

Machine Learning of Linear Differential Equations using Gaussian Processes


Title	Machine Learning of Linear Differential Equations using Gaussian Processes
Authors	Maziar Raissi, George Em. Karniadakis
Abstract	This work leverages recent advances in probabilistic machine learning to discover conservation laws expressed by parametric linear equations. Such equations involve, but are not limited to, ordinary and partial differential, integro-differential, and fractional order operators. Here, Gaussian process priors are modified according to the particular form of such operators and are employed to infer parameters of the linear equations from scarce and possibly noisy observations. Such observations may come from experiments or “black-box” computer simulations.
Tasks	Gaussian Processes
Published	2017-01-10
URL	http://arxiv.org/abs/1701.02440v1
PDF	http://arxiv.org/pdf/1701.02440v1.pdf
PWC	https://paperswithcode.com/paper/machine-learning-of-linear-differential
Repo	https://github.com/Slowpuncher24/mlhiphy_v2
Framework	none

Joint Embedding of Graphs


Title	Joint Embedding of Graphs
Authors	Shangsi Wang, Jesús Arroyo, Joshua T. Vogelstein, Carey E. Priebe
Abstract	Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric matrices and projects adjacency matrices of graphs into this subspace. The projection coefficients can be treated as features of the graphs, while the embedding components can represent vertex features. We also propose a random graph model for multiple graphs that generalizes other classical models for graphs. We show through theory and numerical experiments that under the model, the joint embedding method produces estimates of parameters with small errors. Via simulation experiments, we demonstrate that the joint embedding method produces features which lead to state of the art performance in classifying graphs. Applying the joint embedding method to human brain graphs, we find it extracts interpretable features with good prediction accuracy in different tasks.
Tasks	Dimensionality Reduction
Published	2017-03-10
URL	https://arxiv.org/abs/1703.03862v4
PDF	https://arxiv.org/pdf/1703.03862v4.pdf
PWC	https://paperswithcode.com/paper/joint-embedding-of-graphs
Repo	https://github.com/shangsiwang/Joint-Embedding
Framework	none

Excitation Backprop for RNNs


Title	Excitation Backprop for RNNs
Authors	Sarah Adel Bargal, Andrea Zunino, Donghyun Kim, Jianming Zhang, Vittorio Murino, Stan Sclaroff
Abstract	Deep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such studies are relatively lacking for models of spatiotemporal visual content - videos. In this work, we devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep model’s classification/captioning output using the model’s internal representation. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks.
Tasks	Temporal Action Localization, Video Captioning
Published	2017-11-18
URL	http://arxiv.org/abs/1711.06778v3
PDF	http://arxiv.org/pdf/1711.06778v3.pdf
PWC	https://paperswithcode.com/paper/excitation-backprop-for-rnns
Repo	https://github.com/sbargal/Caffe-ExcitationBP-RNNs
Framework	none

Action Tubelet Detector for Spatio-Temporal Action Localization


Title	Action Tubelet Detector for Spatio-Temporal Action Localization
Authors	Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid
Abstract	Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores. The same way state-of-the-art object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the SSD framework. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more accurate scores and more precise localization. Our ACT-detector outperforms the state-of-the-art methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds.
Tasks	Action Localization, Spatio-Temporal Action Localization, Temporal Action Localization
Published	2017-05-04
URL	http://arxiv.org/abs/1705.01861v3
PDF	http://arxiv.org/pdf/1705.01861v3.pdf
PWC	https://paperswithcode.com/paper/action-tubelet-detector-for-spatio-temporal
Repo	https://github.com/qingzhiwu/pytorch-act-detector
Framework	pytorch

Use of Deep Learning in Modern Recommendation System: A Summary of Recent Works


Title	Use of Deep Learning in Modern Recommendation System: A Summary of Recent Works
Authors	Ayush Singhal, Pradeep Sinha, Rakesh Pant
Abstract	With the exponential increase in the amount of digital information over the internet, online shops, online music, video and image libraries, search engines and recommendation system have become the most convenient ways to find relevant information within a short time. In the recent times, deep learning’s advances have gained significant attention in the field of speech recognition, image processing and natural language processing. Meanwhile, several recent studies have shown the utility of deep learning in the area of recommendation systems and information retrieval as well. In this short review, we cover the recent advances made in the field of recommendation using various variants of deep learning technology. We organize the review in three parts: Collaborative system, Content based system and Hybrid system. The review also discusses the contribution of deep learning integrated recommendation systems into several application domains. The review concludes by discussion of the impact of deep learning in recommendation system in various domain and whether deep learning has shown any significant improvement over the conventional systems for recommendation. Finally, we also provide future directions of research which are possible based on the current state of use of deep learning in recommendation systems.
Tasks	Information Retrieval, Recommendation Systems, Speech Recognition
Published	2017-12-20
URL	http://arxiv.org/abs/1712.07525v1
PDF	http://arxiv.org/pdf/1712.07525v1.pdf
PWC	https://paperswithcode.com/paper/use-of-deep-learning-in-modern-recommendation
Repo	https://github.com/anuragreddygv323/Important-stuff
Framework	tf

Deep word embeddings for visual speech recognition


Title	Deep word embeddings for visual speech recognition
Authors	Themos Stafylakis, Georgios Tzimiropoulos
Abstract	In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. We first show that the proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words. We then examine the capacity of the embeddings in modelling words unseen during training. We deploy Probabilistic Linear Discriminant Analysis (PLDA) to model the embeddings and perform low-shot learning experiments on words unseen during training. The experiments demonstrate that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set.
Tasks	Lipreading, Speech Recognition, Visual Speech Recognition, Word Embeddings
Published	2017-10-30
URL	http://arxiv.org/abs/1710.11201v1
PDF	http://arxiv.org/pdf/1710.11201v1.pdf
PWC	https://paperswithcode.com/paper/deep-word-embeddings-for-visual-speech
Repo	https://github.com/tstafylakis/Lipreading-ResNet
Framework	pytorch

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations


Title	Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations
Authors	H. V. Koops, W. B. de Haas, J. Bransen, A. Volk
Abstract	The increasing accuracy of automatic chord estimation systems, the availability of vast amounts of heterogeneous reference annotations, and insights from annotator subjectivity research make chord label personalization increasingly important. Nevertheless, automatic chord estimation systems are historically exclusively trained and evaluated on a single reference annotation. We introduce a first approach to automatic chord label personalization by modeling subjectivity through deep learning of a harmonic interval-based chord label representation. After integrating these representations from multiple annotators, we can accurately personalize chord labels for individual annotators from a single model and the annotators’ chord label vocabulary. Furthermore, we show that chord personalization using multiple reference annotations outperforms using a single reference annotation.
Tasks
Published	2017-06-29
URL	http://arxiv.org/abs/1706.09552v1
PDF	http://arxiv.org/pdf/1706.09552v1.pdf
PWC	https://paperswithcode.com/paper/chord-label-personalization-through-deep
Repo	https://github.com/hvkoops/chordlabelpersonalization
Framework	tf

Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art


Title	Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art
Authors	Yuan Gao, Brij Mohan Lal Srivastava, James Salsman
Abstract	We use automatic speech recognition to assess spoken English learner pronunciation based on the authentic intelligibility of the learners’ spoken responses determined from support vector machine (SVM) classifier or deep learning neural network model predictions of transcription correctness. Using numeric features produced by PocketSphinx alignment mode and many recognition passes searching for the substitution and deletion of each expected phoneme and insertion of unexpected phonemes in sequence, the SVM models achieve 82 percent agreement with the accuracy of Amazon Mechanical Turk crowdworker transcriptions, up from 75 percent reported by multiple independent researchers. Using such features with SVM classifier probability prediction models can help computer-aided pronunciation teaching (CAPT) systems provide intelligibility remediation.
Tasks	Speech Recognition
Published	2017-09-06
URL	http://arxiv.org/abs/1709.01713v3
PDF	http://arxiv.org/pdf/1709.01713v3.pdf
PWC	https://paperswithcode.com/paper/spoken-english-intelligibility-remediation
Repo	https://github.com/jsalsman/featex
Framework	none

Sequential Dialogue Context Modeling for Spoken Language Understanding


Title	Sequential Dialogue Context Modeling for Spoken Language Understanding
Authors	Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur, Larry Heck
Abstract	Spoken Language Understanding (SLU) is a key component of goal oriented dialogue systems that would parse user utterances into semantic frame representations. Traditionally SLU does not utilize the dialogue history beyond the previous system turn and contextual ambiguities are resolved by the downstream components. In this paper, we explore novel approaches for modeling dialogue context in a recurrent neural network (RNN) based language understanding system. We propose the Sequential Dialogue Encoder Network, that allows encoding context from the dialogue history in chronological order. We compare the performance of our proposed architecture with two context models, one that uses just the previous turn context and another that encodes dialogue context in a memory network, but loses the order of utterances in the dialogue history. Experiments with a multi-domain dialogue dataset demonstrate that the proposed architecture results in reduced semantic frame error rates.
Tasks	Goal-Oriented Dialogue Systems, Spoken Language Understanding
Published	2017-05-08
URL	http://arxiv.org/abs/1705.03455v3
PDF	http://arxiv.org/pdf/1705.03455v3.pdf
PWC	https://paperswithcode.com/paper/sequential-dialogue-context-modeling-for
Repo	https://github.com/sunbopds/SDEN-Pytorch-master
Framework	pytorch

Deep Learning for Case-Based Reasoning through Prototypes: A Neural Network that Explains Its Predictions


Title	Deep Learning for Case-Based Reasoning through Prototypes: A Neural Network that Explains Its Predictions
Authors	Oscar Li, Hao Liu, Chaofan Chen, Cynthia Rudin
Abstract	Deep neural networks are widely used for classification. These deep models often suffer from a lack of interpretability – they are particularly difficult to understand because of their non-linear nature. As a result, neural networks are often treated as “black box” models, and in the past, have been trained purely to optimize the accuracy of predictions. In this work, we create a novel network architecture for deep learning that naturally explains its own reasoning for each prediction. This architecture contains an autoencoder and a special prototype layer, where each unit of that layer stores a weight vector that resembles an encoded training input. The encoder of the autoencoder allows us to do comparisons within the latent space, while the decoder allows us to visualize the learned prototypes. The training objective has four terms: an accuracy term, a term that encourages every prototype to be similar to at least one encoded input, a term that encourages every encoded input to be close to at least one prototype, and a term that encourages faithful reconstruction by the autoencoder. The distances computed in the prototype layer are used as part of the classification process. Since the prototypes are learned during training, the learned network naturally comes with explanations for each prediction, and the explanations are loyal to what the network actually computes.
Tasks
Published	2017-10-13
URL	http://arxiv.org/abs/1710.04806v2
PDF	http://arxiv.org/pdf/1710.04806v2.pdf
PWC	https://paperswithcode.com/paper/deep-learning-for-case-based-reasoning
Repo	https://github.com/OscarcarLi/PrototypeDL
Framework	tf

Real-time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs


Title	Real-time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs
Authors	Andres Milioto, Philipp Lottes, Cyrill Stachniss
Abstract	Precision farming robots, which target to reduce the amount of herbicides that need to be brought out in the fields, must have the ability to identify crops and weeds in real time to trigger weeding actions. In this paper, we address the problem of CNN-based semantic segmentation of crop fields separating sugar beet plants, weeds, and background solely based on RGB data. We propose a CNN that exploits existing vegetation indexes and provides a classification in real time. Furthermore, it can be effectively re-trained to so far unseen fields with a comparably small amount of training data. We implemented and thoroughly evaluated our system on a real agricultural robot operating in different fields in Germany and Switzerland. The results show that our system generalizes well, can operate at around 20Hz, and is suitable for online operation in the fields.
Tasks	Real-Time Semantic Segmentation, Semantic Segmentation
Published	2017-09-20
URL	http://arxiv.org/abs/1709.06764v2
PDF	http://arxiv.org/pdf/1709.06764v2.pdf
PWC	https://paperswithcode.com/paper/real-time-semantic-segmentation-of-crop-and
Repo	https://github.com/PRBonn/bonnet
Framework	tf