Paper Group AWR 144
The “something something” video database for learning and evaluating visual common sense. IQA: Visual Question Answering in Interactive Environments. Estimating the Success of Unsupervised Image to Image Translation. Near-optimal sample complexity for convex tensor completion. Machine Learning of Linear Differential Equations using Gaussian Process …
The “something something” video database for learning and evaluating visual common sense
Title | The “something something” video database for learning and evaluating visual common sense |
Authors | Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, Roland Memisevic |
Abstract | Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale. |
Tasks | Action Recognition In Videos, Common Sense Reasoning, Object Classification, Video Prediction |
Published | 2017-06-13 |
URL | http://arxiv.org/abs/1706.04261v2 |
http://arxiv.org/pdf/1706.04261v2.pdf | |
PWC | https://paperswithcode.com/paper/the-something-something-video-database-for |
Repo | https://github.com/caspillaga/Conv3DSelfAttention |
Framework | pytorch |
IQA: Visual Question Answering in Interactive Environments
Title | IQA: Visual Question Answering in Interactive Environments |
Authors | Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi |
Abstract | We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: “Are there any apples in the fridge?” The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects (code and dataset available at https://github.com/danielgordon10/thor-iqa-cvpr-2018). IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98 |
Tasks | Visual Question Answering |
Published | 2017-12-09 |
URL | http://arxiv.org/abs/1712.03316v3 |
http://arxiv.org/pdf/1712.03316v3.pdf | |
PWC | https://paperswithcode.com/paper/iqa-visual-question-answering-in-interactive |
Repo | https://github.com/danielgordon10/thor-iqa-cvpr-2018 |
Framework | tf |
Estimating the Success of Unsupervised Image to Image Translation
Title | Estimating the Success of Unsupervised Image to Image Translation |
Authors | Sagie Benaim, Tomer Galanti, Lior Wolf |
Abstract | While in supervised learning, the validation error is an unbiased estimator of the generalization (test) error and complexity-based generalization bounds are abundant, no such bounds exist for learning a mapping in an unsupervised way. As a result, when training GANs and specifically when using GANs for learning to map between domains in a completely unsupervised way, one is forced to select the hyperparameters and the stopping epoch by subjectively examining multiple options. We propose a novel bound for predicting the success of unsupervised cross domain mapping methods, which is motivated by the recently proposed Simplicity Principle. The bound can be applied both in expectation, for comparing hyperparameters and for selecting a stopping criterion, or per sample, in order to predict the success of a specific cross-domain translation. The utility of the bound is demonstrated in an extensive set of experiments employing multiple recent algorithms. Our code is available at https://github.com/sagiebenaim/gan_bound . |
Tasks | Image-to-Image Translation, Unsupervised Image-To-Image Translation |
Published | 2017-12-21 |
URL | http://arxiv.org/abs/1712.07886v2 |
http://arxiv.org/pdf/1712.07886v2.pdf | |
PWC | https://paperswithcode.com/paper/estimating-the-success-of-unsupervised-image |
Repo | https://github.com/sagiebenaim/gan_bound |
Framework | pytorch |
Near-optimal sample complexity for convex tensor completion
Title | Near-optimal sample complexity for convex tensor completion |
Authors | Navid Ghadermarzy, Yaniv Plan, Özgür Yılmaz |
Abstract | We analyze low rank tensor completion (TC) using noisy measurements of a subset of the tensor. Assuming a rank-$r$, order-$d$, $N \times N \times \cdots \times N$ tensor where $r=O(1)$, the best sampling complexity that was achieved is $O(N^{\frac{d}{2}})$, which is obtained by solving a tensor nuclear-norm minimization problem. However, this bound is significantly larger than the number of free variables in a low rank tensor which is $O(dN)$. In this paper, we show that by using an atomic-norm whose atoms are rank-$1$ sign tensors, one can obtain a sample complexity of $O(dN)$. Moreover, we generalize the matrix max-norm definition to tensors, which results in a max-quasi-norm (max-qnorm) whose unit ball has small Rademacher complexity. We prove that solving a constrained least squares estimation using either the convex atomic-norm or the nonconvex max-qnorm results in optimal sample complexity for the problem of low-rank tensor completion. Furthermore, we show that these bounds are nearly minimax rate-optimal. We also provide promising numerical results for max-qnorm constrained tensor completion, showing improved recovery results compared to matricization and alternating least squares. |
Tasks | |
Published | 2017-11-14 |
URL | http://arxiv.org/abs/1711.04965v1 |
http://arxiv.org/pdf/1711.04965v1.pdf | |
PWC | https://paperswithcode.com/paper/near-optimal-sample-complexity-for-convex |
Repo | https://github.com/navidghadermarzy/TensorCompletion_1bit_noisy |
Framework | none |
Machine Learning of Linear Differential Equations using Gaussian Processes
Title | Machine Learning of Linear Differential Equations using Gaussian Processes |
Authors | Maziar Raissi, George Em. Karniadakis |
Abstract | This work leverages recent advances in probabilistic machine learning to discover conservation laws expressed by parametric linear equations. Such equations involve, but are not limited to, ordinary and partial differential, integro-differential, and fractional order operators. Here, Gaussian process priors are modified according to the particular form of such operators and are employed to infer parameters of the linear equations from scarce and possibly noisy observations. Such observations may come from experiments or “black-box” computer simulations. |
Tasks | Gaussian Processes |
Published | 2017-01-10 |
URL | http://arxiv.org/abs/1701.02440v1 |
http://arxiv.org/pdf/1701.02440v1.pdf | |
PWC | https://paperswithcode.com/paper/machine-learning-of-linear-differential |
Repo | https://github.com/Slowpuncher24/mlhiphy_v2 |
Framework | none |
Joint Embedding of Graphs
Title | Joint Embedding of Graphs |
Authors | Shangsi Wang, Jesús Arroyo, Joshua T. Vogelstein, Carey E. Priebe |
Abstract | Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric matrices and projects adjacency matrices of graphs into this subspace. The projection coefficients can be treated as features of the graphs, while the embedding components can represent vertex features. We also propose a random graph model for multiple graphs that generalizes other classical models for graphs. We show through theory and numerical experiments that under the model, the joint embedding method produces estimates of parameters with small errors. Via simulation experiments, we demonstrate that the joint embedding method produces features which lead to state of the art performance in classifying graphs. Applying the joint embedding method to human brain graphs, we find it extracts interpretable features with good prediction accuracy in different tasks. |
Tasks | Dimensionality Reduction |
Published | 2017-03-10 |
URL | https://arxiv.org/abs/1703.03862v4 |
https://arxiv.org/pdf/1703.03862v4.pdf | |
PWC | https://paperswithcode.com/paper/joint-embedding-of-graphs |
Repo | https://github.com/shangsiwang/Joint-Embedding |
Framework | none |
Excitation Backprop for RNNs
Title | Excitation Backprop for RNNs |
Authors | Sarah Adel Bargal, Andrea Zunino, Donghyun Kim, Jianming Zhang, Vittorio Murino, Stan Sclaroff |
Abstract | Deep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such studies are relatively lacking for models of spatiotemporal visual content - videos. In this work, we devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep model’s classification/captioning output using the model’s internal representation. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks. |
Tasks | Temporal Action Localization, Video Captioning |
Published | 2017-11-18 |
URL | http://arxiv.org/abs/1711.06778v3 |
http://arxiv.org/pdf/1711.06778v3.pdf | |
PWC | https://paperswithcode.com/paper/excitation-backprop-for-rnns |
Repo | https://github.com/sbargal/Caffe-ExcitationBP-RNNs |
Framework | none |
Action Tubelet Detector for Spatio-Temporal Action Localization
Title | Action Tubelet Detector for Spatio-Temporal Action Localization |
Authors | Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid |
Abstract | Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores. The same way state-of-the-art object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the SSD framework. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more accurate scores and more precise localization. Our ACT-detector outperforms the state-of-the-art methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds. |
Tasks | Action Localization, Spatio-Temporal Action Localization, Temporal Action Localization |
Published | 2017-05-04 |
URL | http://arxiv.org/abs/1705.01861v3 |
http://arxiv.org/pdf/1705.01861v3.pdf | |
PWC | https://paperswithcode.com/paper/action-tubelet-detector-for-spatio-temporal |
Repo | https://github.com/qingzhiwu/pytorch-act-detector |
Framework | pytorch |
Use of Deep Learning in Modern Recommendation System: A Summary of Recent Works
Title | Use of Deep Learning in Modern Recommendation System: A Summary of Recent Works |
Authors | Ayush Singhal, Pradeep Sinha, Rakesh Pant |
Abstract | With the exponential increase in the amount of digital information over the internet, online shops, online music, video and image libraries, search engines and recommendation system have become the most convenient ways to find relevant information within a short time. In the recent times, deep learning’s advances have gained significant attention in the field of speech recognition, image processing and natural language processing. Meanwhile, several recent studies have shown the utility of deep learning in the area of recommendation systems and information retrieval as well. In this short review, we cover the recent advances made in the field of recommendation using various variants of deep learning technology. We organize the review in three parts: Collaborative system, Content based system and Hybrid system. The review also discusses the contribution of deep learning integrated recommendation systems into several application domains. The review concludes by discussion of the impact of deep learning in recommendation system in various domain and whether deep learning has shown any significant improvement over the conventional systems for recommendation. Finally, we also provide future directions of research which are possible based on the current state of use of deep learning in recommendation systems. |
Tasks | Information Retrieval, Recommendation Systems, Speech Recognition |
Published | 2017-12-20 |
URL | http://arxiv.org/abs/1712.07525v1 |
http://arxiv.org/pdf/1712.07525v1.pdf | |
PWC | https://paperswithcode.com/paper/use-of-deep-learning-in-modern-recommendation |
Repo | https://github.com/anuragreddygv323/Important-stuff |
Framework | tf |
Deep word embeddings for visual speech recognition
Title | Deep word embeddings for visual speech recognition |
Authors | Themos Stafylakis, Georgios Tzimiropoulos |
Abstract | In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. We first show that the proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words. We then examine the capacity of the embeddings in modelling words unseen during training. We deploy Probabilistic Linear Discriminant Analysis (PLDA) to model the embeddings and perform low-shot learning experiments on words unseen during training. The experiments demonstrate that word-level visual speech recognition is feasible even in cases where the target words are not included in the training set. |
Tasks | Lipreading, Speech Recognition, Visual Speech Recognition, Word Embeddings |
Published | 2017-10-30 |
URL | http://arxiv.org/abs/1710.11201v1 |
http://arxiv.org/pdf/1710.11201v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-word-embeddings-for-visual-speech |
Repo | https://github.com/tstafylakis/Lipreading-ResNet |
Framework | pytorch |
Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations
Title | Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations |
Authors | H. V. Koops, W. B. de Haas, J. Bransen, A. Volk |
Abstract | The increasing accuracy of automatic chord estimation systems, the availability of vast amounts of heterogeneous reference annotations, and insights from annotator subjectivity research make chord label personalization increasingly important. Nevertheless, automatic chord estimation systems are historically exclusively trained and evaluated on a single reference annotation. We introduce a first approach to automatic chord label personalization by modeling subjectivity through deep learning of a harmonic interval-based chord label representation. After integrating these representations from multiple annotators, we can accurately personalize chord labels for individual annotators from a single model and the annotators’ chord label vocabulary. Furthermore, we show that chord personalization using multiple reference annotations outperforms using a single reference annotation. |
Tasks | |
Published | 2017-06-29 |
URL | http://arxiv.org/abs/1706.09552v1 |
http://arxiv.org/pdf/1706.09552v1.pdf | |
PWC | https://paperswithcode.com/paper/chord-label-personalization-through-deep |
Repo | https://github.com/hvkoops/chordlabelpersonalization |
Framework | tf |
Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art
Title | Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art |
Authors | Yuan Gao, Brij Mohan Lal Srivastava, James Salsman |
Abstract | We use automatic speech recognition to assess spoken English learner pronunciation based on the authentic intelligibility of the learners’ spoken responses determined from support vector machine (SVM) classifier or deep learning neural network model predictions of transcription correctness. Using numeric features produced by PocketSphinx alignment mode and many recognition passes searching for the substitution and deletion of each expected phoneme and insertion of unexpected phonemes in sequence, the SVM models achieve 82 percent agreement with the accuracy of Amazon Mechanical Turk crowdworker transcriptions, up from 75 percent reported by multiple independent researchers. Using such features with SVM classifier probability prediction models can help computer-aided pronunciation teaching (CAPT) systems provide intelligibility remediation. |
Tasks | Speech Recognition |
Published | 2017-09-06 |
URL | http://arxiv.org/abs/1709.01713v3 |
http://arxiv.org/pdf/1709.01713v3.pdf | |
PWC | https://paperswithcode.com/paper/spoken-english-intelligibility-remediation |
Repo | https://github.com/jsalsman/featex |
Framework | none |
Sequential Dialogue Context Modeling for Spoken Language Understanding
Title | Sequential Dialogue Context Modeling for Spoken Language Understanding |
Authors | Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur, Larry Heck |
Abstract | Spoken Language Understanding (SLU) is a key component of goal oriented dialogue systems that would parse user utterances into semantic frame representations. Traditionally SLU does not utilize the dialogue history beyond the previous system turn and contextual ambiguities are resolved by the downstream components. In this paper, we explore novel approaches for modeling dialogue context in a recurrent neural network (RNN) based language understanding system. We propose the Sequential Dialogue Encoder Network, that allows encoding context from the dialogue history in chronological order. We compare the performance of our proposed architecture with two context models, one that uses just the previous turn context and another that encodes dialogue context in a memory network, but loses the order of utterances in the dialogue history. Experiments with a multi-domain dialogue dataset demonstrate that the proposed architecture results in reduced semantic frame error rates. |
Tasks | Goal-Oriented Dialogue Systems, Spoken Language Understanding |
Published | 2017-05-08 |
URL | http://arxiv.org/abs/1705.03455v3 |
http://arxiv.org/pdf/1705.03455v3.pdf | |
PWC | https://paperswithcode.com/paper/sequential-dialogue-context-modeling-for |
Repo | https://github.com/sunbopds/SDEN-Pytorch-master |
Framework | pytorch |
Deep Learning for Case-Based Reasoning through Prototypes: A Neural Network that Explains Its Predictions
Title | Deep Learning for Case-Based Reasoning through Prototypes: A Neural Network that Explains Its Predictions |
Authors | Oscar Li, Hao Liu, Chaofan Chen, Cynthia Rudin |
Abstract | Deep neural networks are widely used for classification. These deep models often suffer from a lack of interpretability – they are particularly difficult to understand because of their non-linear nature. As a result, neural networks are often treated as “black box” models, and in the past, have been trained purely to optimize the accuracy of predictions. In this work, we create a novel network architecture for deep learning that naturally explains its own reasoning for each prediction. This architecture contains an autoencoder and a special prototype layer, where each unit of that layer stores a weight vector that resembles an encoded training input. The encoder of the autoencoder allows us to do comparisons within the latent space, while the decoder allows us to visualize the learned prototypes. The training objective has four terms: an accuracy term, a term that encourages every prototype to be similar to at least one encoded input, a term that encourages every encoded input to be close to at least one prototype, and a term that encourages faithful reconstruction by the autoencoder. The distances computed in the prototype layer are used as part of the classification process. Since the prototypes are learned during training, the learned network naturally comes with explanations for each prediction, and the explanations are loyal to what the network actually computes. |
Tasks | |
Published | 2017-10-13 |
URL | http://arxiv.org/abs/1710.04806v2 |
http://arxiv.org/pdf/1710.04806v2.pdf | |
PWC | https://paperswithcode.com/paper/deep-learning-for-case-based-reasoning |
Repo | https://github.com/OscarcarLi/PrototypeDL |
Framework | tf |
Real-time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs
Title | Real-time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs |
Authors | Andres Milioto, Philipp Lottes, Cyrill Stachniss |
Abstract | Precision farming robots, which target to reduce the amount of herbicides that need to be brought out in the fields, must have the ability to identify crops and weeds in real time to trigger weeding actions. In this paper, we address the problem of CNN-based semantic segmentation of crop fields separating sugar beet plants, weeds, and background solely based on RGB data. We propose a CNN that exploits existing vegetation indexes and provides a classification in real time. Furthermore, it can be effectively re-trained to so far unseen fields with a comparably small amount of training data. We implemented and thoroughly evaluated our system on a real agricultural robot operating in different fields in Germany and Switzerland. The results show that our system generalizes well, can operate at around 20Hz, and is suitable for online operation in the fields. |
Tasks | Real-Time Semantic Segmentation, Semantic Segmentation |
Published | 2017-09-20 |
URL | http://arxiv.org/abs/1709.06764v2 |
http://arxiv.org/pdf/1709.06764v2.pdf | |
PWC | https://paperswithcode.com/paper/real-time-semantic-segmentation-of-crop-and |
Repo | https://github.com/PRBonn/bonnet |
Framework | tf |