Paper Group ANR 912
Automatic Prediction of Building Age from Photographs. Learning to Grasp from a Single Demonstration. Seeing Tree Structure from Vibration. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. ViS-HuD: Using Visual Saliency to Improve Human Detection with Convolutional Neural Networks. Optimized Skeleton-ba …
Automatic Prediction of Building Age from Photographs
Title | Automatic Prediction of Building Age from Photographs |
Authors | Matthias Zeppelzauer, Miroslav Despotovic, Muntaha Sakeena, David Koch, Mario Döller |
Abstract | We present a first method for the automated age estimation of buildings from unconstrained photographs. To this end, we propose a two-stage approach that firstly learns characteristic visual patterns for different building epochs at patch-level and then globally aggregates patch-level age estimates over the building. We compile evaluation datasets from different sources and perform an detailed evaluation of our approach, its sensitivity to parameters, and the capabilities of the employed deep networks to learn characteristic visual age-related patterns. Results show that our approach is able to estimate building age at a surprisingly high level that even outperforms human evaluators and thereby sets a new performance baseline. This work represents a first step towards the automated assessment of building parameters for automated price prediction. |
Tasks | Age Estimation |
Published | 2018-04-06 |
URL | http://arxiv.org/abs/1804.02205v2 |
http://arxiv.org/pdf/1804.02205v2.pdf | |
PWC | https://paperswithcode.com/paper/automatic-prediction-of-building-age-from |
Repo | |
Framework | |
Learning to Grasp from a Single Demonstration
Title | Learning to Grasp from a Single Demonstration |
Authors | Pieter Van Molle, Tim Verbelen, Elias De Coninck, Cedric De Boom, Pieter Simoens, Bart Dhoedt |
Abstract | Learning-based approaches for robotic grasping using visual sensors typically require collecting a large size dataset, either manually labeled or by many trial and errors of a robotic manipulator in the real or simulated world. We propose a simpler learning-from-demonstration approach that is able to detect the object to grasp from merely a single demonstration using a convolutional neural network we call GraspNet. In order to increase robustness and decrease the training time even further, we leverage data from previous demonstrations to quickly fine-tune a GrapNet for each new demonstration. We present some preliminary results on a grasping experiment with the Franka Panda cobot for which we can train a GraspNet with only hundreds of train iterations. |
Tasks | Robotic Grasping |
Published | 2018-06-09 |
URL | http://arxiv.org/abs/1806.03486v1 |
http://arxiv.org/pdf/1806.03486v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-to-grasp-from-a-single-demonstration |
Repo | |
Framework | |
Seeing Tree Structure from Vibration
Title | Seeing Tree Structure from Vibration |
Authors | Tianfan Xue, Jiajun Wu, Zhoutong Zhang, Chengkai Zhang, Joshua B. Tenenbaum, William T. Freeman |
Abstract | Humans recognize object structure from both their appearance and motion; often, motion helps to resolve ambiguities in object structure that arise when we observe object appearance only. There are particular scenarios, however, where neither appearance nor spatial-temporal motion signals are informative: occluding twigs may look connected and have almost identical movements, though they belong to different, possibly disconnected branches. We propose to tackle this problem through spectrum analysis of motion signals, because vibrations of disconnected branches, though visually similar, often have distinctive natural frequencies. We propose a novel formulation of tree structure based on a physics-based link model, and validate its effectiveness by theoretical analysis, numerical simulation, and empirical experiments. With this formulation, we use nonparametric Bayesian inference to reconstruct tree structure from both spectral vibration signals and appearance cues. Our model performs well in recognizing hierarchical tree structure from real-world videos of trees and vessels. |
Tasks | Bayesian Inference |
Published | 2018-09-13 |
URL | http://arxiv.org/abs/1809.05067v1 |
http://arxiv.org/pdf/1809.05067v1.pdf | |
PWC | https://paperswithcode.com/paper/seeing-tree-structure-from-vibration |
Repo | |
Framework | |
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
Title | Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining |
Authors | Yundong Zhang, Juan Carlos Niebles, Alvaro Soto |
Abstract | A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy. |
Tasks | Question Answering, Visual Question Answering |
Published | 2018-08-01 |
URL | http://arxiv.org/abs/1808.00265v1 |
http://arxiv.org/pdf/1808.00265v1.pdf | |
PWC | https://paperswithcode.com/paper/interpretable-visual-question-answering-by-1 |
Repo | |
Framework | |
ViS-HuD: Using Visual Saliency to Improve Human Detection with Convolutional Neural Networks
Title | ViS-HuD: Using Visual Saliency to Improve Human Detection with Convolutional Neural Networks |
Authors | Vandit Gajjar, Yash Khandhediya, Ayesha Gurnani, Viraj Mavani, Mehul S. Raval |
Abstract | The paper presents a technique to improve human detection in still images using deep learning. Our novel method, ViS-HuD, computes visual saliency map from the image. Then the input image is multiplied by the map and product is fed to the Convolutional Neural Network (CNN) which detects humans in the image. A visual saliency map is generated using ML-Net and human detection is carried out using DetectNet. ML-Net is pre-trained on SALICON while, DetectNet is pre-trained on ImageNet database for visual saliency detection and image classification respectively. The CNNs of ViS-HuD were trained on two challenging databases - Penn Fudan and TUD-Brussels Benchmark. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on Penn Fudan Dataset with 91.4% human detection accuracy and it achieves average miss-rate of 53% on the TUDBrussels benchmark. |
Tasks | Human Detection, Image Classification, Saliency Detection |
Published | 2018-02-21 |
URL | http://arxiv.org/abs/1803.01687v3 |
http://arxiv.org/pdf/1803.01687v3.pdf | |
PWC | https://paperswithcode.com/paper/vis-hud-using-visual-saliency-to-improve |
Repo | |
Framework | |
Optimized Skeleton-based Action Recognition via Sparsified Graph Regression
Title | Optimized Skeleton-based Action Recognition via Sparsified Graph Regression |
Authors | Xiang Gao, Wei Hu, Jiaxiang Tang, Jiaying Liu, Zongming Guo |
Abstract | With the prevalence of accessible depth sensors, dynamic human body skeletons have attracted much attention as a robust modality for action recognition. Previous methods model skeletons based on RNN or CNN, which has limited expressive power for irregular skeleton joints. While graph convolutional networks (GCN) have been proposed to address irregular graph-structured data, the fundamental graph construction remains challenging. In this paper, we represent skeletons naturally on graphs, and propose a graph regression based GCN (GR-GCN) for skeleton-based action recognition, aiming to capture the spatio-temporal variation in the data. As the graph representation is crucial to graph convolution, we first propose graph regression to statistically learn the underlying graph from multiple observations. In particular, we provide spatio-temporal modeling of skeletons and pose an optimization problem on the graph structure over consecutive frames, which enforces the sparsity of the underlying graph for efficient representation. The optimized graph not only connects each joint to its neighboring joints in the same frame strongly or weakly, but also links with relevant joints in the previous and subsequent frames. We then feed the optimized graph into the GCN along with the coordinates of the skeleton sequence for feature learning, where we deploy high-order and fast Chebyshev approximation of spectral graph convolution. Further, we provide analysis of the variation characterization by the Chebyshev approximation. Experimental results validate the effectiveness of the proposed graph regression and show that the proposed GR-GCN achieves the state-of-the-art performance on the widely used NTU RGB+D, UT-Kinect and SYSU 3D datasets. |
Tasks | graph construction, Graph Regression, Skeleton Based Action Recognition, Temporal Action Localization |
Published | 2018-11-29 |
URL | http://arxiv.org/abs/1811.12013v2 |
http://arxiv.org/pdf/1811.12013v2.pdf | |
PWC | https://paperswithcode.com/paper/generalized-graph-convolutional-networks-for |
Repo | |
Framework | |
Marrying Universal Dependencies and Universal Morphology
Title | Marrying Universal Dependencies and Universal Morphology |
Authors | Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky |
Abstract | The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects. |
Tasks | |
Published | 2018-10-15 |
URL | http://arxiv.org/abs/1810.06743v1 |
http://arxiv.org/pdf/1810.06743v1.pdf | |
PWC | https://paperswithcode.com/paper/marrying-universal-dependencies-and-universal |
Repo | |
Framework | |
Scene Graph Parsing as Dependency Parsing
Title | Scene Graph Parsing as Dependency Parsing |
Authors | Yu-Siang Wang, Chenxi Liu, Xiaohui Zeng, Alan Yuille |
Abstract | In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effectiveness of our learned parser on image retrieval applications. |
Tasks | Dependency Parsing, Image Retrieval, Knowledge Graphs |
Published | 2018-03-25 |
URL | http://arxiv.org/abs/1803.09189v1 |
http://arxiv.org/pdf/1803.09189v1.pdf | |
PWC | https://paperswithcode.com/paper/scene-graph-parsing-as-dependency-parsing |
Repo | |
Framework | |
Important Attribute Identification in Knowledge Graph
Title | Important Attribute Identification in Knowledge Graph |
Authors | Shengjie Sun, Dong Yang, Hongchun Zhang, Yanxu Chen, Chao Wei, Xiaonan Meng, Yi Hu |
Abstract | The knowledge graph(KG) composed of entities with their descriptions and attributes, and relationship between entities, is finding more and more application scenarios in various natural language processing tasks. In a typical knowledge graph like Wikidata, entities usually have a large number of attributes, but it is difficult to know which ones are important. The importance of attributes can be a valuable piece of information in various applications spanning from information retrieval to natural language generation. In this paper, we propose a general method of using external user generated text data to evaluate the relative importance of an entity’s attributes. To be more specific, we use the word/sub-word embedding techniques to match the external textual data back to entities’ attribute name and values and rank the attributes by their matching cohesiveness. To our best knowledge, this is the first work of applying vector based semantic matching to important attribute identification, and our method outperforms the previous traditional methods. We also apply the outcome of the detected important attributes to a language generation task; compared with previous generated text, the new method generates much more customized and informative messages. |
Tasks | Information Retrieval, Text Generation |
Published | 2018-10-12 |
URL | http://arxiv.org/abs/1810.05320v1 |
http://arxiv.org/pdf/1810.05320v1.pdf | |
PWC | https://paperswithcode.com/paper/important-attribute-identification-in |
Repo | |
Framework | |
Collective Online Learning of Gaussian Processes in Massive Multi-Agent Systems
Title | Collective Online Learning of Gaussian Processes in Massive Multi-Agent Systems |
Authors | Trong Nghia Hoang, Quang Minh Hoang, Kian Hsiang Low, Jonathan How |
Abstract | Distributed machine learning (ML) is a modern computation paradigm that divides its workload into independent tasks that can be simultaneously achieved by multiple machines (i.e., agents) for better scalability. However, a typical distributed system is usually implemented with a central server that collects data statistics from multiple independent machines operating on different subsets of data to build a global analytic model. This centralized communication architecture however exposes a single choke point for operational failure and places severe bottlenecks on the server’s communication and computation capacities as it has to process a growing volume of communication from a crowd of learning agents. To mitigate these bottlenecks, this paper introduces a novel Collective Online Learning Gaussian Process framework for massive distributed systems that allows each agent to build its local model, which can be exchanged and combined efficiently with others via peer-to-peer communication to converge on a global model of higher quality. Finally, our empirical results consistently demonstrate the efficiency of our framework on both synthetic and real-world datasets. |
Tasks | Gaussian Processes |
Published | 2018-05-23 |
URL | http://arxiv.org/abs/1805.09266v2 |
http://arxiv.org/pdf/1805.09266v2.pdf | |
PWC | https://paperswithcode.com/paper/collective-online-learning-of-gaussian |
Repo | |
Framework | |
Deep Net Triage: Analyzing the Importance of Network Layers via Structural Compression
Title | Deep Net Triage: Analyzing the Importance of Network Layers via Structural Compression |
Authors | Theodore S. Nowak, Jason J. Corso |
Abstract | Despite their prevalence, deep networks are poorly understood. This is due, at least in part, to their highly parameterized nature. As such, while certain structures have been found to work better than others, the significance of a model’s unique structure, or the importance of a given layer, and how these translate to overall accuracy, remains unclear. In this paper, we analyze these properties of deep neural networks via a process we term deep net triage. Like medical triage—the assessment of the importance of various wounds—we assess the importance of layers in a neural network, or as we call it, their criticality. We do this by applying structural compression, whereby we reduce a block of layers to a single layer. After compressing a set of layers, we apply a combination of initialization and training schemes, and look at network accuracy, convergence, and the layer’s learned filters to assess the criticality of the layer. We apply this analysis across four data sets of varying complexity. We find that the accuracy of the model does not depend on which layer was compressed; that accuracy can be recovered or exceeded after compression by fine-tuning across the entire model; and, lastly, that Knowledge Distillation can be used to hasten convergence of a compressed network, but constrains the accuracy attainable to that of the base model. |
Tasks | |
Published | 2018-01-15 |
URL | http://arxiv.org/abs/1801.04651v2 |
http://arxiv.org/pdf/1801.04651v2.pdf | |
PWC | https://paperswithcode.com/paper/deep-net-triage-analyzing-the-importance-of |
Repo | |
Framework | |
Siamese Capsule Networks
Title | Siamese Capsule Networks |
Authors | James O’ Neill |
Abstract | Capsule Networks have shown encouraging results on \textit{defacto} benchmark computer vision datasets such as MNIST, CIFAR and smallNORB. Although, they are yet to be tested on tasks where (1) the entities detected inherently have more complex internal representations and (2) there are very few instances per class to learn from and (3) where point-wise classification is not suitable. Hence, this paper carries out experiments on face verification in both controlled and uncontrolled settings that together address these points. In doing so we introduce \textit{Siamese Capsule Networks}, a new variant that can be used for pairwise learning tasks. The model is trained using contrastive loss with $\ell_2$-normalized capsule encoded pose features. We find that \textit{Siamese Capsule Networks} perform well against strong baselines on both pairwise learning datasets, yielding best results in the few-shot learning setting where image pairs in the test set contain unseen subjects. |
Tasks | Face Verification, Few-Shot Learning |
Published | 2018-05-18 |
URL | http://arxiv.org/abs/1805.07242v1 |
http://arxiv.org/pdf/1805.07242v1.pdf | |
PWC | https://paperswithcode.com/paper/siamese-capsule-networks |
Repo | |
Framework | |
DPW-SDNet: Dual Pixel-Wavelet Domain Deep CNNs for Soft Decoding of JPEG-Compressed Images
Title | DPW-SDNet: Dual Pixel-Wavelet Domain Deep CNNs for Soft Decoding of JPEG-Compressed Images |
Authors | Honggang Chen, Xiaohai He, Linbo Qing, Shuhua Xiong, Truong Q. Nguyen |
Abstract | JPEG is one of the widely used lossy compression methods. JPEG-compressed images usually suffer from compression artifacts including blocking and blurring, especially at low bit-rates. Soft decoding is an effective solution to improve the quality of compressed images without changing codec or introducing extra coding bits. Inspired by the excellent performance of the deep convolutional neural networks (CNNs) on both low-level and high-level computer vision problems, we develop a dual pixel-wavelet domain deep CNNs-based soft decoding network for JPEG-compressed images, namely DPW-SDNet. The pixel domain deep network takes the four downsampled versions of the compressed image to form a 4-channel input and outputs a pixel domain prediction, while the wavelet domain deep network uses the 1-level discrete wavelet transformation (DWT) coefficients to form a 4-channel input to produce a DWT domain prediction. The pixel domain and wavelet domain estimates are combined to generate the final soft decoded result. Experimental results demonstrate the superiority of the proposed DPW-SDNet over several state-of-the-art compression artifacts reduction algorithms. |
Tasks | |
Published | 2018-05-27 |
URL | http://arxiv.org/abs/1805.10558v1 |
http://arxiv.org/pdf/1805.10558v1.pdf | |
PWC | https://paperswithcode.com/paper/dpw-sdnet-dual-pixel-wavelet-domain-deep-cnns |
Repo | |
Framework | |
Refining Source Representations with Relation Networks for Neural Machine Translation
Title | Refining Source Representations with Relation Networks for Neural Machine Translation |
Authors | Wen Zhang, Jiawei Hu, Yang Feng, Qun Liu |
Abstract | Although neural machine translation with the encoder-decoder framework has achieved great success recently, it still suffers drawbacks of forgetting distant information, which is an inherent disadvantage of recurrent neural network structure, and disregarding relationship between source words during encoding step. Whereas in practice, the former information and relationship are often useful in current step. We target on solving these problems and thus introduce relation networks to learn better representations of the source. The relation networks are able to facilitate memorization capability of recurrent neural network via associating source words with each other, this would also help retain their relationships. Then the source representations and all the relations are fed into the attention component together while decoding, with the main encoder-decoder framework unchanged. Experiments on several datasets show that our method can improve the translation performance significantly over the conventional encoder-decoder model and even outperform the approach involving supervised syntactic knowledge. |
Tasks | Machine Translation |
Published | 2018-05-25 |
URL | http://arxiv.org/abs/1805.11154v2 |
http://arxiv.org/pdf/1805.11154v2.pdf | |
PWC | https://paperswithcode.com/paper/refining-source-representations-with-relation-1 |
Repo | |
Framework | |
Unsupervised Multi-modal Neural Machine Translation
Title | Unsupervised Multi-modal Neural Machine Translation |
Authors | Yuanhang Su, Kai Fan, Nguyen Bach, C. -C. Jay Kuo, Fei Huang |
Abstract | Unsupervised neural machine translation (UNMT) has recently achieved remarkable results with only large monolingual corpora in each language. However, the uncertainty of associating target with source sentences makes UNMT theoretically an ill-posed problem. This work investigates the possibility of utilizing images for disambiguation to improve the performance of UNMT. Our assumption is intuitively based on the invariant property of image, i.e., the description of the same visual content by different languages should be approximately similar. We propose an unsupervised multi-modal machine translation (UMNMT) framework based on the language translation cycle consistency loss conditional on the image, targeting to learn the bidirectional multi-modal translation simultaneously. Through an alternate training between multi-modal and uni-modal, our inference model can translate with or without the image. On the widely used Multi30K dataset, the experimental results of our approach are significantly better than those of the text-only UNMT on the 2016 test dataset. |
Tasks | Machine Translation |
Published | 2018-11-28 |
URL | https://arxiv.org/abs/1811.11365v2 |
https://arxiv.org/pdf/1811.11365v2.pdf | |
PWC | https://paperswithcode.com/paper/unsupervised-multi-modal-neural-machine |
Repo | |
Framework | |