October 17, 2019

3048 words 15 mins read

Paper Group ANR 912

Automatic Prediction of Building Age from Photographs. Learning to Grasp from a Single Demonstration. Seeing Tree Structure from Vibration. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. ViS-HuD: Using Visual Saliency to Improve Human Detection with Convolutional Neural Networks. Optimized Skeleton-ba …

Automatic Prediction of Building Age from Photographs


Title	Automatic Prediction of Building Age from Photographs
Authors	Matthias Zeppelzauer, Miroslav Despotovic, Muntaha Sakeena, David Koch, Mario Döller
Abstract	We present a first method for the automated age estimation of buildings from unconstrained photographs. To this end, we propose a two-stage approach that firstly learns characteristic visual patterns for different building epochs at patch-level and then globally aggregates patch-level age estimates over the building. We compile evaluation datasets from different sources and perform an detailed evaluation of our approach, its sensitivity to parameters, and the capabilities of the employed deep networks to learn characteristic visual age-related patterns. Results show that our approach is able to estimate building age at a surprisingly high level that even outperforms human evaluators and thereby sets a new performance baseline. This work represents a first step towards the automated assessment of building parameters for automated price prediction.
Tasks	Age Estimation
Published	2018-04-06
URL	http://arxiv.org/abs/1804.02205v2
PDF	http://arxiv.org/pdf/1804.02205v2.pdf
PWC	https://paperswithcode.com/paper/automatic-prediction-of-building-age-from
Repo
Framework

Learning to Grasp from a Single Demonstration


Title	Learning to Grasp from a Single Demonstration
Authors	Pieter Van Molle, Tim Verbelen, Elias De Coninck, Cedric De Boom, Pieter Simoens, Bart Dhoedt
Abstract	Learning-based approaches for robotic grasping using visual sensors typically require collecting a large size dataset, either manually labeled or by many trial and errors of a robotic manipulator in the real or simulated world. We propose a simpler learning-from-demonstration approach that is able to detect the object to grasp from merely a single demonstration using a convolutional neural network we call GraspNet. In order to increase robustness and decrease the training time even further, we leverage data from previous demonstrations to quickly fine-tune a GrapNet for each new demonstration. We present some preliminary results on a grasping experiment with the Franka Panda cobot for which we can train a GraspNet with only hundreds of train iterations.
Tasks	Robotic Grasping
Published	2018-06-09
URL	http://arxiv.org/abs/1806.03486v1
PDF	http://arxiv.org/pdf/1806.03486v1.pdf
PWC	https://paperswithcode.com/paper/learning-to-grasp-from-a-single-demonstration
Repo
Framework

Seeing Tree Structure from Vibration


Title	Seeing Tree Structure from Vibration
Authors	Tianfan Xue, Jiajun Wu, Zhoutong Zhang, Chengkai Zhang, Joshua B. Tenenbaum, William T. Freeman
Abstract	Humans recognize object structure from both their appearance and motion; often, motion helps to resolve ambiguities in object structure that arise when we observe object appearance only. There are particular scenarios, however, where neither appearance nor spatial-temporal motion signals are informative: occluding twigs may look connected and have almost identical movements, though they belong to different, possibly disconnected branches. We propose to tackle this problem through spectrum analysis of motion signals, because vibrations of disconnected branches, though visually similar, often have distinctive natural frequencies. We propose a novel formulation of tree structure based on a physics-based link model, and validate its effectiveness by theoretical analysis, numerical simulation, and empirical experiments. With this formulation, we use nonparametric Bayesian inference to reconstruct tree structure from both spectral vibration signals and appearance cues. Our model performs well in recognizing hierarchical tree structure from real-world videos of trees and vessels.
Tasks	Bayesian Inference
Published	2018-09-13
URL	http://arxiv.org/abs/1809.05067v1
PDF	http://arxiv.org/pdf/1809.05067v1.pdf
PWC	https://paperswithcode.com/paper/seeing-tree-structure-from-vibration
Repo
Framework

Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining


Title	Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
Authors	Yundong Zhang, Juan Carlos Niebles, Alvaro Soto
Abstract	A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.
Tasks	Question Answering, Visual Question Answering
Published	2018-08-01
URL	http://arxiv.org/abs/1808.00265v1
PDF	http://arxiv.org/pdf/1808.00265v1.pdf
PWC	https://paperswithcode.com/paper/interpretable-visual-question-answering-by-1
Repo
Framework

ViS-HuD: Using Visual Saliency to Improve Human Detection with Convolutional Neural Networks


Title	ViS-HuD: Using Visual Saliency to Improve Human Detection with Convolutional Neural Networks
Authors	Vandit Gajjar, Yash Khandhediya, Ayesha Gurnani, Viraj Mavani, Mehul S. Raval
Abstract	The paper presents a technique to improve human detection in still images using deep learning. Our novel method, ViS-HuD, computes visual saliency map from the image. Then the input image is multiplied by the map and product is fed to the Convolutional Neural Network (CNN) which detects humans in the image. A visual saliency map is generated using ML-Net and human detection is carried out using DetectNet. ML-Net is pre-trained on SALICON while, DetectNet is pre-trained on ImageNet database for visual saliency detection and image classification respectively. The CNNs of ViS-HuD were trained on two challenging databases - Penn Fudan and TUD-Brussels Benchmark. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on Penn Fudan Dataset with 91.4% human detection accuracy and it achieves average miss-rate of 53% on the TUDBrussels benchmark.
Tasks	Human Detection, Image Classification, Saliency Detection
Published	2018-02-21
URL	http://arxiv.org/abs/1803.01687v3
PDF	http://arxiv.org/pdf/1803.01687v3.pdf
PWC	https://paperswithcode.com/paper/vis-hud-using-visual-saliency-to-improve
Repo
Framework

Optimized Skeleton-based Action Recognition via Sparsified Graph Regression


Title	Optimized Skeleton-based Action Recognition via Sparsified Graph Regression
Authors	Xiang Gao, Wei Hu, Jiaxiang Tang, Jiaying Liu, Zongming Guo
Abstract	With the prevalence of accessible depth sensors, dynamic human body skeletons have attracted much attention as a robust modality for action recognition. Previous methods model skeletons based on RNN or CNN, which has limited expressive power for irregular skeleton joints. While graph convolutional networks (GCN) have been proposed to address irregular graph-structured data, the fundamental graph construction remains challenging. In this paper, we represent skeletons naturally on graphs, and propose a graph regression based GCN (GR-GCN) for skeleton-based action recognition, aiming to capture the spatio-temporal variation in the data. As the graph representation is crucial to graph convolution, we first propose graph regression to statistically learn the underlying graph from multiple observations. In particular, we provide spatio-temporal modeling of skeletons and pose an optimization problem on the graph structure over consecutive frames, which enforces the sparsity of the underlying graph for efficient representation. The optimized graph not only connects each joint to its neighboring joints in the same frame strongly or weakly, but also links with relevant joints in the previous and subsequent frames. We then feed the optimized graph into the GCN along with the coordinates of the skeleton sequence for feature learning, where we deploy high-order and fast Chebyshev approximation of spectral graph convolution. Further, we provide analysis of the variation characterization by the Chebyshev approximation. Experimental results validate the effectiveness of the proposed graph regression and show that the proposed GR-GCN achieves the state-of-the-art performance on the widely used NTU RGB+D, UT-Kinect and SYSU 3D datasets.
Tasks	graph construction, Graph Regression, Skeleton Based Action Recognition, Temporal Action Localization
Published	2018-11-29
URL	http://arxiv.org/abs/1811.12013v2
PDF	http://arxiv.org/pdf/1811.12013v2.pdf
PWC	https://paperswithcode.com/paper/generalized-graph-convolutional-networks-for
Repo
Framework

Marrying Universal Dependencies and Universal Morphology


Title	Marrying Universal Dependencies and Universal Morphology
Authors	Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky
Abstract	The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.
Tasks
Published	2018-10-15
URL	http://arxiv.org/abs/1810.06743v1
PDF	http://arxiv.org/pdf/1810.06743v1.pdf
PWC	https://paperswithcode.com/paper/marrying-universal-dependencies-and-universal
Repo
Framework

Scene Graph Parsing as Dependency Parsing


Title	Scene Graph Parsing as Dependency Parsing
Authors	Yu-Siang Wang, Chenxi Liu, Xiaohui Zeng, Alan Yuille
Abstract	In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effectiveness of our learned parser on image retrieval applications.
Tasks	Dependency Parsing, Image Retrieval, Knowledge Graphs
Published	2018-03-25
URL	http://arxiv.org/abs/1803.09189v1
PDF	http://arxiv.org/pdf/1803.09189v1.pdf
PWC	https://paperswithcode.com/paper/scene-graph-parsing-as-dependency-parsing
Repo
Framework

Important Attribute Identification in Knowledge Graph


Title	Important Attribute Identification in Knowledge Graph
Authors	Shengjie Sun, Dong Yang, Hongchun Zhang, Yanxu Chen, Chao Wei, Xiaonan Meng, Yi Hu
Abstract	The knowledge graph(KG) composed of entities with their descriptions and attributes, and relationship between entities, is finding more and more application scenarios in various natural language processing tasks. In a typical knowledge graph like Wikidata, entities usually have a large number of attributes, but it is difficult to know which ones are important. The importance of attributes can be a valuable piece of information in various applications spanning from information retrieval to natural language generation. In this paper, we propose a general method of using external user generated text data to evaluate the relative importance of an entity’s attributes. To be more specific, we use the word/sub-word embedding techniques to match the external textual data back to entities’ attribute name and values and rank the attributes by their matching cohesiveness. To our best knowledge, this is the first work of applying vector based semantic matching to important attribute identification, and our method outperforms the previous traditional methods. We also apply the outcome of the detected important attributes to a language generation task; compared with previous generated text, the new method generates much more customized and informative messages.
Tasks	Information Retrieval, Text Generation
Published	2018-10-12
URL	http://arxiv.org/abs/1810.05320v1
PDF	http://arxiv.org/pdf/1810.05320v1.pdf
PWC	https://paperswithcode.com/paper/important-attribute-identification-in
Repo
Framework

Collective Online Learning of Gaussian Processes in Massive Multi-Agent Systems


Title	Collective Online Learning of Gaussian Processes in Massive Multi-Agent Systems
Authors	Trong Nghia Hoang, Quang Minh Hoang, Kian Hsiang Low, Jonathan How
Abstract	Distributed machine learning (ML) is a modern computation paradigm that divides its workload into independent tasks that can be simultaneously achieved by multiple machines (i.e., agents) for better scalability. However, a typical distributed system is usually implemented with a central server that collects data statistics from multiple independent machines operating on different subsets of data to build a global analytic model. This centralized communication architecture however exposes a single choke point for operational failure and places severe bottlenecks on the server’s communication and computation capacities as it has to process a growing volume of communication from a crowd of learning agents. To mitigate these bottlenecks, this paper introduces a novel Collective Online Learning Gaussian Process framework for massive distributed systems that allows each agent to build its local model, which can be exchanged and combined efficiently with others via peer-to-peer communication to converge on a global model of higher quality. Finally, our empirical results consistently demonstrate the efficiency of our framework on both synthetic and real-world datasets.
Tasks	Gaussian Processes
Published	2018-05-23
URL	http://arxiv.org/abs/1805.09266v2
PDF	http://arxiv.org/pdf/1805.09266v2.pdf
PWC	https://paperswithcode.com/paper/collective-online-learning-of-gaussian
Repo
Framework

Deep Net Triage: Analyzing the Importance of Network Layers via Structural Compression


Title	Deep Net Triage: Analyzing the Importance of Network Layers via Structural Compression
Authors	Theodore S. Nowak, Jason J. Corso
Abstract	Despite their prevalence, deep networks are poorly understood. This is due, at least in part, to their highly parameterized nature. As such, while certain structures have been found to work better than others, the significance of a model’s unique structure, or the importance of a given layer, and how these translate to overall accuracy, remains unclear. In this paper, we analyze these properties of deep neural networks via a process we term deep net triage. Like medical triage—the assessment of the importance of various wounds—we assess the importance of layers in a neural network, or as we call it, their criticality. We do this by applying structural compression, whereby we reduce a block of layers to a single layer. After compressing a set of layers, we apply a combination of initialization and training schemes, and look at network accuracy, convergence, and the layer’s learned filters to assess the criticality of the layer. We apply this analysis across four data sets of varying complexity. We find that the accuracy of the model does not depend on which layer was compressed; that accuracy can be recovered or exceeded after compression by fine-tuning across the entire model; and, lastly, that Knowledge Distillation can be used to hasten convergence of a compressed network, but constrains the accuracy attainable to that of the base model.
Tasks
Published	2018-01-15
URL	http://arxiv.org/abs/1801.04651v2
PDF	http://arxiv.org/pdf/1801.04651v2.pdf
PWC	https://paperswithcode.com/paper/deep-net-triage-analyzing-the-importance-of
Repo
Framework

Siamese Capsule Networks


Title	Siamese Capsule Networks
Authors	James O’ Neill
Abstract	Capsule Networks have shown encouraging results on \textit{defacto} benchmark computer vision datasets such as MNIST, CIFAR and smallNORB. Although, they are yet to be tested on tasks where (1) the entities detected inherently have more complex internal representations and (2) there are very few instances per class to learn from and (3) where point-wise classification is not suitable. Hence, this paper carries out experiments on face verification in both controlled and uncontrolled settings that together address these points. In doing so we introduce \textit{Siamese Capsule Networks}, a new variant that can be used for pairwise learning tasks. The model is trained using contrastive loss with $\ell_2$-normalized capsule encoded pose features. We find that \textit{Siamese Capsule Networks} perform well against strong baselines on both pairwise learning datasets, yielding best results in the few-shot learning setting where image pairs in the test set contain unseen subjects.
Tasks	Face Verification, Few-Shot Learning
Published	2018-05-18
URL	http://arxiv.org/abs/1805.07242v1
PDF	http://arxiv.org/pdf/1805.07242v1.pdf
PWC	https://paperswithcode.com/paper/siamese-capsule-networks
Repo
Framework

DPW-SDNet: Dual Pixel-Wavelet Domain Deep CNNs for Soft Decoding of JPEG-Compressed Images


Title	DPW-SDNet: Dual Pixel-Wavelet Domain Deep CNNs for Soft Decoding of JPEG-Compressed Images
Authors	Honggang Chen, Xiaohai He, Linbo Qing, Shuhua Xiong, Truong Q. Nguyen
Abstract	JPEG is one of the widely used lossy compression methods. JPEG-compressed images usually suffer from compression artifacts including blocking and blurring, especially at low bit-rates. Soft decoding is an effective solution to improve the quality of compressed images without changing codec or introducing extra coding bits. Inspired by the excellent performance of the deep convolutional neural networks (CNNs) on both low-level and high-level computer vision problems, we develop a dual pixel-wavelet domain deep CNNs-based soft decoding network for JPEG-compressed images, namely DPW-SDNet. The pixel domain deep network takes the four downsampled versions of the compressed image to form a 4-channel input and outputs a pixel domain prediction, while the wavelet domain deep network uses the 1-level discrete wavelet transformation (DWT) coefficients to form a 4-channel input to produce a DWT domain prediction. The pixel domain and wavelet domain estimates are combined to generate the final soft decoded result. Experimental results demonstrate the superiority of the proposed DPW-SDNet over several state-of-the-art compression artifacts reduction algorithms.
Tasks
Published	2018-05-27
URL	http://arxiv.org/abs/1805.10558v1
PDF	http://arxiv.org/pdf/1805.10558v1.pdf
PWC	https://paperswithcode.com/paper/dpw-sdnet-dual-pixel-wavelet-domain-deep-cnns
Repo
Framework

Refining Source Representations with Relation Networks for Neural Machine Translation


Title	Refining Source Representations with Relation Networks for Neural Machine Translation
Authors	Wen Zhang, Jiawei Hu, Yang Feng, Qun Liu
Abstract	Although neural machine translation with the encoder-decoder framework has achieved great success recently, it still suffers drawbacks of forgetting distant information, which is an inherent disadvantage of recurrent neural network structure, and disregarding relationship between source words during encoding step. Whereas in practice, the former information and relationship are often useful in current step. We target on solving these problems and thus introduce relation networks to learn better representations of the source. The relation networks are able to facilitate memorization capability of recurrent neural network via associating source words with each other, this would also help retain their relationships. Then the source representations and all the relations are fed into the attention component together while decoding, with the main encoder-decoder framework unchanged. Experiments on several datasets show that our method can improve the translation performance significantly over the conventional encoder-decoder model and even outperform the approach involving supervised syntactic knowledge.
Tasks	Machine Translation
Published	2018-05-25
URL	http://arxiv.org/abs/1805.11154v2
PDF	http://arxiv.org/pdf/1805.11154v2.pdf
PWC	https://paperswithcode.com/paper/refining-source-representations-with-relation-1
Repo
Framework


Title	Unsupervised Multi-modal Neural Machine Translation
Authors	Yuanhang Su, Kai Fan, Nguyen Bach, C. -C. Jay Kuo, Fei Huang
Abstract	Unsupervised neural machine translation (UNMT) has recently achieved remarkable results with only large monolingual corpora in each language. However, the uncertainty of associating target with source sentences makes UNMT theoretically an ill-posed problem. This work investigates the possibility of utilizing images for disambiguation to improve the performance of UNMT. Our assumption is intuitively based on the invariant property of image, i.e., the description of the same visual content by different languages should be approximately similar. We propose an unsupervised multi-modal machine translation (UMNMT) framework based on the language translation cycle consistency loss conditional on the image, targeting to learn the bidirectional multi-modal translation simultaneously. Through an alternate training between multi-modal and uni-modal, our inference model can translate with or without the image. On the widely used Multi30K dataset, the experimental results of our approach are significantly better than those of the text-only UNMT on the 2016 test dataset.
Tasks	Machine Translation
Published	2018-11-28
URL	https://arxiv.org/abs/1811.11365v2
PDF	https://arxiv.org/pdf/1811.11365v2.pdf
PWC	https://paperswithcode.com/paper/unsupervised-multi-modal-neural-machine
Repo
Framework