October 20, 2019

3157 words 15 mins read

Paper Group AWR 325

Reverse Attention for Salient Object Detection. Predicting Twitter User Socioeconomic Attributes with Network and Language Information. DeepIM: Deep Iterative Matching for 6D Pose Estimation. Debugging Neural Machine Translations. CapsuleGAN: Generative Adversarial Capsule Network. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation. A Survey …

Reverse Attention for Salient Object Detection


Title	Reverse Attention for Salient Object Detection
Authors	Shuhan Chen, Xiuli Tan, Ben Wang, Xuelong Hu
Abstract	Benefit from the quick development of deep learning techniques, salient object detection has achieved remarkable progresses recently. However, there still exists following two major challenges that hinder its application in embedded devices, low resolution output and heavy model weight. To this end, this paper presents an accurate yet compact deep network for efficient salient object detection. More specifically, given a coarse saliency prediction in the deepest layer, we first employ residual learning to learn side-output residual features for saliency refinement, which can be achieved with very limited convolutional parameters while keep accuracy. Secondly, we further propose reverse attention to guide such side-output residual learning in a top-down manner. By erasing the current predicted salient regions from side-output features, the network can eventually explore the missing object parts and details which results in high resolution and accuracy. Experiments on six benchmark datasets demonstrate that the proposed approach compares favorably against state-of-the-art methods, and with advantages in terms of simplicity, efficiency (45 FPS) and model size (81 MB).
Tasks	Object Detection, Saliency Prediction, Salient Object Detection
Published	2018-07-26
URL	http://arxiv.org/abs/1807.09940v2
PDF	http://arxiv.org/pdf/1807.09940v2.pdf
PWC	https://paperswithcode.com/paper/reverse-attention-for-salient-object
Repo	https://github.com/lhaof/fast-salient-object-detection
Framework	none

Predicting Twitter User Socioeconomic Attributes with Network and Language Information


Title	Predicting Twitter User Socioeconomic Attributes with Network and Language Information
Authors	Nikolaos Aletras, Benjamin Paul Chamberlain
Abstract	Inferring socioeconomic attributes of social media users such as occupation and income is an important problem in computational social science. Automated inference of such characteristics has applications in personalised recommender systems, targeted computational advertising and online political campaigning. While previous work has shown that language features can reliably predict socioeconomic attributes on Twitter, employing information coming from users’ social networks has not yet been explored for such complex user characteristics. In this paper, we describe a method for predicting the occupational class and the income of Twitter users given information extracted from their extended networks by learning a low-dimensional vector representation of users, i.e. graph embeddings. We use this representation to train predictive models for occupational class and income. Results on two publicly available datasets show that our method consistently outperforms the state-of-the-art methods in both tasks. We also obtain further significant improvements when we combine graph embeddings with textual features, demonstrating that social network and language information are complementary.
Tasks	Recommendation Systems
Published	2018-04-11
URL	http://arxiv.org/abs/1804.04095v1
PDF	http://arxiv.org/pdf/1804.04095v1.pdf
PWC	https://paperswithcode.com/paper/predicting-twitter-user-socioeconomic
Repo	https://github.com/melifluos/income-prediction
Framework	none

DeepIM: Deep Iterative Matching for 6D Pose Estimation


Title	DeepIM: Deep Iterative Matching for 6D Pose Estimation
Authors	Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, Dieter Fox
Abstract	Estimating the 6D pose of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the observed image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using an untangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
Tasks	6D Pose Estimation, 6D Pose Estimation using RGB, Pose Estimation
Published	2018-03-31
URL	https://arxiv.org/abs/1804.00175v4
PDF	https://arxiv.org/pdf/1804.00175v4.pdf
PWC	https://paperswithcode.com/paper/deepim-deep-iterative-matching-for-6d-pose
Repo	https://github.com/liyi14/mx-DeepIM
Framework	mxnet

Debugging Neural Machine Translations


Title	Debugging Neural Machine Translations
Authors	Matīss Rikters
Abstract	In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. The purpose of the tool is to help researchers and developers find weak and faulty example translations that their NMT systems produce without the need for reference translations. Our tool also includes an option to directly compare translation outputs from two different NMT engines or experiments. In addition, we present a demo website of our tool with examples of good and bad translations: http://attention.lielakeda.lv
Tasks	Machine Translation
Published	2018-08-08
URL	http://arxiv.org/abs/1808.02733v1
PDF	http://arxiv.org/pdf/1808.02733v1.pdf
PWC	https://paperswithcode.com/paper/debugging-neural-machine-translations
Repo	https://github.com/M4t1ss/SoftAlignments
Framework	tf

CapsuleGAN: Generative Adversarial Capsule Network


Title	CapsuleGAN: Generative Adversarial Capsule Network
Authors	Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, Premkumar Natarajan
Abstract	We present Generative Adversarial Capsule Network (CapsuleGAN), a framework that uses capsule networks (CapsNets) instead of the standard convolutional neural networks (CNNs) as discriminators within the generative adversarial network (GAN) setting, while modeling image data. We provide guidelines for designing CapsNet discriminators and the updated GAN objective function, which incorporates the CapsNet margin loss, for training CapsuleGAN models. We show that CapsuleGAN outperforms convolutional-GAN at modeling image data distribution on MNIST and CIFAR-10 datasets, evaluated on the generative adversarial metric and at semi-supervised image classification.
Tasks	Image Classification, Semi-Supervised Image Classification
Published	2018-02-17
URL	http://arxiv.org/abs/1802.06167v7
PDF	http://arxiv.org/pdf/1802.06167v7.pdf
PWC	https://paperswithcode.com/paper/capsulegan-generative-adversarial-capsule
Repo	https://github.com/CPUFronz/CapsVoxGAN
Framework	pytorch

PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation


Title	PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
Authors	Sida Peng, Yuan Liu, Qixing Huang, Hujun Bao, Xiaowei Zhou
Abstract	This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
Tasks	6D Pose Estimation using RGB, Pose Estimation
Published	2018-12-31
URL	http://arxiv.org/abs/1812.11788v1
PDF	http://arxiv.org/pdf/1812.11788v1.pdf
PWC	https://paperswithcode.com/paper/pvnet-pixel-wise-voting-network-for-6dof-pose
Repo	https://github.com/zju3dv/pvnet
Framework	pytorch

A Survey of Recent DNN Architectures on the TIMIT Phone Recognition Task


Title	A Survey of Recent DNN Architectures on the TIMIT Phone Recognition Task
Authors	Josef Michalek, Jan Vanek
Abstract	In this survey paper, we have evaluated several recent deep neural network (DNN) architectures on a TIMIT phone recognition task. We chose the TIMIT corpus due to its popularity and broad availability in the community. It also simulates a low-resource scenario that is helpful in minor languages. Also, we prefer the phone recognition task because it is much more sensitive to an acoustic model quality than a large vocabulary continuous speech recognition (LVCSR) task. In recent years, many DNN published papers reported results on TIMIT. However, the reported phone error rates (PERs) were often much higher than a PER of a simple feed-forward (FF) DNN. That was the main motivation of this paper: To provide a baseline DNNs with open-source scripts to easily replicate the baseline results for future papers with lowest possible PERs. According to our knowledge, the best-achieved PER of this survey is better than the best-published PER to date.
Tasks	Large Vocabulary Continuous Speech Recognition, Speech Recognition
Published	2018-06-19
URL	http://arxiv.org/abs/1806.07974v1
PDF	http://arxiv.org/pdf/1806.07974v1.pdf
PWC	https://paperswithcode.com/paper/a-survey-of-recent-dnn-architectures-on-the
Repo	https://github.com/OrcusCZ/NNAcousticModeling
Framework	none

DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation


Title	DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation
Authors	Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, Nicolas Courty
Abstract	In computer vision, one is often confronted with problems of domain shifts, which occur when one applies a classifier trained on a source dataset to target data sharing similar characteristics (e.g. same classes), but also different latent data structures (e.g. different acquisition conditions). In such a situation, the model will perform poorly on the new data, since the classifier is specialized to recognize visual cues specific to the source domain. In this work we explore a solution, named DeepJDOT, to tackle this problem: through a measure of discrepancy on joint deep representations/labels based on optimal transport, we not only learn new data representations aligned between the source and target domain, but also simultaneously preserve the discriminative information used by the classifier. We applied DeepJDOT to a series of visual recognition tasks, where it compares favorably against state-of-the-art deep domain adaptation methods.
Tasks	Domain Adaptation, Unsupervised Domain Adaptation
Published	2018-03-27
URL	http://arxiv.org/abs/1803.10081v3
PDF	http://arxiv.org/pdf/1803.10081v3.pdf
PWC	https://paperswithcode.com/paper/deepjdot-deep-joint-distribution-optimal
Repo	https://github.com/bbdamodaran/deepJDOT
Framework	tf

Estimating 6D Pose From Localizing Designated Surface Keypoints


Title	Estimating 6D Pose From Localizing Designated Surface Keypoints
Authors	Zelin Zhao, Gao Peng, Haoyu Wang, Hao-Shu Fang, Chengkun Li, Cewu Lu
Abstract	In this paper, we present an accurate yet effective solution for 6D pose estimation from an RGB image. The core of our approach is that we first designate a set of surface points on target object model as keypoints and then train a keypoint detector (KPD) to localize them. Finally a PnP algorithm can recover the 6D pose according to the 2D-3D relationship of keypoints. Different from recent state-of-the-art CNN-based approaches that rely on a time-consuming post-processing procedure, our method can achieve competitive accuracy without any refinement after pose prediction. Meanwhile, we obtain a 30% relative improvement in terms of ADD accuracy among methods without using refinement. Moreover, we succeed in handling heavy occlusion by selecting the most confident keypoints to recover the 6D pose. For the sake of reproducibility, we will make our code and models publicly available soon.
Tasks	6D Pose Estimation, 6D Pose Estimation using RGB, Pose Estimation, Pose Prediction
Published	2018-12-04
URL	http://arxiv.org/abs/1812.01387v1
PDF	http://arxiv.org/pdf/1812.01387v1.pdf
PWC	https://paperswithcode.com/paper/estimating-6d-pose-from-localizing-designated
Repo	https://github.com/why2011btv/6d_pose_estimation
Framework	pytorch

Efficient Neural Audio Synthesis


Title	Efficient Neural Audio Synthesis
Authors	Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, Koray Kavukcuoglu
Abstract	Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.
Tasks	Speech Synthesis, Text-To-Speech Synthesis
Published	2018-02-23
URL	http://arxiv.org/abs/1802.08435v2
PDF	http://arxiv.org/pdf/1802.08435v2.pdf
PWC	https://paperswithcode.com/paper/efficient-neural-audio-synthesis
Repo	https://github.com/CorentinJ/Real-Time-Voice-Cloning
Framework	tf

Deep Part Induction from Articulated Object Pairs


Title	Deep Part Induction from Articulated Object Pairs
Authors	Li Yi, Haibin Huang, Difan Liu, Evangelos Kalogerakis, Hao Su, Leonidas Guibas
Abstract	Object functionality is often expressed through part articulation – as when the two rigid parts of a scissor pivot against each other to perform the cutting function. Such articulations are often similar across objects within the same functional category. In this paper, we explore how the observation of different articulation states provides evidence for part structure and motion of 3D objects. Our method takes as input a pair of unsegmented shapes representing two different articulation states of two functionally related objects, and induces their common parts along with their underlying rigid motion. This is a challenging setting, as we assume no prior shape structure, no prior shape category information, no consistent shape orientation, the articulation states may belong to objects of different geometry, plus we allow inputs to be noisy and partial scans, or point clouds lifted from RGB images. Our method learns a neural network architecture with three modules that respectively propose correspondences, estimate 3D deformation flows, and perform segmentation. To achieve optimal performance, our architecture alternates between correspondence, deformation flow, and segmentation prediction iteratively in an ICP-like fashion. Our results demonstrate that our method significantly outperforms state-of-the-art techniques in the task of discovering articulated parts of objects. In addition, our part induction is object-class agnostic and successfully generalizes to new and unseen objects.
Tasks
Published	2018-09-19
URL	http://arxiv.org/abs/1809.07417v1
PDF	http://arxiv.org/pdf/1809.07417v1.pdf
PWC	https://paperswithcode.com/paper/deep-part-induction-from-articulated-object
Repo	https://github.com/ericyi/articulated-part-induction
Framework	tf

DVAE++: Discrete Variational Autoencoders with Overlapping Transformations


Title	DVAE++: Discrete Variational Autoencoders with Overlapping Transformations
Authors	Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, Evgeny Andriyash
Abstract	Training of discrete latent variable models remains challenging because passing gradient information through discrete units is difficult. We propose a new class of smoothing transformations based on a mixture of two overlapping distributions, and show that the proposed transformation can be used for training binary latent models with either directed or undirected priors. We derive a new variational bound to efficiently train with Boltzmann machine priors. Using this bound, we develop DVAE++, a generative model with a global discrete prior and a hierarchy of convolutional continuous variables. Experiments on several benchmarks show that overlapping transformations outperform other recent continuous relaxations of discrete latent variables including Gumbel-Softmax (Maddison et al., 2016; Jang et al., 2016), and discrete variational autoencoders (Rolfe 2016).
Tasks	Latent Variable Models
Published	2018-02-14
URL	http://arxiv.org/abs/1802.04920v2
PDF	http://arxiv.org/pdf/1802.04920v2.pdf
PWC	https://paperswithcode.com/paper/dvae-discrete-variational-autoencoders-with-1
Repo	https://github.com/QuadrantAI/dvae
Framework	tf

Learning Factorized Multimodal Representations


Title	Learning Factorized Multimodal Representations
Authors	Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, Ruslan Salakhutdinov
Abstract	Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.
Tasks	Representation Learning
Published	2018-06-16
URL	https://arxiv.org/abs/1806.06176v3
PDF	https://arxiv.org/pdf/1806.06176v3.pdf
PWC	https://paperswithcode.com/paper/learning-factorized-multimodal
Repo	https://github.com/pliang279/factorized
Framework	pytorch

HOUDINI: Lifelong Learning as Program Synthesis


Title	HOUDINI: Lifelong Learning as Program Synthesis
Authors	Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, Swarat Chaudhuri
Abstract	We present a neurosymbolic framework for the lifelong learning of algorithmic tasks that mix perception and procedural reasoning. Reusing high-level concepts across domains and learning complex procedures are key challenges in lifelong learning. We show that a program synthesis approach that combines gradient descent with combinatorial search over programs can be a more effective response to these challenges than purely neural methods. Our framework, called HOUDINI, represents neural networks as strongly typed, differentiable functional programs that use symbolic higher-order combinators to compose a library of neural functions. Our learning algorithm consists of: (1) a symbolic program synthesizer that performs a type-directed search over parameterized programs, and decides on the library functions to reuse, and the architectures to combine them, while learning a sequence of tasks; and (2) a neural module that trains these programs using stochastic gradient descent. We evaluate HOUDINI on three benchmarks that combine perception with the algorithmic tasks of counting, summing, and shortest-path computation. Our experiments show that HOUDINI transfers high-level concepts more effectively than traditional transfer learning and progressive neural networks, and that the typed representation of networks significantly accelerates the search.
Tasks	Program Synthesis, Transfer Learning
Published	2018-03-31
URL	http://arxiv.org/abs/1804.00218v2
PDF	http://arxiv.org/pdf/1804.00218v2.pdf
PWC	https://paperswithcode.com/paper/houdini-lifelong-learning-as-program
Repo	https://github.com/capergroup/houdini
Framework	pytorch

Simultaneous Coherent Structure Coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity


Title	Simultaneous Coherent Structure Coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity
Authors	Brooke E. Husic, Kristy L. Schlueter-Kuck, John O. Dabiri
Abstract	The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience.
Tasks
Published	2018-07-12
URL	http://arxiv.org/abs/1807.04427v3
PDF	http://arxiv.org/pdf/1807.04427v3.pdf
PWC	https://paperswithcode.com/paper/simultaneous-coherent-structure-coloring
Repo	https://github.com/brookehus/sCSC
Framework	none