Paper Group AWR 325
Reverse Attention for Salient Object Detection. Predicting Twitter User Socioeconomic Attributes with Network and Language Information. DeepIM: Deep Iterative Matching for 6D Pose Estimation. Debugging Neural Machine Translations. CapsuleGAN: Generative Adversarial Capsule Network. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation. A Survey …
Reverse Attention for Salient Object Detection
Title | Reverse Attention for Salient Object Detection |
Authors | Shuhan Chen, Xiuli Tan, Ben Wang, Xuelong Hu |
Abstract | Benefit from the quick development of deep learning techniques, salient object detection has achieved remarkable progresses recently. However, there still exists following two major challenges that hinder its application in embedded devices, low resolution output and heavy model weight. To this end, this paper presents an accurate yet compact deep network for efficient salient object detection. More specifically, given a coarse saliency prediction in the deepest layer, we first employ residual learning to learn side-output residual features for saliency refinement, which can be achieved with very limited convolutional parameters while keep accuracy. Secondly, we further propose reverse attention to guide such side-output residual learning in a top-down manner. By erasing the current predicted salient regions from side-output features, the network can eventually explore the missing object parts and details which results in high resolution and accuracy. Experiments on six benchmark datasets demonstrate that the proposed approach compares favorably against state-of-the-art methods, and with advantages in terms of simplicity, efficiency (45 FPS) and model size (81 MB). |
Tasks | Object Detection, Saliency Prediction, Salient Object Detection |
Published | 2018-07-26 |
URL | http://arxiv.org/abs/1807.09940v2 |
http://arxiv.org/pdf/1807.09940v2.pdf | |
PWC | https://paperswithcode.com/paper/reverse-attention-for-salient-object |
Repo | https://github.com/lhaof/fast-salient-object-detection |
Framework | none |
Predicting Twitter User Socioeconomic Attributes with Network and Language Information
Title | Predicting Twitter User Socioeconomic Attributes with Network and Language Information |
Authors | Nikolaos Aletras, Benjamin Paul Chamberlain |
Abstract | Inferring socioeconomic attributes of social media users such as occupation and income is an important problem in computational social science. Automated inference of such characteristics has applications in personalised recommender systems, targeted computational advertising and online political campaigning. While previous work has shown that language features can reliably predict socioeconomic attributes on Twitter, employing information coming from users’ social networks has not yet been explored for such complex user characteristics. In this paper, we describe a method for predicting the occupational class and the income of Twitter users given information extracted from their extended networks by learning a low-dimensional vector representation of users, i.e. graph embeddings. We use this representation to train predictive models for occupational class and income. Results on two publicly available datasets show that our method consistently outperforms the state-of-the-art methods in both tasks. We also obtain further significant improvements when we combine graph embeddings with textual features, demonstrating that social network and language information are complementary. |
Tasks | Recommendation Systems |
Published | 2018-04-11 |
URL | http://arxiv.org/abs/1804.04095v1 |
http://arxiv.org/pdf/1804.04095v1.pdf | |
PWC | https://paperswithcode.com/paper/predicting-twitter-user-socioeconomic |
Repo | https://github.com/melifluos/income-prediction |
Framework | none |
DeepIM: Deep Iterative Matching for 6D Pose Estimation
Title | DeepIM: Deep Iterative Matching for 6D Pose Estimation |
Authors | Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, Dieter Fox |
Abstract | Estimating the 6D pose of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the observed image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using an untangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects. |
Tasks | 6D Pose Estimation, 6D Pose Estimation using RGB, Pose Estimation |
Published | 2018-03-31 |
URL | https://arxiv.org/abs/1804.00175v4 |
https://arxiv.org/pdf/1804.00175v4.pdf | |
PWC | https://paperswithcode.com/paper/deepim-deep-iterative-matching-for-6d-pose |
Repo | https://github.com/liyi14/mx-DeepIM |
Framework | mxnet |
Debugging Neural Machine Translations
Title | Debugging Neural Machine Translations |
Authors | Matīss Rikters |
Abstract | In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. The purpose of the tool is to help researchers and developers find weak and faulty example translations that their NMT systems produce without the need for reference translations. Our tool also includes an option to directly compare translation outputs from two different NMT engines or experiments. In addition, we present a demo website of our tool with examples of good and bad translations: http://attention.lielakeda.lv |
Tasks | Machine Translation |
Published | 2018-08-08 |
URL | http://arxiv.org/abs/1808.02733v1 |
http://arxiv.org/pdf/1808.02733v1.pdf | |
PWC | https://paperswithcode.com/paper/debugging-neural-machine-translations |
Repo | https://github.com/M4t1ss/SoftAlignments |
Framework | tf |
CapsuleGAN: Generative Adversarial Capsule Network
Title | CapsuleGAN: Generative Adversarial Capsule Network |
Authors | Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, Premkumar Natarajan |
Abstract | We present Generative Adversarial Capsule Network (CapsuleGAN), a framework that uses capsule networks (CapsNets) instead of the standard convolutional neural networks (CNNs) as discriminators within the generative adversarial network (GAN) setting, while modeling image data. We provide guidelines for designing CapsNet discriminators and the updated GAN objective function, which incorporates the CapsNet margin loss, for training CapsuleGAN models. We show that CapsuleGAN outperforms convolutional-GAN at modeling image data distribution on MNIST and CIFAR-10 datasets, evaluated on the generative adversarial metric and at semi-supervised image classification. |
Tasks | Image Classification, Semi-Supervised Image Classification |
Published | 2018-02-17 |
URL | http://arxiv.org/abs/1802.06167v7 |
http://arxiv.org/pdf/1802.06167v7.pdf | |
PWC | https://paperswithcode.com/paper/capsulegan-generative-adversarial-capsule |
Repo | https://github.com/CPUFronz/CapsVoxGAN |
Framework | pytorch |
PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation
Title | PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation |
Authors | Sida Peng, Yuan Liu, Qixing Huang, Hujun Bao, Xiaowei Zhou |
Abstract | This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/. |
Tasks | 6D Pose Estimation using RGB, Pose Estimation |
Published | 2018-12-31 |
URL | http://arxiv.org/abs/1812.11788v1 |
http://arxiv.org/pdf/1812.11788v1.pdf | |
PWC | https://paperswithcode.com/paper/pvnet-pixel-wise-voting-network-for-6dof-pose |
Repo | https://github.com/zju3dv/pvnet |
Framework | pytorch |
A Survey of Recent DNN Architectures on the TIMIT Phone Recognition Task
Title | A Survey of Recent DNN Architectures on the TIMIT Phone Recognition Task |
Authors | Josef Michalek, Jan Vanek |
Abstract | In this survey paper, we have evaluated several recent deep neural network (DNN) architectures on a TIMIT phone recognition task. We chose the TIMIT corpus due to its popularity and broad availability in the community. It also simulates a low-resource scenario that is helpful in minor languages. Also, we prefer the phone recognition task because it is much more sensitive to an acoustic model quality than a large vocabulary continuous speech recognition (LVCSR) task. In recent years, many DNN published papers reported results on TIMIT. However, the reported phone error rates (PERs) were often much higher than a PER of a simple feed-forward (FF) DNN. That was the main motivation of this paper: To provide a baseline DNNs with open-source scripts to easily replicate the baseline results for future papers with lowest possible PERs. According to our knowledge, the best-achieved PER of this survey is better than the best-published PER to date. |
Tasks | Large Vocabulary Continuous Speech Recognition, Speech Recognition |
Published | 2018-06-19 |
URL | http://arxiv.org/abs/1806.07974v1 |
http://arxiv.org/pdf/1806.07974v1.pdf | |
PWC | https://paperswithcode.com/paper/a-survey-of-recent-dnn-architectures-on-the |
Repo | https://github.com/OrcusCZ/NNAcousticModeling |
Framework | none |
DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation
Title | DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation |
Authors | Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, Nicolas Courty |
Abstract | In computer vision, one is often confronted with problems of domain shifts, which occur when one applies a classifier trained on a source dataset to target data sharing similar characteristics (e.g. same classes), but also different latent data structures (e.g. different acquisition conditions). In such a situation, the model will perform poorly on the new data, since the classifier is specialized to recognize visual cues specific to the source domain. In this work we explore a solution, named DeepJDOT, to tackle this problem: through a measure of discrepancy on joint deep representations/labels based on optimal transport, we not only learn new data representations aligned between the source and target domain, but also simultaneously preserve the discriminative information used by the classifier. We applied DeepJDOT to a series of visual recognition tasks, where it compares favorably against state-of-the-art deep domain adaptation methods. |
Tasks | Domain Adaptation, Unsupervised Domain Adaptation |
Published | 2018-03-27 |
URL | http://arxiv.org/abs/1803.10081v3 |
http://arxiv.org/pdf/1803.10081v3.pdf | |
PWC | https://paperswithcode.com/paper/deepjdot-deep-joint-distribution-optimal |
Repo | https://github.com/bbdamodaran/deepJDOT |
Framework | tf |
Estimating 6D Pose From Localizing Designated Surface Keypoints
Title | Estimating 6D Pose From Localizing Designated Surface Keypoints |
Authors | Zelin Zhao, Gao Peng, Haoyu Wang, Hao-Shu Fang, Chengkun Li, Cewu Lu |
Abstract | In this paper, we present an accurate yet effective solution for 6D pose estimation from an RGB image. The core of our approach is that we first designate a set of surface points on target object model as keypoints and then train a keypoint detector (KPD) to localize them. Finally a PnP algorithm can recover the 6D pose according to the 2D-3D relationship of keypoints. Different from recent state-of-the-art CNN-based approaches that rely on a time-consuming post-processing procedure, our method can achieve competitive accuracy without any refinement after pose prediction. Meanwhile, we obtain a 30% relative improvement in terms of ADD accuracy among methods without using refinement. Moreover, we succeed in handling heavy occlusion by selecting the most confident keypoints to recover the 6D pose. For the sake of reproducibility, we will make our code and models publicly available soon. |
Tasks | 6D Pose Estimation, 6D Pose Estimation using RGB, Pose Estimation, Pose Prediction |
Published | 2018-12-04 |
URL | http://arxiv.org/abs/1812.01387v1 |
http://arxiv.org/pdf/1812.01387v1.pdf | |
PWC | https://paperswithcode.com/paper/estimating-6d-pose-from-localizing-designated |
Repo | https://github.com/why2011btv/6d_pose_estimation |
Framework | pytorch |
Efficient Neural Audio Synthesis
Title | Efficient Neural Audio Synthesis |
Authors | Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, Koray Kavukcuoglu |
Abstract | Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency. |
Tasks | Speech Synthesis, Text-To-Speech Synthesis |
Published | 2018-02-23 |
URL | http://arxiv.org/abs/1802.08435v2 |
http://arxiv.org/pdf/1802.08435v2.pdf | |
PWC | https://paperswithcode.com/paper/efficient-neural-audio-synthesis |
Repo | https://github.com/CorentinJ/Real-Time-Voice-Cloning |
Framework | tf |
Deep Part Induction from Articulated Object Pairs
Title | Deep Part Induction from Articulated Object Pairs |
Authors | Li Yi, Haibin Huang, Difan Liu, Evangelos Kalogerakis, Hao Su, Leonidas Guibas |
Abstract | Object functionality is often expressed through part articulation – as when the two rigid parts of a scissor pivot against each other to perform the cutting function. Such articulations are often similar across objects within the same functional category. In this paper, we explore how the observation of different articulation states provides evidence for part structure and motion of 3D objects. Our method takes as input a pair of unsegmented shapes representing two different articulation states of two functionally related objects, and induces their common parts along with their underlying rigid motion. This is a challenging setting, as we assume no prior shape structure, no prior shape category information, no consistent shape orientation, the articulation states may belong to objects of different geometry, plus we allow inputs to be noisy and partial scans, or point clouds lifted from RGB images. Our method learns a neural network architecture with three modules that respectively propose correspondences, estimate 3D deformation flows, and perform segmentation. To achieve optimal performance, our architecture alternates between correspondence, deformation flow, and segmentation prediction iteratively in an ICP-like fashion. Our results demonstrate that our method significantly outperforms state-of-the-art techniques in the task of discovering articulated parts of objects. In addition, our part induction is object-class agnostic and successfully generalizes to new and unseen objects. |
Tasks | |
Published | 2018-09-19 |
URL | http://arxiv.org/abs/1809.07417v1 |
http://arxiv.org/pdf/1809.07417v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-part-induction-from-articulated-object |
Repo | https://github.com/ericyi/articulated-part-induction |
Framework | tf |
DVAE++: Discrete Variational Autoencoders with Overlapping Transformations
Title | DVAE++: Discrete Variational Autoencoders with Overlapping Transformations |
Authors | Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, Evgeny Andriyash |
Abstract | Training of discrete latent variable models remains challenging because passing gradient information through discrete units is difficult. We propose a new class of smoothing transformations based on a mixture of two overlapping distributions, and show that the proposed transformation can be used for training binary latent models with either directed or undirected priors. We derive a new variational bound to efficiently train with Boltzmann machine priors. Using this bound, we develop DVAE++, a generative model with a global discrete prior and a hierarchy of convolutional continuous variables. Experiments on several benchmarks show that overlapping transformations outperform other recent continuous relaxations of discrete latent variables including Gumbel-Softmax (Maddison et al., 2016; Jang et al., 2016), and discrete variational autoencoders (Rolfe 2016). |
Tasks | Latent Variable Models |
Published | 2018-02-14 |
URL | http://arxiv.org/abs/1802.04920v2 |
http://arxiv.org/pdf/1802.04920v2.pdf | |
PWC | https://paperswithcode.com/paper/dvae-discrete-variational-autoencoders-with-1 |
Repo | https://github.com/QuadrantAI/dvae |
Framework | tf |
Learning Factorized Multimodal Representations
Title | Learning Factorized Multimodal Representations |
Authors | Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, Ruslan Salakhutdinov |
Abstract | Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning. |
Tasks | Representation Learning |
Published | 2018-06-16 |
URL | https://arxiv.org/abs/1806.06176v3 |
https://arxiv.org/pdf/1806.06176v3.pdf | |
PWC | https://paperswithcode.com/paper/learning-factorized-multimodal |
Repo | https://github.com/pliang279/factorized |
Framework | pytorch |
HOUDINI: Lifelong Learning as Program Synthesis
Title | HOUDINI: Lifelong Learning as Program Synthesis |
Authors | Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, Swarat Chaudhuri |
Abstract | We present a neurosymbolic framework for the lifelong learning of algorithmic tasks that mix perception and procedural reasoning. Reusing high-level concepts across domains and learning complex procedures are key challenges in lifelong learning. We show that a program synthesis approach that combines gradient descent with combinatorial search over programs can be a more effective response to these challenges than purely neural methods. Our framework, called HOUDINI, represents neural networks as strongly typed, differentiable functional programs that use symbolic higher-order combinators to compose a library of neural functions. Our learning algorithm consists of: (1) a symbolic program synthesizer that performs a type-directed search over parameterized programs, and decides on the library functions to reuse, and the architectures to combine them, while learning a sequence of tasks; and (2) a neural module that trains these programs using stochastic gradient descent. We evaluate HOUDINI on three benchmarks that combine perception with the algorithmic tasks of counting, summing, and shortest-path computation. Our experiments show that HOUDINI transfers high-level concepts more effectively than traditional transfer learning and progressive neural networks, and that the typed representation of networks significantly accelerates the search. |
Tasks | Program Synthesis, Transfer Learning |
Published | 2018-03-31 |
URL | http://arxiv.org/abs/1804.00218v2 |
http://arxiv.org/pdf/1804.00218v2.pdf | |
PWC | https://paperswithcode.com/paper/houdini-lifelong-learning-as-program |
Repo | https://github.com/capergroup/houdini |
Framework | pytorch |
Simultaneous Coherent Structure Coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity
Title | Simultaneous Coherent Structure Coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity |
Authors | Brooke E. Husic, Kristy L. Schlueter-Kuck, John O. Dabiri |
Abstract | The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience. |
Tasks | |
Published | 2018-07-12 |
URL | http://arxiv.org/abs/1807.04427v3 |
http://arxiv.org/pdf/1807.04427v3.pdf | |
PWC | https://paperswithcode.com/paper/simultaneous-coherent-structure-coloring |
Repo | https://github.com/brookehus/sCSC |
Framework | none |