January 27, 2020

2950 words 14 mins read

Paper Group ANR 1095

Recent Advances in Deep Learning for Object Detection. Vision-to-Language Tasks Based on Attributes and Attention Mechanism. HA-CCN: Hierarchical Attention-based Crowd Counting Network. Learn to Compress CSI and Allocate Resources in Vehicular Networks. Multimodal Transformer with Multi-View Visual Representation for Image Captioning. Dialogue Act …

Recent Advances in Deep Learning for Object Detection


Title	Recent Advances in Deep Learning for Object Detection
Authors	Xiongwei Wu, Doyen Sahoo, Steven C. H. Hoi
Abstract	Object detection is a fundamental visual recognition problem in computer vision and has been widely studied in the past decades. Visual object detection aims to find objects of certain target classes with precise localization in a given image and assign each object instance a corresponding class label. Due to the tremendous successes of deep learning based image classification, object detection techniques using deep learning have been actively studied in recent years. In this paper, we give a comprehensive survey of recent advances in visual object detection with deep learning. By reviewing a large body of recent related work in literature, we systematically analyze the existing object detection frameworks and organize the survey into three major parts: (i) detection components, (ii) learning strategies, and (iii) applications & benchmarks. In the survey, we cover a variety of factors affecting the detection performance in detail, such as detector architectures, feature learning, proposal generation, sampling strategies, etc. Finally, we discuss several future directions to facilitate and spur future research for visual object detection with deep learning. Keywords: Object Detection, Deep Learning, Deep Convolutional Neural Networks
Tasks	Image Classification, Object Detection
Published	2019-08-10
URL	https://arxiv.org/abs/1908.03673v1
PDF	https://arxiv.org/pdf/1908.03673v1.pdf
PWC	https://paperswithcode.com/paper/recent-advances-in-deep-learning-for-object
Repo
Framework

Vision-to-Language Tasks Based on Attributes and Attention Mechanism


Title	Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Authors	Xuelong Li, Aihong Yuan, Xiaoqiang Lu
Abstract	Vision-to-language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations and decode it into natural language sentences. While they neglect high-level semantic concepts and subtle relationships between image regions and natural language elements. To make full use of these information, this paper attempt to exploit the text guided attention and semantic-guided attention (SA) to find the more correlated spatial information and reduce the semantic gap between vision and language. Our method includes two level attention networks. One is the text-guided attention network which is used to select the text-related regions. The other is SA network which is used to highlight the concept-related regions and the region-related concepts. At last, all these information are incorporated to generate captions or answers. Practically, image captioning and visual question answering experiments have been carried out, and the experimental results have shown the excellent performance of the proposed approach.
Tasks	Image Captioning, Question Answering, Visual Question Answering
Published	2019-05-29
URL	https://arxiv.org/abs/1905.12243v1
PDF	https://arxiv.org/pdf/1905.12243v1.pdf
PWC	https://paperswithcode.com/paper/vision-to-language-tasks-based-on-attributes
Repo
Framework

HA-CCN: Hierarchical Attention-based Crowd Counting Network


Title	HA-CCN: Hierarchical Attention-based Crowd Counting Network
Authors	Vishwanath A. Sindagi, Vishal M. Patel
Abstract	Single image-based crowd counting has recently witnessed increased focus, but many leading methods are far from optimal, especially in highly congested scenes. In this paper, we present Hierarchical Attention-based Crowd Counting Network (HA-CCN) that employs attention mechanisms at various levels to selectively enhance the features of the network. The proposed method, which is based on the VGG16 network, consists of a spatial attention module (SAM) and a set of global attention modules (GAM). SAM enhances low-level features in the network by infusing spatial segmentation information, whereas the GAM focuses on enhancing channel-wise information in the higher level layers. The proposed method is a single-step training framework, simple to implement and achieves state-of-the-art results on different datasets. Furthermore, we extend the proposed counting network by introducing a novel set-up to adapt the network to different scenes and datasets via weak supervision using image-level labels. This new set up reduces the burden of acquiring labour intensive point-wise annotations for new datasets while improving the cross-dataset performance.
Tasks	Crowd Counting
Published	2019-07-24
URL	https://arxiv.org/abs/1907.10255v1
PDF	https://arxiv.org/pdf/1907.10255v1.pdf
PWC	https://paperswithcode.com/paper/ha-ccn-hierarchical-attention-based-crowd
Repo
Framework

Learn to Compress CSI and Allocate Resources in Vehicular Networks


Title	Learn to Compress CSI and Allocate Resources in Vehicular Networks
Authors	Liang Wang, Hao Ye, Le Liang, Geoffrey Ye Li
Abstract	Resource allocation has a direct and profound impact on the performance of vehicle-to-everything (V2X) networks. In this paper, we develop a hybrid architecture consisting of centralized decision making and distributed resource sharing (the C-Decision scheme) to maximize the long-term sum rate of all vehicles. To reduce the network signaling overhead, each vehicle uses a deep neural network to compress its observed information that is thereafter fed back to the centralized decision making unit. The centralized decision unit employs a deep Q-network to allocate resources and then sends the decision results to all vehicles. We further adopt a quantization layer for each vehicle that learns to quantize the continuous feedback. In addition, we devise a mechanism to balance the transmission of vehicle-to-vehicle (V2V) links and vehicle-to-infrastructure (V2I) links. To further facilitate distributed spectrum sharing, we also propose a distributed decision making and spectrum sharing architecture (the D-Decision scheme) for each V2V link. Through extensive simulation results, we demonstrate that the proposed C-Decision and D-Decision schemes can both achieve near-optimal performance and are robust to feedback interval variations, input noise, and feedback noise.
Tasks	Decision Making, Quantization
Published	2019-08-12
URL	https://arxiv.org/abs/1908.04685v1
PDF	https://arxiv.org/pdf/1908.04685v1.pdf
PWC	https://paperswithcode.com/paper/learn-to-compress-csi-and-allocate-resources
Repo
Framework

Multimodal Transformer with Multi-View Visual Representation for Image Captioning


Title	Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Authors	Jun Yu, Jing Li, Zhou Yu, Qingming Huang
Abstract	Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN)-based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.
Tasks	Image Captioning, Machine Translation
Published	2019-05-20
URL	https://arxiv.org/abs/1905.07841v1
PDF	https://arxiv.org/pdf/1905.07841v1.pdf
PWC	https://paperswithcode.com/paper/multimodal-transformer-with-multi-view-visual
Repo
Framework

Dialogue Act Classification in Group Chats with DAG-LSTMs


Title	Dialogue Act Classification in Group Chats with DAG-LSTMs
Authors	Ozan İrsoy, Rakesh Gosangi, Haimin Zhang, Mu-Hsin Wei, Peter Lund, Duccio Pappadopulo, Brendan Fahy, Neophytos Nephytou, Camilo Ortiz
Abstract	Dialogue act (DA) classification has been studied for the past two decades and has several key applications such as workflow automation and conversation analytics. Researchers have used, to address this problem, various traditional machine learning models, and more recently deep neural network models such as hierarchical convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. In this paper, we introduce a new model architecture, directed-acyclic-graph LSTM (DAG-LSTM) for DA classification. A DAG-LSTM exploits the turn-taking structure naturally present in a multi-party conversation, and encodes this relation in its model structure. Using the STAC corpus, we show that the proposed method performs roughly 0.8% better in accuracy and 1.2% better in macro-F1 score when compared to existing methods. The proposed method is generic and not limited to conversation applications.
Tasks	Dialogue Act Classification
Published	2019-08-02
URL	https://arxiv.org/abs/1908.01821v1
PDF	https://arxiv.org/pdf/1908.01821v1.pdf
PWC	https://paperswithcode.com/paper/dialogue-act-classification-in-group-chats
Repo
Framework

Issues with post-hoc counterfactual explanations: a discussion


Title	Issues with post-hoc counterfactual explanations: a discussion
Authors	Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Marcin Detyniecki
Abstract	Counterfactual post-hoc interpretability approaches have been proven to be useful tools to generate explanations for the predictions of a trained blackbox classifier. However, the assumptions they make about the data and the classifier make them unreliable in many contexts. In this paper, we discuss three desirable properties and approaches to quantify them: proximity, connectedness and stability. In addition, we illustrate that there is a risk for post-hoc counterfactual approaches to not satisfy these properties.
Tasks
Published	2019-06-11
URL	https://arxiv.org/abs/1906.04774v1
PDF	https://arxiv.org/pdf/1906.04774v1.pdf
PWC	https://paperswithcode.com/paper/issues-with-post-hoc-counterfactual
Repo
Framework

Fast and Efficient Model for Real-Time Tiger Detection In The Wild


Title	Fast and Efficient Model for Real-Time Tiger Detection In The Wild
Authors	Orest Kupyn, Dmitry Pranchuk
Abstract	The highest accuracy object detectors to date are based either on a two-stage approach such as Fast R-CNN or one-stage detectors such as Retina-Net or SSD with deep and complex backbones. In this paper we present TigerNet - simple yet efficient FPN based network architecture for Amur Tiger Detection in the wild. The model has 600k parameters, requires 0.071 GFLOPs per image and can run on the edge devices (smart cameras) in near real time. In addition, we introduce a two-stage semi-supervised learning via pseudo-labelling learning approach to distill the knowledge from the larger networks. For ATRW-ICCV 2019 tiger detection sub-challenge, based on public leaderboard score, our approach shows superior performance in comparison to other methods.
Tasks
Published	2019-09-03
URL	https://arxiv.org/abs/1909.01122v1
PDF	https://arxiv.org/pdf/1909.01122v1.pdf
PWC	https://paperswithcode.com/paper/fast-and-efficient-model-for-real-time-tiger
Repo
Framework

Adaptive Online Learning for Gradient-Based Optimizers


Title	Adaptive Online Learning for Gradient-Based Optimizers
Authors	Saeed Masoudian, Ali Arabzadeh, Mahdi Jafari Siavoshani, Milad Jalal, Alireza Amouzad
Abstract	As application demands for online convex optimization accelerate, the need for designing new methods that simultaneously cover a large class of convex functions and impose the lowest possible regret is highly rising. Known online optimization methods usually perform well only in specific settings, and their performance depends highly on the geometry of the decision space and cost functions. However, in practice, lack of such geometric information leads to confusion in using the appropriate algorithm. To address this issue, some adaptive methods have been proposed that focus on adaptively learning parameters such as step size, Lipschitz constant, and strong convexity coefficient, or on specific parametric families such as quadratic regularizers. In this work, we generalize these methods and propose a framework that competes with the best algorithm in a family of expert algorithms. Our framework includes many of the well-known adaptive methods including MetaGrad, MetaGrad+C, and Ader. We also introduce a second algorithm that computationally outperforms our first algorithm with at most a constant factor increase in regret. Finally, as a representative application of our proposed algorithm, we study the problem of learning the best regularizer from a family of regularizers for Online Mirror Descent. Empirically, we support our theoretical findings in the problem of learning the best regularizer on the simplex and $l_2$-ball in a multiclass learning problem.
Tasks
Published	2019-06-01
URL	https://arxiv.org/abs/1906.00290v1
PDF	https://arxiv.org/pdf/1906.00290v1.pdf
PWC	https://paperswithcode.com/paper/190600290
Repo
Framework

Distributed Function Minimization in Apache Spark


Title	Distributed Function Minimization in Apache Spark
Authors	Andrea Schioppa
Abstract	We report on an open-source implementation for distributed function minimization on top of Apache Spark by using gradient and quasi-Newton methods. We show-case it with an application to Optimal Transport and some scalability tests on classification and regression problems.
Tasks
Published	2019-09-17
URL	https://arxiv.org/abs/1909.07922v1
PDF	https://arxiv.org/pdf/1909.07922v1.pdf
PWC	https://paperswithcode.com/paper/distributed-function-minimization-in-apache
Repo
Framework

Interpreting Adversarial Examples by Activation Promotion and Suppression


Title	Interpreting Adversarial Examples by Activation Promotion and Suppression
Authors	Kaidi Xu, Sijia Liu, Gaoyuan Zhang, Mengshu Sun, Pu Zhao, Quanfu Fan, Chuang Gan, Xue Lin
Abstract	It is widely known that convolutional neural networks (CNNs) are vulnerable to adversarial examples: images with imperceptible perturbations crafted to fool classifiers. However, interpretability of these perturbations is less explored in the literature. This work aims to better understand the roles of adversarial perturbations and provide visual explanations from pixel, image and network perspectives. We show that adversaries have a promotion-suppression effect (PSE) on neurons’ activations and can be primarily categorized into three types: i) suppression-dominated perturbations that mainly reduce the classification score of the true label, ii) promotion-dominated perturbations that focus on boosting the confidence of the target label, and iii) balanced perturbations that play a dual role in suppression and promotion. We also provide image-level interpretability of adversarial examples. This links PSE of pixel-level perturbations to class-specific discriminative image regions localized by class activation mapping (Zhou et al. 2016). Further, we examine the adversarial effect through network dissection (Bau et al. 2017), which offers concept-level interpretability of hidden units. We show that there exists a tight connection between the units’ sensitivity to adversarial attacks and their interpretability on semantic concepts. Lastly, we provide some new insights from our interpretation to improve the adversarial robustness of networks.
Tasks
Published	2019-04-03
URL	https://arxiv.org/abs/1904.02057v2
PDF	https://arxiv.org/pdf/1904.02057v2.pdf
PWC	https://paperswithcode.com/paper/interpreting-adversarial-examples-by
Repo
Framework

What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes


Title	What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes
Authors	Alexey Romanov, Maria De-Arteaga, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, Anna Rumshisky, Adam Tauman Kalai
Abstract	There is a growing body of work that proposes methods for mitigating bias in machine learning systems. These methods typically rely on access to protected attributes such as race, gender, or age. However, this raises two significant challenges: (1) protected attributes may not be available or it may not be legal to use them, and (2) it is often desirable to simultaneously consider multiple protected attributes, as well as their intersections. In the context of mitigating bias in occupation classification, we propose a method for discouraging correlation between the predicted probability of an individual’s true occupation and a word embedding of their name. This method leverages the societal biases that are encoded in word embeddings, eliminating the need for access to protected attributes. Crucially, it only requires access to individuals’ names at training time and not at deployment time. We evaluate two variations of our proposed method using a large-scale dataset of online biographies. We find that both variations simultaneously reduce race and gender biases, with almost no reduction in the classifier’s overall true positive rate.
Tasks	Word Embeddings
Published	2019-04-10
URL	http://arxiv.org/abs/1904.05233v1
PDF	http://arxiv.org/pdf/1904.05233v1.pdf
PWC	https://paperswithcode.com/paper/whats-in-a-name-reducing-bias-in-bios-without
Repo
Framework

dpVAEs: Fixing Sample Generation for Regularized VAEs


Title	dpVAEs: Fixing Sample Generation for Regularized VAEs
Authors	Riddhish Bhalodia, Iain Lee, Shireen Elhabian
Abstract	Unsupervised representation learning via generative modeling is a staple to many computer vision applications in the absence of labeled data. Variational Autoencoders (VAEs) are powerful generative models that learn representations useful for data generation. However, due to inherent challenges in the training objective, VAEs fail to learn useful representations amenable for downstream tasks. Regularization-based methods that attempt to improve the representation learning aspect of VAEs come at a price: poor sample generation. In this paper, we explore this representation-generation trade-off for regularized VAEs and introduce a new family of priors, namely decoupled priors, or dpVAEs, that decouple the representation space from the generation space. This decoupling enables the use of VAE regularizers on the representation space without impacting the distribution used for sample generation, and thereby reaping the representation learning benefits of the regularizations without sacrificing the sample generation. dpVAE leverages invertible networks to learn a bijective mapping from an arbitrarily complex representation distribution to a simple, tractable, generative distribution. Decoupled priors can be adapted to the state-of-the-art VAE regularizers without additional hyperparameter tuning. We showcase the use of dpVAEs with different regularizers. Experiments on MNIST, SVHN, and CelebA demonstrate, quantitatively and qualitatively, that dpVAE fixes sample generation for regularized VAEs.
Tasks	Representation Learning, Unsupervised Representation Learning
Published	2019-11-24
URL	https://arxiv.org/abs/1911.10506v2
PDF	https://arxiv.org/pdf/1911.10506v2.pdf
PWC	https://paperswithcode.com/paper/dpvaes-fixing-sample-generation-for
Repo
Framework

Towards Intelligent Interactive Theatre: Drama Management as a way of Handling Performance


Title	Towards Intelligent Interactive Theatre: Drama Management as a way of Handling Performance
Authors	Nic Velissaris, Jessica Rivera-Villicana
Abstract	In this paper, we present a new modality for intelligent interactive narratives within the theatre domain. We discuss the possibilities of using an intelligent agent that serves as a drama manager and as an actor that plays a character within the live theatre experience. We pose a set of research challenges that arise from our analysis towards the implementation of such an agent, as well as potential methodologies as a starting point to bridge the gaps between current literature and the proposed modality.
Tasks
Published	2019-09-23
URL	https://arxiv.org/abs/1909.10371v1
PDF	https://arxiv.org/pdf/1909.10371v1.pdf
PWC	https://paperswithcode.com/paper/190910371
Repo
Framework

Learning Body Shape and Pose from Dense Correspondences


Title	Learning Body Shape and Pose from Dense Correspondences
Authors	Yusuke Yoshiyasu, Lucas Gamez
Abstract	In this paper, we address the problem of learning 3D human pose and body shape from 2D image dataset, without having to use 3D dataset (body shape and pose). The idea is to use dense correspondences between image points and a body surface, which can be annotated on in-the wild 2D images, and extract and aggregate 3D information from them. To do so, we propose a training strategy called ``deform-and-learn” where we alternate deformable surface registration and training of deep convolutional neural networks (ConvNets). Unlike previous approaches, our method does not require 3D pose annotations from a motion capture (MoCap) system or human intervention to validate 3D pose annotations. \|
Tasks	Motion Capture
Published	2019-07-27
URL	https://arxiv.org/abs/1907.11955v1
PDF	https://arxiv.org/pdf/1907.11955v1.pdf
PWC	https://paperswithcode.com/paper/learning-body-shape-and-pose-from-dense
Repo
Framework