Paper Group ANR 1095
Recent Advances in Deep Learning for Object Detection. Vision-to-Language Tasks Based on Attributes and Attention Mechanism. HA-CCN: Hierarchical Attention-based Crowd Counting Network. Learn to Compress CSI and Allocate Resources in Vehicular Networks. Multimodal Transformer with Multi-View Visual Representation for Image Captioning. Dialogue Act …
Recent Advances in Deep Learning for Object Detection
Title | Recent Advances in Deep Learning for Object Detection |
Authors | Xiongwei Wu, Doyen Sahoo, Steven C. H. Hoi |
Abstract | Object detection is a fundamental visual recognition problem in computer vision and has been widely studied in the past decades. Visual object detection aims to find objects of certain target classes with precise localization in a given image and assign each object instance a corresponding class label. Due to the tremendous successes of deep learning based image classification, object detection techniques using deep learning have been actively studied in recent years. In this paper, we give a comprehensive survey of recent advances in visual object detection with deep learning. By reviewing a large body of recent related work in literature, we systematically analyze the existing object detection frameworks and organize the survey into three major parts: (i) detection components, (ii) learning strategies, and (iii) applications & benchmarks. In the survey, we cover a variety of factors affecting the detection performance in detail, such as detector architectures, feature learning, proposal generation, sampling strategies, etc. Finally, we discuss several future directions to facilitate and spur future research for visual object detection with deep learning. Keywords: Object Detection, Deep Learning, Deep Convolutional Neural Networks |
Tasks | Image Classification, Object Detection |
Published | 2019-08-10 |
URL | https://arxiv.org/abs/1908.03673v1 |
https://arxiv.org/pdf/1908.03673v1.pdf | |
PWC | https://paperswithcode.com/paper/recent-advances-in-deep-learning-for-object |
Repo | |
Framework | |
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Title | Vision-to-Language Tasks Based on Attributes and Attention Mechanism |
Authors | Xuelong Li, Aihong Yuan, Xiaoqiang Lu |
Abstract | Vision-to-language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations and decode it into natural language sentences. While they neglect high-level semantic concepts and subtle relationships between image regions and natural language elements. To make full use of these information, this paper attempt to exploit the text guided attention and semantic-guided attention (SA) to find the more correlated spatial information and reduce the semantic gap between vision and language. Our method includes two level attention networks. One is the text-guided attention network which is used to select the text-related regions. The other is SA network which is used to highlight the concept-related regions and the region-related concepts. At last, all these information are incorporated to generate captions or answers. Practically, image captioning and visual question answering experiments have been carried out, and the experimental results have shown the excellent performance of the proposed approach. |
Tasks | Image Captioning, Question Answering, Visual Question Answering |
Published | 2019-05-29 |
URL | https://arxiv.org/abs/1905.12243v1 |
https://arxiv.org/pdf/1905.12243v1.pdf | |
PWC | https://paperswithcode.com/paper/vision-to-language-tasks-based-on-attributes |
Repo | |
Framework | |
HA-CCN: Hierarchical Attention-based Crowd Counting Network
Title | HA-CCN: Hierarchical Attention-based Crowd Counting Network |
Authors | Vishwanath A. Sindagi, Vishal M. Patel |
Abstract | Single image-based crowd counting has recently witnessed increased focus, but many leading methods are far from optimal, especially in highly congested scenes. In this paper, we present Hierarchical Attention-based Crowd Counting Network (HA-CCN) that employs attention mechanisms at various levels to selectively enhance the features of the network. The proposed method, which is based on the VGG16 network, consists of a spatial attention module (SAM) and a set of global attention modules (GAM). SAM enhances low-level features in the network by infusing spatial segmentation information, whereas the GAM focuses on enhancing channel-wise information in the higher level layers. The proposed method is a single-step training framework, simple to implement and achieves state-of-the-art results on different datasets. Furthermore, we extend the proposed counting network by introducing a novel set-up to adapt the network to different scenes and datasets via weak supervision using image-level labels. This new set up reduces the burden of acquiring labour intensive point-wise annotations for new datasets while improving the cross-dataset performance. |
Tasks | Crowd Counting |
Published | 2019-07-24 |
URL | https://arxiv.org/abs/1907.10255v1 |
https://arxiv.org/pdf/1907.10255v1.pdf | |
PWC | https://paperswithcode.com/paper/ha-ccn-hierarchical-attention-based-crowd |
Repo | |
Framework | |
Learn to Compress CSI and Allocate Resources in Vehicular Networks
Title | Learn to Compress CSI and Allocate Resources in Vehicular Networks |
Authors | Liang Wang, Hao Ye, Le Liang, Geoffrey Ye Li |
Abstract | Resource allocation has a direct and profound impact on the performance of vehicle-to-everything (V2X) networks. In this paper, we develop a hybrid architecture consisting of centralized decision making and distributed resource sharing (the C-Decision scheme) to maximize the long-term sum rate of all vehicles. To reduce the network signaling overhead, each vehicle uses a deep neural network to compress its observed information that is thereafter fed back to the centralized decision making unit. The centralized decision unit employs a deep Q-network to allocate resources and then sends the decision results to all vehicles. We further adopt a quantization layer for each vehicle that learns to quantize the continuous feedback. In addition, we devise a mechanism to balance the transmission of vehicle-to-vehicle (V2V) links and vehicle-to-infrastructure (V2I) links. To further facilitate distributed spectrum sharing, we also propose a distributed decision making and spectrum sharing architecture (the D-Decision scheme) for each V2V link. Through extensive simulation results, we demonstrate that the proposed C-Decision and D-Decision schemes can both achieve near-optimal performance and are robust to feedback interval variations, input noise, and feedback noise. |
Tasks | Decision Making, Quantization |
Published | 2019-08-12 |
URL | https://arxiv.org/abs/1908.04685v1 |
https://arxiv.org/pdf/1908.04685v1.pdf | |
PWC | https://paperswithcode.com/paper/learn-to-compress-csi-and-allocate-resources |
Repo | |
Framework | |
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Title | Multimodal Transformer with Multi-View Visual Representation for Image Captioning |
Authors | Jun Yu, Jing Li, Zhou Yu, Qingming Huang |
Abstract | Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN)-based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper. |
Tasks | Image Captioning, Machine Translation |
Published | 2019-05-20 |
URL | https://arxiv.org/abs/1905.07841v1 |
https://arxiv.org/pdf/1905.07841v1.pdf | |
PWC | https://paperswithcode.com/paper/multimodal-transformer-with-multi-view-visual |
Repo | |
Framework | |
Dialogue Act Classification in Group Chats with DAG-LSTMs
Title | Dialogue Act Classification in Group Chats with DAG-LSTMs |
Authors | Ozan İrsoy, Rakesh Gosangi, Haimin Zhang, Mu-Hsin Wei, Peter Lund, Duccio Pappadopulo, Brendan Fahy, Neophytos Nephytou, Camilo Ortiz |
Abstract | Dialogue act (DA) classification has been studied for the past two decades and has several key applications such as workflow automation and conversation analytics. Researchers have used, to address this problem, various traditional machine learning models, and more recently deep neural network models such as hierarchical convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. In this paper, we introduce a new model architecture, directed-acyclic-graph LSTM (DAG-LSTM) for DA classification. A DAG-LSTM exploits the turn-taking structure naturally present in a multi-party conversation, and encodes this relation in its model structure. Using the STAC corpus, we show that the proposed method performs roughly 0.8% better in accuracy and 1.2% better in macro-F1 score when compared to existing methods. The proposed method is generic and not limited to conversation applications. |
Tasks | Dialogue Act Classification |
Published | 2019-08-02 |
URL | https://arxiv.org/abs/1908.01821v1 |
https://arxiv.org/pdf/1908.01821v1.pdf | |
PWC | https://paperswithcode.com/paper/dialogue-act-classification-in-group-chats |
Repo | |
Framework | |
Issues with post-hoc counterfactual explanations: a discussion
Title | Issues with post-hoc counterfactual explanations: a discussion |
Authors | Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Marcin Detyniecki |
Abstract | Counterfactual post-hoc interpretability approaches have been proven to be useful tools to generate explanations for the predictions of a trained blackbox classifier. However, the assumptions they make about the data and the classifier make them unreliable in many contexts. In this paper, we discuss three desirable properties and approaches to quantify them: proximity, connectedness and stability. In addition, we illustrate that there is a risk for post-hoc counterfactual approaches to not satisfy these properties. |
Tasks | |
Published | 2019-06-11 |
URL | https://arxiv.org/abs/1906.04774v1 |
https://arxiv.org/pdf/1906.04774v1.pdf | |
PWC | https://paperswithcode.com/paper/issues-with-post-hoc-counterfactual |
Repo | |
Framework | |
Fast and Efficient Model for Real-Time Tiger Detection In The Wild
Title | Fast and Efficient Model for Real-Time Tiger Detection In The Wild |
Authors | Orest Kupyn, Dmitry Pranchuk |
Abstract | The highest accuracy object detectors to date are based either on a two-stage approach such as Fast R-CNN or one-stage detectors such as Retina-Net or SSD with deep and complex backbones. In this paper we present TigerNet - simple yet efficient FPN based network architecture for Amur Tiger Detection in the wild. The model has 600k parameters, requires 0.071 GFLOPs per image and can run on the edge devices (smart cameras) in near real time. In addition, we introduce a two-stage semi-supervised learning via pseudo-labelling learning approach to distill the knowledge from the larger networks. For ATRW-ICCV 2019 tiger detection sub-challenge, based on public leaderboard score, our approach shows superior performance in comparison to other methods. |
Tasks | |
Published | 2019-09-03 |
URL | https://arxiv.org/abs/1909.01122v1 |
https://arxiv.org/pdf/1909.01122v1.pdf | |
PWC | https://paperswithcode.com/paper/fast-and-efficient-model-for-real-time-tiger |
Repo | |
Framework | |
Adaptive Online Learning for Gradient-Based Optimizers
Title | Adaptive Online Learning for Gradient-Based Optimizers |
Authors | Saeed Masoudian, Ali Arabzadeh, Mahdi Jafari Siavoshani, Milad Jalal, Alireza Amouzad |
Abstract | As application demands for online convex optimization accelerate, the need for designing new methods that simultaneously cover a large class of convex functions and impose the lowest possible regret is highly rising. Known online optimization methods usually perform well only in specific settings, and their performance depends highly on the geometry of the decision space and cost functions. However, in practice, lack of such geometric information leads to confusion in using the appropriate algorithm. To address this issue, some adaptive methods have been proposed that focus on adaptively learning parameters such as step size, Lipschitz constant, and strong convexity coefficient, or on specific parametric families such as quadratic regularizers. In this work, we generalize these methods and propose a framework that competes with the best algorithm in a family of expert algorithms. Our framework includes many of the well-known adaptive methods including MetaGrad, MetaGrad+C, and Ader. We also introduce a second algorithm that computationally outperforms our first algorithm with at most a constant factor increase in regret. Finally, as a representative application of our proposed algorithm, we study the problem of learning the best regularizer from a family of regularizers for Online Mirror Descent. Empirically, we support our theoretical findings in the problem of learning the best regularizer on the simplex and $l_2$-ball in a multiclass learning problem. |
Tasks | |
Published | 2019-06-01 |
URL | https://arxiv.org/abs/1906.00290v1 |
https://arxiv.org/pdf/1906.00290v1.pdf | |
PWC | https://paperswithcode.com/paper/190600290 |
Repo | |
Framework | |
Distributed Function Minimization in Apache Spark
Title | Distributed Function Minimization in Apache Spark |
Authors | Andrea Schioppa |
Abstract | We report on an open-source implementation for distributed function minimization on top of Apache Spark by using gradient and quasi-Newton methods. We show-case it with an application to Optimal Transport and some scalability tests on classification and regression problems. |
Tasks | |
Published | 2019-09-17 |
URL | https://arxiv.org/abs/1909.07922v1 |
https://arxiv.org/pdf/1909.07922v1.pdf | |
PWC | https://paperswithcode.com/paper/distributed-function-minimization-in-apache |
Repo | |
Framework | |
Interpreting Adversarial Examples by Activation Promotion and Suppression
Title | Interpreting Adversarial Examples by Activation Promotion and Suppression |
Authors | Kaidi Xu, Sijia Liu, Gaoyuan Zhang, Mengshu Sun, Pu Zhao, Quanfu Fan, Chuang Gan, Xue Lin |
Abstract | It is widely known that convolutional neural networks (CNNs) are vulnerable to adversarial examples: images with imperceptible perturbations crafted to fool classifiers. However, interpretability of these perturbations is less explored in the literature. This work aims to better understand the roles of adversarial perturbations and provide visual explanations from pixel, image and network perspectives. We show that adversaries have a promotion-suppression effect (PSE) on neurons’ activations and can be primarily categorized into three types: i) suppression-dominated perturbations that mainly reduce the classification score of the true label, ii) promotion-dominated perturbations that focus on boosting the confidence of the target label, and iii) balanced perturbations that play a dual role in suppression and promotion. We also provide image-level interpretability of adversarial examples. This links PSE of pixel-level perturbations to class-specific discriminative image regions localized by class activation mapping (Zhou et al. 2016). Further, we examine the adversarial effect through network dissection (Bau et al. 2017), which offers concept-level interpretability of hidden units. We show that there exists a tight connection between the units’ sensitivity to adversarial attacks and their interpretability on semantic concepts. Lastly, we provide some new insights from our interpretation to improve the adversarial robustness of networks. |
Tasks | |
Published | 2019-04-03 |
URL | https://arxiv.org/abs/1904.02057v2 |
https://arxiv.org/pdf/1904.02057v2.pdf | |
PWC | https://paperswithcode.com/paper/interpreting-adversarial-examples-by |
Repo | |
Framework | |
What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes
Title | What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes |
Authors | Alexey Romanov, Maria De-Arteaga, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, Anna Rumshisky, Adam Tauman Kalai |
Abstract | There is a growing body of work that proposes methods for mitigating bias in machine learning systems. These methods typically rely on access to protected attributes such as race, gender, or age. However, this raises two significant challenges: (1) protected attributes may not be available or it may not be legal to use them, and (2) it is often desirable to simultaneously consider multiple protected attributes, as well as their intersections. In the context of mitigating bias in occupation classification, we propose a method for discouraging correlation between the predicted probability of an individual’s true occupation and a word embedding of their name. This method leverages the societal biases that are encoded in word embeddings, eliminating the need for access to protected attributes. Crucially, it only requires access to individuals’ names at training time and not at deployment time. We evaluate two variations of our proposed method using a large-scale dataset of online biographies. We find that both variations simultaneously reduce race and gender biases, with almost no reduction in the classifier’s overall true positive rate. |
Tasks | Word Embeddings |
Published | 2019-04-10 |
URL | http://arxiv.org/abs/1904.05233v1 |
http://arxiv.org/pdf/1904.05233v1.pdf | |
PWC | https://paperswithcode.com/paper/whats-in-a-name-reducing-bias-in-bios-without |
Repo | |
Framework | |
dpVAEs: Fixing Sample Generation for Regularized VAEs
Title | dpVAEs: Fixing Sample Generation for Regularized VAEs |
Authors | Riddhish Bhalodia, Iain Lee, Shireen Elhabian |
Abstract | Unsupervised representation learning via generative modeling is a staple to many computer vision applications in the absence of labeled data. Variational Autoencoders (VAEs) are powerful generative models that learn representations useful for data generation. However, due to inherent challenges in the training objective, VAEs fail to learn useful representations amenable for downstream tasks. Regularization-based methods that attempt to improve the representation learning aspect of VAEs come at a price: poor sample generation. In this paper, we explore this representation-generation trade-off for regularized VAEs and introduce a new family of priors, namely decoupled priors, or dpVAEs, that decouple the representation space from the generation space. This decoupling enables the use of VAE regularizers on the representation space without impacting the distribution used for sample generation, and thereby reaping the representation learning benefits of the regularizations without sacrificing the sample generation. dpVAE leverages invertible networks to learn a bijective mapping from an arbitrarily complex representation distribution to a simple, tractable, generative distribution. Decoupled priors can be adapted to the state-of-the-art VAE regularizers without additional hyperparameter tuning. We showcase the use of dpVAEs with different regularizers. Experiments on MNIST, SVHN, and CelebA demonstrate, quantitatively and qualitatively, that dpVAE fixes sample generation for regularized VAEs. |
Tasks | Representation Learning, Unsupervised Representation Learning |
Published | 2019-11-24 |
URL | https://arxiv.org/abs/1911.10506v2 |
https://arxiv.org/pdf/1911.10506v2.pdf | |
PWC | https://paperswithcode.com/paper/dpvaes-fixing-sample-generation-for |
Repo | |
Framework | |
Towards Intelligent Interactive Theatre: Drama Management as a way of Handling Performance
Title | Towards Intelligent Interactive Theatre: Drama Management as a way of Handling Performance |
Authors | Nic Velissaris, Jessica Rivera-Villicana |
Abstract | In this paper, we present a new modality for intelligent interactive narratives within the theatre domain. We discuss the possibilities of using an intelligent agent that serves as a drama manager and as an actor that plays a character within the live theatre experience. We pose a set of research challenges that arise from our analysis towards the implementation of such an agent, as well as potential methodologies as a starting point to bridge the gaps between current literature and the proposed modality. |
Tasks | |
Published | 2019-09-23 |
URL | https://arxiv.org/abs/1909.10371v1 |
https://arxiv.org/pdf/1909.10371v1.pdf | |
PWC | https://paperswithcode.com/paper/190910371 |
Repo | |
Framework | |
Learning Body Shape and Pose from Dense Correspondences
Title | Learning Body Shape and Pose from Dense Correspondences |
Authors | Yusuke Yoshiyasu, Lucas Gamez |
Abstract | In this paper, we address the problem of learning 3D human pose and body shape from 2D image dataset, without having to use 3D dataset (body shape and pose). The idea is to use dense correspondences between image points and a body surface, which can be annotated on in-the wild 2D images, and extract and aggregate 3D information from them. To do so, we propose a training strategy called ``deform-and-learn” where we alternate deformable surface registration and training of deep convolutional neural networks (ConvNets). Unlike previous approaches, our method does not require 3D pose annotations from a motion capture (MoCap) system or human intervention to validate 3D pose annotations. | |
Tasks | Motion Capture |
Published | 2019-07-27 |
URL | https://arxiv.org/abs/1907.11955v1 |
https://arxiv.org/pdf/1907.11955v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-body-shape-and-pose-from-dense |
Repo | |
Framework | |