January 31, 2020

3038 words 15 mins read

Paper Group AWR 381

Paper Group AWR 381

Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network. Temporal Reasoning via Audio Question Answering. DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition. Mimetics: Towards Understanding Human Actions Out of Context. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Affect-bas …

Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network

Title Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network
Authors Tao Hu, Lichao Huang, Xianming Liu, Han Shen
Abstract More powerful feature representations derived from deep neural networks benefit visual tracking algorithms widely. However, the lack of exploitation on temporal information prevents tracking algorithms from adapting to appearances changing or resisting to drift. This paper proposes a correlation filter based tracking method which aggregates historical features in a spatial-aligned and scale-aware paradigm. The features of historical frames are sampled and aggregated to search frame according to a pixel-level alignment module based on deformable convolutions. In addition, we also use a feature pyramid structure to handle motion estimation at different scales, and address the different demands on feature granularity between tracking losses and deformation offset learning. By this design, the tracker, named as Spatial-Aware Temporal Aggregation network (SATA), is able to assemble appearances and motion contexts of various scales in a time period, resulting in better performance compared to a single static image. Our tracker achieves leading performance in OTB2013, OTB2015, VOT2015, VOT2016 and LaSOT, and operates at a real-time speed of 26 FPS, which indicates our method is effective and practical. Our code will be made publicly available at \href{https://github.com/ecart18/SATA}{https://github.com/ecart18/SATA}.
Tasks Motion Estimation, Real-Time Visual Tracking, Visual Tracking
Published 2019-08-02
URL https://arxiv.org/abs/1908.00692v1
PDF https://arxiv.org/pdf/1908.00692v1.pdf
PWC https://paperswithcode.com/paper/real-time-visual-tracking-using-spatial-aware
Repo https://github.com/ecart18/SATA
Framework pytorch

Temporal Reasoning via Audio Question Answering

Title Temporal Reasoning via Audio Question Answering
Authors Haytham M. Fayek, Justin Johnson
Abstract Multimodal question answering tasks can be used as proxy tasks to study systems that can perceive and reason about the world. Answering questions about different types of input modalities stresses different aspects of reasoning such as visual reasoning, reading comprehension, story understanding, or navigation. In this paper, we use the task of Audio Question Answering (AQA) to study the temporal reasoning abilities of machine learning models. To this end, we introduce the Diagnostic Audio Question Answering (DAQA) dataset comprising audio sequences of natural sound events and programmatically generated questions and answers that probe various aspects of temporal reasoning. We adapt several recent state-of-the-art methods for visual question answering to the AQA task, and use DAQA to demonstrate that they perform poorly on questions that require in-depth temporal reasoning. Finally, we propose a new model, Multiple Auxiliary Controllers for Linear Modulation (MALiMo) that extends the recent Feature-wise Linear Modulation (FiLM) model and significantly improves its temporal reasoning capabilities. We envisage DAQA to foster research on AQA and temporal reasoning and MALiMo a step towards models for AQA.
Tasks Question Answering, Reading Comprehension, Visual Question Answering, Visual Reasoning
Published 2019-11-21
URL https://arxiv.org/abs/1911.09655v1
PDF https://arxiv.org/pdf/1911.09655v1.pdf
PWC https://paperswithcode.com/paper/temporal-reasoning-via-audio-question
Repo https://github.com/facebookresearch/daqa
Framework none

DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition

Title DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition
Authors Nuno C. Garcia, Sarah Adel Bargal, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, Stan Sclaroff
Abstract In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network. We introduce a novel Distillation Multiple Choice Learning framework for multimodal data, where different modality networks learn in a cooperative setting from scratch, strengthening one another. The modality networks learned using our method achieve significantly higher accuracy than if trained separately, due to the guidance of other modalities. We evaluate this approach on three video action recognition benchmark datasets. We obtain state-of-the-art results in comparison to other approaches that work with missing modalities at test time.
Tasks Temporal Action Localization
Published 2019-12-23
URL https://arxiv.org/abs/1912.10982v1
PDF https://arxiv.org/pdf/1912.10982v1.pdf
PWC https://paperswithcode.com/paper/dmcl-distillation-multiple-choice-learning
Repo https://github.com/ncgarcia/DMCL
Framework tf

Mimetics: Towards Understanding Human Actions Out of Context

Title Mimetics: Towards Understanding Human Actions Out of Context
Authors Philippe Weinzaepfel, Grégory Rogez
Abstract Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in the absence of context. We therefore introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions. Body language, captured by human pose and motion, is a meaningful cue to recognize out-of-context actions. We thus evaluate several pose-based baselines, either based on explicit 2D or 3D pose estimates, or on transferring pose features to the action recognition problem. This last method, less prone to inherent pose estimation noise, performs better than the other pose-based baselines, suggesting that an explicit pose representation might not be optimal for real-world action recognition.
Tasks Pose Estimation, Temporal Action Localization
Published 2019-12-16
URL https://arxiv.org/abs/1912.07249v1
PDF https://arxiv.org/pdf/1912.07249v1.pdf
PWC https://paperswithcode.com/paper/mimetics-towards-understanding-human-actions
Repo https://github.com/vt-vl-lab/reading_group
Framework none

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Title EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Authors Mingxing Tan, Quoc V. Le
Abstract Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.
Tasks Fine-Grained Image Classification, Image Classification, Neural Architecture Search, Transfer Learning
Published 2019-05-28
URL https://arxiv.org/abs/1905.11946v3
PDF https://arxiv.org/pdf/1905.11946v3.pdf
PWC https://paperswithcode.com/paper/efficientnet-rethinking-model-scaling-for
Repo https://github.com/hsandmann/espm.ml.2019.1
Framework none

Affect-based Intrinsic Rewards for Exploration and Learning

Title Affect-based Intrinsic Rewards for Exploration and Learning
Authors Dean Zadok, Daniel McDuff, Ashish Kapoor
Abstract Positive affect has been linked to increased interest, curiosity and satisfaction in human learning. In reinforcement learning, extrinsic rewards are often sparse and difficult to define, intrinsically motivated learning can help address these challenges. We argue that positive affect is an important intrinsic reward that effectively helps drive exploration that is useful in gathering experiences. We present a novel approach leveraging a task-independent intrinsic reward function trained on spontaneous smile behavior that captures positive affect. To evaluate our approach we trained several downstream computer vision tasks on data collected with our policy and several baseline methods. We show that the policy based on intrinsic affective rewards successfully increases the duration of episodes, the area explored and reduces collisions. The impact is the increased speed of learning for several downstream computer vision tasks.
Tasks
Published 2019-12-01
URL https://arxiv.org/abs/1912.00403v6
PDF https://arxiv.org/pdf/1912.00403v6.pdf
PWC https://paperswithcode.com/paper/affect-based-intrinsic-rewards-for-learning
Repo https://github.com/microsoft/affectbased
Framework tf

Deep Learning-Based Semantic Segmentation of Microscale Objects

Title Deep Learning-Based Semantic Segmentation of Microscale Objects
Authors Ekta U. Samani, Wei Guo, Ashis G. Banerjee
Abstract Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91.
Tasks Semantic Segmentation
Published 2019-07-03
URL https://arxiv.org/abs/1907.03576v1
PDF https://arxiv.org/pdf/1907.03576v1.pdf
PWC https://paperswithcode.com/paper/deep-learning-based-semantic-segmentation-of
Repo https://github.com/ektas0330/cell-segmentation
Framework tf

Perturbative GAN: GAN with Perturbation Layers

Title Perturbative GAN: GAN with Perturbation Layers
Authors Yuma Kishi, Tsutomu Ikegami, Shin-ichi O’uchi, Ryousei Takano, Wakana Nogami, Tomohiro Kudoh
Abstract Perturbative GAN, which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP, BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolu-tional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the incep-tion score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Perturbative GAN is evaluated using con-ventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator.
Tasks
Published 2019-02-05
URL http://arxiv.org/abs/1902.01514v1
PDF http://arxiv.org/pdf/1902.01514v1.pdf
PWC https://paperswithcode.com/paper/perturbative-gan-gan-with-perturbation-layers
Repo https://github.com/obake2ai/Obake-GAN
Framework pytorch

Pre-training of Graph Augmented Transformers for Medication Recommendation

Title Pre-training of Graph Augmented Transformers for Medication Recommendation
Authors Junyuan Shang, Tengfei Ma, Cao Xiao, Jimeng Sun
Abstract Medication recommendation is an important healthcare application. It is commonly formulated as a temporal prediction task. Hence, most existing works only utilize longitudinal electronic health records (EHRs) from a small number of patients with multiple visits ignoring a large number of patients with a single visit (selection bias). Moreover, important hierarchical knowledge such as diagnosis hierarchy is not leveraged in the representation learning process. To address these challenges, we propose G-BERT, a new model to combine the power of Graph Neural Networks (GNNs) and BERT (Bidirectional Encoder Representations from Transformers) for medical code representation and medication recommendation. We use GNNs to represent the internal hierarchical structures of medical codes. Then we integrate the GNN representation into a transformer-based visit encoder and pre-train it on EHR data from patients only with a single visit. The pre-trained visit encoder and representation are then fine-tuned for downstream predictive tasks on longitudinal EHRs from patients with multiple visits. G-BERT is the first to bring the language model pre-training schema into the healthcare domain and it achieved state-of-the-art performance on the medication recommendation task.
Tasks Language Modelling, Representation Learning
Published 2019-06-02
URL https://arxiv.org/abs/1906.00346v2
PDF https://arxiv.org/pdf/1906.00346v2.pdf
PWC https://paperswithcode.com/paper/190600346
Repo https://github.com/jshang123/G-Bert
Framework pytorch

Policy Consolidation for Continual Reinforcement Learning

Title Policy Consolidation for Continual Reinforcement Learning
Authors Christos Kaplanis, Murray Shanahan, Claudia Clopath
Abstract We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is \textit{agnostic} to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries, and can adapt in \textit{continuously} changing environments. In our \textit{policy consolidation} model, the policy network interacts with a cascade of hidden networks that simultaneously remember the agent’s policy at a range of timescales and regularise the current policy by its own history, thereby improving its ability to learn without forgetting. We find that the model improves continual learning relative to baselines on a number of continuous control tasks in single-task, alternating two-task, and multi-agent competitive self-play settings.
Tasks Continual Learning, Continuous Control
Published 2019-02-01
URL https://arxiv.org/abs/1902.00255v2
PDF https://arxiv.org/pdf/1902.00255v2.pdf
PWC https://paperswithcode.com/paper/policy-consolidation-for-continual
Repo https://github.com/ChristosKap/policy_consolidation
Framework tf

VITON-GAN: Virtual Try-on Image Generator Trained with Adversarial Loss

Title VITON-GAN: Virtual Try-on Image Generator Trained with Adversarial Loss
Authors Shion Honda
Abstract Generating a virtual try-on image from in-shop clothing images and a model person’s snapshot is a challenging task because the human body and clothes have high flexibility in their shapes. In this paper, we develop a Virtual Try-on Generative Adversarial Network (VITON-GAN), that generates virtual try-on images using images of in-shop clothing and a model person. This method enhances the quality of the generated image when occlusion is present in a model person’s image (e.g., arms crossed in front of the clothes) by adding an adversarial mechanism in the training pipeline.
Tasks
Published 2019-11-12
URL https://arxiv.org/abs/1911.07926v1
PDF https://arxiv.org/pdf/1911.07926v1.pdf
PWC https://paperswithcode.com/paper/viton-gan-virtual-try-on-image-generator
Repo https://github.com/shionhonda/viton-gan
Framework pytorch

On the Use of Emojis to Train Emotion Classifiers

Title On the Use of Emojis to Train Emotion Classifiers
Authors Wegdan Hussien, Mahmoud Al-Ayyoub, Yahya Tashtoush, Mohammed Al-Kabi
Abstract Nowadays, the automatic detection of emotions is employed by many applications in different fields like security informatics, e-learning, humor detection, targeted advertising, etc. Many of these applications focus on social media and treat this problem as a classification problem, which requires preparing training data. The typical method for annotating the training data by human experts is considered time consuming, labor intensive and sometimes prone to error. Moreover, such an approach is not easily extensible to new domains/languages since such extensions require annotating new training data. In this study, we propose a distant supervised learning approach where the training sentences are automatically annotated based on the emojis they have. Such training data would be very cheap to produce compared with the manually created training data, thus, much larger training data can be easily obtained. On the other hand, this training data would naturally have lower quality as it may contain some errors in the annotation. Nonetheless, we experimentally show that training classifiers on cheap, large and possibly erroneous data annotated using this approach leads to more accurate results compared with training the same classifiers on the more expensive, much smaller and error-free manually annotated training data. Our experiments are conducted on an in-house dataset of emotional Arabic tweets and the classifiers we consider are: Support Vector Machine (SVM), Multinomial Naive Bayes (MNB) and Random Forest (RF). In addition to experimenting with single classifiers, we also consider using an ensemble of classifiers. The results show that using an automatically annotated training data (that is only one order of magnitude larger than the manually annotated one) gives better results in almost all settings considered.
Tasks Humor Detection
Published 2019-02-24
URL http://arxiv.org/abs/1902.08906v2
PDF http://arxiv.org/pdf/1902.08906v2.pdf
PWC https://paperswithcode.com/paper/on-the-use-of-emojis-to-train-emotion
Repo https://github.com/malayyoub/emojis-to-train-emotion-classifiers
Framework none

Multi-level Wavelet Convolutional Neural Networks

Title Multi-level Wavelet Convolutional Neural Networks
Authors Pengju Liu, Hongzhi Zhang, Wei Lian, Wangmeng Zuo
Abstract In computer vision, convolutional networks (CNNs) often adopts pooling to enlarge receptive field which has the advantage of low computational complexity. However, pooling can cause information loss and thus is detrimental to further operations such as features extraction and analysis. Recently, dilated filter has been proposed to trade off between receptive field size and efficiency. But the accompanying gridding effect can cause a sparse sampling of input images with checkerboard patterns. To address this problem, in this paper, we propose a novel multi-level wavelet CNN (MWCNN) model to achieve better trade-off between receptive field size and computational efficiency. The core idea is to embed wavelet transform into CNN architecture to reduce the resolution of feature maps while at the same time, increasing receptive field. Specifically, MWCNN for image restoration is based on U-Net architecture, and inverse wavelet transform (IWT) is deployed to reconstruct the high resolution (HR) feature maps. The proposed MWCNN can also be viewed as an improvement of dilated filter and a generalization of average pooling, and can be applied to not only image restoration tasks, but also any CNNs requiring a pooling operation. The experimental results demonstrate effectiveness of the proposed MWCNN for tasks such as image denoising, single image super-resolution, JPEG image artifacts removal and object classification.
Tasks Denoising, Image Denoising, Image Restoration, Image Super-Resolution, Object Classification, Super-Resolution
Published 2019-07-06
URL https://arxiv.org/abs/1907.03128v1
PDF https://arxiv.org/pdf/1907.03128v1.pdf
PWC https://paperswithcode.com/paper/multi-level-wavelet-convolutional-neural
Repo https://github.com/lpj-github-io/MWCNNv2
Framework pytorch

NLH: A Blind Pixel-level Non-local Method for Real-world Image Denoising

Title NLH: A Blind Pixel-level Non-local Method for Real-world Image Denoising
Authors Yingkun Hou, Jun Xu, Mingxia Liu, Guanghai Liu, Li Liu, Fan Zhu, Ling Shao
Abstract Non-local self similarity (NSS) is a powerful prior of natural images for image denoising. Most of existing denoising methods employ similar patches, which is a patch-level NSS prior. In this paper, we take one step forward by introducing a pixel-level NSS prior, i.e., searching similar pixels across a non-local region. This is motivated by the fact that finding closely similar pixels is more feasible than similar patches in natural images, which can be used to enhance image denoising performance. With the introduced pixel-level NSS prior, we propose an accurate noise level estimation method, and then develop a blind image denoising method based on the lifting Haar transform and Wiener filtering techniques. Experiments on benchmark datasets demonstrate that, the proposed method achieves much better performance than previous non-deep methods, and is still competitive with existing state-of-the-art deep learning based methods on real-world image denoising. The code is publicly available at https://github.com/njusthyk1972/NLH.
Tasks Denoising, Image Denoising
Published 2019-06-17
URL https://arxiv.org/abs/1906.06834v6
PDF https://arxiv.org/pdf/1906.06834v6.pdf
PWC https://paperswithcode.com/paper/nlh-a-blind-pixel-level-non-local-method-for
Repo https://github.com/njusthyk1972/NLH
Framework none

EGG: a toolkit for research on Emergence of lanGuage in Games

Title EGG: a toolkit for research on Emergence of lanGuage in Games
Authors Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, Marco Baroni
Abstract There is renewed interest in simulating language emergence among deep neural agents that communicate to jointly solve a task, spurred by the practical aim to develop language-enabled interactive AIs, as well as by theoretical questions about the evolution of human language. However, optimizing deep architectures connected by a discrete communication channel (such as that in which language emerges) is technically challenging. We introduce EGG, a toolkit that greatly simplifies the implementation of emergent-language communication games. EGG’s modular design provides a set of building blocks that the user can combine to create new games, easily navigating the optimization and architecture space. We hope that the tool will lower the technical barrier, and encourage researchers from various backgrounds to do original work in this exciting area.
Tasks
Published 2019-07-01
URL https://arxiv.org/abs/1907.00852v2
PDF https://arxiv.org/pdf/1907.00852v2.pdf
PWC https://paperswithcode.com/paper/egg-a-toolkit-for-research-on-emergence-of
Repo https://github.com/facebookresearch/EGG
Framework pytorch
comments powered by Disqus