January 31, 2020

3038 words 15 mins read

Paper Group AWR 381

Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network. Temporal Reasoning via Audio Question Answering. DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition. Mimetics: Towards Understanding Human Actions Out of Context. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Affect-bas …

Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network


Title	Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network
Authors	Tao Hu, Lichao Huang, Xianming Liu, Han Shen
Abstract	More powerful feature representations derived from deep neural networks benefit visual tracking algorithms widely. However, the lack of exploitation on temporal information prevents tracking algorithms from adapting to appearances changing or resisting to drift. This paper proposes a correlation filter based tracking method which aggregates historical features in a spatial-aligned and scale-aware paradigm. The features of historical frames are sampled and aggregated to search frame according to a pixel-level alignment module based on deformable convolutions. In addition, we also use a feature pyramid structure to handle motion estimation at different scales, and address the different demands on feature granularity between tracking losses and deformation offset learning. By this design, the tracker, named as Spatial-Aware Temporal Aggregation network (SATA), is able to assemble appearances and motion contexts of various scales in a time period, resulting in better performance compared to a single static image. Our tracker achieves leading performance in OTB2013, OTB2015, VOT2015, VOT2016 and LaSOT, and operates at a real-time speed of 26 FPS, which indicates our method is effective and practical. Our code will be made publicly available at \href{https://github.com/ecart18/SATA}{https://github.com/ecart18/SATA}.
Tasks	Motion Estimation, Real-Time Visual Tracking, Visual Tracking
Published	2019-08-02
URL	https://arxiv.org/abs/1908.00692v1
PDF	https://arxiv.org/pdf/1908.00692v1.pdf
PWC	https://paperswithcode.com/paper/real-time-visual-tracking-using-spatial-aware
Repo	https://github.com/ecart18/SATA
Framework	pytorch

Temporal Reasoning via Audio Question Answering


Title	Temporal Reasoning via Audio Question Answering
Authors	Haytham M. Fayek, Justin Johnson
Abstract	Multimodal question answering tasks can be used as proxy tasks to study systems that can perceive and reason about the world. Answering questions about different types of input modalities stresses different aspects of reasoning such as visual reasoning, reading comprehension, story understanding, or navigation. In this paper, we use the task of Audio Question Answering (AQA) to study the temporal reasoning abilities of machine learning models. To this end, we introduce the Diagnostic Audio Question Answering (DAQA) dataset comprising audio sequences of natural sound events and programmatically generated questions and answers that probe various aspects of temporal reasoning. We adapt several recent state-of-the-art methods for visual question answering to the AQA task, and use DAQA to demonstrate that they perform poorly on questions that require in-depth temporal reasoning. Finally, we propose a new model, Multiple Auxiliary Controllers for Linear Modulation (MALiMo) that extends the recent Feature-wise Linear Modulation (FiLM) model and significantly improves its temporal reasoning capabilities. We envisage DAQA to foster research on AQA and temporal reasoning and MALiMo a step towards models for AQA.
Tasks	Question Answering, Reading Comprehension, Visual Question Answering, Visual Reasoning
Published	2019-11-21
URL	https://arxiv.org/abs/1911.09655v1
PDF	https://arxiv.org/pdf/1911.09655v1.pdf
PWC	https://paperswithcode.com/paper/temporal-reasoning-via-audio-question
Repo	https://github.com/facebookresearch/daqa
Framework	none

DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition


Title	DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition
Authors	Nuno C. Garcia, Sarah Adel Bargal, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, Stan Sclaroff
Abstract	In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network. We introduce a novel Distillation Multiple Choice Learning framework for multimodal data, where different modality networks learn in a cooperative setting from scratch, strengthening one another. The modality networks learned using our method achieve significantly higher accuracy than if trained separately, due to the guidance of other modalities. We evaluate this approach on three video action recognition benchmark datasets. We obtain state-of-the-art results in comparison to other approaches that work with missing modalities at test time.
Tasks	Temporal Action Localization
Published	2019-12-23
URL	https://arxiv.org/abs/1912.10982v1
PDF	https://arxiv.org/pdf/1912.10982v1.pdf
PWC	https://paperswithcode.com/paper/dmcl-distillation-multiple-choice-learning
Repo	https://github.com/ncgarcia/DMCL
Framework	tf

Mimetics: Towards Understanding Human Actions Out of Context


Title	Mimetics: Towards Understanding Human Actions Out of Context
Authors	Philippe Weinzaepfel, Grégory Rogez
Abstract	Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in the absence of context. We therefore introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions. Body language, captured by human pose and motion, is a meaningful cue to recognize out-of-context actions. We thus evaluate several pose-based baselines, either based on explicit 2D or 3D pose estimates, or on transferring pose features to the action recognition problem. This last method, less prone to inherent pose estimation noise, performs better than the other pose-based baselines, suggesting that an explicit pose representation might not be optimal for real-world action recognition.
Tasks	Pose Estimation, Temporal Action Localization
Published	2019-12-16
URL	https://arxiv.org/abs/1912.07249v1
PDF	https://arxiv.org/pdf/1912.07249v1.pdf
PWC	https://paperswithcode.com/paper/mimetics-towards-understanding-human-actions
Repo	https://github.com/vt-vl-lab/reading_group
Framework	none

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks


Title	EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Authors	Mingxing Tan, Quoc V. Le
Abstract	Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.
Tasks	Fine-Grained Image Classification, Image Classification, Neural Architecture Search, Transfer Learning
Published	2019-05-28
URL	https://arxiv.org/abs/1905.11946v3
PDF	https://arxiv.org/pdf/1905.11946v3.pdf
PWC	https://paperswithcode.com/paper/efficientnet-rethinking-model-scaling-for
Repo	https://github.com/hsandmann/espm.ml.2019.1
Framework	none

Affect-based Intrinsic Rewards for Exploration and Learning


Title	Affect-based Intrinsic Rewards for Exploration and Learning
Authors	Dean Zadok, Daniel McDuff, Ashish Kapoor
Abstract	Positive affect has been linked to increased interest, curiosity and satisfaction in human learning. In reinforcement learning, extrinsic rewards are often sparse and difficult to define, intrinsically motivated learning can help address these challenges. We argue that positive affect is an important intrinsic reward that effectively helps drive exploration that is useful in gathering experiences. We present a novel approach leveraging a task-independent intrinsic reward function trained on spontaneous smile behavior that captures positive affect. To evaluate our approach we trained several downstream computer vision tasks on data collected with our policy and several baseline methods. We show that the policy based on intrinsic affective rewards successfully increases the duration of episodes, the area explored and reduces collisions. The impact is the increased speed of learning for several downstream computer vision tasks.
Tasks
Published	2019-12-01
URL	https://arxiv.org/abs/1912.00403v6
PDF	https://arxiv.org/pdf/1912.00403v6.pdf
PWC	https://paperswithcode.com/paper/affect-based-intrinsic-rewards-for-learning
Repo	https://github.com/microsoft/affectbased
Framework	tf

Deep Learning-Based Semantic Segmentation of Microscale Objects


Title	Deep Learning-Based Semantic Segmentation of Microscale Objects
Authors	Ekta U. Samani, Wei Guo, Ashis G. Banerjee
Abstract	Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91.
Tasks	Semantic Segmentation
Published	2019-07-03
URL	https://arxiv.org/abs/1907.03576v1
PDF	https://arxiv.org/pdf/1907.03576v1.pdf
PWC	https://paperswithcode.com/paper/deep-learning-based-semantic-segmentation-of
Repo	https://github.com/ektas0330/cell-segmentation
Framework	tf

Perturbative GAN: GAN with Perturbation Layers


Title	Perturbative GAN: GAN with Perturbation Layers
Authors	Yuma Kishi, Tsutomu Ikegami, Shin-ichi O’uchi, Ryousei Takano, Wakana Nogami, Tomohiro Kudoh
Abstract	Perturbative GAN, which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP, BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolu-tional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the incep-tion score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Perturbative GAN is evaluated using con-ventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator.
Tasks
Published	2019-02-05
URL	http://arxiv.org/abs/1902.01514v1
PDF	http://arxiv.org/pdf/1902.01514v1.pdf
PWC	https://paperswithcode.com/paper/perturbative-gan-gan-with-perturbation-layers
Repo	https://github.com/obake2ai/Obake-GAN
Framework	pytorch

Pre-training of Graph Augmented Transformers for Medication Recommendation


Title	Pre-training of Graph Augmented Transformers for Medication Recommendation
Authors	Junyuan Shang, Tengfei Ma, Cao Xiao, Jimeng Sun
Abstract	Medication recommendation is an important healthcare application. It is commonly formulated as a temporal prediction task. Hence, most existing works only utilize longitudinal electronic health records (EHRs) from a small number of patients with multiple visits ignoring a large number of patients with a single visit (selection bias). Moreover, important hierarchical knowledge such as diagnosis hierarchy is not leveraged in the representation learning process. To address these challenges, we propose G-BERT, a new model to combine the power of Graph Neural Networks (GNNs) and BERT (Bidirectional Encoder Representations from Transformers) for medical code representation and medication recommendation. We use GNNs to represent the internal hierarchical structures of medical codes. Then we integrate the GNN representation into a transformer-based visit encoder and pre-train it on EHR data from patients only with a single visit. The pre-trained visit encoder and representation are then fine-tuned for downstream predictive tasks on longitudinal EHRs from patients with multiple visits. G-BERT is the first to bring the language model pre-training schema into the healthcare domain and it achieved state-of-the-art performance on the medication recommendation task.
Tasks	Language Modelling, Representation Learning
Published	2019-06-02
URL	https://arxiv.org/abs/1906.00346v2
PDF	https://arxiv.org/pdf/1906.00346v2.pdf
PWC	https://paperswithcode.com/paper/190600346
Repo	https://github.com/jshang123/G-Bert
Framework	pytorch

Policy Consolidation for Continual Reinforcement Learning


Title	Policy Consolidation for Continual Reinforcement Learning
Authors	Christos Kaplanis, Murray Shanahan, Claudia Clopath
Abstract	We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is \textit{agnostic} to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries, and can adapt in \textit{continuously} changing environments. In our \textit{policy consolidation} model, the policy network interacts with a cascade of hidden networks that simultaneously remember the agent’s policy at a range of timescales and regularise the current policy by its own history, thereby improving its ability to learn without forgetting. We find that the model improves continual learning relative to baselines on a number of continuous control tasks in single-task, alternating two-task, and multi-agent competitive self-play settings.
Tasks	Continual Learning, Continuous Control
Published	2019-02-01
URL	https://arxiv.org/abs/1902.00255v2
PDF	https://arxiv.org/pdf/1902.00255v2.pdf
PWC	https://paperswithcode.com/paper/policy-consolidation-for-continual
Repo	https://github.com/ChristosKap/policy_consolidation
Framework	tf

VITON-GAN: Virtual Try-on Image Generator Trained with Adversarial Loss


Title	VITON-GAN: Virtual Try-on Image Generator Trained with Adversarial Loss
Authors	Shion Honda
Abstract	Generating a virtual try-on image from in-shop clothing images and a model person’s snapshot is a challenging task because the human body and clothes have high flexibility in their shapes. In this paper, we develop a Virtual Try-on Generative Adversarial Network (VITON-GAN), that generates virtual try-on images using images of in-shop clothing and a model person. This method enhances the quality of the generated image when occlusion is present in a model person’s image (e.g., arms crossed in front of the clothes) by adding an adversarial mechanism in the training pipeline.
Tasks
Published	2019-11-12
URL	https://arxiv.org/abs/1911.07926v1
PDF	https://arxiv.org/pdf/1911.07926v1.pdf
PWC	https://paperswithcode.com/paper/viton-gan-virtual-try-on-image-generator
Repo	https://github.com/shionhonda/viton-gan
Framework	pytorch

On the Use of Emojis to Train Emotion Classifiers


Title	On the Use of Emojis to Train Emotion Classifiers
Authors	Wegdan Hussien, Mahmoud Al-Ayyoub, Yahya Tashtoush, Mohammed Al-Kabi
Abstract	Nowadays, the automatic detection of emotions is employed by many applications in different fields like security informatics, e-learning, humor detection, targeted advertising, etc. Many of these applications focus on social media and treat this problem as a classification problem, which requires preparing training data. The typical method for annotating the training data by human experts is considered time consuming, labor intensive and sometimes prone to error. Moreover, such an approach is not easily extensible to new domains/languages since such extensions require annotating new training data. In this study, we propose a distant supervised learning approach where the training sentences are automatically annotated based on the emojis they have. Such training data would be very cheap to produce compared with the manually created training data, thus, much larger training data can be easily obtained. On the other hand, this training data would naturally have lower quality as it may contain some errors in the annotation. Nonetheless, we experimentally show that training classifiers on cheap, large and possibly erroneous data annotated using this approach leads to more accurate results compared with training the same classifiers on the more expensive, much smaller and error-free manually annotated training data. Our experiments are conducted on an in-house dataset of emotional Arabic tweets and the classifiers we consider are: Support Vector Machine (SVM), Multinomial Naive Bayes (MNB) and Random Forest (RF). In addition to experimenting with single classifiers, we also consider using an ensemble of classifiers. The results show that using an automatically annotated training data (that is only one order of magnitude larger than the manually annotated one) gives better results in almost all settings considered.
Tasks	Humor Detection
Published	2019-02-24
URL	http://arxiv.org/abs/1902.08906v2
PDF	http://arxiv.org/pdf/1902.08906v2.pdf
PWC	https://paperswithcode.com/paper/on-the-use-of-emojis-to-train-emotion
Repo	https://github.com/malayyoub/emojis-to-train-emotion-classifiers
Framework	none

Multi-level Wavelet Convolutional Neural Networks


Title	Multi-level Wavelet Convolutional Neural Networks
Authors	Pengju Liu, Hongzhi Zhang, Wei Lian, Wangmeng Zuo
Abstract	In computer vision, convolutional networks (CNNs) often adopts pooling to enlarge receptive field which has the advantage of low computational complexity. However, pooling can cause information loss and thus is detrimental to further operations such as features extraction and analysis. Recently, dilated filter has been proposed to trade off between receptive field size and efficiency. But the accompanying gridding effect can cause a sparse sampling of input images with checkerboard patterns. To address this problem, in this paper, we propose a novel multi-level wavelet CNN (MWCNN) model to achieve better trade-off between receptive field size and computational efficiency. The core idea is to embed wavelet transform into CNN architecture to reduce the resolution of feature maps while at the same time, increasing receptive field. Specifically, MWCNN for image restoration is based on U-Net architecture, and inverse wavelet transform (IWT) is deployed to reconstruct the high resolution (HR) feature maps. The proposed MWCNN can also be viewed as an improvement of dilated filter and a generalization of average pooling, and can be applied to not only image restoration tasks, but also any CNNs requiring a pooling operation. The experimental results demonstrate effectiveness of the proposed MWCNN for tasks such as image denoising, single image super-resolution, JPEG image artifacts removal and object classification.
Tasks	Denoising, Image Denoising, Image Restoration, Image Super-Resolution, Object Classification, Super-Resolution
Published	2019-07-06
URL	https://arxiv.org/abs/1907.03128v1
PDF	https://arxiv.org/pdf/1907.03128v1.pdf
PWC	https://paperswithcode.com/paper/multi-level-wavelet-convolutional-neural
Repo	https://github.com/lpj-github-io/MWCNNv2
Framework	pytorch


Title	NLH: A Blind Pixel-level Non-local Method for Real-world Image Denoising
Authors	Yingkun Hou, Jun Xu, Mingxia Liu, Guanghai Liu, Li Liu, Fan Zhu, Ling Shao
Abstract	Non-local self similarity (NSS) is a powerful prior of natural images for image denoising. Most of existing denoising methods employ similar patches, which is a patch-level NSS prior. In this paper, we take one step forward by introducing a pixel-level NSS prior, i.e., searching similar pixels across a non-local region. This is motivated by the fact that finding closely similar pixels is more feasible than similar patches in natural images, which can be used to enhance image denoising performance. With the introduced pixel-level NSS prior, we propose an accurate noise level estimation method, and then develop a blind image denoising method based on the lifting Haar transform and Wiener filtering techniques. Experiments on benchmark datasets demonstrate that, the proposed method achieves much better performance than previous non-deep methods, and is still competitive with existing state-of-the-art deep learning based methods on real-world image denoising. The code is publicly available at https://github.com/njusthyk1972/NLH.
Tasks	Denoising, Image Denoising
Published	2019-06-17
URL	https://arxiv.org/abs/1906.06834v6
PDF	https://arxiv.org/pdf/1906.06834v6.pdf
PWC	https://paperswithcode.com/paper/nlh-a-blind-pixel-level-non-local-method-for
Repo	https://github.com/njusthyk1972/NLH
Framework	none

EGG: a toolkit for research on Emergence of lanGuage in Games


Title	EGG: a toolkit for research on Emergence of lanGuage in Games
Authors	Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, Marco Baroni
Abstract	There is renewed interest in simulating language emergence among deep neural agents that communicate to jointly solve a task, spurred by the practical aim to develop language-enabled interactive AIs, as well as by theoretical questions about the evolution of human language. However, optimizing deep architectures connected by a discrete communication channel (such as that in which language emerges) is technically challenging. We introduce EGG, a toolkit that greatly simplifies the implementation of emergent-language communication games. EGG’s modular design provides a set of building blocks that the user can combine to create new games, easily navigating the optimization and architecture space. We hope that the tool will lower the technical barrier, and encourage researchers from various backgrounds to do original work in this exciting area.
Tasks
Published	2019-07-01
URL	https://arxiv.org/abs/1907.00852v2
PDF	https://arxiv.org/pdf/1907.00852v2.pdf
PWC	https://paperswithcode.com/paper/egg-a-toolkit-for-research-on-emergence-of
Repo	https://github.com/facebookresearch/EGG
Framework	pytorch