Paper Group AWR 381
Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network. Temporal Reasoning via Audio Question Answering. DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition. Mimetics: Towards Understanding Human Actions Out of Context. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Affect-bas …
Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network
Title | Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network |
Authors | Tao Hu, Lichao Huang, Xianming Liu, Han Shen |
Abstract | More powerful feature representations derived from deep neural networks benefit visual tracking algorithms widely. However, the lack of exploitation on temporal information prevents tracking algorithms from adapting to appearances changing or resisting to drift. This paper proposes a correlation filter based tracking method which aggregates historical features in a spatial-aligned and scale-aware paradigm. The features of historical frames are sampled and aggregated to search frame according to a pixel-level alignment module based on deformable convolutions. In addition, we also use a feature pyramid structure to handle motion estimation at different scales, and address the different demands on feature granularity between tracking losses and deformation offset learning. By this design, the tracker, named as Spatial-Aware Temporal Aggregation network (SATA), is able to assemble appearances and motion contexts of various scales in a time period, resulting in better performance compared to a single static image. Our tracker achieves leading performance in OTB2013, OTB2015, VOT2015, VOT2016 and LaSOT, and operates at a real-time speed of 26 FPS, which indicates our method is effective and practical. Our code will be made publicly available at \href{https://github.com/ecart18/SATA}{https://github.com/ecart18/SATA}. |
Tasks | Motion Estimation, Real-Time Visual Tracking, Visual Tracking |
Published | 2019-08-02 |
URL | https://arxiv.org/abs/1908.00692v1 |
https://arxiv.org/pdf/1908.00692v1.pdf | |
PWC | https://paperswithcode.com/paper/real-time-visual-tracking-using-spatial-aware |
Repo | https://github.com/ecart18/SATA |
Framework | pytorch |
Temporal Reasoning via Audio Question Answering
Title | Temporal Reasoning via Audio Question Answering |
Authors | Haytham M. Fayek, Justin Johnson |
Abstract | Multimodal question answering tasks can be used as proxy tasks to study systems that can perceive and reason about the world. Answering questions about different types of input modalities stresses different aspects of reasoning such as visual reasoning, reading comprehension, story understanding, or navigation. In this paper, we use the task of Audio Question Answering (AQA) to study the temporal reasoning abilities of machine learning models. To this end, we introduce the Diagnostic Audio Question Answering (DAQA) dataset comprising audio sequences of natural sound events and programmatically generated questions and answers that probe various aspects of temporal reasoning. We adapt several recent state-of-the-art methods for visual question answering to the AQA task, and use DAQA to demonstrate that they perform poorly on questions that require in-depth temporal reasoning. Finally, we propose a new model, Multiple Auxiliary Controllers for Linear Modulation (MALiMo) that extends the recent Feature-wise Linear Modulation (FiLM) model and significantly improves its temporal reasoning capabilities. We envisage DAQA to foster research on AQA and temporal reasoning and MALiMo a step towards models for AQA. |
Tasks | Question Answering, Reading Comprehension, Visual Question Answering, Visual Reasoning |
Published | 2019-11-21 |
URL | https://arxiv.org/abs/1911.09655v1 |
https://arxiv.org/pdf/1911.09655v1.pdf | |
PWC | https://paperswithcode.com/paper/temporal-reasoning-via-audio-question |
Repo | https://github.com/facebookresearch/daqa |
Framework | none |
DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition
Title | DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition |
Authors | Nuno C. Garcia, Sarah Adel Bargal, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, Stan Sclaroff |
Abstract | In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network. We introduce a novel Distillation Multiple Choice Learning framework for multimodal data, where different modality networks learn in a cooperative setting from scratch, strengthening one another. The modality networks learned using our method achieve significantly higher accuracy than if trained separately, due to the guidance of other modalities. We evaluate this approach on three video action recognition benchmark datasets. We obtain state-of-the-art results in comparison to other approaches that work with missing modalities at test time. |
Tasks | Temporal Action Localization |
Published | 2019-12-23 |
URL | https://arxiv.org/abs/1912.10982v1 |
https://arxiv.org/pdf/1912.10982v1.pdf | |
PWC | https://paperswithcode.com/paper/dmcl-distillation-multiple-choice-learning |
Repo | https://github.com/ncgarcia/DMCL |
Framework | tf |
Mimetics: Towards Understanding Human Actions Out of Context
Title | Mimetics: Towards Understanding Human Actions Out of Context |
Authors | Philippe Weinzaepfel, Grégory Rogez |
Abstract | Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in the absence of context. We therefore introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions. Body language, captured by human pose and motion, is a meaningful cue to recognize out-of-context actions. We thus evaluate several pose-based baselines, either based on explicit 2D or 3D pose estimates, or on transferring pose features to the action recognition problem. This last method, less prone to inherent pose estimation noise, performs better than the other pose-based baselines, suggesting that an explicit pose representation might not be optimal for real-world action recognition. |
Tasks | Pose Estimation, Temporal Action Localization |
Published | 2019-12-16 |
URL | https://arxiv.org/abs/1912.07249v1 |
https://arxiv.org/pdf/1912.07249v1.pdf | |
PWC | https://paperswithcode.com/paper/mimetics-towards-understanding-human-actions |
Repo | https://github.com/vt-vl-lab/reading_group |
Framework | none |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Title | EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks |
Authors | Mingxing Tan, Quoc V. Le |
Abstract | Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. |
Tasks | Fine-Grained Image Classification, Image Classification, Neural Architecture Search, Transfer Learning |
Published | 2019-05-28 |
URL | https://arxiv.org/abs/1905.11946v3 |
https://arxiv.org/pdf/1905.11946v3.pdf | |
PWC | https://paperswithcode.com/paper/efficientnet-rethinking-model-scaling-for |
Repo | https://github.com/hsandmann/espm.ml.2019.1 |
Framework | none |
Affect-based Intrinsic Rewards for Exploration and Learning
Title | Affect-based Intrinsic Rewards for Exploration and Learning |
Authors | Dean Zadok, Daniel McDuff, Ashish Kapoor |
Abstract | Positive affect has been linked to increased interest, curiosity and satisfaction in human learning. In reinforcement learning, extrinsic rewards are often sparse and difficult to define, intrinsically motivated learning can help address these challenges. We argue that positive affect is an important intrinsic reward that effectively helps drive exploration that is useful in gathering experiences. We present a novel approach leveraging a task-independent intrinsic reward function trained on spontaneous smile behavior that captures positive affect. To evaluate our approach we trained several downstream computer vision tasks on data collected with our policy and several baseline methods. We show that the policy based on intrinsic affective rewards successfully increases the duration of episodes, the area explored and reduces collisions. The impact is the increased speed of learning for several downstream computer vision tasks. |
Tasks | |
Published | 2019-12-01 |
URL | https://arxiv.org/abs/1912.00403v6 |
https://arxiv.org/pdf/1912.00403v6.pdf | |
PWC | https://paperswithcode.com/paper/affect-based-intrinsic-rewards-for-learning |
Repo | https://github.com/microsoft/affectbased |
Framework | tf |
Deep Learning-Based Semantic Segmentation of Microscale Objects
Title | Deep Learning-Based Semantic Segmentation of Microscale Objects |
Authors | Ekta U. Samani, Wei Guo, Ashis G. Banerjee |
Abstract | Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91. |
Tasks | Semantic Segmentation |
Published | 2019-07-03 |
URL | https://arxiv.org/abs/1907.03576v1 |
https://arxiv.org/pdf/1907.03576v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-learning-based-semantic-segmentation-of |
Repo | https://github.com/ektas0330/cell-segmentation |
Framework | tf |
Perturbative GAN: GAN with Perturbation Layers
Title | Perturbative GAN: GAN with Perturbation Layers |
Authors | Yuma Kishi, Tsutomu Ikegami, Shin-ichi O’uchi, Ryousei Takano, Wakana Nogami, Tomohiro Kudoh |
Abstract | Perturbative GAN, which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP, BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolu-tional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the incep-tion score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Perturbative GAN is evaluated using con-ventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator. |
Tasks | |
Published | 2019-02-05 |
URL | http://arxiv.org/abs/1902.01514v1 |
http://arxiv.org/pdf/1902.01514v1.pdf | |
PWC | https://paperswithcode.com/paper/perturbative-gan-gan-with-perturbation-layers |
Repo | https://github.com/obake2ai/Obake-GAN |
Framework | pytorch |
Pre-training of Graph Augmented Transformers for Medication Recommendation
Title | Pre-training of Graph Augmented Transformers for Medication Recommendation |
Authors | Junyuan Shang, Tengfei Ma, Cao Xiao, Jimeng Sun |
Abstract | Medication recommendation is an important healthcare application. It is commonly formulated as a temporal prediction task. Hence, most existing works only utilize longitudinal electronic health records (EHRs) from a small number of patients with multiple visits ignoring a large number of patients with a single visit (selection bias). Moreover, important hierarchical knowledge such as diagnosis hierarchy is not leveraged in the representation learning process. To address these challenges, we propose G-BERT, a new model to combine the power of Graph Neural Networks (GNNs) and BERT (Bidirectional Encoder Representations from Transformers) for medical code representation and medication recommendation. We use GNNs to represent the internal hierarchical structures of medical codes. Then we integrate the GNN representation into a transformer-based visit encoder and pre-train it on EHR data from patients only with a single visit. The pre-trained visit encoder and representation are then fine-tuned for downstream predictive tasks on longitudinal EHRs from patients with multiple visits. G-BERT is the first to bring the language model pre-training schema into the healthcare domain and it achieved state-of-the-art performance on the medication recommendation task. |
Tasks | Language Modelling, Representation Learning |
Published | 2019-06-02 |
URL | https://arxiv.org/abs/1906.00346v2 |
https://arxiv.org/pdf/1906.00346v2.pdf | |
PWC | https://paperswithcode.com/paper/190600346 |
Repo | https://github.com/jshang123/G-Bert |
Framework | pytorch |
Policy Consolidation for Continual Reinforcement Learning
Title | Policy Consolidation for Continual Reinforcement Learning |
Authors | Christos Kaplanis, Murray Shanahan, Claudia Clopath |
Abstract | We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is \textit{agnostic} to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries, and can adapt in \textit{continuously} changing environments. In our \textit{policy consolidation} model, the policy network interacts with a cascade of hidden networks that simultaneously remember the agent’s policy at a range of timescales and regularise the current policy by its own history, thereby improving its ability to learn without forgetting. We find that the model improves continual learning relative to baselines on a number of continuous control tasks in single-task, alternating two-task, and multi-agent competitive self-play settings. |
Tasks | Continual Learning, Continuous Control |
Published | 2019-02-01 |
URL | https://arxiv.org/abs/1902.00255v2 |
https://arxiv.org/pdf/1902.00255v2.pdf | |
PWC | https://paperswithcode.com/paper/policy-consolidation-for-continual |
Repo | https://github.com/ChristosKap/policy_consolidation |
Framework | tf |
VITON-GAN: Virtual Try-on Image Generator Trained with Adversarial Loss
Title | VITON-GAN: Virtual Try-on Image Generator Trained with Adversarial Loss |
Authors | Shion Honda |
Abstract | Generating a virtual try-on image from in-shop clothing images and a model person’s snapshot is a challenging task because the human body and clothes have high flexibility in their shapes. In this paper, we develop a Virtual Try-on Generative Adversarial Network (VITON-GAN), that generates virtual try-on images using images of in-shop clothing and a model person. This method enhances the quality of the generated image when occlusion is present in a model person’s image (e.g., arms crossed in front of the clothes) by adding an adversarial mechanism in the training pipeline. |
Tasks | |
Published | 2019-11-12 |
URL | https://arxiv.org/abs/1911.07926v1 |
https://arxiv.org/pdf/1911.07926v1.pdf | |
PWC | https://paperswithcode.com/paper/viton-gan-virtual-try-on-image-generator |
Repo | https://github.com/shionhonda/viton-gan |
Framework | pytorch |
On the Use of Emojis to Train Emotion Classifiers
Title | On the Use of Emojis to Train Emotion Classifiers |
Authors | Wegdan Hussien, Mahmoud Al-Ayyoub, Yahya Tashtoush, Mohammed Al-Kabi |
Abstract | Nowadays, the automatic detection of emotions is employed by many applications in different fields like security informatics, e-learning, humor detection, targeted advertising, etc. Many of these applications focus on social media and treat this problem as a classification problem, which requires preparing training data. The typical method for annotating the training data by human experts is considered time consuming, labor intensive and sometimes prone to error. Moreover, such an approach is not easily extensible to new domains/languages since such extensions require annotating new training data. In this study, we propose a distant supervised learning approach where the training sentences are automatically annotated based on the emojis they have. Such training data would be very cheap to produce compared with the manually created training data, thus, much larger training data can be easily obtained. On the other hand, this training data would naturally have lower quality as it may contain some errors in the annotation. Nonetheless, we experimentally show that training classifiers on cheap, large and possibly erroneous data annotated using this approach leads to more accurate results compared with training the same classifiers on the more expensive, much smaller and error-free manually annotated training data. Our experiments are conducted on an in-house dataset of emotional Arabic tweets and the classifiers we consider are: Support Vector Machine (SVM), Multinomial Naive Bayes (MNB) and Random Forest (RF). In addition to experimenting with single classifiers, we also consider using an ensemble of classifiers. The results show that using an automatically annotated training data (that is only one order of magnitude larger than the manually annotated one) gives better results in almost all settings considered. |
Tasks | Humor Detection |
Published | 2019-02-24 |
URL | http://arxiv.org/abs/1902.08906v2 |
http://arxiv.org/pdf/1902.08906v2.pdf | |
PWC | https://paperswithcode.com/paper/on-the-use-of-emojis-to-train-emotion |
Repo | https://github.com/malayyoub/emojis-to-train-emotion-classifiers |
Framework | none |
Multi-level Wavelet Convolutional Neural Networks
Title | Multi-level Wavelet Convolutional Neural Networks |
Authors | Pengju Liu, Hongzhi Zhang, Wei Lian, Wangmeng Zuo |
Abstract | In computer vision, convolutional networks (CNNs) often adopts pooling to enlarge receptive field which has the advantage of low computational complexity. However, pooling can cause information loss and thus is detrimental to further operations such as features extraction and analysis. Recently, dilated filter has been proposed to trade off between receptive field size and efficiency. But the accompanying gridding effect can cause a sparse sampling of input images with checkerboard patterns. To address this problem, in this paper, we propose a novel multi-level wavelet CNN (MWCNN) model to achieve better trade-off between receptive field size and computational efficiency. The core idea is to embed wavelet transform into CNN architecture to reduce the resolution of feature maps while at the same time, increasing receptive field. Specifically, MWCNN for image restoration is based on U-Net architecture, and inverse wavelet transform (IWT) is deployed to reconstruct the high resolution (HR) feature maps. The proposed MWCNN can also be viewed as an improvement of dilated filter and a generalization of average pooling, and can be applied to not only image restoration tasks, but also any CNNs requiring a pooling operation. The experimental results demonstrate effectiveness of the proposed MWCNN for tasks such as image denoising, single image super-resolution, JPEG image artifacts removal and object classification. |
Tasks | Denoising, Image Denoising, Image Restoration, Image Super-Resolution, Object Classification, Super-Resolution |
Published | 2019-07-06 |
URL | https://arxiv.org/abs/1907.03128v1 |
https://arxiv.org/pdf/1907.03128v1.pdf | |
PWC | https://paperswithcode.com/paper/multi-level-wavelet-convolutional-neural |
Repo | https://github.com/lpj-github-io/MWCNNv2 |
Framework | pytorch |
NLH: A Blind Pixel-level Non-local Method for Real-world Image Denoising
Title | NLH: A Blind Pixel-level Non-local Method for Real-world Image Denoising |
Authors | Yingkun Hou, Jun Xu, Mingxia Liu, Guanghai Liu, Li Liu, Fan Zhu, Ling Shao |
Abstract | Non-local self similarity (NSS) is a powerful prior of natural images for image denoising. Most of existing denoising methods employ similar patches, which is a patch-level NSS prior. In this paper, we take one step forward by introducing a pixel-level NSS prior, i.e., searching similar pixels across a non-local region. This is motivated by the fact that finding closely similar pixels is more feasible than similar patches in natural images, which can be used to enhance image denoising performance. With the introduced pixel-level NSS prior, we propose an accurate noise level estimation method, and then develop a blind image denoising method based on the lifting Haar transform and Wiener filtering techniques. Experiments on benchmark datasets demonstrate that, the proposed method achieves much better performance than previous non-deep methods, and is still competitive with existing state-of-the-art deep learning based methods on real-world image denoising. The code is publicly available at https://github.com/njusthyk1972/NLH. |
Tasks | Denoising, Image Denoising |
Published | 2019-06-17 |
URL | https://arxiv.org/abs/1906.06834v6 |
https://arxiv.org/pdf/1906.06834v6.pdf | |
PWC | https://paperswithcode.com/paper/nlh-a-blind-pixel-level-non-local-method-for |
Repo | https://github.com/njusthyk1972/NLH |
Framework | none |
EGG: a toolkit for research on Emergence of lanGuage in Games
Title | EGG: a toolkit for research on Emergence of lanGuage in Games |
Authors | Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, Marco Baroni |
Abstract | There is renewed interest in simulating language emergence among deep neural agents that communicate to jointly solve a task, spurred by the practical aim to develop language-enabled interactive AIs, as well as by theoretical questions about the evolution of human language. However, optimizing deep architectures connected by a discrete communication channel (such as that in which language emerges) is technically challenging. We introduce EGG, a toolkit that greatly simplifies the implementation of emergent-language communication games. EGG’s modular design provides a set of building blocks that the user can combine to create new games, easily navigating the optimization and architecture space. We hope that the tool will lower the technical barrier, and encourage researchers from various backgrounds to do original work in this exciting area. |
Tasks | |
Published | 2019-07-01 |
URL | https://arxiv.org/abs/1907.00852v2 |
https://arxiv.org/pdf/1907.00852v2.pdf | |
PWC | https://paperswithcode.com/paper/egg-a-toolkit-for-research-on-emergence-of |
Repo | https://github.com/facebookresearch/EGG |
Framework | pytorch |