April 3, 2020

3291 words 16 mins read

Paper Group AWR 69

Paper Group AWR 69

TResNet: High Performance GPU-Dedicated Architecture. Scene Text Recognition via Transformer. Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment. Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning. Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation. Adversarial System Variant Approximation to Quan …

TResNet: High Performance GPU-Dedicated Architecture

Title TResNet: High Performance GPU-Dedicated Architecture
Authors Tal Ridnik, Hussam Lawen, Asaf Noy, Itamar Friedman
Abstract Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks’ accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.7% top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). Implementation is available at: https://github.com/mrT23/TResNet
Tasks Fine-Grained Image Classification, Image Classification
Published 2020-03-30
URL https://arxiv.org/abs/2003.13630v1
PDF https://arxiv.org/pdf/2003.13630v1.pdf
PWC https://paperswithcode.com/paper/tresnet-high-performance-gpu-dedicated
Repo https://github.com/mrT23/TResNet
Framework pytorch

Scene Text Recognition via Transformer

Title Scene Text Recognition via Transformer
Authors Xinjie Feng, Hongxun Yao, Yuankai Qi, Jun Zhang, Shengping Zhang
Abstract Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause errors due to distortion perspective. In this paper, we find that the rectification is completely unnecessary. What all we need is the spatial attention. We therefore propose a simple but extremely effective scene text recognition method based on transformer [50]. Different from previous transformer based models [56,34], which just use the decoder of the transformer to decode the convolutional attention, the proposed method use a convolutional feature maps as word embedding input into transformer. In such a way, our method is able to make full use of the powerful attention mechanism of the transformer. Extensive experimental results show that the proposed method significantly outperforms state-of-the-art methods by a very large margin on both regular and irregular text datasets. On one of the most challenging CUTE dataset whose state-of-the-art prediction accuracy is 89.6%, our method achieves 99.3%, which is a pretty surprising result. We will release our source code and believe that our method will be a new benchmark of scene text recognition with arbitrary shapes.
Tasks Scene Text Recognition
Published 2020-03-18
URL https://arxiv.org/abs/2003.08077v2
PDF https://arxiv.org/pdf/2003.08077v2.pdf
PWC https://paperswithcode.com/paper/scene-text-recognition-via-transformer
Repo https://github.com/fengxinjie/Transformer-OCR
Framework pytorch

Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment

Title Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment
Authors Zirui Zhao, Yijun Mao, Yan Ding, Pengju Ren, Nanning Zheng
Abstract Semantic SLAM is an important field in autonomous driving and intelligent agents, which can enable robots to achieve high-level navigation tasks, obtain simple cognition or reasoning ability and achieve language-based human-robot-interaction. In this paper, we built a system to creat a semantic 3D map by combining 3D point cloud from ORB SLAM with semantic segmentation information from Convolutional Neural Network model PSPNet-101 for large-scale environments. Besides, a new dataset for KITTI sequences has been built, which contains the GPS information and labels of landmarks from Google Map in related streets of the sequences. Moreover, we find a way to associate the real-world landmark with point cloud map and built a topological map based on semantic map.
Tasks Autonomous Driving, Semantic Segmentation
Published 2020-01-04
URL https://arxiv.org/abs/2001.01028v1
PDF https://arxiv.org/pdf/2001.01028v1.pdf
PWC https://paperswithcode.com/paper/visual-semantic-slam-with-landmarks-for-large
Repo https://github.com/1989Ryan/Semantic_SLAM
Framework tf

Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning

Title Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning
Authors Sejun Park, Jaeho Lee, Sangwoo Mo, Jinwoo Shin
Abstract Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants demonstrated remarkable performances for pruning modern architectures. Based on the observation that magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization. Our experimental results demonstrate that the proposed method consistently outperforms magnitude-based pruning on various networks, including VGG and ResNet, particularly in the high-sparsity regime. See https://github.com/alinlab/lookahead_pruning for codes.
Tasks
Published 2020-02-12
URL https://arxiv.org/abs/2002.04809v1
PDF https://arxiv.org/pdf/2002.04809v1.pdf
PWC https://paperswithcode.com/paper/lookahead-a-far-sighted-alternative-of-1
Repo https://github.com/alinlab/lookahead_pruning
Framework pytorch

Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation

Title Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation
Authors István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe
Abstract Heatmap representations have formed the basis of 2D human pose estimation systems for many years, but their generalizations for 3D pose have only recently been considered. This includes 2.5D volumetric heatmaps, whose X and Y axes correspond to image space and the Z axis to metric depth around the subject. To obtain metric-scale predictions, these methods must include a separate, explicit post-processing step to resolve scale ambiguity. Further, they cannot encode body joint positions outside of the image boundaries, leading to incomplete pose estimates in case of image truncation. We address these limitations by proposing metric-scale truncation-robust (MeTRo) volumetric heatmaps, whose dimensions are defined in metric 3D space near the subject, instead of being aligned with image space. We train a fully-convolutional network to estimate such heatmaps from monocular RGB in an end-to-end manner. This reinterpretation of the heatmap dimensions allows us to estimate complete metric-scale poses without test-time knowledge of the focal length or person distance and without relying on anthropometric heuristics in post-processing. Furthermore, as the image space is decoupled from the heatmap space, the network can learn to reason about joints beyond the image boundary. Using ResNet-50 without any additional learned layers, we obtain state-of-the-art results on the Human3.6M and MPI-INF-3DHP benchmarks. As our method is simple and fast, it can become a useful component for real-time top-down multi-person pose estimation systems. We make our code publicly available to facilitate further research (see https://vision.rwth-aachen.de/metro-pose3d).
Tasks 3D Human Pose Estimation, Multi-Person Pose Estimation, Pose Estimation
Published 2020-03-05
URL https://arxiv.org/abs/2003.02953v1
PDF https://arxiv.org/pdf/2003.02953v1.pdf
PWC https://paperswithcode.com/paper/metric-scale-truncation-robust-heatmaps-for
Repo https://github.com/isarandi/metro-pose3d
Framework tf

Adversarial System Variant Approximation to Quantify Process Model Generalization

Title Adversarial System Variant Approximation to Quantify Process Model Generalization
Authors Julian Theis, Houshang Darabi
Abstract In process mining, process models are extracted from event logs using process discovery algorithms and are commonly assessed using multiple quality dimensions. While the metrics that measure the relationship of an extracted process model to its event log are well-studied, quantifying the level by which a process model can describe the unobserved behavior of its underlying system falls short in the literature. In this paper, a novel deep learning-based methodology called Adversarial System Variant Approximation (AVATAR) is proposed to overcome this issue. Sequence Generative Adversarial Networks are trained on the variants contained in an event log with the intention to approximate the underlying variant distribution of the system behavior. Unobserved realistic variants are sampled either directly from the Sequence Generative Adversarial Network or by leveraging the Metropolis-Hastings algorithm. The degree by which a process model relates to its underlying unknown system behavior is then quantified based on the realistic observed and estimated unobserved variants using established process model quality metrics. Significant performance improvements in revealing realistic unobserved variants are demonstrated in a controlled experiment on 15 ground truth systems. Additionally, the proposed methodology is experimentally tested and evaluated to quantify the generalization of 60 discovered process models with respect to their systems.
Tasks
Published 2020-03-26
URL https://arxiv.org/abs/2003.12168v1
PDF https://arxiv.org/pdf/2003.12168v1.pdf
PWC https://paperswithcode.com/paper/adversarial-system-variant-approximation-to
Repo https://github.com/ProminentLab/AVATAR
Framework none

Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models

Title Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models
Authors Xiao Zhang, Jinghui Chen, Quanquan Gu, David Evans
Abstract Starting with Gilmer et al. (2018), several works have demonstrated the inevitability of adversarial examples based on different assumptions about the underlying input probability space. It remains unclear, however, whether these results apply to natural image distributions. In this work, we assume the underlying data distribution is captured by some conditional generative model, and prove intrinsic robustness bounds for a general class of classifiers, which solves an open problem in Fawzi et al. (2018). Building upon the state-of-the-art conditional generative models, we study the intrinsic robustness of two common image benchmarks under $\ell_2$ perturbations, and show the existence of a large gap between the robustness limits implied by our theory and the adversarial robustness achieved by current state-of-the-art robust models. Code for all our experiments is available at https://github.com/xiaozhanguva/Intrinsic-Rob.
Tasks
Published 2020-03-01
URL https://arxiv.org/abs/2003.00378v1
PDF https://arxiv.org/pdf/2003.00378v1.pdf
PWC https://paperswithcode.com/paper/understanding-the-intrinsic-robustness-of
Repo https://github.com/xiaozhanguva/Intrinsic-Rob
Framework pytorch

Replacing Mobile Camera ISP with a Single Deep Learning Model

Title Replacing Mobile Camera ISP with a Single Deep Learning Model
Authors Andrey Ignatov, Luc Van Gool, Radu Timofte
Abstract As the popularity of mobile photography is growing constantly, lots of efforts are being invested now into building complex hand-crafted camera ISP solutions. In this work, we demonstrate that even the most sophisticated ISP pipelines can be replaced with a single end-to-end deep learning model trained without any prior knowledge about the sensor and optics used in a particular device. For this, we present PyNET, a novel pyramidal CNN architecture designed for fine-grained image restoration that implicitly learns to perform all ISP steps such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The model is trained to convert RAW Bayer data obtained directly from mobile camera sensor into photos captured with a professional high-end DSLR camera, making the solution independent of any particular mobile ISP implementation. To validate the proposed approach on the real data, we collected a large-scale dataset consisting of 10 thousand full-resolution RAW-RGB image pairs captured in the wild with the Huawei P20 cameraphone (12.3 MP Sony Exmor IMX380 sensor) and Canon 5D Mark IV DSLR. The experiments demonstrate that the proposed solution can easily get to the level of the embedded P20’s ISP pipeline that, unlike our approach, is combining the data from two (RGB + B/W) camera sensors. The dataset, pre-trained models and codes used in this paper are available on the project website.
Tasks Demosaicking, Denoising, Image Restoration
Published 2020-02-13
URL https://arxiv.org/abs/2002.05509v1
PDF https://arxiv.org/pdf/2002.05509v1.pdf
PWC https://paperswithcode.com/paper/replacing-mobile-camera-isp-with-a-single
Repo https://github.com/aiff22/pynet
Framework tf

Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning

Title Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning
Authors Simone Parisi, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters, Joni Pajarinen
Abstract Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore
Tasks
Published 2020-01-01
URL https://arxiv.org/abs/2001.00119v1
PDF https://arxiv.org/pdf/2001.00119v1.pdf
PWC https://paperswithcode.com/paper/long-term-visitation-value-for-deep
Repo https://github.com/sparisi/visit-value-explore
Framework none

Simulating Lexical Semantic Change from Sense-Annotated Data

Title Simulating Lexical Semantic Change from Sense-Annotated Data
Authors Dominik Schlechtweg, Sabine Schulte im Walde
Abstract We present a novel procedure to simulate lexical semantic change from synchronic sense-annotated data, and demonstrate its usefulness for assessing lexical semantic change detection models. The induced dataset represents a stronger correspondence to empirically observed lexical semantic change than previous synthetic datasets, because it exploits the intimate relationship between synchronic polysemy and diachronic change. We publish the data and provide the first large-scale evaluation gold standard for LSC detection models.
Tasks
Published 2020-01-09
URL https://arxiv.org/abs/2001.03216v1
PDF https://arxiv.org/pdf/2001.03216v1.pdf
PWC https://paperswithcode.com/paper/simulating-lexical-semantic-change-from-sense
Repo https://github.com/Garrafao/LSCDetection
Framework none

Automatically Discovering and Learning New Visual Categories with Ranking Statistics

Title Automatically Discovering and Learning New Visual Categories with Ranking Statistics
Authors Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
Abstract We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model’s knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin.
Tasks
Published 2020-02-13
URL https://arxiv.org/abs/2002.05714v1
PDF https://arxiv.org/pdf/2002.05714v1.pdf
PWC https://paperswithcode.com/paper/automatically-discovering-and-learning-new-1
Repo https://github.com/k-han/AutoNovel
Framework pytorch

Identifying the Development and Application of Artificial Intelligence in Scientific Text

Title Identifying the Development and Application of Artificial Intelligence in Scientific Text
Authors James Dunham, Jennifer Melot, Dewey Murdick
Abstract We describe a strategy for identifying the universe of research publications relating to the application and development of artificial intelligence. The approach leverages arXiv’s corpus of scientific preprints, in which authors choose subject tags for their papers from a set defined by editors. We compose from these subjects a functional definition of AI-relevance with intuitive components, by learning the subject definitions from paper metadata, and then inferring the arXiv-subject labels of papers in Web of Science. We find predictive classification $F_1$ scores between .59 and .86 for AI-relevant subject models. For an all-subjects model, we see precision of .83 and recall of .85. We evaluate the out-of-domain performance of our classifiers against other sources of subject information and results from other methods. We find that for the high-level fields of study represented on arXiv, a supervised solution can generalize for inference in other corpora. This offers a method for identifying AI-relevant publications that updates at the pace of research output, without reliance on subject-matter experts for query development or labeling.
Tasks
Published 2020-02-17
URL https://arxiv.org/abs/2002.07143v1
PDF https://arxiv.org/pdf/2002.07143v1.pdf
PWC https://paperswithcode.com/paper/identifying-the-development-and-application
Repo https://github.com/georgetown-cset/ai-relevant-papers
Framework pytorch

PONAS: Progressive One-shot Neural Architecture Search for Very Efficient Deployment

Title PONAS: Progressive One-shot Neural Architecture Search for Very Efficient Deployment
Authors Sian-Yao Huang, Wei-Ta Chu
Abstract We achieve very efficient deep learning model deployment that designs neural network architectures to fit different hardware constraints. Given a constraint, most neural architecture search (NAS) methods either sample a set of sub-networks according to a pre-trained accuracy predictor, or adopt the evolutionary algorithm to evolve specialized networks from the supernet. Both approaches are time consuming. Here our key idea for very efficient deployment is, when searching the architecture space, constructing a table that stores the validation accuracy of all candidate blocks at all layers. For a stricter hardware constraint, the architecture of a specialized network can be very efficiently determined based on this table by picking the best candidate blocks that yield the least accuracy loss. To accomplish this idea, we propose Progressive One-shot Neural Architecture Search (PONAS) that combines advantages of progressive NAS and one-shot methods. In PONAS, we propose a two-stage training scheme, including the meta training stage and the fine-tuning stage, to make the search process efficient and stable. During search, we evaluate candidate blocks in different layers and construct the accuracy table that is to be used in deployment. Comprehensive experiments verify that PONAS is extremely flexible, and is able to find architecture of a specialized network in around 10 seconds. In ImageNet classification, 75.2% top-1 accuracy can be obtained, which is comparable with the state of the arts.
Tasks Neural Architecture Search
Published 2020-03-11
URL https://arxiv.org/abs/2003.05112v1
PDF https://arxiv.org/pdf/2003.05112v1.pdf
PWC https://paperswithcode.com/paper/ponas-progressive-one-shot-neural
Repo https://github.com/eric8607242/PONAS
Framework pytorch

Learning Group Structure and Disentangled Representations of Dynamical Environments

Title Learning Group Structure and Disentangled Representations of Dynamical Environments
Authors Robin Quessard, Thomas D. Barrett, William R. Clements
Abstract Discovering the underlying structure of a dynamical environment involves learning representations that are interpretable and disentangled, which is a challenging task. In physics, interpretable representations of our universe and its underlying dynamics are formulated in terms of representations of groups of symmetry transformations. We propose a physics-inspired method, built upon the theory of group representation, that learns a representation of an environment structured around the transformations that generate its evolution. Experimentally, we learn the structure of explicitly symmetric environments without supervision while ensuring the interpretability of the representations. We show that the learned representations allow for accurate long-horizon predictions and further demonstrate a correlation between the quality of predictions and disentanglement in the latent space.
Tasks
Published 2020-02-17
URL https://arxiv.org/abs/2002.06991v1
PDF https://arxiv.org/pdf/2002.06991v1.pdf
PWC https://paperswithcode.com/paper/learning-group-structure-and-disentangled
Repo https://github.com/IndustAI/learning-group-structure
Framework pytorch

An Implicit Attention Mechanism for Deep Learning Pedestrian Re-identification Frameworks

Title An Implicit Attention Mechanism for Deep Learning Pedestrian Re-identification Frameworks
Authors Ehsan Yaghoubi, Diana Borza, Aruna Kumar, Hugo Proença
Abstract Attention is defined as the preparedness for the mental selection of certain aspects in a physical environment. In the computer vision domain, this mechanism is of most interest, as it helps to define the segments of an image/video that are critical for obtaining a specific decision. This paper introduces one ‘implicit’ attentional mechanism for deep learning frameworks, that provides simultaneously: 1) masks-free; and 2) foreground-focused samples for the inference phase. The main idea is to generate synthetic data composed of interleaved segments from the original learning set, while using class information only from specific segments. During the learning phase, the newly generated samples feed the network, keeping their label exclusively consistent with the identity from where the region-of-interest was cropped. Hence, as the model receives images of each identity with inconsistent unwanted areas, it naturally pays the most attention to the label consistent consistent regions, which we observed to be equivalent to learn an effective receptive field. During the test phase, samples are provided without any mask, and the network naturally disregards the detrimental information, which is the insight for the observed improvements in performance. As a proof-of-concept, we consider the challenging problem of pedestrian re-identification and compare the effectiveness of our solution to the state-of-the-art techniques in the well known Richly Annotated Pedestrian (RAP) dataset. The code is available at https://github.com/Ehsan-Yaghoubi/reid-strong-baseline.
Tasks
Published 2020-01-30
URL https://arxiv.org/abs/2001.11267v2
PDF https://arxiv.org/pdf/2001.11267v2.pdf
PWC https://paperswithcode.com/paper/an-implicit-attention-mechanism-for-deep
Repo https://github.com/Ehsan-Yaghoubi/reid-strong-baseline
Framework pytorch
comments powered by Disqus