April 3, 2020

3291 words 16 mins read

Paper Group AWR 69

TResNet: High Performance GPU-Dedicated Architecture. Scene Text Recognition via Transformer. Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment. Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning. Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation. Adversarial System Variant Approximation to Quan …

TResNet: High Performance GPU-Dedicated Architecture


Title	TResNet: High Performance GPU-Dedicated Architecture
Authors	Tal Ridnik, Hussam Lawen, Asaf Noy, Itamar Friedman
Abstract	Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks’ accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.7% top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). Implementation is available at: https://github.com/mrT23/TResNet
Tasks	Fine-Grained Image Classification, Image Classification
Published	2020-03-30
URL	https://arxiv.org/abs/2003.13630v1
PDF	https://arxiv.org/pdf/2003.13630v1.pdf
PWC	https://paperswithcode.com/paper/tresnet-high-performance-gpu-dedicated
Repo	https://github.com/mrT23/TResNet
Framework	pytorch

Scene Text Recognition via Transformer


Title	Scene Text Recognition via Transformer
Authors	Xinjie Feng, Hongxun Yao, Yuankai Qi, Jun Zhang, Shengping Zhang
Abstract	Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause errors due to distortion perspective. In this paper, we find that the rectification is completely unnecessary. What all we need is the spatial attention. We therefore propose a simple but extremely effective scene text recognition method based on transformer [50]. Different from previous transformer based models [56,34], which just use the decoder of the transformer to decode the convolutional attention, the proposed method use a convolutional feature maps as word embedding input into transformer. In such a way, our method is able to make full use of the powerful attention mechanism of the transformer. Extensive experimental results show that the proposed method significantly outperforms state-of-the-art methods by a very large margin on both regular and irregular text datasets. On one of the most challenging CUTE dataset whose state-of-the-art prediction accuracy is 89.6%, our method achieves 99.3%, which is a pretty surprising result. We will release our source code and believe that our method will be a new benchmark of scene text recognition with arbitrary shapes.
Tasks	Scene Text Recognition
Published	2020-03-18
URL	https://arxiv.org/abs/2003.08077v2
PDF	https://arxiv.org/pdf/2003.08077v2.pdf
PWC	https://paperswithcode.com/paper/scene-text-recognition-via-transformer
Repo	https://github.com/fengxinjie/Transformer-OCR
Framework	pytorch

Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment


Title	Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment
Authors	Zirui Zhao, Yijun Mao, Yan Ding, Pengju Ren, Nanning Zheng
Abstract	Semantic SLAM is an important field in autonomous driving and intelligent agents, which can enable robots to achieve high-level navigation tasks, obtain simple cognition or reasoning ability and achieve language-based human-robot-interaction. In this paper, we built a system to creat a semantic 3D map by combining 3D point cloud from ORB SLAM with semantic segmentation information from Convolutional Neural Network model PSPNet-101 for large-scale environments. Besides, a new dataset for KITTI sequences has been built, which contains the GPS information and labels of landmarks from Google Map in related streets of the sequences. Moreover, we find a way to associate the real-world landmark with point cloud map and built a topological map based on semantic map.
Tasks	Autonomous Driving, Semantic Segmentation
Published	2020-01-04
URL	https://arxiv.org/abs/2001.01028v1
PDF	https://arxiv.org/pdf/2001.01028v1.pdf
PWC	https://paperswithcode.com/paper/visual-semantic-slam-with-landmarks-for-large
Repo	https://github.com/1989Ryan/Semantic_SLAM
Framework	tf

Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning


Title	Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning
Authors	Sejun Park, Jaeho Lee, Sangwoo Mo, Jinwoo Shin
Abstract	Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants demonstrated remarkable performances for pruning modern architectures. Based on the observation that magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization. Our experimental results demonstrate that the proposed method consistently outperforms magnitude-based pruning on various networks, including VGG and ResNet, particularly in the high-sparsity regime. See https://github.com/alinlab/lookahead_pruning for codes.
Tasks
Published	2020-02-12
URL	https://arxiv.org/abs/2002.04809v1
PDF	https://arxiv.org/pdf/2002.04809v1.pdf
PWC	https://paperswithcode.com/paper/lookahead-a-far-sighted-alternative-of-1
Repo	https://github.com/alinlab/lookahead_pruning
Framework	pytorch

Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation


Title	Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation
Authors	István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe
Abstract	Heatmap representations have formed the basis of 2D human pose estimation systems for many years, but their generalizations for 3D pose have only recently been considered. This includes 2.5D volumetric heatmaps, whose X and Y axes correspond to image space and the Z axis to metric depth around the subject. To obtain metric-scale predictions, these methods must include a separate, explicit post-processing step to resolve scale ambiguity. Further, they cannot encode body joint positions outside of the image boundaries, leading to incomplete pose estimates in case of image truncation. We address these limitations by proposing metric-scale truncation-robust (MeTRo) volumetric heatmaps, whose dimensions are defined in metric 3D space near the subject, instead of being aligned with image space. We train a fully-convolutional network to estimate such heatmaps from monocular RGB in an end-to-end manner. This reinterpretation of the heatmap dimensions allows us to estimate complete metric-scale poses without test-time knowledge of the focal length or person distance and without relying on anthropometric heuristics in post-processing. Furthermore, as the image space is decoupled from the heatmap space, the network can learn to reason about joints beyond the image boundary. Using ResNet-50 without any additional learned layers, we obtain state-of-the-art results on the Human3.6M and MPI-INF-3DHP benchmarks. As our method is simple and fast, it can become a useful component for real-time top-down multi-person pose estimation systems. We make our code publicly available to facilitate further research (see https://vision.rwth-aachen.de/metro-pose3d).
Tasks	3D Human Pose Estimation, Multi-Person Pose Estimation, Pose Estimation
Published	2020-03-05
URL	https://arxiv.org/abs/2003.02953v1
PDF	https://arxiv.org/pdf/2003.02953v1.pdf
PWC	https://paperswithcode.com/paper/metric-scale-truncation-robust-heatmaps-for
Repo	https://github.com/isarandi/metro-pose3d
Framework	tf

Adversarial System Variant Approximation to Quantify Process Model Generalization


Title	Adversarial System Variant Approximation to Quantify Process Model Generalization
Authors	Julian Theis, Houshang Darabi
Abstract	In process mining, process models are extracted from event logs using process discovery algorithms and are commonly assessed using multiple quality dimensions. While the metrics that measure the relationship of an extracted process model to its event log are well-studied, quantifying the level by which a process model can describe the unobserved behavior of its underlying system falls short in the literature. In this paper, a novel deep learning-based methodology called Adversarial System Variant Approximation (AVATAR) is proposed to overcome this issue. Sequence Generative Adversarial Networks are trained on the variants contained in an event log with the intention to approximate the underlying variant distribution of the system behavior. Unobserved realistic variants are sampled either directly from the Sequence Generative Adversarial Network or by leveraging the Metropolis-Hastings algorithm. The degree by which a process model relates to its underlying unknown system behavior is then quantified based on the realistic observed and estimated unobserved variants using established process model quality metrics. Significant performance improvements in revealing realistic unobserved variants are demonstrated in a controlled experiment on 15 ground truth systems. Additionally, the proposed methodology is experimentally tested and evaluated to quantify the generalization of 60 discovered process models with respect to their systems.
Tasks
Published	2020-03-26
URL	https://arxiv.org/abs/2003.12168v1
PDF	https://arxiv.org/pdf/2003.12168v1.pdf
PWC	https://paperswithcode.com/paper/adversarial-system-variant-approximation-to
Repo	https://github.com/ProminentLab/AVATAR
Framework	none

Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models


Title	Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models
Authors	Xiao Zhang, Jinghui Chen, Quanquan Gu, David Evans
Abstract	Starting with Gilmer et al. (2018), several works have demonstrated the inevitability of adversarial examples based on different assumptions about the underlying input probability space. It remains unclear, however, whether these results apply to natural image distributions. In this work, we assume the underlying data distribution is captured by some conditional generative model, and prove intrinsic robustness bounds for a general class of classifiers, which solves an open problem in Fawzi et al. (2018). Building upon the state-of-the-art conditional generative models, we study the intrinsic robustness of two common image benchmarks under $\ell_2$ perturbations, and show the existence of a large gap between the robustness limits implied by our theory and the adversarial robustness achieved by current state-of-the-art robust models. Code for all our experiments is available at https://github.com/xiaozhanguva/Intrinsic-Rob.
Tasks
Published	2020-03-01
URL	https://arxiv.org/abs/2003.00378v1
PDF	https://arxiv.org/pdf/2003.00378v1.pdf
PWC	https://paperswithcode.com/paper/understanding-the-intrinsic-robustness-of
Repo	https://github.com/xiaozhanguva/Intrinsic-Rob
Framework	pytorch

Replacing Mobile Camera ISP with a Single Deep Learning Model


Title	Replacing Mobile Camera ISP with a Single Deep Learning Model
Authors	Andrey Ignatov, Luc Van Gool, Radu Timofte
Abstract	As the popularity of mobile photography is growing constantly, lots of efforts are being invested now into building complex hand-crafted camera ISP solutions. In this work, we demonstrate that even the most sophisticated ISP pipelines can be replaced with a single end-to-end deep learning model trained without any prior knowledge about the sensor and optics used in a particular device. For this, we present PyNET, a novel pyramidal CNN architecture designed for fine-grained image restoration that implicitly learns to perform all ISP steps such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The model is trained to convert RAW Bayer data obtained directly from mobile camera sensor into photos captured with a professional high-end DSLR camera, making the solution independent of any particular mobile ISP implementation. To validate the proposed approach on the real data, we collected a large-scale dataset consisting of 10 thousand full-resolution RAW-RGB image pairs captured in the wild with the Huawei P20 cameraphone (12.3 MP Sony Exmor IMX380 sensor) and Canon 5D Mark IV DSLR. The experiments demonstrate that the proposed solution can easily get to the level of the embedded P20’s ISP pipeline that, unlike our approach, is combining the data from two (RGB + B/W) camera sensors. The dataset, pre-trained models and codes used in this paper are available on the project website.
Tasks	Demosaicking, Denoising, Image Restoration
Published	2020-02-13
URL	https://arxiv.org/abs/2002.05509v1
PDF	https://arxiv.org/pdf/2002.05509v1.pdf
PWC	https://paperswithcode.com/paper/replacing-mobile-camera-isp-with-a-single
Repo	https://github.com/aiff22/pynet
Framework	tf

Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning


Title	Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning
Authors	Simone Parisi, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters, Joni Pajarinen
Abstract	Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore
Tasks
Published	2020-01-01
URL	https://arxiv.org/abs/2001.00119v1
PDF	https://arxiv.org/pdf/2001.00119v1.pdf
PWC	https://paperswithcode.com/paper/long-term-visitation-value-for-deep
Repo	https://github.com/sparisi/visit-value-explore
Framework	none

Simulating Lexical Semantic Change from Sense-Annotated Data


Title	Simulating Lexical Semantic Change from Sense-Annotated Data
Authors	Dominik Schlechtweg, Sabine Schulte im Walde
Abstract	We present a novel procedure to simulate lexical semantic change from synchronic sense-annotated data, and demonstrate its usefulness for assessing lexical semantic change detection models. The induced dataset represents a stronger correspondence to empirically observed lexical semantic change than previous synthetic datasets, because it exploits the intimate relationship between synchronic polysemy and diachronic change. We publish the data and provide the first large-scale evaluation gold standard for LSC detection models.
Tasks
Published	2020-01-09
URL	https://arxiv.org/abs/2001.03216v1
PDF	https://arxiv.org/pdf/2001.03216v1.pdf
PWC	https://paperswithcode.com/paper/simulating-lexical-semantic-change-from-sense
Repo	https://github.com/Garrafao/LSCDetection
Framework	none

Automatically Discovering and Learning New Visual Categories with Ranking Statistics


Title	Automatically Discovering and Learning New Visual Categories with Ranking Statistics
Authors	Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
Abstract	We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model’s knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin.
Tasks
Published	2020-02-13
URL	https://arxiv.org/abs/2002.05714v1
PDF	https://arxiv.org/pdf/2002.05714v1.pdf
PWC	https://paperswithcode.com/paper/automatically-discovering-and-learning-new-1
Repo	https://github.com/k-han/AutoNovel
Framework	pytorch

Identifying the Development and Application of Artificial Intelligence in Scientific Text


Title	Identifying the Development and Application of Artificial Intelligence in Scientific Text
Authors	James Dunham, Jennifer Melot, Dewey Murdick
Abstract	We describe a strategy for identifying the universe of research publications relating to the application and development of artificial intelligence. The approach leverages arXiv’s corpus of scientific preprints, in which authors choose subject tags for their papers from a set defined by editors. We compose from these subjects a functional definition of AI-relevance with intuitive components, by learning the subject definitions from paper metadata, and then inferring the arXiv-subject labels of papers in Web of Science. We find predictive classification $F_1$ scores between .59 and .86 for AI-relevant subject models. For an all-subjects model, we see precision of .83 and recall of .85. We evaluate the out-of-domain performance of our classifiers against other sources of subject information and results from other methods. We find that for the high-level fields of study represented on arXiv, a supervised solution can generalize for inference in other corpora. This offers a method for identifying AI-relevant publications that updates at the pace of research output, without reliance on subject-matter experts for query development or labeling.
Tasks
Published	2020-02-17
URL	https://arxiv.org/abs/2002.07143v1
PDF	https://arxiv.org/pdf/2002.07143v1.pdf
PWC	https://paperswithcode.com/paper/identifying-the-development-and-application
Repo	https://github.com/georgetown-cset/ai-relevant-papers
Framework	pytorch

PONAS: Progressive One-shot Neural Architecture Search for Very Efficient Deployment


Title	PONAS: Progressive One-shot Neural Architecture Search for Very Efficient Deployment
Authors	Sian-Yao Huang, Wei-Ta Chu
Abstract	We achieve very efficient deep learning model deployment that designs neural network architectures to fit different hardware constraints. Given a constraint, most neural architecture search (NAS) methods either sample a set of sub-networks according to a pre-trained accuracy predictor, or adopt the evolutionary algorithm to evolve specialized networks from the supernet. Both approaches are time consuming. Here our key idea for very efficient deployment is, when searching the architecture space, constructing a table that stores the validation accuracy of all candidate blocks at all layers. For a stricter hardware constraint, the architecture of a specialized network can be very efficiently determined based on this table by picking the best candidate blocks that yield the least accuracy loss. To accomplish this idea, we propose Progressive One-shot Neural Architecture Search (PONAS) that combines advantages of progressive NAS and one-shot methods. In PONAS, we propose a two-stage training scheme, including the meta training stage and the fine-tuning stage, to make the search process efficient and stable. During search, we evaluate candidate blocks in different layers and construct the accuracy table that is to be used in deployment. Comprehensive experiments verify that PONAS is extremely flexible, and is able to find architecture of a specialized network in around 10 seconds. In ImageNet classification, 75.2% top-1 accuracy can be obtained, which is comparable with the state of the arts.
Tasks	Neural Architecture Search
Published	2020-03-11
URL	https://arxiv.org/abs/2003.05112v1
PDF	https://arxiv.org/pdf/2003.05112v1.pdf
PWC	https://paperswithcode.com/paper/ponas-progressive-one-shot-neural
Repo	https://github.com/eric8607242/PONAS
Framework	pytorch

Learning Group Structure and Disentangled Representations of Dynamical Environments


Title	Learning Group Structure and Disentangled Representations of Dynamical Environments
Authors	Robin Quessard, Thomas D. Barrett, William R. Clements
Abstract	Discovering the underlying structure of a dynamical environment involves learning representations that are interpretable and disentangled, which is a challenging task. In physics, interpretable representations of our universe and its underlying dynamics are formulated in terms of representations of groups of symmetry transformations. We propose a physics-inspired method, built upon the theory of group representation, that learns a representation of an environment structured around the transformations that generate its evolution. Experimentally, we learn the structure of explicitly symmetric environments without supervision while ensuring the interpretability of the representations. We show that the learned representations allow for accurate long-horizon predictions and further demonstrate a correlation between the quality of predictions and disentanglement in the latent space.
Tasks
Published	2020-02-17
URL	https://arxiv.org/abs/2002.06991v1
PDF	https://arxiv.org/pdf/2002.06991v1.pdf
PWC	https://paperswithcode.com/paper/learning-group-structure-and-disentangled
Repo	https://github.com/IndustAI/learning-group-structure
Framework	pytorch

An Implicit Attention Mechanism for Deep Learning Pedestrian Re-identification Frameworks


Title	An Implicit Attention Mechanism for Deep Learning Pedestrian Re-identification Frameworks
Authors	Ehsan Yaghoubi, Diana Borza, Aruna Kumar, Hugo Proença
Abstract	Attention is defined as the preparedness for the mental selection of certain aspects in a physical environment. In the computer vision domain, this mechanism is of most interest, as it helps to define the segments of an image/video that are critical for obtaining a specific decision. This paper introduces one ‘implicit’ attentional mechanism for deep learning frameworks, that provides simultaneously: 1) masks-free; and 2) foreground-focused samples for the inference phase. The main idea is to generate synthetic data composed of interleaved segments from the original learning set, while using class information only from specific segments. During the learning phase, the newly generated samples feed the network, keeping their label exclusively consistent with the identity from where the region-of-interest was cropped. Hence, as the model receives images of each identity with inconsistent unwanted areas, it naturally pays the most attention to the label consistent consistent regions, which we observed to be equivalent to learn an effective receptive field. During the test phase, samples are provided without any mask, and the network naturally disregards the detrimental information, which is the insight for the observed improvements in performance. As a proof-of-concept, we consider the challenging problem of pedestrian re-identification and compare the effectiveness of our solution to the state-of-the-art techniques in the well known Richly Annotated Pedestrian (RAP) dataset. The code is available at https://github.com/Ehsan-Yaghoubi/reid-strong-baseline.
Tasks
Published	2020-01-30
URL	https://arxiv.org/abs/2001.11267v2
PDF	https://arxiv.org/pdf/2001.11267v2.pdf
PWC	https://paperswithcode.com/paper/an-implicit-attention-mechanism-for-deep
Repo	https://github.com/Ehsan-Yaghoubi/reid-strong-baseline
Framework	pytorch