Paper Group AWR 69
TResNet: High Performance GPU-Dedicated Architecture. Scene Text Recognition via Transformer. Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment. Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning. Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation. Adversarial System Variant Approximation to Quan …
TResNet: High Performance GPU-Dedicated Architecture
Title | TResNet: High Performance GPU-Dedicated Architecture |
Authors | Tal Ridnik, Hussam Lawen, Asaf Noy, Itamar Friedman |
Abstract | Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks’ accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.7% top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). Implementation is available at: https://github.com/mrT23/TResNet |
Tasks | Fine-Grained Image Classification, Image Classification |
Published | 2020-03-30 |
URL | https://arxiv.org/abs/2003.13630v1 |
https://arxiv.org/pdf/2003.13630v1.pdf | |
PWC | https://paperswithcode.com/paper/tresnet-high-performance-gpu-dedicated |
Repo | https://github.com/mrT23/TResNet |
Framework | pytorch |
Scene Text Recognition via Transformer
Title | Scene Text Recognition via Transformer |
Authors | Xinjie Feng, Hongxun Yao, Yuankai Qi, Jun Zhang, Shengping Zhang |
Abstract | Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause errors due to distortion perspective. In this paper, we find that the rectification is completely unnecessary. What all we need is the spatial attention. We therefore propose a simple but extremely effective scene text recognition method based on transformer [50]. Different from previous transformer based models [56,34], which just use the decoder of the transformer to decode the convolutional attention, the proposed method use a convolutional feature maps as word embedding input into transformer. In such a way, our method is able to make full use of the powerful attention mechanism of the transformer. Extensive experimental results show that the proposed method significantly outperforms state-of-the-art methods by a very large margin on both regular and irregular text datasets. On one of the most challenging CUTE dataset whose state-of-the-art prediction accuracy is 89.6%, our method achieves 99.3%, which is a pretty surprising result. We will release our source code and believe that our method will be a new benchmark of scene text recognition with arbitrary shapes. |
Tasks | Scene Text Recognition |
Published | 2020-03-18 |
URL | https://arxiv.org/abs/2003.08077v2 |
https://arxiv.org/pdf/2003.08077v2.pdf | |
PWC | https://paperswithcode.com/paper/scene-text-recognition-via-transformer |
Repo | https://github.com/fengxinjie/Transformer-OCR |
Framework | pytorch |
Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment
Title | Visual Semantic SLAM with Landmarks for Large-Scale Outdoor Environment |
Authors | Zirui Zhao, Yijun Mao, Yan Ding, Pengju Ren, Nanning Zheng |
Abstract | Semantic SLAM is an important field in autonomous driving and intelligent agents, which can enable robots to achieve high-level navigation tasks, obtain simple cognition or reasoning ability and achieve language-based human-robot-interaction. In this paper, we built a system to creat a semantic 3D map by combining 3D point cloud from ORB SLAM with semantic segmentation information from Convolutional Neural Network model PSPNet-101 for large-scale environments. Besides, a new dataset for KITTI sequences has been built, which contains the GPS information and labels of landmarks from Google Map in related streets of the sequences. Moreover, we find a way to associate the real-world landmark with point cloud map and built a topological map based on semantic map. |
Tasks | Autonomous Driving, Semantic Segmentation |
Published | 2020-01-04 |
URL | https://arxiv.org/abs/2001.01028v1 |
https://arxiv.org/pdf/2001.01028v1.pdf | |
PWC | https://paperswithcode.com/paper/visual-semantic-slam-with-landmarks-for-large |
Repo | https://github.com/1989Ryan/Semantic_SLAM |
Framework | tf |
Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning
Title | Lookahead: a Far-Sighted Alternative of Magnitude-based Pruning |
Authors | Sejun Park, Jaeho Lee, Sangwoo Mo, Jinwoo Shin |
Abstract | Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants demonstrated remarkable performances for pruning modern architectures. Based on the observation that magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization. Our experimental results demonstrate that the proposed method consistently outperforms magnitude-based pruning on various networks, including VGG and ResNet, particularly in the high-sparsity regime. See https://github.com/alinlab/lookahead_pruning for codes. |
Tasks | |
Published | 2020-02-12 |
URL | https://arxiv.org/abs/2002.04809v1 |
https://arxiv.org/pdf/2002.04809v1.pdf | |
PWC | https://paperswithcode.com/paper/lookahead-a-far-sighted-alternative-of-1 |
Repo | https://github.com/alinlab/lookahead_pruning |
Framework | pytorch |
Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation
Title | Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation |
Authors | István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe |
Abstract | Heatmap representations have formed the basis of 2D human pose estimation systems for many years, but their generalizations for 3D pose have only recently been considered. This includes 2.5D volumetric heatmaps, whose X and Y axes correspond to image space and the Z axis to metric depth around the subject. To obtain metric-scale predictions, these methods must include a separate, explicit post-processing step to resolve scale ambiguity. Further, they cannot encode body joint positions outside of the image boundaries, leading to incomplete pose estimates in case of image truncation. We address these limitations by proposing metric-scale truncation-robust (MeTRo) volumetric heatmaps, whose dimensions are defined in metric 3D space near the subject, instead of being aligned with image space. We train a fully-convolutional network to estimate such heatmaps from monocular RGB in an end-to-end manner. This reinterpretation of the heatmap dimensions allows us to estimate complete metric-scale poses without test-time knowledge of the focal length or person distance and without relying on anthropometric heuristics in post-processing. Furthermore, as the image space is decoupled from the heatmap space, the network can learn to reason about joints beyond the image boundary. Using ResNet-50 without any additional learned layers, we obtain state-of-the-art results on the Human3.6M and MPI-INF-3DHP benchmarks. As our method is simple and fast, it can become a useful component for real-time top-down multi-person pose estimation systems. We make our code publicly available to facilitate further research (see https://vision.rwth-aachen.de/metro-pose3d). |
Tasks | 3D Human Pose Estimation, Multi-Person Pose Estimation, Pose Estimation |
Published | 2020-03-05 |
URL | https://arxiv.org/abs/2003.02953v1 |
https://arxiv.org/pdf/2003.02953v1.pdf | |
PWC | https://paperswithcode.com/paper/metric-scale-truncation-robust-heatmaps-for |
Repo | https://github.com/isarandi/metro-pose3d |
Framework | tf |
Adversarial System Variant Approximation to Quantify Process Model Generalization
Title | Adversarial System Variant Approximation to Quantify Process Model Generalization |
Authors | Julian Theis, Houshang Darabi |
Abstract | In process mining, process models are extracted from event logs using process discovery algorithms and are commonly assessed using multiple quality dimensions. While the metrics that measure the relationship of an extracted process model to its event log are well-studied, quantifying the level by which a process model can describe the unobserved behavior of its underlying system falls short in the literature. In this paper, a novel deep learning-based methodology called Adversarial System Variant Approximation (AVATAR) is proposed to overcome this issue. Sequence Generative Adversarial Networks are trained on the variants contained in an event log with the intention to approximate the underlying variant distribution of the system behavior. Unobserved realistic variants are sampled either directly from the Sequence Generative Adversarial Network or by leveraging the Metropolis-Hastings algorithm. The degree by which a process model relates to its underlying unknown system behavior is then quantified based on the realistic observed and estimated unobserved variants using established process model quality metrics. Significant performance improvements in revealing realistic unobserved variants are demonstrated in a controlled experiment on 15 ground truth systems. Additionally, the proposed methodology is experimentally tested and evaluated to quantify the generalization of 60 discovered process models with respect to their systems. |
Tasks | |
Published | 2020-03-26 |
URL | https://arxiv.org/abs/2003.12168v1 |
https://arxiv.org/pdf/2003.12168v1.pdf | |
PWC | https://paperswithcode.com/paper/adversarial-system-variant-approximation-to |
Repo | https://github.com/ProminentLab/AVATAR |
Framework | none |
Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models
Title | Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models |
Authors | Xiao Zhang, Jinghui Chen, Quanquan Gu, David Evans |
Abstract | Starting with Gilmer et al. (2018), several works have demonstrated the inevitability of adversarial examples based on different assumptions about the underlying input probability space. It remains unclear, however, whether these results apply to natural image distributions. In this work, we assume the underlying data distribution is captured by some conditional generative model, and prove intrinsic robustness bounds for a general class of classifiers, which solves an open problem in Fawzi et al. (2018). Building upon the state-of-the-art conditional generative models, we study the intrinsic robustness of two common image benchmarks under $\ell_2$ perturbations, and show the existence of a large gap between the robustness limits implied by our theory and the adversarial robustness achieved by current state-of-the-art robust models. Code for all our experiments is available at https://github.com/xiaozhanguva/Intrinsic-Rob. |
Tasks | |
Published | 2020-03-01 |
URL | https://arxiv.org/abs/2003.00378v1 |
https://arxiv.org/pdf/2003.00378v1.pdf | |
PWC | https://paperswithcode.com/paper/understanding-the-intrinsic-robustness-of |
Repo | https://github.com/xiaozhanguva/Intrinsic-Rob |
Framework | pytorch |
Replacing Mobile Camera ISP with a Single Deep Learning Model
Title | Replacing Mobile Camera ISP with a Single Deep Learning Model |
Authors | Andrey Ignatov, Luc Van Gool, Radu Timofte |
Abstract | As the popularity of mobile photography is growing constantly, lots of efforts are being invested now into building complex hand-crafted camera ISP solutions. In this work, we demonstrate that even the most sophisticated ISP pipelines can be replaced with a single end-to-end deep learning model trained without any prior knowledge about the sensor and optics used in a particular device. For this, we present PyNET, a novel pyramidal CNN architecture designed for fine-grained image restoration that implicitly learns to perform all ISP steps such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The model is trained to convert RAW Bayer data obtained directly from mobile camera sensor into photos captured with a professional high-end DSLR camera, making the solution independent of any particular mobile ISP implementation. To validate the proposed approach on the real data, we collected a large-scale dataset consisting of 10 thousand full-resolution RAW-RGB image pairs captured in the wild with the Huawei P20 cameraphone (12.3 MP Sony Exmor IMX380 sensor) and Canon 5D Mark IV DSLR. The experiments demonstrate that the proposed solution can easily get to the level of the embedded P20’s ISP pipeline that, unlike our approach, is combining the data from two (RGB + B/W) camera sensors. The dataset, pre-trained models and codes used in this paper are available on the project website. |
Tasks | Demosaicking, Denoising, Image Restoration |
Published | 2020-02-13 |
URL | https://arxiv.org/abs/2002.05509v1 |
https://arxiv.org/pdf/2002.05509v1.pdf | |
PWC | https://paperswithcode.com/paper/replacing-mobile-camera-isp-with-a-single |
Repo | https://github.com/aiff22/pynet |
Framework | tf |
Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning
Title | Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning |
Authors | Simone Parisi, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters, Joni Pajarinen |
Abstract | Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore |
Tasks | |
Published | 2020-01-01 |
URL | https://arxiv.org/abs/2001.00119v1 |
https://arxiv.org/pdf/2001.00119v1.pdf | |
PWC | https://paperswithcode.com/paper/long-term-visitation-value-for-deep |
Repo | https://github.com/sparisi/visit-value-explore |
Framework | none |
Simulating Lexical Semantic Change from Sense-Annotated Data
Title | Simulating Lexical Semantic Change from Sense-Annotated Data |
Authors | Dominik Schlechtweg, Sabine Schulte im Walde |
Abstract | We present a novel procedure to simulate lexical semantic change from synchronic sense-annotated data, and demonstrate its usefulness for assessing lexical semantic change detection models. The induced dataset represents a stronger correspondence to empirically observed lexical semantic change than previous synthetic datasets, because it exploits the intimate relationship between synchronic polysemy and diachronic change. We publish the data and provide the first large-scale evaluation gold standard for LSC detection models. |
Tasks | |
Published | 2020-01-09 |
URL | https://arxiv.org/abs/2001.03216v1 |
https://arxiv.org/pdf/2001.03216v1.pdf | |
PWC | https://paperswithcode.com/paper/simulating-lexical-semantic-change-from-sense |
Repo | https://github.com/Garrafao/LSCDetection |
Framework | none |
Automatically Discovering and Learning New Visual Categories with Ranking Statistics
Title | Automatically Discovering and Learning New Visual Categories with Ranking Statistics |
Authors | Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman |
Abstract | We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model’s knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin. |
Tasks | |
Published | 2020-02-13 |
URL | https://arxiv.org/abs/2002.05714v1 |
https://arxiv.org/pdf/2002.05714v1.pdf | |
PWC | https://paperswithcode.com/paper/automatically-discovering-and-learning-new-1 |
Repo | https://github.com/k-han/AutoNovel |
Framework | pytorch |
Identifying the Development and Application of Artificial Intelligence in Scientific Text
Title | Identifying the Development and Application of Artificial Intelligence in Scientific Text |
Authors | James Dunham, Jennifer Melot, Dewey Murdick |
Abstract | We describe a strategy for identifying the universe of research publications relating to the application and development of artificial intelligence. The approach leverages arXiv’s corpus of scientific preprints, in which authors choose subject tags for their papers from a set defined by editors. We compose from these subjects a functional definition of AI-relevance with intuitive components, by learning the subject definitions from paper metadata, and then inferring the arXiv-subject labels of papers in Web of Science. We find predictive classification $F_1$ scores between .59 and .86 for AI-relevant subject models. For an all-subjects model, we see precision of .83 and recall of .85. We evaluate the out-of-domain performance of our classifiers against other sources of subject information and results from other methods. We find that for the high-level fields of study represented on arXiv, a supervised solution can generalize for inference in other corpora. This offers a method for identifying AI-relevant publications that updates at the pace of research output, without reliance on subject-matter experts for query development or labeling. |
Tasks | |
Published | 2020-02-17 |
URL | https://arxiv.org/abs/2002.07143v1 |
https://arxiv.org/pdf/2002.07143v1.pdf | |
PWC | https://paperswithcode.com/paper/identifying-the-development-and-application |
Repo | https://github.com/georgetown-cset/ai-relevant-papers |
Framework | pytorch |
PONAS: Progressive One-shot Neural Architecture Search for Very Efficient Deployment
Title | PONAS: Progressive One-shot Neural Architecture Search for Very Efficient Deployment |
Authors | Sian-Yao Huang, Wei-Ta Chu |
Abstract | We achieve very efficient deep learning model deployment that designs neural network architectures to fit different hardware constraints. Given a constraint, most neural architecture search (NAS) methods either sample a set of sub-networks according to a pre-trained accuracy predictor, or adopt the evolutionary algorithm to evolve specialized networks from the supernet. Both approaches are time consuming. Here our key idea for very efficient deployment is, when searching the architecture space, constructing a table that stores the validation accuracy of all candidate blocks at all layers. For a stricter hardware constraint, the architecture of a specialized network can be very efficiently determined based on this table by picking the best candidate blocks that yield the least accuracy loss. To accomplish this idea, we propose Progressive One-shot Neural Architecture Search (PONAS) that combines advantages of progressive NAS and one-shot methods. In PONAS, we propose a two-stage training scheme, including the meta training stage and the fine-tuning stage, to make the search process efficient and stable. During search, we evaluate candidate blocks in different layers and construct the accuracy table that is to be used in deployment. Comprehensive experiments verify that PONAS is extremely flexible, and is able to find architecture of a specialized network in around 10 seconds. In ImageNet classification, 75.2% top-1 accuracy can be obtained, which is comparable with the state of the arts. |
Tasks | Neural Architecture Search |
Published | 2020-03-11 |
URL | https://arxiv.org/abs/2003.05112v1 |
https://arxiv.org/pdf/2003.05112v1.pdf | |
PWC | https://paperswithcode.com/paper/ponas-progressive-one-shot-neural |
Repo | https://github.com/eric8607242/PONAS |
Framework | pytorch |
Learning Group Structure and Disentangled Representations of Dynamical Environments
Title | Learning Group Structure and Disentangled Representations of Dynamical Environments |
Authors | Robin Quessard, Thomas D. Barrett, William R. Clements |
Abstract | Discovering the underlying structure of a dynamical environment involves learning representations that are interpretable and disentangled, which is a challenging task. In physics, interpretable representations of our universe and its underlying dynamics are formulated in terms of representations of groups of symmetry transformations. We propose a physics-inspired method, built upon the theory of group representation, that learns a representation of an environment structured around the transformations that generate its evolution. Experimentally, we learn the structure of explicitly symmetric environments without supervision while ensuring the interpretability of the representations. We show that the learned representations allow for accurate long-horizon predictions and further demonstrate a correlation between the quality of predictions and disentanglement in the latent space. |
Tasks | |
Published | 2020-02-17 |
URL | https://arxiv.org/abs/2002.06991v1 |
https://arxiv.org/pdf/2002.06991v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-group-structure-and-disentangled |
Repo | https://github.com/IndustAI/learning-group-structure |
Framework | pytorch |
An Implicit Attention Mechanism for Deep Learning Pedestrian Re-identification Frameworks
Title | An Implicit Attention Mechanism for Deep Learning Pedestrian Re-identification Frameworks |
Authors | Ehsan Yaghoubi, Diana Borza, Aruna Kumar, Hugo Proença |
Abstract | Attention is defined as the preparedness for the mental selection of certain aspects in a physical environment. In the computer vision domain, this mechanism is of most interest, as it helps to define the segments of an image/video that are critical for obtaining a specific decision. This paper introduces one ‘implicit’ attentional mechanism for deep learning frameworks, that provides simultaneously: 1) masks-free; and 2) foreground-focused samples for the inference phase. The main idea is to generate synthetic data composed of interleaved segments from the original learning set, while using class information only from specific segments. During the learning phase, the newly generated samples feed the network, keeping their label exclusively consistent with the identity from where the region-of-interest was cropped. Hence, as the model receives images of each identity with inconsistent unwanted areas, it naturally pays the most attention to the label consistent consistent regions, which we observed to be equivalent to learn an effective receptive field. During the test phase, samples are provided without any mask, and the network naturally disregards the detrimental information, which is the insight for the observed improvements in performance. As a proof-of-concept, we consider the challenging problem of pedestrian re-identification and compare the effectiveness of our solution to the state-of-the-art techniques in the well known Richly Annotated Pedestrian (RAP) dataset. The code is available at https://github.com/Ehsan-Yaghoubi/reid-strong-baseline. |
Tasks | |
Published | 2020-01-30 |
URL | https://arxiv.org/abs/2001.11267v2 |
https://arxiv.org/pdf/2001.11267v2.pdf | |
PWC | https://paperswithcode.com/paper/an-implicit-attention-mechanism-for-deep |
Repo | https://github.com/Ehsan-Yaghoubi/reid-strong-baseline |
Framework | pytorch |