February 2, 2020

3269 words 16 mins read

Paper Group AWR 53

Unsupervised Deep Learning by Neighbourhood Discovery. Online Knowledge Distillation with Diverse Peers. Deep High-Resolution Representation Learning for Visual Recognition. RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free. Multi-stage Deep Classifier Cascades for Open World Recognition. SOLO: Segmentin …

Unsupervised Deep Learning by Neighbourhood Discovery


Title	Unsupervised Deep Learning by Neighbourhood Discovery
Authors	Jiabo Huang, Qi Dong, Shaogang Gong, Xiatian Zhu
Abstract	Deep convolutional neural networks (CNNs) have demonstrated remarkable success in computer vision by supervisedly learning strong visual feature representations. However, training CNNs relies heavily on the availability of exhaustive training data annotations, limiting significantly their deployment and scalability in many application scenarios. In this work, we introduce a generic unsupervised deep learning approach to training deep models without the need for any manual label supervision. Specifically, we progressively discover sample anchored/centred neighbourhoods to reason and learn the underlying class decision boundaries iteratively and accumulatively. Every single neighbourhood is specially formulated so that all the member samples can share the same unseen class labels at high probability for facilitating the extraction of class discriminative feature representations during training. Experiments on image classification show the performance advantages of the proposed method over the state-of-the-art unsupervised learning models on six benchmarks including both coarse-grained and fine-grained object image categorisation.
Tasks	Image Classification
Published	2019-04-25
URL	https://arxiv.org/abs/1904.11567v3
PDF	https://arxiv.org/pdf/1904.11567v3.pdf
PWC	https://paperswithcode.com/paper/unsupervised-deep-learning-by-neighbourhood
Repo	https://github.com/raymond-sci/AND
Framework	pytorch

Online Knowledge Distillation with Diverse Peers


Title	Online Knowledge Distillation with Diverse Peers
Authors	Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, Chun Chen
Abstract	Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always available. Recently proposed online variants use the aggregated intermediate predictions of multiple student models as targets to train each student model. Although group-derived targets give a good recipe for teacher-free distillation, group members are homogenized quickly with simple aggregation functions, leading to early saturated solutions. In this work, we propose Online Knowledge Distillation with Diverse peers (OKDDip), which performs two-level distillation during training with multiple auxiliary peers and one group leader. In the first-level distillation, each auxiliary peer holds an individual set of aggregation weights generated with an attention-based mechanism to derive its own targets from predictions of other auxiliary peers. Learning from distinct target distributions helps to boost peer diversity for effectiveness of group-based distillation. The second-level distillation is performed to transfer the knowledge in the ensemble of auxiliary peers further to the group leader, i.e., the model used for inference. Experimental results show that the proposed framework consistently gives better performance than state-of-the-art approaches without sacrificing training or inference complexity, demonstrating the effectiveness of the proposed two-level distillation framework.
Tasks	Transfer Learning
Published	2019-12-01
URL	https://arxiv.org/abs/1912.00350v2
PDF	https://arxiv.org/pdf/1912.00350v2.pdf
PWC	https://paperswithcode.com/paper/online-knowledge-distillation-with-diverse
Repo	https://github.com/DefangChen/OKDDip-AAAI2020
Framework	pytorch

Deep High-Resolution Representation Learning for Visual Recognition


Title	Deep High-Resolution Representation Learning for Visual Recognition
Authors	Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
Abstract	High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions \emph{in series} (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at~{\url{https://github.com/HRNet}}.
Tasks	Instance Segmentation, Object Detection, Pose Estimation, Representation Learning, Semantic Segmentation
Published	2019-08-20
URL	https://arxiv.org/abs/1908.07919v2
PDF	https://arxiv.org/pdf/1908.07919v2.pdf
PWC	https://paperswithcode.com/paper/190807919
Repo	https://github.com/HRNet/HRNet-Facial-Landmark-Detection
Framework	pytorch

RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free


Title	RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free
Authors	Cheng-Yang Fu, Mykhailo Shvets, Alexander C. Berg
Abstract	Recently two-stage detectors have surged ahead of single-shot detectors in the accuracy-vs-speed trade-off. Nevertheless single-shot detectors are immensely popular in embedded vision applications. This paper brings single-shot detectors up to the same level as current two-stage techniques. We do this by improving training for the state-of-the-art single-shot detector, RetinaNet, in three ways: integrating instance mask prediction for the first time, making the loss function adaptive and more stable, and including additional hard examples in training. We call the resulting augmented network RetinaMask. The detection component of RetinaMask has the same computational cost as the original RetinaNet, but is more accurate. COCO test-dev results are up to 41.4 mAP for RetinaMask-101 vs 39.1mAP for RetinaNet-101, while the runtime is the same during evaluation. Adding Group Normalization increases the performance of RetinaMask-101 to 41.7 mAP. Code is at:https://github.com/chengyangfu/retinamask
Tasks	Object Detection
Published	2019-01-10
URL	http://arxiv.org/abs/1901.03353v1
PDF	http://arxiv.org/pdf/1901.03353v1.pdf
PWC	https://paperswithcode.com/paper/retinamask-learning-to-predict-masks-improves
Repo	https://github.com/chencq1234/maskrcnn_facebook
Framework	pytorch

Multi-stage Deep Classifier Cascades for Open World Recognition


Title	Multi-stage Deep Classifier Cascades for Open World Recognition
Authors	Xiaojie Guo, Amir Alipour-Fanid, Lingfei Wu, Hemant Purohit, Xiang Chen, Kai Zeng, Liang Zhao
Abstract	At present, object recognition studies are mostly conducted in a closed lab setting with classes in test phase typically in training phase. However, real-world problem is far more challenging because: i) new classes unseen in the training phase can appear when predicting; ii) discriminative features need to evolve when new classes emerge in real time; and iii) instances in new classes may not follow the “independent and identically distributed” (iid) assumption. Most existing work only aims to detect the unknown classes and is incapable of continuing to learn newer classes. Although a few methods consider both detecting and including new classes, all are based on the predefined handcrafted features that cannot evolve and are out-of-date for characterizing emerging classes. Thus, to address the above challenges, we propose a novel generic end-to-end framework consisting of a dynamic cascade of classifiers that incrementally learn their dynamic and inherent features. The proposed method injects dynamic elements into the system by detecting instances from unknown classes, while at the same time incrementally updating the model to include the new classes. The resulting cascade tree grows by adding a new leaf node classifier once a new class is detected, and the discriminative features are updated via an end-to-end learning strategy. Experiments on two real-world datasets demonstrate that our proposed method outperforms existing state-of-the-art methods.
Tasks	Object Recognition
Published	2019-08-26
URL	https://arxiv.org/abs/1908.09931v1
PDF	https://arxiv.org/pdf/1908.09931v1.pdf
PWC	https://paperswithcode.com/paper/multi-stage-deep-classifier-cascades-for-open
Repo	https://github.com/xguo7/MDCC-for-open-world-recognition
Framework	none

SOLO: Segmenting Objects by Locations


Title	SOLO: Segmenting Objects by Locations
Authors	Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, Lei Li
Abstract	We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the ‘detect-thensegment’ strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of “instance categories”, which assigns categories to each pixel within an instance according to the instance’s location and size, thus nicely converting instance mask segmentation into a classification-solvable problem. Now instance segmentation is decomposed into two classification tasks. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent singleshot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.
Tasks	Instance Segmentation, Semantic Segmentation
Published	2019-12-10
URL	https://arxiv.org/abs/1912.04488v2
PDF	https://arxiv.org/pdf/1912.04488v2.pdf
PWC	https://paperswithcode.com/paper/solo-segmenting-objects-by-locations
Repo	https://github.com/aim-uofa/AdelaiDet
Framework	pytorch

Generating Question Relevant Captions to Aid Visual Question Answering


Title	Generating Question Relevant Captions to Aid Visual Question Answering
Authors	Jialin Wu, Zeyuan Hu, Raymond J. Mooney
Abstract	Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.
Tasks	Image Captioning, Question Answering, Visual Question Answering
Published	2019-06-03
URL	https://arxiv.org/abs/1906.00513v3
PDF	https://arxiv.org/pdf/1906.00513v3.pdf
PWC	https://paperswithcode.com/paper/190600513
Repo	https://github.com/jialinwu17/joint_vqa_and_caption
Framework	pytorch

Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth


Title	Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth
Authors	Davy Neven, Bert De Brabandere, Marc Proesmans, Luc Van Gool
Abstract	Current state-of-the-art instance segmentation methods are not suited for real-time applications like autonomous driving, which require fast execution times at high accuracy. Although the currently dominant proposal-based methods have high accuracy, they are slow and generate masks at a fixed and low resolution. Proposal-free methods, by contrast, can generate masks at high resolution and are often faster, but fail to reach the same accuracy as the proposal-based methods. In this work we propose a new clustering loss function for proposal-free instance segmentation. The loss function pulls the spatial embeddings of pixels belonging to the same instance together and jointly learns an instance-specific clustering bandwidth, maximizing the intersection-over-union of the resulting instance mask. When combined with a fast architecture, the network can perform instance segmentation in real-time while maintaining a high accuracy. We evaluate our method on the challenging Cityscapes benchmark and achieve top results (5% improvement over Mask R-CNN) at more than 10 fps on 2MP images. Code will be available at https://github.com/davyneven/SpatialEmbeddings .
Tasks	Autonomous Driving, Instance Segmentation, Semantic Segmentation
Published	2019-06-26
URL	https://arxiv.org/abs/1906.11109v2
PDF	https://arxiv.org/pdf/1906.11109v2.pdf
PWC	https://paperswithcode.com/paper/instance-segmentation-by-jointly-optimizing-1
Repo	https://github.com/davyneven/SpatialEmbeddings
Framework	pytorch

Budgeted Reinforcement Learning in Continuous State Space


Title	Budgeted Reinforcement Learning in Continuous State Space
Authors	Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin
Abstract	A Budgeted Markov Decision Process (BMDP) is an extension of a Markov Decision Process to critical applications requiring safety constraints. It relies on a notion of risk implemented in the shape of a cost signal constrained to lie below an - adjustable - threshold. So far, BMDPs could only be solved in the case of finite state spaces with known dynamics. This work extends the state-of-the-art to continuous spaces environments and unknown dynamics. We show that the solution to a BMDP is a fixed point of a novel Budgeted Bellman Optimality operator. This observation allows us to introduce natural extensions of Deep Reinforcement Learning algorithms to address large-scale BMDPs. We validate our approach on two simulated applications: spoken dialogue and autonomous driving.
Tasks	Autonomous Driving
Published	2019-03-03
URL	https://arxiv.org/abs/1903.01004v3
PDF	https://arxiv.org/pdf/1903.01004v3.pdf
PWC	https://paperswithcode.com/paper/scaling-up-budgeted-reinforcement-learning
Repo	https://github.com/eleurent/rl-agents
Framework	pytorch

Dr.VOT : Measuring Positive and Negative Voice Onset Time in the Wild


Title	Dr.VOT : Measuring Positive and Negative Voice Onset Time in the Wild
Authors	Yosi Shrem, Matthew Goldrick, Joseph Keshet
Abstract	Voice Onset Time (VOT), a key measurement of speech for basic research and applied medical studies, is the time between the onset of a stop burst and the onset of voicing. When the voicing onset precedes burst onset the VOT is negative; if voicing onset follows the burst, it is positive. In this work, we present a deep-learning model for accurate and reliable measurement of VOT in naturalistic speech. The proposed system addresses two critical issues: it can measure positive and negative VOT equally well, and it is trained to be robust to variation across annotations. Our approach is based on the structured prediction framework, where the feature functions are defined to be RNNs. These learn to capture segmental variation in the signal. Results suggest that our method substantially improves over the current state-of-the-art. In contrast to previous work, our Deep and Robust VOT annotator, Dr.VOT, can successfully estimate negative VOTs while maintaining state-of-the-art performance on positive VOTs. This high level of performance generalizes to new corpora without further retraining. Index Terms: structured prediction, multi-task learning, adversarial training, recurrent neural networks, sequence segmentation.
Tasks	Multi-Task Learning, Structured Prediction
Published	2019-10-27
URL	https://arxiv.org/abs/1910.13255v1
PDF	https://arxiv.org/pdf/1910.13255v1.pdf
PWC	https://paperswithcode.com/paper/drvot-measuring-positive-and-negative-voice
Repo	https://github.com/MLSpeech/Dr.VOT
Framework	none

MUSEFood: Multi-sensor-based Food Volume Estimation on Smartphones


Title	MUSEFood: Multi-sensor-based Food Volume Estimation on Smartphones
Authors	Junyi Gao, Weihao Tan, Liantao Ma, Yasha Wang, Wen Tang
Abstract	Researches have shown that diet recording can help people increase awareness of food intake and improve nutrition management, and thereby maintain a healthier life. Recently, researchers have been working on smartphone-based diet recording methods and applications that help users accomplish two tasks: record what they eat and how much they eat. Although the former task has made great progress through adopting image recognition technology, it is still a challenge to estimate the volume of foods accurately and conveniently. In this paper, we propose a novel method, named MUSEFood, for food volume estimation. MUSEFood uses the camera to capture photos of the food, but unlike existing volume measurement methods, MUSEFood requires neither training images with volume information nor placing a reference object of known size while taking photos. In addition, considering the impact of different containers on the contour shape of foods, MUSEFood uses a multi-task learning framework to improve the accuracy of food segmentation, and uses a differential model applicable for various containers to further reduce the negative impact of container differences on volume estimation accuracy. Furthermore, MUSEFood uses the microphone and the speaker to accurately measure the vertical distance from the camera to the food in a noisy environment, thus scaling the size of food in the image to its actual size. The experiments on real foods indicate that MUSEFood outperforms state-of-the-art approaches, and highly improves the speed of food volume estimation.
Tasks	Multi-Task Learning
Published	2019-03-18
URL	https://arxiv.org/abs/1903.07437v3
PDF	https://arxiv.org/pdf/1903.07437v3.pdf
PWC	https://paperswithcode.com/paper/musefood-multi-sensor-based-food-volume
Repo	https://github.com/MUSEFood/MUSEFood
Framework	tf

Learning Lightweight Lane Detection CNNs by Self Attention Distillation


Title	Learning Lightweight Lane Detection CNNs by Self Attention Distillation
Authors	Yuenan Hou, Zheng Ma, Chunxiao Liu, Chen Change Loy
Abstract	Training deep models for lane detection is challenging due to the very subtle and sparse supervisory signals inherent in lane annotations. Without learning from much richer context, these models often fail in challenging scenarios, e.g., severe occlusion, ambiguous lanes, and poor lighting conditions. In this paper, we present a novel knowledge distillation approach, i.e., Self Attention Distillation (SAD), which allows a model to learn from itself and gains substantial improvement without any additional supervision or labels. Specifically, we observe that attention maps extracted from a model trained to a reasonable level would encode rich contextual information. The valuable contextual information can be used as a form of ‘free’ supervision for further representation learning through performing topdown and layer-wise attention distillation within the network itself. SAD can be easily incorporated in any feedforward convolutional neural networks (CNN) and does not increase the inference time. We validate SAD on three popular lane detection benchmarks (TuSimple, CULane and BDD100K) using lightweight models such as ENet, ResNet-18 and ResNet-34. The lightest model, ENet-SAD, performs comparatively or even surpasses existing algorithms. Notably, ENet-SAD has 20 x fewer parameters and runs 10 x faster compared to the state-of-the-art SCNN, while still achieving compelling performance in all benchmarks. Our code is available at https://github.com/cardwing/Codes-for-Lane-Detection.
Tasks	Lane Detection, Representation Learning
Published	2019-08-02
URL	https://arxiv.org/abs/1908.00821v1
PDF	https://arxiv.org/pdf/1908.00821v1.pdf
PWC	https://paperswithcode.com/paper/learning-lightweight-lane-detection-cnns-by
Repo	https://github.com/cardwing/Codes-for-Lane-Detection
Framework	tf

BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition


Title	BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition
Authors	Boyan Zhou, Quan Cui, Xiu-Shen Wei, Zhao-Min Chen
Abstract	Our work focuses on tackling the challenging but natural visual recognition task of long-tailed data distribution (i.e., a few classes occupy most of the data, while most classes have rarely few samples). In the literature, class re-balancing strategies (e.g., re-weighting and re-sampling) are the prominent and effective methods proposed to alleviate the extreme imbalance for dealing with long-tailed problems. In this paper, we firstly discover that these re-balancing methods achieving satisfactory recognition accuracy owe to that they could significantly promote the classifier learning of deep networks. However, at the same time, they will unexpectedly damage the representative ability of the learned deep features to some extent. Therefore, we propose a unified Bilateral-Branch Network (BBN) to take care of both representation learning and classifier learning simultaneously, where each branch does perform its own duty separately. In particular, our BBN model is further equipped with a novel cumulative learning strategy, which is designed to first learn the universal patterns and then pay attention to the tail data gradually. Extensive experiments on four benchmark datasets, including the large-scale iNaturalist ones, justify that the proposed BBN can significantly outperform state-of-the-art methods. Furthermore, validation experiments can demonstrate both our preliminary discovery and effectiveness of tailored designs in BBN for long-tailed problems. Our method won the first place in the iNaturalist 2019 large scale species classification competition, and our code is open-source and available at https://github.com/Megvii-Nanjing/BBN.
Tasks	Representation Learning
Published	2019-12-05
URL	https://arxiv.org/abs/1912.02413v4
PDF	https://arxiv.org/pdf/1912.02413v4.pdf
PWC	https://paperswithcode.com/paper/bbn-bilateral-branch-network-with-cumulative
Repo	https://github.com/Megvii-Nanjing/BBN
Framework	pytorch

No Training Required: Exploring Random Encoders for Sentence Classification


Title	No Training Required: Exploring Random Encoders for Sentence Classification
Authors	John Wieting, Douwe Kiela
Abstract	We explore various methods for computing sentence representations from pre-trained word embeddings without any training, i.e., using nothing but random parameterizations. Our aim is to put sentence embeddings on more solid footing by 1) looking at how much modern sentence embeddings gain over random methods—as it turns out, surprisingly little; and by 2) providing the field with more appropriate baselines going forward—which are, as it turns out, quite strong. We also make important observations about proper experimental protocol for sentence classification evaluation, together with recommendations for future research.
Tasks	Sentence Classification, Sentence Embeddings, Word Embeddings
Published	2019-01-29
URL	http://arxiv.org/abs/1901.10444v1
PDF	http://arxiv.org/pdf/1901.10444v1.pdf
PWC	https://paperswithcode.com/paper/no-training-required-exploring-random
Repo	https://github.com/facebookresearch/randsent
Framework	pytorch

A Fair Comparison of Graph Neural Networks for Graph Classification


Title	A Fair Comparison of Graph Neural Networks for Graph Classification
Authors	Federico Errica, Marco Podda, Davide Bacciu, Alessio Micheli
Abstract	Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.
Tasks	Graph Classification, Graph Representation Learning, Representation Learning
Published	2019-12-20
URL	https://arxiv.org/abs/1912.09893v2
PDF	https://arxiv.org/pdf/1912.09893v2.pdf
PWC	https://paperswithcode.com/paper/a-fair-comparison-of-graph-neural-networks-1
Repo	https://github.com/diningphil/gnn-comparison
Framework	pytorch