January 31, 2020

3463 words 17 mins read

Paper Group ANR 177

A HVS-inspired Attention to Improve Loss Metrics for CNN-based Perception-Oriented Super-Resolution. Privacy-Preserving Action Recognition using Coded Aperture Videos. Low-Rank Pairwise Alignment Bilinear Network For Few-Shot Fine-Grained Image Classification. Qini-based Uplift Regression. Interaction is necessary for distributed learning with priv …

A HVS-inspired Attention to Improve Loss Metrics for CNN-based Perception-Oriented Super-Resolution


Title	A HVS-inspired Attention to Improve Loss Metrics for CNN-based Perception-Oriented Super-Resolution
Authors	Taimoor Tariq, Juan Luis Gonzalez, Munchurl Kim
Abstract	Deep Convolutional Neural Network (CNN) features have been demonstrated to be effective perceptual quality features. The perceptual loss, based on feature maps of pre-trained CNN’s has proven to be remarkably effective for CNN based perceptual image restoration problems. In this work, taking inspiration from the the Human Visual System (HVS) and visual perception, we propose a spatial attention mechanism based on the dependency human contrast sensitivity on spatial frequency. We identify regions in input images, based on the underlying spatial frequency, which are not generally well reconstructed during Super-Resolution but are most important in terms of visual sensitivity. Based on this prior, we design a spatial attention map that is applied to feature maps in the perceptual loss and its variants, helping them to identify regions that are of more perceptual importance. The results demonstrate the our technique improves the ability of the perceptual loss and contextual loss to deliver more natural images in CNN based super-resolution.
Tasks	Image Restoration, Super-Resolution
Published	2019-03-30
URL	https://arxiv.org/abs/1904.00205v2
PDF	https://arxiv.org/pdf/1904.00205v2.pdf
PWC	https://paperswithcode.com/paper/a-hvs-inspired-attention-map-to-improve-cnn
Repo
Framework

Privacy-Preserving Action Recognition using Coded Aperture Videos


Title	Privacy-Preserving Action Recognition using Coded Aperture Videos
Authors	Zihao W. Wang, Vibhav Vineet, Francesco Pittaluga, Sudipta Sinha, Oliver Cossairt, Sing Bing Kang
Abstract	The risk of unauthorized remote access of streaming video from networked cameras underlines the need for stronger privacy safeguards. We propose a lens-free coded aperture camera system for human action recognition that is privacy-preserving. While coded aperture systems exist, we believe ours is the first system designed for action recognition without the need for image restoration as an intermediate step. Action recognition is done using a deep network that takes in as input, non-invertible motion features between pairs of frames computed using phase correlation and log-polar transformation. Phase correlation encodes translation while the log polar transformation encodes in-plane rotation and scaling. We show that the translation features are independent of the coded aperture design, as long as its spectral response within the bandwidth has no zeros. Stacking motion features computed on frames at multiple different strides in the video can improve accuracy. Preliminary results on simulated data based on a subset of the UCF and NTU datasets are promising. We also describe our prototype lens-free coded aperture camera system, and results for real captured videos are mixed.
Tasks	Image Restoration, Temporal Action Localization
Published	2019-02-25
URL	http://arxiv.org/abs/1902.09085v2
PDF	http://arxiv.org/pdf/1902.09085v2.pdf
PWC	https://paperswithcode.com/paper/privacy-preserving-action-recognition-using
Repo
Framework

Low-Rank Pairwise Alignment Bilinear Network For Few-Shot Fine-Grained Image Classification


Title	Low-Rank Pairwise Alignment Bilinear Network For Few-Shot Fine-Grained Image Classification
Authors	Huaxi Huang, Junjie Zhang, Jian Zhang, Jingsong Xu, Qiang Wu
Abstract	Deep neural networks have demonstrated advanced abilities on various visual classification tasks, which heavily rely on the large-scale training samples with annotated ground-truth. However, it is unrealistic always to require such annotation in real-world applications. Recently, Few-Shot learning (FS), as an attempt to address the shortage of training samples, has made significant progress in generic classification tasks. Nonetheless, it is still challenging for current FS models to distinguish the subtle differences between fine-grained categories given limited training data. To filling the classification gap, in this paper, we address the Few-Shot Fine-Grained (FSFG) classification problem, which focuses on tackling the fine-grained classification under the challenging few-shot learning setting. A novel low-rank pairwise bilinear pooling operation is proposed to capture the nuanced differences between the support and query images for learning an effective distance metric. Moreover, a feature alignment layer is designed to match the support image features with query ones before the comparison. We name the proposed model Low-Rank Pairwise Alignment Bilinear Network (LRPABN), which is trained in an end-to-end fashion. Comprehensive experimental results on four widely used fine-grained classification datasets demonstrate that our LRPABN model achieves the superior performances compared to state-of-the-art methods.
Tasks	Few-Shot Learning, Fine-Grained Image Classification, Image Classification
Published	2019-08-04
URL	https://arxiv.org/abs/1908.01313v2
PDF	https://arxiv.org/pdf/1908.01313v2.pdf
PWC	https://paperswithcode.com/paper/low-rank-pairwise-alignment-bilinear-network
Repo
Framework

Qini-based Uplift Regression


Title	Qini-based Uplift Regression
Authors	Mouloud Belbahri, Alejandro Murua, Olivier Gandouet, Vahid Partovi Nia
Abstract	Uplift models provide a solution to the problem of isolating the marketing effect of a campaign. For customer churn reduction, uplift models are used to identify the customers who are likely to respond positively to a retention activity only if targeted, and to avoid wasting resources on customers that are very likely to switch to another company. We introduce a Qini-based uplift regression model to analyze a large insurance company’s retention marketing campaign. Our approach is based on logistic regression models. We show that a Qini-optimized uplift model acts as a regularizing factor for uplift, much as a penalized likelihood model does for regression. This results in interpretable parsimonious models with few relevant explanatory variables. Our results show that performing Qini-based variable selection significantly improves the uplift models performance.
Tasks
Published	2019-11-28
URL	https://arxiv.org/abs/1911.12474v1
PDF	https://arxiv.org/pdf/1911.12474v1.pdf
PWC	https://paperswithcode.com/paper/qini-based-uplift-regression
Repo
Framework

Interaction is necessary for distributed learning with privacy or communication constraints


Title	Interaction is necessary for distributed learning with privacy or communication constraints
Authors	Yuval Dagan, Vitaly Feldman
Abstract	Local differential privacy (LDP) is a model where users send privatized data to an untrusted central server whose goal it to solve some data analysis task. In the non-interactive version of this model the protocol consists of a single round in which a server sends requests to all users then receives their responses. This version is deployed in industry due to its practical advantages and has attracted significant research interest. Our main result is an exponential lower bound on the number of samples necessary to solve the standard task of learning a large-margin linear separator in the non-interactive LDP model. Via a standard reduction this lower bound implies an exponential lower bound for stochastic convex optimization and specifically, for learning linear models with a convex, Lipschitz and smooth loss. These results answer the questions posed in \citep{SmithTU17,DanielyF18}. Our lower bound relies on a new technique for constructing pairs of distributions with nearly matching moments but whose supports can be nearly separated by a large margin hyperplane. These lower bounds also hold in the model where communication from each user is limited and follow from a lower bound on learning using non-adaptive \emph{statistical queries}.
Tasks
Published	2019-11-11
URL	https://arxiv.org/abs/1911.04014v1
PDF	https://arxiv.org/pdf/1911.04014v1.pdf
PWC	https://paperswithcode.com/paper/interaction-is-necessary-for-distributed
Repo
Framework

No-Reference Video Quality Assessment using Multi-Level Spatially Pooled Features


Title	No-Reference Video Quality Assessment using Multi-Level Spatially Pooled Features
Authors	Franz Götz-Hahn, Vlad Hosu, Hanhe Lin, Dietmar Saupe
Abstract	Video Quality Assessment (VQA) methods have been designed with a focus on particular degradation types, usually artificially induced on a small set of reference videos. Hence, most traditional VQA methods under-perform in-the-wild. Deep learning approaches have had limited success due to the small size and diversity of existing VQA datasets, either artificial or authentically distorted. We introduce a new in-the-wild VQA dataset that is substantially larger and diverse: FlickrVid-150k. It consists of a coarsely annotated set of 153,841 videos having 5 quality ratings each, and 1600 videos with a minimum of 89 ratings each. Additionally, we propose new efficient VQA approaches (MLSP-VQA) relying on multi-level spatially pooled deep features (MLSP). They are extremely well suited for training at scale, compared to deep transfer learning approaches. Our best method MLSP-VQA-FF improves the Spearman Rank-order Correlation Coefficient (SRCC) performance metric on the standard KonVid-1k in-the-wild benchmark dataset to 0.83 surpassing the best existing deep-learning model (0.8 SRCC) and hand-crafted feature-based method (0.78 SRCC). We further investigate how alternative approaches perform under different levels of label noise, and dataset size, showing that MLSP-VQA-FF is the overall best method. Finally, we show that MLSP-VQA-FF trained on FlickrVid-150k sets the new state-of-the-art for cross-test performance on KonVid-1k and LIVE-Qualcomm with a 0.79 and 0.58 SRCC, respectively, showing excellent generalization.
Tasks	Transfer Learning, Video Quality Assessment, Visual Question Answering
Published	2019-12-17
URL	https://arxiv.org/abs/1912.07966v1
PDF	https://arxiv.org/pdf/1912.07966v1.pdf
PWC	https://paperswithcode.com/paper/no-reference-video-quality-assessment-using
Repo
Framework

Differentiating Features for Scene Segmentation Based on Dedicated Attention Mechanisms


Title	Differentiating Features for Scene Segmentation Based on Dedicated Attention Mechanisms
Authors	Zhiqiang Xiong, Zhicheng Wang, Zhaohui Yu, Xi Gu
Abstract	Semantic segmentation is a challenge in scene parsing. It requires both context information and rich spatial information. In this paper, we differentiate features for scene segmentation based on dedicated attention mechanisms (DF-DAM), and two attention modules are proposed to optimize the high-level and low-level features in the encoder, respectively. Specifically, we use the high-level and low-level features of ResNet as the source of context information and spatial information, respectively, and optimize them with attention fusion module and 2D position attention module, respectively. For attention fusion module, we adopt dual channel weight to selectively adjust the channel map for the highest two stage features of ResNet, and fuse them to get context information. For 2D position attention module, we use the context information obtained by attention fusion module to assist the selection of the lowest-stage features of ResNet as supplementary spatial information. Finally, the two sets of information obtained by the two modules are simply fused to obtain the prediction. We evaluate our approach on Cityscapes and PASCAL VOC 2012 datasets. In particular, there aren’t complicated and redundant processing modules in our architecture, which greatly reduces the complexity, and we achieving 82.3% Mean IoU on PASCAL VOC 2012 test dataset without pre-training on MS-COCO dataset.
Tasks	Scene Parsing, Scene Segmentation, Semantic Segmentation
Published	2019-11-19
URL	https://arxiv.org/abs/1911.08149v1
PDF	https://arxiv.org/pdf/1911.08149v1.pdf
PWC	https://paperswithcode.com/paper/differentiating-features-for-scene
Repo
Framework

Deep Bayesian Active Learning for Multiple Correct Outputs


Title	Deep Bayesian Active Learning for Multiple Correct Outputs
Authors	Khaled Jedoui, Ranjay Krishna, Michael Bernstein, Li Fei-Fei
Abstract	Typical active learning strategies are designed for tasks, such as classification, with the assumption that the output space is mutually exclusive. The assumption that these tasks always have exactly one correct answer has resulted in the creation of numerous uncertainty-based measurements, such as entropy and least confidence, which operate over a model’s outputs. Unfortunately, many real-world vision tasks, like visual question answering and image captioning, have multiple correct answers, causing these measurements to overestimate uncertainty and sometimes perform worse than a random sampling baseline. In this paper, we propose a new paradigm that estimates uncertainty in the model’s internal hidden space instead of the model’s output space. We specifically study a manifestation of this problem for visual question answer generation (VQA), where the aim is not to classify the correct answer but to produce a natural language answer, given an image and a question. Our method overcomes the paraphrastic nature of language. It requires a semantic space that structures the model’s output concepts and that enables the usage of techniques like dropout-based Bayesian uncertainty. We build a visual-semantic space that embeds paraphrases close together for any existing VQA model. We empirically show state-of-art active learning results on the task of VQA on two datasets, being 5 times more cost-efficient on Visual Genome and 3 times more cost-efficient on VQA 2.0.
Tasks	Active Learning, Image Captioning, Question Answering, Visual Question Answering
Published	2019-12-02
URL	https://arxiv.org/abs/1912.01119v2
PDF	https://arxiv.org/pdf/1912.01119v2.pdf
PWC	https://paperswithcode.com/paper/deep-bayesian-active-learning-for-multiple
Repo
Framework

Boundary-Aware Feature Propagation for Scene Segmentation


Title	Boundary-Aware Feature Propagation for Scene Segmentation
Authors	Henghui Ding, Xudong Jiang, Ai Qun Liu, Nadia Magnenat Thalmann, Gang Wang
Abstract	In this work, we address the challenging issue of scene segmentation. To increase the feature similarity of the same object while keeping the feature discrimination of different objects, we explore to propagate information throughout the image under the control of objects’ boundaries. To this end, we first propose to learn the boundary as an additional semantic class to enable the network to be aware of the boundary layout. Then, we propose unidirectional acyclic graphs (UAGs) to model the function of undirected cyclic graphs (UCGs), which structurize the image via building graphic pixel-by-pixel connections, in an efficient and effective way. Furthermore, we propose a boundary-aware feature propagation (BFP) module to harvest and propagate the local features within their regions isolated by the learned boundaries in the UAG-structured image. The proposed BFP is capable of splitting the feature propagation into a set of semantic groups via building strong connections among the same segment region but weak connections between different segment regions. Without bells and whistles, our approach achieves new state-of-the-art segmentation performance on three challenging semantic segmentation datasets, i.e., PASCAL-Context, CamVid, and Cityscapes.
Tasks	Scene Segmentation, Semantic Segmentation
Published	2019-08-31
URL	https://arxiv.org/abs/1909.00179v1
PDF	https://arxiv.org/pdf/1909.00179v1.pdf
PWC	https://paperswithcode.com/paper/boundary-aware-feature-propagation-for-scene
Repo
Framework

Inductive Transfer for Neural Architecture Optimization


Title	Inductive Transfer for Neural Architecture Optimization
Authors	Martin Wistuba, Tejaswini Pedapati
Abstract	The recent advent of automated neural network architecture search led to several methods that outperform state-of-the-art human-designed architectures. However, these approaches are computationally expensive, in extreme cases consuming GPU years. We propose two novel methods which aim to expedite this optimization problem by transferring knowledge acquired from previous tasks to new ones. First, we propose a novel neural architecture selection method which employs this knowledge to identify strong and weak characteristics of neural architectures across datasets. Thus, these characteristics do not need to be rediscovered in every search, a strong weakness of current state-of-the-art searches. Second, we propose a method for learning curve extrapolation to determine if a training process can be terminated early. In contrast to existing work, we propose to learn from learning curves of architectures trained on other datasets to improve the prediction accuracy for novel datasets. On five different image classification benchmarks, we empirically demonstrate that both of our orthogonal contributions independently lead to an acceleration, without any significant loss in accuracy.
Tasks	Image Classification, Neural Architecture Search
Published	2019-03-08
URL	http://arxiv.org/abs/1903.03536v1
PDF	http://arxiv.org/pdf/1903.03536v1.pdf
PWC	https://paperswithcode.com/paper/inductive-transfer-for-neural-architecture
Repo
Framework


Title	Online Path Generation and Navigation for Swarms of UAVs
Authors	Adnan Ashraf, Amin Majd, Elena Troubitsyna
Abstract	With the growing popularity of Unmanned Aerial Vehicles (UAVs) for consumer applications, the number of accidents involving UAVs is also increasing rapidly. Therefore, motion safety of UAVs has become a prime concern for UAV operators. For a swarm of UAVs, a safe operation can not be guaranteed without preventing the UAVs from colliding with one another and with static and dynamically appearing, moving obstacles in the flying zone. In this paper, we present an online, collision-free path generation and navigation system for swarms of UAVs. The proposed system uses geographical locations of the UAVs and of the successfully detected, static and moving obstacles to predict and avoid: (1) UAV-to-UAV collisions, (2) UAV-to-static-obstacle collisions, and (3) UAV-to-moving-obstacle collisions. Our collision prediction approach leverages efficient runtime monitoring and Complex Event Processing (CEP) to make timely predictions. A distinctive feature of the proposed system is its ability to foresee potential collisions and proactively find best ways to avoid predicted collisions in order to ensure safety of the entire swarm. We also present a simulation-based implementation of the proposed system along with an experimental evaluation involving a series of experiments and compare our results with the results of four existing approaches. The results show that the proposed system successfully predicts and avoids all three kinds of collisions in an online manner. Moreover, it generates safe and efficient UAV routes, efficiently scales to large-sized problem instances, and is suitable for cluttered flying zones and for scenarios involving high risks of UAV collisions.
Tasks
Published	2019-12-19
URL	https://arxiv.org/abs/1912.09288v1
PDF	https://arxiv.org/pdf/1912.09288v1.pdf
PWC	https://paperswithcode.com/paper/online-path-generation-and-navigation-for
Repo
Framework

3D Local Features for Direct Pairwise Registration


Title	3D Local Features for Direct Pairwise Registration
Authors	Haowen Deng, Tolga Birdal, Slobodan Ilic
Abstract	We present a novel, data driven approach for solving the problem of registration of two point cloud scans. Our approach is direct in the sense that a single pair of corresponding local patches already provides the necessary transformation cue for the global registration. To achieve that, we first endow the state of the art PPF-FoldNet auto-encoder (AE) with a pose-variant sibling, where the discrepancy between the two leads to pose-specific descriptors. Based upon this, we introduce RelativeNet, a relative pose estimation network to assign correspondence-specific orientations to the keypoints, eliminating any local reference frame computations. Finally, we devise a simple yet effective hypothesize-and-verify algorithm to quickly use the predictions and align two point sets. Our extensive quantitative and qualitative experiments suggests that our approach outperforms the state of the art in challenging real datasets of pairwise registration and that augmenting the keypoints with local pose information leads to better generalization and a dramatic speed-up.
Tasks	Pose Estimation
Published	2019-04-08
URL	http://arxiv.org/abs/1904.04281v1
PDF	http://arxiv.org/pdf/1904.04281v1.pdf
PWC	https://paperswithcode.com/paper/3d-local-features-for-direct-pairwise
Repo
Framework

End-to-End Video Captioning


Title	End-to-End Video Captioning
Authors	Silvio Olivastri, Gurkirt Singh, Fabio Cuzzolin
Abstract	Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video’s description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two-stage training setting, we first initialise our architecture using pre-trained encoders and decoders – then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.
Tasks	Machine Translation, Temporal Action Localization, Text Generation, Video Captioning, Video Description
Published	2019-04-04
URL	https://arxiv.org/abs/1904.02628v2
PDF	https://arxiv.org/pdf/1904.02628v2.pdf
PWC	https://paperswithcode.com/paper/an-end-to-end-baseline-for-video-captioning
Repo
Framework

Manipulating a Learning Defender and Ways to Counteract


Title	Manipulating a Learning Defender and Ways to Counteract
Authors	Jiarui Gan, Qingyu Guo, Long Tran-Thanh, Bo An, Michael Wooldridge
Abstract	In Stackelberg security games when information about the attacker’s payoffs is uncertain, algorithms have been proposed to learn the optimal defender commitment by interacting with the attacker and observing their best responses. In this paper, we show that, however, these algorithms can be easily manipulated if the attacker responds untruthfully. As a key finding, attacker manipulation normally leads to the defender learning a maximin strategy, which effectively renders the learning attempt meaningless as to compute a maximin strategy requires no additional information about the other player at all. We then apply a game-theoretic framework at a higher level to counteract such manipulation, in which the defender commits to a policy that specifies her strategy commitment according to the learned information. We provide a polynomial-time algorithm to compute the optimal such policy, and in addition, a heuristic approach that applies even when the attacker’s payoff space is infinite or completely unknown. Empirical evaluation shows that our approaches can improve the defender’s utility significantly as compared to the situation when attacker manipulation is ignored.
Tasks
Published	2019-05-28
URL	https://arxiv.org/abs/1905.11759v2
PDF	https://arxiv.org/pdf/1905.11759v2.pdf
PWC	https://paperswithcode.com/paper/manipulating-a-learning-defender-and-ways-to
Repo
Framework

Deep Multivariate Mixture of Gaussians for Object Detection under Occlusion


Title	Deep Multivariate Mixture of Gaussians for Object Detection under Occlusion
Authors	Yihui He, Jianren Wang
Abstract	In this paper, we consider the problem of detecting object under occlusion. Most object detectors formulate bounding box regression as a unimodal task (i.e., regressing a single set of bounding box coordinates independently). However, we observe that the bounding box borders of an occluded object can have multiple plausible configurations. Also, the occluded bounding box borders have correlations with visible ones. Motivated by these two observations, we propose a deep multivariate mixture of Gaussians model for bounding box regression under occlusion. The mixture components potentially learn different configurations of an occluded part, and the covariances between variates help to learn the relationship between the occluded parts and the visible ones. Quantitatively, our model improves the AP of the baselines by 3.9% and 1.2% on CrowdHuman and MS-COCO respectively with almost no computational or memory overhead. Qualitatively, our model enjoys explainability since we can interpret the resulting bounding boxes via the covariance matrices and the mixture components.
Tasks	Object Detection
Published	2019-11-24
URL	https://arxiv.org/abs/1911.10614v1
PDF	https://arxiv.org/pdf/1911.10614v1.pdf
PWC	https://paperswithcode.com/paper/deep-multivariate-mixture-of-gaussians-for
Repo
Framework