February 1, 2020

3191 words 15 mins read

Paper Group AWR 218

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Gated2Depth: Real-time Dense Lidar from Gated Images. Quantum-inspired canonical correlation analysis for exponentially large dimensional data. Deep Transfer Learning for Multiple Class Novelty Detection. Rotation Invariant Convolutions for 3D Point Cl …

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks


Title	ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Authors	Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
Abstract	We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks – visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval – by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models – achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.
Tasks	Image Retrieval, Question Answering, Visual Commonsense Reasoning, Visual Question Answering
Published	2019-08-06
URL	https://arxiv.org/abs/1908.02265v1
PDF	https://arxiv.org/pdf/1908.02265v1.pdf
PWC	https://paperswithcode.com/paper/vilbert-pretraining-task-agnostic
Repo	https://github.com/vmurahari3/visdial-bert
Framework	pytorch

Gated2Depth: Real-time Dense Lidar from Gated Images


Title	Gated2Depth: Real-time Dense Lidar from Gated Images
Authors	Tobias Gruber, Frank Julca-Aguilar, Mario Bijelic, Werner Ritter, Klaus Dietmayer, Felix Heide
Abstract	We present an imaging framework which converts three images from a gated camera into high-resolution depth maps with depth accuracy comparable to pulsed lidar measurements. Existing scanning lidar systems achieve low spatial resolution at large ranges due to mechanically-limited angular sampling rates, restricting scene understanding tasks to close-range clusters with dense sampling. Moreover, today’s pulsed lidar scanners suffer from high cost, power consumption, large form-factors, and they fail in the presence of strong backscatter. We depart from point scanning and demonstrate that it is possible to turn a low-cost CMOS gated imager into a dense depth camera with at least 80m range - by learning depth from three gated images. The proposed architecture exploits semantic context across gated slices, and is trained on a synthetic discriminator loss without the need of dense depth labels. The proposed replacement for scanning lidar systems is real-time, handles back-scatter and provides dense depth at long ranges. We validate our approach in simulation and on real-world data acquired over 4,000km driving in northern Europe. Data and code are available at https://github.com/gruberto/Gated2Depth.
Tasks	Scene Understanding
Published	2019-02-13
URL	https://arxiv.org/abs/1902.04997v3
PDF	https://arxiv.org/pdf/1902.04997v3.pdf
PWC	https://paperswithcode.com/paper/gated2depth-real-time-dense-lidar-from-gated
Repo	https://github.com/gruberto/Gated2Depth
Framework	tf

Quantum-inspired canonical correlation analysis for exponentially large dimensional data


Title	Quantum-inspired canonical correlation analysis for exponentially large dimensional data
Authors	Naoko Koide-Majima, Kei Majima
Abstract	Canonical correlation analysis (CCA) is a technique to find statistical dependencies between a pair of multivariate data. However, its application to high dimensional data is limited due to the resulting time complexity. While the conventional CCA algorithm requires polynomial time, we have developed an algorithm that approximates CCA with computational time proportional to the logarithm of the input dimensionality using quantum-inspired computation. The computational efficiency and approximation performance of the proposed quantum-inspired CCA (qiCCA) algorithm are experimentally demonstrated. Furthermore, the fast computation of qiCCA allows us to directly apply CCA even after nonlinearly mapping raw input data into very high dimensional spaces. Experiments performed using a benchmark dataset demonstrated that, by mapping the raw input data into the high dimensional spaces with second-order monomials, the proposed qiCCA extracted more correlations than linear CCA and was comparable to deep CCA and kernel CCA. These results suggest that qiCCA is considerably useful and quantum-inspired computation has the potential to unlock a new field in which exponentially large dimensional data can be analyzed.
Tasks
Published	2019-07-07
URL	https://arxiv.org/abs/1907.03236v1
PDF	https://arxiv.org/pdf/1907.03236v1.pdf
PWC	https://paperswithcode.com/paper/quantum-inspired-canonical-correlation
Repo	https://github.com/nkmjm/qiML
Framework	none

Deep Transfer Learning for Multiple Class Novelty Detection


Title	Deep Transfer Learning for Multiple Class Novelty Detection
Authors	Pramuditha Perera, Vishal M. Patel
Abstract	We propose a transfer learning-based solution for the problem of multiple class novelty detection. In particular, we propose an end-to-end deep-learning based approach in which we investigate how the knowledge contained in an external, out-of-distributional dataset can be used to improve the performance of a deep network for visual novelty detection. Our solution differs from the standard deep classification networks on two accounts. First, we use a novel loss function, membership loss, in addition to the classical cross-entropy loss for training networks. Secondly, we use the knowledge from the external dataset more effectively to learn globally negative filters, filters that respond to generic objects outside the known class set. We show that thresholding the maximal activation of the proposed network can be used to identify novel objects effectively. Extensive experiments on four publicly available novelty detection datasets show that the proposed method achieves significant improvements over the state-of-the-art methods.
Tasks	Transfer Learning
Published	2019-03-06
URL	http://arxiv.org/abs/1903.02196v1
PDF	http://arxiv.org/pdf/1903.02196v1.pdf
PWC	https://paperswithcode.com/paper/deep-transfer-learning-for-multiple-class
Repo	https://github.com/PramuPerera/TransferLearningNovelty
Framework	caffe2

Rotation Invariant Convolutions for 3D Point Clouds Deep Learning


Title	Rotation Invariant Convolutions for 3D Point Clouds Deep Learning
Authors	Zhiyuan Zhang, Binh-Son Hua, David W. Rosen, Sai-Kit Yeung
Abstract	Recent progresses in 3D deep learning has shown that it is possible to design special convolution operators to consume point cloud data. However, a typical drawback is that rotation invariance is often not guaranteed, resulting in networks being trained with data augmented with rotations. In this paper, we introduce a novel convolution operator for point clouds that achieves rotation invariance. Our core idea is to use low-level rotation invariant geometric features such as distances and angles to design a convolution operator for point cloud learning. The well-known point ordering problem is also addressed by a binning approach seamlessly built into the convolution. This convolution operator then serves as the basic building block of a neural network that is robust to point clouds under 6DoF transformations such as translation and rotation. Our experiment shows that our method performs with high accuracy in common scene understanding tasks such as object classification and segmentation. Compared to previous works, most importantly, our method is able to generalize and achieve consistent results across different scenarios in which training and testing can contain arbitrary rotations.
Tasks	Object Classification, Scene Understanding
Published	2019-08-17
URL	https://arxiv.org/abs/1908.06297v1
PDF	https://arxiv.org/pdf/1908.06297v1.pdf
PWC	https://paperswithcode.com/paper/rotation-invariant-convolutions-for-3d-point
Repo	https://github.com/hkust-vgd/riconv
Framework	tf

Bayesian parameter estimation using conditional variational autoencoders for gravitational-wave astronomy


Title	Bayesian parameter estimation using conditional variational autoencoders for gravitational-wave astronomy
Authors	Hunter Gabbard, Chris Messenger, Ik Siong Heng, Francesco Tonolini, Roderick Murray-Smith
Abstract	Gravitational wave (GW) detection is now commonplace and as the sensitivity of the global network of GW detectors improves, we will observe $\mathcal{O}(100)$s of transient GW events per year. The current methods used to estimate their source parameters employ optimally sensitive but computationally costly Bayesian inference approaches where typical analyses have taken between 6 hours and 5 days. For binary neutron star and neutron star black hole systems prompt counterpart electromagnetic (EM) signatures are expected on timescales of 1 second – 1 minute and the current fastest method for alerting EM follow-up observers, can provide estimates in $\mathcal{O}(1)$ minute, on a limited range of key source parameters. Here we show that a conditional variational autoencoder pre-trained on binary black hole signals can return Bayesian posterior probability estimates. The training procedure need only be performed once for a given prior parameter space and the resulting trained machine can then generate samples describing the posterior distribution $\sim 6$ orders of magnitude faster than existing techniques.
Tasks	Bayesian Inference
Published	2019-09-13
URL	https://arxiv.org/abs/1909.06296v2
PDF	https://arxiv.org/pdf/1909.06296v2.pdf
PWC	https://paperswithcode.com/paper/bayesian-parameter-estimation-using
Repo	https://github.com/hagabbar/VItamin
Framework	tf

3D Manhattan Room Layout Reconstruction from a Single 360 Image


Title	3D Manhattan Room Layout Reconstruction from a Single 360 Image
Authors	Chuhang Zou, Jheng-Wei Su, Chi-Han Peng, Alex Colburn, Qi Shan, Peter Wonka, Hung-Kuo Chu, Derek Hoiem
Abstract	Recent approaches for predicting layouts from 360 panoramas produce excellent results. These approaches build on a common framework consisting of three steps: a pre-processing step based on edge-based alignment, prediction of layout elements, and a post-processing step by fitting a 3D layout to the layout elements. Until now, it has been difficult to compare the methods due to multiple different design decisions, such as the encoding network (e.g. SegNet or ResNet), type of elements predicted (e.g. corners, wall/floor boundaries, or semantic segmentation), or method of fitting the 3D layout. To address this challenge, we summarize and describe the common framework, the variants, and the impact of the design decisions. For a complete evaluation, we also propose extended annotations for the Matterport3D dataset, and introduce two depth-based evaluation metrics.
Tasks	Semantic Segmentation
Published	2019-10-09
URL	https://arxiv.org/abs/1910.04099v2
PDF	https://arxiv.org/pdf/1910.04099v2.pdf
PWC	https://paperswithcode.com/paper/3d-manhattan-room-layout-reconstruction-from
Repo	https://github.com/zouchuhang/LayoutNetv2
Framework	pytorch

Deep Video Deblurring: The Devil is in the Details


Title	Deep Video Deblurring: The Devil is in the Details
Authors	Jochen Gast, Stefan Roth
Abstract	Video deblurring for hand-held cameras is a challenging task, since the underlying blur is caused by both camera shake and object motion. State-of-the-art deep networks exploit temporal information from neighboring frames, either by means of spatio-temporal transformers or by recurrent architectures. In contrast to these involved models, we found that a simple baseline CNN can perform astonishingly well when particular care is taken w.r.t. the details of model and training procedure. To that end, we conduct a comprehensive study regarding these crucial details, uncovering extreme differences in quantitative and qualitative performance. Exploiting these details allows us to boost the architecture and training procedure of a simple baseline CNN by a staggering 3.15dB, such that it becomes highly competitive w.r.t. cutting-edge networks. This raises the question whether the reported accuracy difference between models is always due to technical contributions or also subject to such orthogonal, but crucial details.
Tasks	Deblurring
Published	2019-09-26
URL	https://arxiv.org/abs/1909.12196v1
PDF	https://arxiv.org/pdf/1909.12196v1.pdf
PWC	https://paperswithcode.com/paper/deep-video-deblurring-the-devil-is-in-the
Repo	https://github.com/visinf/deblur-devil
Framework	pytorch

Designing and Interpreting Probes with Control Tasks


Title	Designing and Interpreting Probes with Control Tasks
Authors	John Hewitt, Percy Liang
Abstract	Probes, supervised models trained to predict properties (like parts-of-speech) from representations (like ELMo), have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? In this paper, we propose control tasks, which associate word types with random outputs, to complement linguistic tasks. By construction, these tasks can only be learned by the probe itself. So a good probe, (one that reflects the representation), should be selective, achieving high linguistic task accuracy and low control task accuracy. The selectivity of a probe puts linguistic task accuracy in context with the probe’s capacity to memorize from word types. We construct control tasks for English part-of-speech tagging and dependency edge prediction, and show that popular probes on ELMo representations are not selective. We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective. Finally, we find that while probes on the first layer of ELMo yield slightly better part-of-speech tagging accuracy than the second, probes on the second layer are substantially more selective, which raises the question of which layer better represents parts-of-speech.
Tasks	Part-Of-Speech Tagging
Published	2019-09-08
URL	https://arxiv.org/abs/1909.03368v1
PDF	https://arxiv.org/pdf/1909.03368v1.pdf
PWC	https://paperswithcode.com/paper/designing-and-interpreting-probes-with
Repo	https://github.com/i-machine-think/diagnnose
Framework	pytorch

Deep Transfer Learning Based Downlink Channel Prediction for FDD Massive MIMO Systems


Title	Deep Transfer Learning Based Downlink Channel Prediction for FDD Massive MIMO Systems
Authors	Yuwen Yang, Feifei Gao, Zhimeng Zhong, Bo Ai, Ahmed Alkhateeb
Abstract	Artificial intelligence (AI) based downlink channel state information (CSI) prediction for frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems has attracted growing attention recently. However, existing works focus on the downlink CSI prediction for the users under a given environment and is hard to adapt to users in new environment especially when labeled data is limited. To address this issue, we formulate the downlink channel prediction as a deep transfer learning (DTL) problem, where each learning task aims to predict the downlink CSI from the uplink CSI for one single environment. Specifically, we develop the direct-transfer algorithm based on the fully-connected neural network architecture, where the network is trained on the data from all previous environments in the manner of classical deep learning and is then fine-tuned for new environments. To further improve the transfer efficiency, we propose the meta-learning algorithm that trains the network by alternating inner-task and across-task updates and then adapts to a new environment with a small number of labeled data. Simulation results show that the direct-transfer algorithm achieves better performance than the deep learning algorithm, which implies that the transfer learning benefits the downlink channel prediction in new environments. Moreover, the meta-learning algorithm significantly outperforms the direct-transfer algorithm in terms of both prediction accuracy and stability, which validates its effectiveness and superiority.
Tasks	Meta-Learning, Transfer Learning
Published	2019-12-27
URL	https://arxiv.org/abs/1912.12265v2
PDF	https://arxiv.org/pdf/1912.12265v2.pdf
PWC	https://paperswithcode.com/paper/deep-transfer-learning-based-downlink-channel
Repo	https://github.com/yangyuwenyang/Codes-for-Deep-Transfer-Learning-Based-Downlink-Channel-Prediction-for-FDD-Massive-MIMO-Systems
Framework	tf

Virtual Mixup Training for Unsupervised Domain Adaptation


Title	Virtual Mixup Training for Unsupervised Domain Adaptation
Authors	Xudong Mao, Yun Ma, Zhenguo Yang, Yangbin Chen, Qing Li
Abstract	We study the problem of unsupervised domain adaptation which aims to adapt models trained on a labeled source domain to a completely unlabeled target domain. Recently, the cluster assumption has been applied to unsupervised domain adaptation and achieved strong performance. One critical factor in successful training of the cluster assumption is to impose the locally-Lipschitz constraint to the model. Existing methods only impose the locally-Lipschitz constraint around the training points while miss the other areas, such as the points in-between training data. In this paper, we address this issue by encouraging the model to behave linearly in-between training points. We propose a new regularization method called Virtual Mixup Training (VMT), which is able to incorporate the locally-Lipschitz constraint to the areas in-between training data. Unlike the traditional mixup model, our method constructs the combination samples without using the label information, allowing it to apply to unsupervised domain adaptation. The proposed method is generic and can be combined with most existing models such as the recent state-of-the-art model called VADA. Extensive experiments demonstrate that VMT significantly improves the performance of VADA on six domain adaptation benchmark datasets. For the challenging task of adapting MNIST to SVHN, VMT can improve the accuracy of VADA by over 30%. Code is available at \url{https://github.com/xudonmao/VMT}.
Tasks	Domain Adaptation, Unsupervised Domain Adaptation
Published	2019-05-10
URL	https://arxiv.org/abs/1905.04215v4
PDF	https://arxiv.org/pdf/1905.04215v4.pdf
PWC	https://paperswithcode.com/paper/virtual-mixup-training-for-unsupervised
Repo	https://github.com/xudonmao/VMT
Framework	tf

AdaGraph: Unifying Predictive and Continuous Domain Adaptation through Graphs


Title	AdaGraph: Unifying Predictive and Continuous Domain Adaptation through Graphs
Authors	Massimiliano Mancini, Samuel Rota Bulò, Barbara Caputo, Elisa Ricci
Abstract	The ability to categorize is a cornerstone of visual intelligence, and a key functionality for artificial, autonomous visual machines. This problem will never be solved without algorithms able to adapt and generalize across visual domains. Within the context of domain adaptation and generalization, this paper focuses on the predictive domain adaptation scenario, namely the case where no target data are available and the system has to learn to generalize from annotated source images plus unlabeled samples with associated metadata from auxiliary domains. Our contributionis the first deep architecture that tackles predictive domainadaptation, able to leverage over the information broughtby the auxiliary domains through a graph. Moreover, we present a simple yet effective strategy that allows us to take advantage of the incoming target data at test time, in a continuous domain adaptation scenario. Experiments on three benchmark databases support the value of our approach.
Tasks	Domain Adaptation
Published	2019-03-17
URL	https://arxiv.org/abs/1903.07062v3
PDF	https://arxiv.org/pdf/1903.07062v3.pdf
PWC	https://paperswithcode.com/paper/adagraph-unifying-predictive-and
Repo	https://github.com/mancinimassimiliano/adagraph
Framework	pytorch

Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks


Title	Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks
Authors	Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger Grosse, Jörn-Henrik Jacobsen
Abstract	Lipschitz constraints under L2 norm on deep neural networks are useful for provable adversarial robustness bounds, stable training, and Wasserstein distance estimation. While heuristic approaches such as the gradient penalty have seen much practical success, it is challenging to achieve similar practical performance while provably enforcing a Lipschitz constraint. In principle, one can design Lipschitz constrained architectures using the composition property of Lipschitz functions, but Anil et al. recently identified a key obstacle to this approach: gradient norm attenuation. They showed how to circumvent this problem in the case of fully connected networks by designing each layer to be gradient norm preserving. We extend their approach to train scalable, expressive, provably Lipschitz convolutional networks. In particular, we present the Block Convolution Orthogonal Parameterization (BCOP), an expressive parameterization of orthogonal convolution operations. We show that even though the space of orthogonal convolutions is disconnected, the largest connected component of BCOP with 2n channels can represent arbitrary BCOP convolutions over n channels. Our BCOP parameterization allows us to train large convolutional networks with provable Lipschitz bounds. Empirically, we find that it is competitive with existing approaches to provable adversarial robustness and Wasserstein distance estimation.
Tasks
Published	2019-11-03
URL	https://arxiv.org/abs/1911.00937v2
PDF	https://arxiv.org/pdf/1911.00937v2.pdf
PWC	https://paperswithcode.com/paper/preventing-gradient-attenuation-in-lipschitz
Repo	https://github.com/ColinQiyangLi/LConvNet
Framework	pytorch

An Inertial Newton Algorithm for Deep Learning


Title	An Inertial Newton Algorithm for Deep Learning
Authors	Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels
Abstract	We devise a learning algorithm for possibly nonsmooth deep neural networks featuring inertia and Newtonian directional intelligence only by means of a back-propagation oracle. Our algorithm, called INDIAN, has an appealing mechanical interpretation, making the role of its two hyperparameters transparent. An elementary phase space lifting allows both for its implementation and its theoretical study under very general assumptions. We handle in particular a stochastic version of our method (which encompasses usual mini-batch approaches) for nonsmooth activation functions (such as ReLU). Our algorithm shows high efficiency and reaches state of the art on image classification problems.
Tasks	Image Classification
Published	2019-05-29
URL	https://arxiv.org/abs/1905.12278v3
PDF	https://arxiv.org/pdf/1905.12278v3.pdf
PWC	https://paperswithcode.com/paper/an-inertial-newton-algorithm-for-deep
Repo	https://github.com/camcastera/Indian-for-DeepLearning
Framework	tf

SpatialNLI: A Spatial Domain Natural Language Interface to Databases Using Spatial Comprehension


Title	SpatialNLI: A Spatial Domain Natural Language Interface to Databases Using Spatial Comprehension
Authors	Jingjing Li, Wenlu Wang, Wei-Shinn Ku, Yingtao Tian, Haixun Wang
Abstract	A natural language interface (NLI) to databases is an interface that translates a natural language question to a structured query that is executable by database management systems (DBMS). However, an NLI that is trained in the general domain is hard to apply in the spatial domain due to the idiosyncrasy and expressiveness of the spatial questions. Inspired by the machine comprehension model, we propose a spatial comprehension model that is able to recognize the meaning of spatial entities based on the semantics of the context. The spatial semantics learned from the spatial comprehension model is then injected to the natural language question to ease the burden of capturing the spatial-specific semantics. With our spatial comprehension model and information injection, our NLI for the spatial domain, named SpatialNLI, is able to capture the semantic structure of the question and translate it to the corresponding syntax of an executable query accurately. We also experimentally ascertain that SpatialNLI outperforms state-of-the-art methods.
Tasks	Reading Comprehension
Published	2019-08-28
URL	https://arxiv.org/abs/1908.10917v1
PDF	https://arxiv.org/pdf/1908.10917v1.pdf
PWC	https://paperswithcode.com/paper/spatialnli-a-spatial-domain-natural-language
Repo	https://github.com/VV123/SpatialNLI
Framework	tf