Paper Group AWR 218
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Gated2Depth: Real-time Dense Lidar from Gated Images. Quantum-inspired canonical correlation analysis for exponentially large dimensional data. Deep Transfer Learning for Multiple Class Novelty Detection. Rotation Invariant Convolutions for 3D Point Cl …
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Title | ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks |
Authors | Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee |
Abstract | We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks – visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval – by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models – achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability. |
Tasks | Image Retrieval, Question Answering, Visual Commonsense Reasoning, Visual Question Answering |
Published | 2019-08-06 |
URL | https://arxiv.org/abs/1908.02265v1 |
https://arxiv.org/pdf/1908.02265v1.pdf | |
PWC | https://paperswithcode.com/paper/vilbert-pretraining-task-agnostic |
Repo | https://github.com/vmurahari3/visdial-bert |
Framework | pytorch |
Gated2Depth: Real-time Dense Lidar from Gated Images
Title | Gated2Depth: Real-time Dense Lidar from Gated Images |
Authors | Tobias Gruber, Frank Julca-Aguilar, Mario Bijelic, Werner Ritter, Klaus Dietmayer, Felix Heide |
Abstract | We present an imaging framework which converts three images from a gated camera into high-resolution depth maps with depth accuracy comparable to pulsed lidar measurements. Existing scanning lidar systems achieve low spatial resolution at large ranges due to mechanically-limited angular sampling rates, restricting scene understanding tasks to close-range clusters with dense sampling. Moreover, today’s pulsed lidar scanners suffer from high cost, power consumption, large form-factors, and they fail in the presence of strong backscatter. We depart from point scanning and demonstrate that it is possible to turn a low-cost CMOS gated imager into a dense depth camera with at least 80m range - by learning depth from three gated images. The proposed architecture exploits semantic context across gated slices, and is trained on a synthetic discriminator loss without the need of dense depth labels. The proposed replacement for scanning lidar systems is real-time, handles back-scatter and provides dense depth at long ranges. We validate our approach in simulation and on real-world data acquired over 4,000km driving in northern Europe. Data and code are available at https://github.com/gruberto/Gated2Depth. |
Tasks | Scene Understanding |
Published | 2019-02-13 |
URL | https://arxiv.org/abs/1902.04997v3 |
https://arxiv.org/pdf/1902.04997v3.pdf | |
PWC | https://paperswithcode.com/paper/gated2depth-real-time-dense-lidar-from-gated |
Repo | https://github.com/gruberto/Gated2Depth |
Framework | tf |
Quantum-inspired canonical correlation analysis for exponentially large dimensional data
Title | Quantum-inspired canonical correlation analysis for exponentially large dimensional data |
Authors | Naoko Koide-Majima, Kei Majima |
Abstract | Canonical correlation analysis (CCA) is a technique to find statistical dependencies between a pair of multivariate data. However, its application to high dimensional data is limited due to the resulting time complexity. While the conventional CCA algorithm requires polynomial time, we have developed an algorithm that approximates CCA with computational time proportional to the logarithm of the input dimensionality using quantum-inspired computation. The computational efficiency and approximation performance of the proposed quantum-inspired CCA (qiCCA) algorithm are experimentally demonstrated. Furthermore, the fast computation of qiCCA allows us to directly apply CCA even after nonlinearly mapping raw input data into very high dimensional spaces. Experiments performed using a benchmark dataset demonstrated that, by mapping the raw input data into the high dimensional spaces with second-order monomials, the proposed qiCCA extracted more correlations than linear CCA and was comparable to deep CCA and kernel CCA. These results suggest that qiCCA is considerably useful and quantum-inspired computation has the potential to unlock a new field in which exponentially large dimensional data can be analyzed. |
Tasks | |
Published | 2019-07-07 |
URL | https://arxiv.org/abs/1907.03236v1 |
https://arxiv.org/pdf/1907.03236v1.pdf | |
PWC | https://paperswithcode.com/paper/quantum-inspired-canonical-correlation |
Repo | https://github.com/nkmjm/qiML |
Framework | none |
Deep Transfer Learning for Multiple Class Novelty Detection
Title | Deep Transfer Learning for Multiple Class Novelty Detection |
Authors | Pramuditha Perera, Vishal M. Patel |
Abstract | We propose a transfer learning-based solution for the problem of multiple class novelty detection. In particular, we propose an end-to-end deep-learning based approach in which we investigate how the knowledge contained in an external, out-of-distributional dataset can be used to improve the performance of a deep network for visual novelty detection. Our solution differs from the standard deep classification networks on two accounts. First, we use a novel loss function, membership loss, in addition to the classical cross-entropy loss for training networks. Secondly, we use the knowledge from the external dataset more effectively to learn globally negative filters, filters that respond to generic objects outside the known class set. We show that thresholding the maximal activation of the proposed network can be used to identify novel objects effectively. Extensive experiments on four publicly available novelty detection datasets show that the proposed method achieves significant improvements over the state-of-the-art methods. |
Tasks | Transfer Learning |
Published | 2019-03-06 |
URL | http://arxiv.org/abs/1903.02196v1 |
http://arxiv.org/pdf/1903.02196v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-transfer-learning-for-multiple-class |
Repo | https://github.com/PramuPerera/TransferLearningNovelty |
Framework | caffe2 |
Rotation Invariant Convolutions for 3D Point Clouds Deep Learning
Title | Rotation Invariant Convolutions for 3D Point Clouds Deep Learning |
Authors | Zhiyuan Zhang, Binh-Son Hua, David W. Rosen, Sai-Kit Yeung |
Abstract | Recent progresses in 3D deep learning has shown that it is possible to design special convolution operators to consume point cloud data. However, a typical drawback is that rotation invariance is often not guaranteed, resulting in networks being trained with data augmented with rotations. In this paper, we introduce a novel convolution operator for point clouds that achieves rotation invariance. Our core idea is to use low-level rotation invariant geometric features such as distances and angles to design a convolution operator for point cloud learning. The well-known point ordering problem is also addressed by a binning approach seamlessly built into the convolution. This convolution operator then serves as the basic building block of a neural network that is robust to point clouds under 6DoF transformations such as translation and rotation. Our experiment shows that our method performs with high accuracy in common scene understanding tasks such as object classification and segmentation. Compared to previous works, most importantly, our method is able to generalize and achieve consistent results across different scenarios in which training and testing can contain arbitrary rotations. |
Tasks | Object Classification, Scene Understanding |
Published | 2019-08-17 |
URL | https://arxiv.org/abs/1908.06297v1 |
https://arxiv.org/pdf/1908.06297v1.pdf | |
PWC | https://paperswithcode.com/paper/rotation-invariant-convolutions-for-3d-point |
Repo | https://github.com/hkust-vgd/riconv |
Framework | tf |
Bayesian parameter estimation using conditional variational autoencoders for gravitational-wave astronomy
Title | Bayesian parameter estimation using conditional variational autoencoders for gravitational-wave astronomy |
Authors | Hunter Gabbard, Chris Messenger, Ik Siong Heng, Francesco Tonolini, Roderick Murray-Smith |
Abstract | Gravitational wave (GW) detection is now commonplace and as the sensitivity of the global network of GW detectors improves, we will observe $\mathcal{O}(100)$s of transient GW events per year. The current methods used to estimate their source parameters employ optimally sensitive but computationally costly Bayesian inference approaches where typical analyses have taken between 6 hours and 5 days. For binary neutron star and neutron star black hole systems prompt counterpart electromagnetic (EM) signatures are expected on timescales of 1 second – 1 minute and the current fastest method for alerting EM follow-up observers, can provide estimates in $\mathcal{O}(1)$ minute, on a limited range of key source parameters. Here we show that a conditional variational autoencoder pre-trained on binary black hole signals can return Bayesian posterior probability estimates. The training procedure need only be performed once for a given prior parameter space and the resulting trained machine can then generate samples describing the posterior distribution $\sim 6$ orders of magnitude faster than existing techniques. |
Tasks | Bayesian Inference |
Published | 2019-09-13 |
URL | https://arxiv.org/abs/1909.06296v2 |
https://arxiv.org/pdf/1909.06296v2.pdf | |
PWC | https://paperswithcode.com/paper/bayesian-parameter-estimation-using |
Repo | https://github.com/hagabbar/VItamin |
Framework | tf |
3D Manhattan Room Layout Reconstruction from a Single 360 Image
Title | 3D Manhattan Room Layout Reconstruction from a Single 360 Image |
Authors | Chuhang Zou, Jheng-Wei Su, Chi-Han Peng, Alex Colburn, Qi Shan, Peter Wonka, Hung-Kuo Chu, Derek Hoiem |
Abstract | Recent approaches for predicting layouts from 360 panoramas produce excellent results. These approaches build on a common framework consisting of three steps: a pre-processing step based on edge-based alignment, prediction of layout elements, and a post-processing step by fitting a 3D layout to the layout elements. Until now, it has been difficult to compare the methods due to multiple different design decisions, such as the encoding network (e.g. SegNet or ResNet), type of elements predicted (e.g. corners, wall/floor boundaries, or semantic segmentation), or method of fitting the 3D layout. To address this challenge, we summarize and describe the common framework, the variants, and the impact of the design decisions. For a complete evaluation, we also propose extended annotations for the Matterport3D dataset, and introduce two depth-based evaluation metrics. |
Tasks | Semantic Segmentation |
Published | 2019-10-09 |
URL | https://arxiv.org/abs/1910.04099v2 |
https://arxiv.org/pdf/1910.04099v2.pdf | |
PWC | https://paperswithcode.com/paper/3d-manhattan-room-layout-reconstruction-from |
Repo | https://github.com/zouchuhang/LayoutNetv2 |
Framework | pytorch |
Deep Video Deblurring: The Devil is in the Details
Title | Deep Video Deblurring: The Devil is in the Details |
Authors | Jochen Gast, Stefan Roth |
Abstract | Video deblurring for hand-held cameras is a challenging task, since the underlying blur is caused by both camera shake and object motion. State-of-the-art deep networks exploit temporal information from neighboring frames, either by means of spatio-temporal transformers or by recurrent architectures. In contrast to these involved models, we found that a simple baseline CNN can perform astonishingly well when particular care is taken w.r.t. the details of model and training procedure. To that end, we conduct a comprehensive study regarding these crucial details, uncovering extreme differences in quantitative and qualitative performance. Exploiting these details allows us to boost the architecture and training procedure of a simple baseline CNN by a staggering 3.15dB, such that it becomes highly competitive w.r.t. cutting-edge networks. This raises the question whether the reported accuracy difference between models is always due to technical contributions or also subject to such orthogonal, but crucial details. |
Tasks | Deblurring |
Published | 2019-09-26 |
URL | https://arxiv.org/abs/1909.12196v1 |
https://arxiv.org/pdf/1909.12196v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-video-deblurring-the-devil-is-in-the |
Repo | https://github.com/visinf/deblur-devil |
Framework | pytorch |
Designing and Interpreting Probes with Control Tasks
Title | Designing and Interpreting Probes with Control Tasks |
Authors | John Hewitt, Percy Liang |
Abstract | Probes, supervised models trained to predict properties (like parts-of-speech) from representations (like ELMo), have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? In this paper, we propose control tasks, which associate word types with random outputs, to complement linguistic tasks. By construction, these tasks can only be learned by the probe itself. So a good probe, (one that reflects the representation), should be selective, achieving high linguistic task accuracy and low control task accuracy. The selectivity of a probe puts linguistic task accuracy in context with the probe’s capacity to memorize from word types. We construct control tasks for English part-of-speech tagging and dependency edge prediction, and show that popular probes on ELMo representations are not selective. We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective. Finally, we find that while probes on the first layer of ELMo yield slightly better part-of-speech tagging accuracy than the second, probes on the second layer are substantially more selective, which raises the question of which layer better represents parts-of-speech. |
Tasks | Part-Of-Speech Tagging |
Published | 2019-09-08 |
URL | https://arxiv.org/abs/1909.03368v1 |
https://arxiv.org/pdf/1909.03368v1.pdf | |
PWC | https://paperswithcode.com/paper/designing-and-interpreting-probes-with |
Repo | https://github.com/i-machine-think/diagnnose |
Framework | pytorch |
Deep Transfer Learning Based Downlink Channel Prediction for FDD Massive MIMO Systems
Title | Deep Transfer Learning Based Downlink Channel Prediction for FDD Massive MIMO Systems |
Authors | Yuwen Yang, Feifei Gao, Zhimeng Zhong, Bo Ai, Ahmed Alkhateeb |
Abstract | Artificial intelligence (AI) based downlink channel state information (CSI) prediction for frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems has attracted growing attention recently. However, existing works focus on the downlink CSI prediction for the users under a given environment and is hard to adapt to users in new environment especially when labeled data is limited. To address this issue, we formulate the downlink channel prediction as a deep transfer learning (DTL) problem, where each learning task aims to predict the downlink CSI from the uplink CSI for one single environment. Specifically, we develop the direct-transfer algorithm based on the fully-connected neural network architecture, where the network is trained on the data from all previous environments in the manner of classical deep learning and is then fine-tuned for new environments. To further improve the transfer efficiency, we propose the meta-learning algorithm that trains the network by alternating inner-task and across-task updates and then adapts to a new environment with a small number of labeled data. Simulation results show that the direct-transfer algorithm achieves better performance than the deep learning algorithm, which implies that the transfer learning benefits the downlink channel prediction in new environments. Moreover, the meta-learning algorithm significantly outperforms the direct-transfer algorithm in terms of both prediction accuracy and stability, which validates its effectiveness and superiority. |
Tasks | Meta-Learning, Transfer Learning |
Published | 2019-12-27 |
URL | https://arxiv.org/abs/1912.12265v2 |
https://arxiv.org/pdf/1912.12265v2.pdf | |
PWC | https://paperswithcode.com/paper/deep-transfer-learning-based-downlink-channel |
Repo | https://github.com/yangyuwenyang/Codes-for-Deep-Transfer-Learning-Based-Downlink-Channel-Prediction-for-FDD-Massive-MIMO-Systems |
Framework | tf |
Virtual Mixup Training for Unsupervised Domain Adaptation
Title | Virtual Mixup Training for Unsupervised Domain Adaptation |
Authors | Xudong Mao, Yun Ma, Zhenguo Yang, Yangbin Chen, Qing Li |
Abstract | We study the problem of unsupervised domain adaptation which aims to adapt models trained on a labeled source domain to a completely unlabeled target domain. Recently, the cluster assumption has been applied to unsupervised domain adaptation and achieved strong performance. One critical factor in successful training of the cluster assumption is to impose the locally-Lipschitz constraint to the model. Existing methods only impose the locally-Lipschitz constraint around the training points while miss the other areas, such as the points in-between training data. In this paper, we address this issue by encouraging the model to behave linearly in-between training points. We propose a new regularization method called Virtual Mixup Training (VMT), which is able to incorporate the locally-Lipschitz constraint to the areas in-between training data. Unlike the traditional mixup model, our method constructs the combination samples without using the label information, allowing it to apply to unsupervised domain adaptation. The proposed method is generic and can be combined with most existing models such as the recent state-of-the-art model called VADA. Extensive experiments demonstrate that VMT significantly improves the performance of VADA on six domain adaptation benchmark datasets. For the challenging task of adapting MNIST to SVHN, VMT can improve the accuracy of VADA by over 30%. Code is available at \url{https://github.com/xudonmao/VMT}. |
Tasks | Domain Adaptation, Unsupervised Domain Adaptation |
Published | 2019-05-10 |
URL | https://arxiv.org/abs/1905.04215v4 |
https://arxiv.org/pdf/1905.04215v4.pdf | |
PWC | https://paperswithcode.com/paper/virtual-mixup-training-for-unsupervised |
Repo | https://github.com/xudonmao/VMT |
Framework | tf |
AdaGraph: Unifying Predictive and Continuous Domain Adaptation through Graphs
Title | AdaGraph: Unifying Predictive and Continuous Domain Adaptation through Graphs |
Authors | Massimiliano Mancini, Samuel Rota Bulò, Barbara Caputo, Elisa Ricci |
Abstract | The ability to categorize is a cornerstone of visual intelligence, and a key functionality for artificial, autonomous visual machines. This problem will never be solved without algorithms able to adapt and generalize across visual domains. Within the context of domain adaptation and generalization, this paper focuses on the predictive domain adaptation scenario, namely the case where no target data are available and the system has to learn to generalize from annotated source images plus unlabeled samples with associated metadata from auxiliary domains. Our contributionis the first deep architecture that tackles predictive domainadaptation, able to leverage over the information broughtby the auxiliary domains through a graph. Moreover, we present a simple yet effective strategy that allows us to take advantage of the incoming target data at test time, in a continuous domain adaptation scenario. Experiments on three benchmark databases support the value of our approach. |
Tasks | Domain Adaptation |
Published | 2019-03-17 |
URL | https://arxiv.org/abs/1903.07062v3 |
https://arxiv.org/pdf/1903.07062v3.pdf | |
PWC | https://paperswithcode.com/paper/adagraph-unifying-predictive-and |
Repo | https://github.com/mancinimassimiliano/adagraph |
Framework | pytorch |
Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks
Title | Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks |
Authors | Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger Grosse, Jörn-Henrik Jacobsen |
Abstract | Lipschitz constraints under L2 norm on deep neural networks are useful for provable adversarial robustness bounds, stable training, and Wasserstein distance estimation. While heuristic approaches such as the gradient penalty have seen much practical success, it is challenging to achieve similar practical performance while provably enforcing a Lipschitz constraint. In principle, one can design Lipschitz constrained architectures using the composition property of Lipschitz functions, but Anil et al. recently identified a key obstacle to this approach: gradient norm attenuation. They showed how to circumvent this problem in the case of fully connected networks by designing each layer to be gradient norm preserving. We extend their approach to train scalable, expressive, provably Lipschitz convolutional networks. In particular, we present the Block Convolution Orthogonal Parameterization (BCOP), an expressive parameterization of orthogonal convolution operations. We show that even though the space of orthogonal convolutions is disconnected, the largest connected component of BCOP with 2n channels can represent arbitrary BCOP convolutions over n channels. Our BCOP parameterization allows us to train large convolutional networks with provable Lipschitz bounds. Empirically, we find that it is competitive with existing approaches to provable adversarial robustness and Wasserstein distance estimation. |
Tasks | |
Published | 2019-11-03 |
URL | https://arxiv.org/abs/1911.00937v2 |
https://arxiv.org/pdf/1911.00937v2.pdf | |
PWC | https://paperswithcode.com/paper/preventing-gradient-attenuation-in-lipschitz |
Repo | https://github.com/ColinQiyangLi/LConvNet |
Framework | pytorch |
An Inertial Newton Algorithm for Deep Learning
Title | An Inertial Newton Algorithm for Deep Learning |
Authors | Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels |
Abstract | We devise a learning algorithm for possibly nonsmooth deep neural networks featuring inertia and Newtonian directional intelligence only by means of a back-propagation oracle. Our algorithm, called INDIAN, has an appealing mechanical interpretation, making the role of its two hyperparameters transparent. An elementary phase space lifting allows both for its implementation and its theoretical study under very general assumptions. We handle in particular a stochastic version of our method (which encompasses usual mini-batch approaches) for nonsmooth activation functions (such as ReLU). Our algorithm shows high efficiency and reaches state of the art on image classification problems. |
Tasks | Image Classification |
Published | 2019-05-29 |
URL | https://arxiv.org/abs/1905.12278v3 |
https://arxiv.org/pdf/1905.12278v3.pdf | |
PWC | https://paperswithcode.com/paper/an-inertial-newton-algorithm-for-deep |
Repo | https://github.com/camcastera/Indian-for-DeepLearning |
Framework | tf |
SpatialNLI: A Spatial Domain Natural Language Interface to Databases Using Spatial Comprehension
Title | SpatialNLI: A Spatial Domain Natural Language Interface to Databases Using Spatial Comprehension |
Authors | Jingjing Li, Wenlu Wang, Wei-Shinn Ku, Yingtao Tian, Haixun Wang |
Abstract | A natural language interface (NLI) to databases is an interface that translates a natural language question to a structured query that is executable by database management systems (DBMS). However, an NLI that is trained in the general domain is hard to apply in the spatial domain due to the idiosyncrasy and expressiveness of the spatial questions. Inspired by the machine comprehension model, we propose a spatial comprehension model that is able to recognize the meaning of spatial entities based on the semantics of the context. The spatial semantics learned from the spatial comprehension model is then injected to the natural language question to ease the burden of capturing the spatial-specific semantics. With our spatial comprehension model and information injection, our NLI for the spatial domain, named SpatialNLI, is able to capture the semantic structure of the question and translate it to the corresponding syntax of an executable query accurately. We also experimentally ascertain that SpatialNLI outperforms state-of-the-art methods. |
Tasks | Reading Comprehension |
Published | 2019-08-28 |
URL | https://arxiv.org/abs/1908.10917v1 |
https://arxiv.org/pdf/1908.10917v1.pdf | |
PWC | https://paperswithcode.com/paper/spatialnli-a-spatial-domain-natural-language |
Repo | https://github.com/VV123/SpatialNLI |
Framework | tf |