October 20, 2019

3342 words 16 mins read

Paper Group AWR 234

Auto-Encoding Variational Neural Machine Translation. Unsupervised Discovery of Object Landmarks as Structural Representations. Efficient Formal Safety Analysis of Neural Networks. Extended Isolation Forest. Self-supervised learning of a facial attribute embedding from video. Improved Fusion of Visual and Language Representations by Dense Symmetric …

Auto-Encoding Variational Neural Machine Translation


Title	Auto-Encoding Variational Neural Machine Translation
Authors	Bryan Eikema, Wilker Aziz
Abstract	We present a deep generative model of bilingual sentence pairs for machine translation. The model generates source and target sentences jointly from a shared latent representation and is parameterised by neural networks. We perform efficient training using amortised variational inference and reparameterised gradients. Additionally, we discuss the statistical implications of joint modelling and propose an efficient approximation to maximum a posteriori decoding for fast test-time predictions. We demonstrate the effectiveness of our model in three machine translation scenarios: in-domain training, mixed-domain training, and learning from a mix of gold-standard and synthetic data. Our experiments show consistently that our joint formulation outperforms conditional modelling (i.e. standard neural machine translation) in all such scenarios.
Tasks	Machine Translation
Published	2018-07-27
URL	https://arxiv.org/abs/1807.10564v4
PDF	https://arxiv.org/pdf/1807.10564v4.pdf
PWC	https://paperswithcode.com/paper/auto-encoding-variational-neural-machine
Repo	https://github.com/Roxot/AEVNMT
Framework	tf

Unsupervised Discovery of Object Landmarks as Structural Representations


Title	Unsupervised Discovery of Object Landmarks as Structural Representations
Authors	Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee
Abstract	Deep neural networks can model images with rich latent representations, but they cannot naturally conceptualize structures of object categories in a human-perceptible way. This paper addresses the problem of learning object structures in an image modeling process without supervision. We propose an autoencoding formulation to discover landmarks as explicit structural representations. The encoding module outputs landmark coordinates, whose validity is ensured by constraints that reflect the necessary properties for landmarks. The decoding module takes the landmarks as a part of the learnable input representations in an end-to-end differentiable framework. Our discovered landmarks are semantically meaningful and more predictive of manually annotated landmarks than those discovered by previous methods. The coordinates of our landmarks are also complementary features to pretrained deep-neural-network representations in recognizing visual attributes. In addition, the proposed method naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures. The project webpage is at http://ytzhang.net/projects/lmdis-rep
Tasks	Unsupervised Facial Landmark Detection
Published	2018-04-12
URL	http://arxiv.org/abs/1804.04412v1
PDF	http://arxiv.org/pdf/1804.04412v1.pdf
PWC	https://paperswithcode.com/paper/unsupervised-discovery-of-object-landmarks-as
Repo	https://github.com/YutingZhang/lmdis-rep
Framework	tf

Efficient Formal Safety Analysis of Neural Networks


Title	Efficient Formal Safety Analysis of Neural Networks
Authors	Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, Suman Jana
Abstract	Neural networks are increasingly deployed in real-world safety-critical domains such as autonomous driving, aircraft collision avoidance, and malware detection. However, these networks have been shown to often mispredict on inputs with minor adversarial or even accidental perturbations. Consequences of such errors can be disastrous and even potentially fatal as shown by the recent Tesla autopilot crash. Thus, there is an urgent need for formal analysis systems that can rigorously check neural networks for violations of different safety properties such as robustness against adversarial perturbations within a certain $L$-norm of a given image. An effective safety analysis system for a neural network must be able to either ensure that a safety property is satisfied by the network or find a counterexample, i.e., an input for which the network will violate the property. Unfortunately, most existing techniques for performing such analysis struggle to scale beyond very small networks and the ones that can scale to larger networks suffer from high false positives and cannot produce concrete counterexamples in case of a property violation. In this paper, we present a new efficient approach for rigorously checking different safety properties of neural networks that significantly outperforms existing approaches by multiple orders of magnitude. Our approach can check different safety properties and find concrete counterexamples for networks that are 10$\times$ larger than the ones supported by existing analysis techniques. We believe that our approach to estimating tight output bounds of a network for a given input range can also help improve the explainability of neural networks and guide the training process of more robust neural networks.
Tasks	Adversarial Attack, Adversarial Defense, Autonomous Driving, Malware Detection
Published	2018-09-19
URL	http://arxiv.org/abs/1809.08098v3
PDF	http://arxiv.org/pdf/1809.08098v3.pdf
PWC	https://paperswithcode.com/paper/efficient-formal-safety-analysis-of-neural
Repo	https://github.com/tcwangshiqi-columbia/Interval-Attack
Framework	tf

Extended Isolation Forest


Title	Extended Isolation Forest
Authors	Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner
Abstract	We present an extension to the model-free anomaly detection algorithm, Isolation Forest. This extension, named Extended Isolation Forest (EIF), resolves issues with assignment of anomaly score to given data points. We motivate the problem using heat maps for anomaly scores. These maps suffer from artifacts generated by the criteria for branching operation of the binary tree. We explain this problem in detail and demonstrate the mechanism by which it occurs visually. We then propose two different approaches for improving the situation. First we propose transforming the data randomly before creation of each tree, which results in averaging out the bias. Second, which is the preferred way, is to allow the slicing of the data to use hyperplanes with random slopes. This approach results in remedying the artifact seen in the anomaly score heat maps. We show that the robustness of the algorithm is much improved using this method by looking at the variance of scores of data points distributed along constant level sets. We report AUROC and AUPRC for our synthetic datasets, along with real-world benchmark datasets. We find no appreciable difference in the rate of convergence nor in computation time between the standard Isolation Forest and EIF.
Tasks	Anomaly Detection
Published	2018-11-06
URL	https://arxiv.org/abs/1811.02141v2
PDF	https://arxiv.org/pdf/1811.02141v2.pdf
PWC	https://paperswithcode.com/paper/extended-isolation-forest
Repo	https://github.com/fyumoto/EIF
Framework	none

Self-supervised learning of a facial attribute embedding from video


Title	Self-supervised learning of a facial attribute embedding from video
Authors	Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
Abstract	We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.
Tasks	Unsupervised Facial Landmark Detection
Published	2018-08-21
URL	http://arxiv.org/abs/1808.06882v1
PDF	http://arxiv.org/pdf/1808.06882v1.pdf
PWC	https://paperswithcode.com/paper/self-supervised-learning-of-a-facial
Repo	https://github.com/oawiles/FAb-Net
Framework	pytorch

Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering


Title	Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
Authors	Duy-Kien Nguyen, Takayuki Okatani
Abstract	A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
Tasks	Visual Question Answering
Published	2018-04-03
URL	http://arxiv.org/abs/1804.00775v2
PDF	http://arxiv.org/pdf/1804.00775v2.pdf
PWC	https://paperswithcode.com/paper/improved-fusion-of-visual-and-language
Repo	https://github.com/cvlab-tohoku/Dense-CoAttention-Network
Framework	pytorch

A First Look at Deep Learning Apps on Smartphones


Title	A First Look at Deep Learning Apps on Smartphones
Authors	Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, Xuanzhe Liu
Abstract	We are in the dawn of deep learning explosion for smartphones. To bridge the gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions: what are the early adopter apps of deep learning, what do they use deep learning for, and how do their deep learning models look like. Our study has strong implications for app developers, smartphone vendors, and deep learning R&D. On one hand, our findings paint a promising picture of deep learning for smartphones, showing the prosperity of mobile deep learning frameworks as well as the prosperity of apps building their cores atop deep learning. On the other hand, our findings urge optimizations on deep learning models deployed on smartphones, the protection of these models, and validation of research ideas on these models.
Tasks
Published	2018-11-08
URL	https://arxiv.org/abs/1812.05448v3
PDF	https://arxiv.org/pdf/1812.05448v3.pdf
PWC	https://paperswithcode.com/paper/a-first-look-at-deep-learning-apps-on
Repo	https://github.com/xumengwei/MobileDL
Framework	tf

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform


Title	Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform
Authors	Xintao Wang, Ke Yu, Chao Dong, Chen Change Loy
Abstract	Despite that convolutional neural networks (CNN) have recently demonstrated high-quality reconstruction for single-image super-resolution (SR), recovering natural and realistic texture remains a challenging problem. In this paper, we show that it is possible to recover textures faithful to semantic classes. In particular, we only need to modulate features of a few intermediate layers in a single network conditioned on semantic segmentation probability maps. This is made possible through a novel Spatial Feature Transform (SFT) layer that generates affine transformation parameters for spatial-wise feature modulation. SFT layers can be trained end-to-end together with the SR network using the same loss function. During testing, it accepts an input image of arbitrary size and generates a high-resolution image with just a single forward pass conditioned on the categorical priors. Our final results show that an SR network equipped with SFT can generate more realistic and visually pleasing textures in comparison to state-of-the-art SRGAN and EnhanceNet.
Tasks	Image Super-Resolution, Semantic Segmentation, Super-Resolution
Published	2018-04-09
URL	http://arxiv.org/abs/1804.02815v1
PDF	http://arxiv.org/pdf/1804.02815v1.pdf
PWC	https://paperswithcode.com/paper/recovering-realistic-texture-in-image-super
Repo	https://github.com/xinntao/BasicSR
Framework	pytorch

From Recognition to Cognition: Visual Commonsense Reasoning


Title	From Recognition to Cognition: Visual Commonsense Reasoning
Authors	Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi
Abstract	Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
Tasks	Visual Commonsense Reasoning
Published	2018-11-27
URL	http://arxiv.org/abs/1811.10830v2
PDF	http://arxiv.org/pdf/1811.10830v2.pdf
PWC	https://paperswithcode.com/paper/from-recognition-to-cognition-visual
Repo	https://github.com/TheShadow29/visual-commonsense-pytorch
Framework	pytorch

Understand Functionality and Dimensionality of Vector Embeddings: the Distributional Hypothesis, the Pairwise Inner Product Loss and Its Bias-Variance Trade-off


Title	Understand Functionality and Dimensionality of Vector Embeddings: the Distributional Hypothesis, the Pairwise Inner Product Loss and Its Bias-Variance Trade-off
Authors	Zi Yin
Abstract	Vector embedding is a foundational building block of many deep learning models, especially in natural language processing. In this paper, we present a theoretical framework for understanding the effect of dimensionality on vector embeddings. We observe that the distributional hypothesis, a governing principle of statistical semantics, requires a natural unitary-invariance for vector embeddings. Motivated by the unitary-invariance observation, we propose the Pairwise Inner Product (PIP) loss, a unitary-invariant metric on the similarity between two embeddings. We demonstrate that the PIP loss captures the difference in functionality between embeddings, and that the PIP loss is tightly connect with two basic properties of vector embeddings, namely similarity and compositionality. By formulating the embedding training process as matrix factorization with noise, we reveal a fundamental bias-variance trade-off between the signal spectrum and noise power in the dimensionality selection process. This bias-variance trade-off sheds light on many empirical observations which have not been thoroughly explained, for example the existence of an optimal dimensionality. Moreover, we discover two new results about vector embeddings, namely their robustness against over-parametrization and their forward stability. The bias-variance trade-off of the PIP loss explicitly answers the fundamental open problem of dimensionality selection for vector embeddings.
Tasks
Published	2018-03-01
URL	http://arxiv.org/abs/1803.00502v4
PDF	http://arxiv.org/pdf/1803.00502v4.pdf
PWC	https://paperswithcode.com/paper/understand-functionality-and-dimensionality
Repo	https://github.com/aaaasssddf/PIP-experiments
Framework	tf

Sparsely Grouped Multi-task Generative Adversarial Networks for Facial Attribute Manipulation


Title	Sparsely Grouped Multi-task Generative Adversarial Networks for Facial Attribute Manipulation
Authors	Jichao Zhang, Yezhi Shu, Songhua Xu, Gongze Cao, Fan Zhong, Xueying Qin
Abstract	Recently, Image-to-Image Translation (IIT) has achieved great progress in image style transfer and semantic context manipulation for images. However, existing approaches require exhaustively labelling training data, which is labor demanding, difficult to scale up, and hard to adapt to a new domain. To overcome such a key limitation, we propose Sparsely Grouped Generative Adversarial Networks (SG-GAN) as a novel approach that can translate images in sparsely grouped datasets where only a few train samples are labelled. Using a one-input multi-output architecture, SG-GAN is well-suited for tackling multi-task learning and sparsely grouped learning tasks. The new model is able to translate images among multiple groups using only a single trained model. To experimentally validate the advantages of the new model, we apply the proposed method to tackle a series of attribute manipulation tasks for facial images as a case study. Experimental results show that SG-GAN can achieve comparable results with state-of-the-art methods on adequately labelled datasets while attaining a superior image translation quality on sparsely grouped datasets.
Tasks	Image-to-Image Translation, Multi-Task Learning, Style Transfer
Published	2018-05-19
URL	http://arxiv.org/abs/1805.07509v6
PDF	http://arxiv.org/pdf/1805.07509v6.pdf
PWC	https://paperswithcode.com/paper/sparsely-grouped-multi-task-generative
Repo	https://github.com/zhangqianhui/SGGAN-tensorflow
Framework	tf

Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance


Title	Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance
Authors	Zechun Liu, Wenhan Luo, Baoyuan Wu, Xin Yang, Wei Liu, Kwang-Ting Cheng
Abstract	In this paper, we study 1-bit convolutional neural networks (CNNs), of which both the weights and activations are binary. While efficient, the lacking of representational capability and the training difficulty impede 1-bit CNNs from performing as well as real-valued networks. We propose Bi-Real net with a novel training algorithm to tackle these two challenges. To enhance the representational capability, we propagate the real-valued activations generated by each 1-bit convolution via a parameter-free shortcut. To address the training difficulty, we propose a training algorithm using a tighter approximation to the derivative of the sign function, a magnitude-aware gradient for weight updating, a better initialization method, and a two-step scheme for training a deep network. Experiments on ImageNet show that an 18-layer Bi-Real net with the proposed training algorithm achieves 56.4% top-1 classification accuracy, which is 10% higher than the state-of-the-arts (e.g., XNOR-Net) with greater memory saving and lower computational cost. Bi-Real net is also the first to scale up 1-bit CNNs to an ultra-deep network with 152 layers, and achieves 64.5% top-1 accuracy on ImageNet. A 50-layer Bi-Real net shows comparable performance to a real-valued network on the depth estimation task with only a 0.3% accuracy gap.
Tasks	Depth Estimation
Published	2018-11-04
URL	https://arxiv.org/abs/1811.01335v2
PDF	https://arxiv.org/pdf/1811.01335v2.pdf
PWC	https://paperswithcode.com/paper/bi-real-net-binarizing-deep-network-towards
Repo	https://github.com/liuzechun/Bi-Real-net
Framework	pytorch

Analysis of Minimax Error Rate for Crowdsourcing and Its Application to Worker Clustering Model


Title	Analysis of Minimax Error Rate for Crowdsourcing and Its Application to Worker Clustering Model
Authors	Hideaki Imamura, Issei Sato, Masashi Sugiyama
Abstract	While crowdsourcing has become an important means to label data, there is great interest in estimating the ground truth from unreliable labels produced by crowdworkers. The Dawid and Skene (DS) model is one of the most well-known models in the study of crowdsourcing. Despite its practical popularity, theoretical error analysis for the DS model has been conducted only under restrictive assumptions on class priors, confusion matrices, or the number of labels each worker provides. In this paper, we derive a minimax error rate under more practical setting for a broader class of crowdsourcing models including the DS model as a special case. We further propose the worker clustering model, which is more practical than the DS model under real crowdsourcing settings. The wide applicability of our theoretical analysis allows us to immediately investigate the behavior of this proposed model, which can not be analyzed by existing studies. Experimental results showed that there is a strong similarity between the lower bound of the minimax error rate derived by our theoretical analysis and the empirical error of the estimated value.
Tasks
Published	2018-02-13
URL	http://arxiv.org/abs/1802.04551v2
PDF	http://arxiv.org/pdf/1802.04551v2.pdf
PWC	https://paperswithcode.com/paper/analysis-of-minimax-error-rate-for
Repo	https://github.com/HideakiImamura/MinimaxErrorRate
Framework	none

Learning unknown ODE models with Gaussian processes


Title	Learning unknown ODE models with Gaussian processes
Authors	Markus Heinonen, Cagatay Yildiz, Henrik Mannerström, Jukka Intosalmi, Harri Lähdesmäki
Abstract	In conventional ODE modelling coefficients of an equation driving the system state forward in time are estimated. However, for many complex systems it is practically impossible to determine the equations or interactions governing the underlying dynamics. In these settings, parametric ODE model cannot be formulated. Here, we overcome this issue by introducing a novel paradigm of nonparametric ODE modelling that can learn the underlying dynamics of arbitrary continuous-time systems without prior knowledge. We propose to learn non-linear, unknown differential functions from state observations using Gaussian process vector fields within the exact ODE formalism. We demonstrate the model’s capabilities to infer dynamics from sparse data and to simulate the system forward into future.
Tasks	Gaussian Processes
Published	2018-03-12
URL	http://arxiv.org/abs/1803.04303v1
PDF	http://arxiv.org/pdf/1803.04303v1.pdf
PWC	https://paperswithcode.com/paper/learning-unknown-ode-models-with-gaussian
Repo	https://github.com/cagatayyildiz/npde
Framework	tf

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning


Title	Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning
Authors	Nicolas Papernot, Patrick McDaniel
Abstract	Deep neural networks (DNNs) enable innovative applications of machine learning like image recognition, machine translation, or malware detection. However, deep learning is often criticized for its lack of robustness in adversarial settings (e.g., vulnerability to adversarial inputs) and general inability to rationalize its predictions. In this work, we exploit the structure of deep learning to enable new learning-based inference and decision strategies that achieve desirable properties such as robustness and interpretability. We take a first step in this direction and introduce the Deep k-Nearest Neighbors (DkNN). This hybrid classifier combines the k-nearest neighbors algorithm with representations of the data learned by each layer of the DNN: a test input is compared to its neighboring training points according to the distance that separates them in the representations. We show the labels of these neighboring points afford confidence estimates for inputs outside the model’s training manifold, including on malicious inputs like adversarial examples–and therein provides protections against inputs that are outside the models understanding. This is because the nearest neighbors can be used to estimate the nonconformity of, i.e., the lack of support for, a prediction in the training data. The neighbors also constitute human-interpretable explanations of predictions. We evaluate the DkNN algorithm on several datasets, and show the confidence estimates accurately identify inputs outside the model, and that the explanations provided by nearest neighbors are intuitive and useful in understanding model failures.
Tasks	Machine Translation, Malware Detection
Published	2018-03-13
URL	http://arxiv.org/abs/1803.04765v1
PDF	http://arxiv.org/pdf/1803.04765v1.pdf
PWC	https://paperswithcode.com/paper/deep-k-nearest-neighbors-towards-confident
Repo	https://github.com/rodgzilla/machine_learning_deep_knn
Framework	none