May 7, 2019

3156 words 15 mins read

Paper Group AWR 63

An Analysis of Deep Neural Network Models for Practical Applications. Column Networks for Collective Classification. Visualizing Large-scale and High-dimensional Data. Sentence Similarity Learning by Lexical Decomposition and Composition. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. A Convolutional Encoder Model for Neural …

An Analysis of Deep Neural Network Models for Practical Applications


Title	An Analysis of Deep Neural Network Models for Practical Applications
Authors	Alfredo Canziani, Adam Paszke, Eugenio Culurciello
Abstract	Since the emergence of Deep Neural Networks (DNNs) as a prominent technique in the field of computer vision, the ImageNet classification challenge has played a major role in advancing the state-of-the-art. While accuracy figures have steadily increased, the resource utilisation of winning models has not been properly taken into account. In this work, we present a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption. Key findings are: (1) power consumption is independent of batch size and architecture; (2) accuracy and inference time are in a hyperbolic relationship; (3) energy constraint is an upper bound on the maximum achievable accuracy and model complexity; (4) the number of operations is a reliable estimate of the inference time. We believe our analysis provides a compelling set of information that helps design and engineer efficient DNNs.
Tasks
Published	2016-05-24
URL	http://arxiv.org/abs/1605.07678v4
PDF	http://arxiv.org/pdf/1605.07678v4.pdf
PWC	https://paperswithcode.com/paper/an-analysis-of-deep-neural-network-models-for
Repo	https://github.com/CentaurusM/Utils
Framework	none

Column Networks for Collective Classification


Title	Column Networks for Collective Classification
Authors	Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh
Abstract	Relational learning deals with data that are characterized by relational structures. An important task is collective classification, which is to jointly classify networked objects. While it holds a great promise to produce a better accuracy than non-collective classifiers, collective classification is computational challenging and has not leveraged on the recent breakthroughs of deep learning. We present Column Network (CLN), a novel deep learning model for collective classification in multi-relational domains. CLN has many desirable theoretical properties: (i) it encodes multi-relations between any two instances; (ii) it is deep and compact, allowing complex functions to be approximated at the network level with a small set of free parameters; (iii) local and relational features are learned simultaneously; (iv) long-range, higher-order dependencies between instances are supported naturally; and (v) crucially, learning and inference are efficient, linear in the size of the network and the number of relations. We evaluate CLN on multiple real-world applications: (a) delay prediction in software projects, (b) PubMed Diabetes publication classification and (c) film genre classification. In all applications, CLN demonstrates a higher accuracy than state-of-the-art rivals.
Tasks	Relational Reasoning
Published	2016-09-15
URL	http://arxiv.org/abs/1609.04508v2
PDF	http://arxiv.org/pdf/1609.04508v2.pdf
PWC	https://paperswithcode.com/paper/column-networks-for-collective-classification
Repo	https://github.com/trangptm/Column_networks
Framework	none

Visualizing Large-scale and High-dimensional Data


Title	Visualizing Large-scale and High-dimensional Data
Authors	Jian Tang, Jingzhou Liu, Ming Zhang, Qiaozhu Mei
Abstract	We study the problem of visualizing large-scale and high-dimensional data in a low-dimensional (typically 2D or 3D) space. Much success has been reported recently by techniques that first compute a similarity structure of the data points and then project them into a low-dimensional space with the structure preserved. These two steps suffer from considerable computational costs, preventing the state-of-the-art methods such as the t-SNE from scaling to large-scale and high-dimensional data (e.g., millions of data points and hundreds of dimensions). We propose the LargeVis, a technique that first constructs an accurately approximated K-nearest neighbor graph from the data and then layouts the graph in the low-dimensional space. Comparing to t-SNE, LargeVis significantly reduces the computational cost of the graph construction step and employs a principled probabilistic model for the visualization step, the objective of which can be effectively optimized through asynchronous stochastic gradient descent with a linear time complexity. The whole procedure thus easily scales to millions of high-dimensional data points. Experimental results on real-world data sets demonstrate that the LargeVis outperforms the state-of-the-art methods in both efficiency and effectiveness. The hyper-parameters of LargeVis are also much more stable over different data sets.
Tasks	graph construction
Published	2016-02-01
URL	http://arxiv.org/abs/1602.00370v2
PDF	http://arxiv.org/pdf/1602.00370v2.pdf
PWC	https://paperswithcode.com/paper/visualizing-large-scale-and-high-dimensional
Repo	https://github.com/jlmelville/uwot
Framework	none

Sentence Similarity Learning by Lexical Decomposition and Composition


Title	Sentence Similarity Learning by Lexical Decomposition and Composition
Authors	Zhiguo Wang, Haitao Mi, Abraham Ittycheriah
Abstract	Most conventional sentence similarity methods only focus on similar parts of two input sentences, and simply ignore the dissimilar parts, which usually give us some clues and semantic meanings about the sentences. In this work, we propose a model to take into account both the similarities and dissimilarities by decomposing and composing lexical semantics over sentences. The model represents each word as a vector, and calculates a semantic matching vector for each word based on all words in the other sentence. Then, each word vector is decomposed into a similar component and a dissimilar component based on the semantic matching vector. After this, a two-channel CNN model is employed to capture features by composing the similar and dissimilar components. Finally, a similarity score is estimated over the composed feature vectors. Experimental results show that our model gets the state-of-the-art performance on the answer sentence selection task, and achieves a comparable result on the paraphrase identification task.
Tasks	Paraphrase Identification, Question Answering
Published	2016-02-23
URL	http://arxiv.org/abs/1602.07019v2
PDF	http://arxiv.org/pdf/1602.07019v2.pdf
PWC	https://paperswithcode.com/paper/sentence-similarity-learning-by-lexical
Repo	https://github.com/Leputa/CIKM-AnalytiCup-2018
Framework	tf

MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition


Title	MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition
Authors	Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, Jianfeng Gao
Abstract	In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base. More specifically, we propose a benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data. The rich information provided by the knowledge base helps to conduct disambiguation and improve the recognition accuracy, and contributes to various real-world applications, such as image captioning and news video analysis. Associated with this task, we design and provide concrete measurement set, evaluation protocol, as well as training data. We also present in details our experiment setup and report promising baseline results. Our benchmark task could lead to one of the largest classification problems in computer vision. To the best of our knowledge, our training dataset, which contains 10M images in version 1, is the largest publicly available one in the world.
Tasks	Face Recognition, Image Captioning
Published	2016-07-27
URL	http://arxiv.org/abs/1607.08221v1
PDF	http://arxiv.org/pdf/1607.08221v1.pdf
PWC	https://paperswithcode.com/paper/ms-celeb-1m-a-dataset-and-benchmark-for-large
Repo	https://github.com/deepinsight/insightface
Framework	mxnet

A Convolutional Encoder Model for Neural Machine Translation


Title	A Convolutional Encoder Model for Neural Machine Translation
Authors	Jonas Gehring, Michael Auli, David Grangier, Yann N. Dauphin
Abstract	The prevalent approach to neural machine translation relies on bi-directional LSTMs to encode the source sentence. In this paper we present a faster and simpler architecture based on a succession of convolutional layers. This allows to encode the entire source sentence simultaneously compared to recurrent networks for which computation is constrained by temporal dependencies. On WMT’16 English-Romanian translation we achieve competitive accuracy to the state-of-the-art and we outperform several recently published results on the WMT’15 English-German task. Our models obtain almost the same accuracy as a very deep LSTM setup on WMT’14 English-French translation. Our convolutional encoder speeds up CPU decoding by more than two times at the same or higher accuracy as a strong bi-directional LSTM baseline.
Tasks	Machine Translation
Published	2016-11-07
URL	http://arxiv.org/abs/1611.02344v3
PDF	http://arxiv.org/pdf/1611.02344v3.pdf
PWC	https://paperswithcode.com/paper/a-convolutional-encoder-model-for-neural
Repo	https://github.com/facebookresearch/fairseq
Framework	torch

Hierarchical Deep Temporal Models for Group Activity Recognition


Title	Hierarchical Deep Temporal Models for Group Activity Recognition
Authors	Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, Greg Mori
Abstract	In this paper we present an approach for classifying the activity performed by a group of people in a video sequence. This problem of group activity recognition can be addressed by examining individual person actions and their relations. Temporal dynamics exist both at the level of individual person actions as well as at the level of group activity. Given a video sequence as input, methods can be developed to capture these dynamics at both person-level and group-level detail. We build a deep model to capture these dynamics based on LSTM (long short-term memory) models. In order to model both person-level and group-level dynamics, we present a 2-stage deep temporal model for the group activity recognition problem. In our approach, one LSTM model is designed to represent action dynamics of individual people in a video sequence and another LSTM model is designed to aggregate person-level information for group activity recognition. We collected a new dataset consisting of volleyball videos labeled with individual and group activities in order to evaluate our method. Experimental results on this new Volleyball Dataset and the standard benchmark Collective Activity Dataset demonstrate the efficacy of the proposed models.
Tasks	Activity Recognition, Group Activity Recognition
Published	2016-07-09
URL	http://arxiv.org/abs/1607.02643v1
PDF	http://arxiv.org/pdf/1607.02643v1.pdf
PWC	https://paperswithcode.com/paper/hierarchical-deep-temporal-models-for-group
Repo	https://github.com/mostafa-saad/deep-activity-rec
Framework	none

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks


Title	Synthesizing the preferred inputs for neurons in neural networks via deep generator networks
Authors	Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, Jeff Clune
Abstract	Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right - similar to why we study the human brain - and will enable researchers to further improve DNNs. One path to understanding how a neural network functions internally is to study what each of its neurons has learned to detect. One such method is called activation maximization (AM), which synthesizes an input (e.g. an image) that highly activates a neuron. Here we dramatically improve the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN). The algorithm (1) generates qualitatively state-of-the-art synthetic images that look almost real, (2) reveals the features learned by each neuron in an interpretable way, (3) generalizes well to new datasets and somewhat well to different network architectures without requiring the prior to be relearned, and (4) can be considered as a high-quality generative method (in this case, by generating novel, creative, interesting, recognizable images).
Tasks
Published	2016-05-30
URL	http://arxiv.org/abs/1605.09304v5
PDF	http://arxiv.org/pdf/1605.09304v5.pdf
PWC	https://paperswithcode.com/paper/synthesizing-the-preferred-inputs-for-neurons
Repo	https://github.com/Evolving-AI-Lab/synthesizing
Framework	caffe2

Associative Embedding: End-to-End Learning for Joint Detection and Grouping


Title	Associative Embedding: End-to-End Learning for Joint Detection and Grouping
Authors	Alejandro Newell, Zhiao Huang, Jia Deng
Abstract	We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping. A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually the grouping of detections is achieved with multi-stage pipelines, instead we propose an approach that teaches a network to simultaneously output detections and group assignments. This technique can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions. We show how to apply this method to both multi-person pose estimation and instance segmentation and report state-of-the-art performance for multi-person pose on the MPII and MS-COCO datasets.
Tasks	Instance Segmentation, Keypoint Detection, Multi-Object Tracking, Multi-Person Pose Estimation, Object Tracking, Pose Estimation, Semantic Segmentation
Published	2016-11-16
URL	http://arxiv.org/abs/1611.05424v2
PDF	http://arxiv.org/pdf/1611.05424v2.pdf
PWC	https://paperswithcode.com/paper/associative-embedding-end-to-end-learning-for
Repo	https://github.com/stevehjc/pose-ae-demo-tf
Framework	tf

Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image


Title	Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image
Authors	Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, Michael J. Black
Abstract	We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.
Tasks
Published	2016-07-27
URL	http://arxiv.org/abs/1607.08128v1
PDF	http://arxiv.org/pdf/1607.08128v1.pdf
PWC	https://paperswithcode.com/paper/keep-it-smpl-automatic-estimation-of-3d-human
Repo	https://github.com/Jtoo/fitting_human_smpl_model
Framework	tf

Applying Deep Learning to Basketball Trajectories


Title	Applying Deep Learning to Basketball Trajectories
Authors	Rajiv Shah, Rob Romijnders
Abstract	One of the emerging trends for sports analytics is the growing use of player and ball tracking data. A parallel development is deep learning predictive approaches that use vast quantities of data with less reliance on feature engineering. This paper applies recurrent neural networks in the form of sequence modeling to predict whether a three-point shot is successful. The models are capable of learning the trajectory of a basketball without any knowledge of physics. For comparison, a baseline static machine learning model with a full set of features, such as angle and velocity, in addition to the positional data is also tested. Using a dataset of over 20,000 three pointers from NBA SportVu data, the models based simply on sequential positional data outperform a static feature rich machine learning model in predicting whether a three-point shot is successful. This suggests deep learning models may offer an improvement to traditional feature based machine learning methods for tracking data.
Tasks	Feature Engineering
Published	2016-08-12
URL	http://arxiv.org/abs/1608.03793v2
PDF	http://arxiv.org/pdf/1608.03793v2.pdf
PWC	https://paperswithcode.com/paper/applying-deep-learning-to-basketball
Repo	https://github.com/RobRomijnders/RNN_basketball
Framework	tf

Colorful Image Colorization


Title	Colorful Image Colorization
Authors	Richard Zhang, Phillip Isola, Alexei A. Efros
Abstract	Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a “colorization Turing test,” asking human participants to choose between a generated and ground truth color image. Our method successfully fools humans on 32% of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. This approach results in state-of-the-art performance on several feature learning benchmarks.
Tasks	Colorization
Published	2016-03-28
URL	http://arxiv.org/abs/1603.08511v5
PDF	http://arxiv.org/pdf/1603.08511v5.pdf
PWC	https://paperswithcode.com/paper/colorful-image-colorization
Repo	https://github.com/Epiphqny/Colorization
Framework	pytorch

Guided Open Vocabulary Image Captioning with Constrained Beam Search


Title	Guided Open Vocabulary Image Captioning with Constrained Beam Search
Authors	Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould
Abstract	Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We also show that we can significantly improve the quality of generated ImageNet captions by leveraging ground-truth labels.
Tasks	Image Captioning, Word Embeddings
Published	2016-12-02
URL	http://arxiv.org/abs/1612.00576v2
PDF	http://arxiv.org/pdf/1612.00576v2.pdf
PWC	https://paperswithcode.com/paper/guided-open-vocabulary-image-captioning-with
Repo	https://github.com/nocaps-org/updown-baseline
Framework	pytorch

Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots


Title	Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots
Authors	Yu Wu, Wei Wu, Chen Xing, Ming Zhou, Zhoujun Li
Abstract	We study response selection for multi-turn conversation in retrieval-based chatbots. Existing work either concatenates utterances in context or matches a response with a highly abstract context vector finally, which may lose relationships among utterances or important contextual information. We propose a sequential matching network (SMN) to address both problems. SMN first matches a response with each utterance in the context on multiple levels of granularity, and distills important matching information from each pair as a vector with convolution and pooling operations. The vectors are then accumulated in a chronological order through a recurrent neural network (RNN) which models relationships among utterances. The final matching score is calculated with the hidden states of the RNN. An empirical study on two public data sets shows that SMN can significantly outperform state-of-the-art methods for response selection in multi-turn conversation.
Tasks	Conversational Response Selection
Published	2016-12-06
URL	http://arxiv.org/abs/1612.01627v2
PDF	http://arxiv.org/pdf/1612.01627v2.pdf
PWC	https://paperswithcode.com/paper/sequential-matching-network-a-new
Repo	https://github.com/yangliuy/NeuralResponseRanking
Framework	tf

VoxResNet: Deep Voxelwise Residual Networks for Volumetric Brain Segmentation


Title	VoxResNet: Deep Voxelwise Residual Networks for Volumetric Brain Segmentation
Authors	Hao Chen, Qi Dou, Lequan Yu, Pheng-Ann Heng
Abstract	Recently deep residual learning with residual units for training very deep neural networks advanced the state-of-the-art performance on 2D image recognition tasks, e.g., object detection and segmentation. However, how to fully leverage contextual representations for recognition tasks from volumetric data has not been well studied, especially in the field of medical image computing, where a majority of image modalities are in volumetric format. In this paper we explore the deep residual learning on the task of volumetric brain segmentation. There are at least two main contributions in our work. First, we propose a deep voxelwise residual network, referred as VoxResNet, which borrows the spirit of deep residual learning in 2D image recognition tasks, and is extended into a 3D variant for handling volumetric data. Second, an auto-context version of VoxResNet is proposed by seamlessly integrating the low-level image appearance features, implicit shape information and high-level context together for further improving the volumetric segmentation performance. Extensive experiments on the challenging benchmark of brain segmentation from magnetic resonance (MR) images corroborated the efficacy of our proposed method in dealing with volumetric data. We believe this work unravels the potential of 3D deep learning to advance the recognition performance on volumetric image segmentation.
Tasks	Brain Segmentation, Object Detection, Semantic Segmentation
Published	2016-08-21
URL	http://arxiv.org/abs/1608.05895v1
PDF	http://arxiv.org/pdf/1608.05895v1.pdf
PWC	https://paperswithcode.com/paper/voxresnet-deep-voxelwise-residual-networks
Repo	https://github.com/mediteamC/teamC
Framework	pytorch