May 7, 2019

3456 words 17 mins read

Paper Group AWR 7

DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. Distributed and parallel time series feature extraction for industrial big data applications. A Riemannian Framework for Statistical Analysis of Topological Persistence Diagrams. Programming P …

DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification


Title	DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification
Authors	Rohit Babbar, Bernhard Shoelkopf
Abstract	Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousands or even millions of labels. Datasets in extreme classification exhibit fit to power-law distribution, i.e. a large fraction of labels have very few positive instances in the data distribution. Most state-of-the-art approaches for extreme multi-label classification attempt to capture correlation among labels by embedding the label matrix to a low-dimensional linear sub-space. However, in the presence of power-law distributed extremely large and diverse label spaces, structural assumptions such as low rank can be easily violated. In this work, we present DiSMEC, which is a large-scale distributed framework for learning one-versus-rest linear classifiers coupled with explicit capacity control to control model size. Unlike most state-of-the-art methods, DiSMEC does not make any low rank assumptions on the label matrix. Using double layer of parallelization, DiSMEC can learn classifiers for datasets consisting hundreds of thousands labels within few hours. The explicit capacity control mechanism filters out spurious parameters which keep the model compact in size, without losing prediction accuracy. We conduct extensive empirical evaluation on publicly available real-world datasets consisting upto 670,000 labels. We compare DiSMEC with recent state-of-the-art approaches, including - SLEEC which is a leading approach for learning sparse local embeddings, and FastXML which is a tree-based approach optimizing ranking based loss function. On some of the datasets, DiSMEC can significantly boost prediction accuracies - 10% better compared to SLECC and 15% better compared to FastXML, in absolute terms.
Tasks	Extreme Multi-Label Classification, Multi-Label Classification, Multi-Label Learning
Published	2016-09-08
URL	http://arxiv.org/abs/1609.02521v1
PDF	http://arxiv.org/pdf/1609.02521v1.pdf
PWC	https://paperswithcode.com/paper/dismec-distributed-sparse-machines-for
Repo	https://github.com/Refefer/fastxml
Framework	none

Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data


Title	Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
Authors	Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, Kunal Talwar
Abstract	Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore reveal sensitive information. To address this problem, we demonstrate a generally applicable approach to providing strong privacy guarantees for training data: Private Aggregation of Teacher Ensembles (PATE). The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as “teachers” for a “student” model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student’s privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the student’s training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings. Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-convex models like DNNs. We achieve state-of-the-art privacy/utility trade-offs on MNIST and SVHN thanks to an improved privacy analysis and semi-supervised learning.
Tasks	Transfer Learning
Published	2016-10-18
URL	http://arxiv.org/abs/1610.05755v4
PDF	http://arxiv.org/pdf/1610.05755v4.pdf
PWC	https://paperswithcode.com/paper/semi-supervised-knowledge-transfer-for-deep
Repo	https://github.com/styluna7/60-days-of-Udacity
Framework	pytorch

Distributed and parallel time series feature extraction for industrial big data applications


Title	Distributed and parallel time series feature extraction for industrial big data applications
Authors	Maximilian Christ, Andreas W. Kempa-Liehr, Michael Feindt
Abstract	The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics.
Tasks	Feature Importance, Feature Selection, Time Series, Time Series Classification
Published	2016-10-25
URL	http://arxiv.org/abs/1610.07717v3
PDF	http://arxiv.org/pdf/1610.07717v3.pdf
PWC	https://paperswithcode.com/paper/distributed-and-parallel-time-series-feature
Repo	https://github.com/blue-yonder/tsfresh
Framework	none

A Riemannian Framework for Statistical Analysis of Topological Persistence Diagrams


Title	A Riemannian Framework for Statistical Analysis of Topological Persistence Diagrams
Authors	Rushil Anirudh, Vinay Venkataraman, Karthikeyan Natesan Ramamurthy, Pavan Turaga
Abstract	Topological data analysis is becoming a popular way to study high dimensional feature spaces without any contextual clues or assumptions. This paper concerns itself with one popular topological feature, which is the number of $d-$dimensional holes in the dataset, also known as the Betti$-d$ number. The persistence of the Betti numbers over various scales is encoded into a persistence diagram (PD), which indicates the birth and death times of these holes as scale varies. A common way to compare PDs is by a point-to-point matching, which is given by the $n$-Wasserstein metric. However, a big drawback of this approach is the need to solve correspondence between points before computing the distance; for $n$ points, the complexity grows according to $\mathcal{O}($n$^3)$. Instead, we propose to use an entirely new framework built on Riemannian geometry, that models PDs as 2D probability density functions that are represented in the square-root framework on a Hilbert Sphere. The resulting space is much more intuitive with closed form expressions for common operations. The distance metric is 1) correspondence-free and also 2) independent of the number of points in the dataset. The complexity of computing distance between PDs now grows according to $\mathcal{O}(K^2)$, for a $K \times K$ discretization of $[0,1]^2$. This also enables the use of existing machinery in differential geometry towards statistical analysis of PDs such as computing the mean, geodesics, classification etc. We report competitive results with the Wasserstein metric, at a much lower computational load, indicating the favorable properties of the proposed approach.
Tasks	Topological Data Analysis
Published	2016-05-28
URL	http://arxiv.org/abs/1605.08912v1
PDF	http://arxiv.org/pdf/1605.08912v1.pdf
PWC	https://paperswithcode.com/paper/a-riemannian-framework-for-statistical
Repo	https://github.com/rushilanirudh/pdsphere
Framework	none

Programming Patterns in Dataflow Matrix Machines and Generalized Recurrent Neural Nets


Title	Programming Patterns in Dataflow Matrix Machines and Generalized Recurrent Neural Nets
Authors	Michael Bukatin, Steve Matthews, Andrey Radul
Abstract	Dataflow matrix machines arise naturally in the context of synchronous dataflow programming with linear streams. They can be viewed as a rather powerful generalization of recurrent neural networks. Similarly to recurrent neural networks, large classes of dataflow matrix machines are described by matrices of numbers, and therefore dataflow matrix machines can be synthesized by computing their matrices. At the same time, the evidence is fairly strong that dataflow matrix machines have sufficient expressive power to be a convenient general-purpose programming platform. Because of the network nature of this platform, programming patterns often correspond to patterns of connectivity in the generalized recurrent neural networks understood as programs. This paper explores a variety of such programming patterns.
Tasks
Published	2016-06-30
URL	http://arxiv.org/abs/1606.09470v2
PDF	http://arxiv.org/pdf/1606.09470v2.pdf
PWC	https://paperswithcode.com/paper/programming-patterns-in-dataflow-matrix
Repo	https://github.com/anhinga/fluid
Framework	none

Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection


Title	Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection
Authors	Vered Shwartz, Enrico Santus, Dominik Schlechtweg
Abstract	The fundamental role of hypernymy in NLP has motivated the development of many methods for the automatic identification of this relation, most of which rely on word distribution. We investigate an extensive number of such unsupervised measures, using several distributional semantic models that differ by context type and feature weighting. We analyze the performance of the different methods based on their linguistic motivation. Comparison to the state-of-the-art supervised methods shows that while supervised methods generally outperform the unsupervised ones, the former are sensitive to the distribution of training instances, hurting their reliability. Being based on general linguistic hypotheses and independent from training data, unsupervised measures are more robust, and therefore are still useful artillery for hypernymy detection.
Tasks	Hypernym Discovery
Published	2016-12-14
URL	http://arxiv.org/abs/1612.04460v2
PDF	http://arxiv.org/pdf/1612.04460v2.pdf
PWC	https://paperswithcode.com/paper/hypernyms-under-siege-linguistically
Repo	https://github.com/vered1986/UnsupervisedHypernymy
Framework	none

Can Active Memory Replace Attention?


Title	Can Active Memory Replace Attention?
Authors	Łukasz Kaiser, Samy Bengio
Abstract	Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation. Recently, similar improvements have been obtained using alternative mechanisms that do not focus on a single part of a memory but operate on all of it in parallel, in a uniform way. Such mechanism, which we call active memory, improved over attention in algorithmic tasks, image processing, and in generative modelling. So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.
Tasks	Image Captioning, Machine Translation
Published	2016-10-27
URL	http://arxiv.org/abs/1610.08613v2
PDF	http://arxiv.org/pdf/1610.08613v2.pdf
PWC	https://paperswithcode.com/paper/can-active-memory-replace-attention
Repo	https://github.com/tensorflow/models/tree/master/research/neural_gpu
Framework	tf


Title	Sequential Voting Promotes Collective Discovery in Social Recommendation Systems
Authors	L. Elisa Celis, Peter M. Krafft, Nathan Kobe
Abstract	One goal of online social recommendation systems is to harness the wisdom of crowds in order to identify high quality content. Yet the sequential voting mechanisms that are commonly used by these systems are at odds with existing theoretical and empirical literature on optimal aggregation. This literature suggests that sequential voting will promote herding—the tendency for individuals to copy the decisions of others around them—and hence lead to suboptimal content recommendation. Is there a problem with our practice, or a problem with our theory? Previous attempts at answering this question have been limited by a lack of objective measurements of content quality. Quality is typically defined endogenously as the popularity of content in absence of social influence. The flaw of this metric is its presupposition that the preferences of the crowd are aligned with underlying quality. Domains in which content quality can be defined exogenously and measured objectively are thus needed in order to better assess the design choices of social recommendation systems. In this work, we look to the domain of education, where content quality can be measured via how well students are able to learn from the material presented to them. Through a behavioral experiment involving a simulated massive open online course (MOOC) run on Amazon Mechanical Turk, we show that sequential voting systems can surface better content than systems that elicit independent votes.
Tasks	Recommendation Systems
Published	2016-03-14
URL	http://arxiv.org/abs/1603.04466v1
PDF	http://arxiv.org/pdf/1603.04466v1.pdf
PWC	https://paperswithcode.com/paper/sequential-voting-promotes-collective
Repo	https://github.com/pkrafft/Sequential-Voting-Promotes-Collective-Discovery-in-Social-Recommendation-Systems
Framework	none

LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling


Title	LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling
Authors	Zhen Li, Yukang Gan, Xiaodan Liang, Yizhou Yu, Hui Cheng, Liang Lin
Abstract	Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by i) developing a novel Long Short-Term Memorized Context Fusion (LSTM-CF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Specifically, contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both short-range and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state of the art, i.e., 48.1% and 49.4% average class accuracy over 37 categories (2.2% and 5.4% improvement) on the large-scale SUNRGBD dataset and the NYUDv2dataset, respectively.
Tasks	Scene Labeling
Published	2016-04-18
URL	http://arxiv.org/abs/1604.05000v3
PDF	http://arxiv.org/pdf/1604.05000v3.pdf
PWC	https://paperswithcode.com/paper/lstm-cf-unifying-context-modeling-and-fusion
Repo	https://github.com/icemansina/LSTM-CF
Framework	none

Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model


Title	Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model
Authors	Aleksandr Chuklin, Maarten de Rijke
Abstract	Modern search engine result pages often provide immediate value to users and organize information in such a way that it is easy to navigate. The core ranking function contributes to this and so do result snippets, smart organization of result blocks and extensive use of one-box answers or side panels. While they are useful to the user and help search engines to stand out, such features present two big challenges for evaluation. First, the presence of such elements on a search engine result page (SERP) may lead to the absence of clicks, which is, however, not related to dissatisfaction, so-called “good abandonments.” Second, the non-linear layout and visual difference of SERP items may lead to non-trivial patterns of user attention, which is not captured by existing evaluation metrics. In this paper we propose a model of user behavior on a SERP that jointly captures click behavior, user attention and satisfaction, the CAS model, and demonstrate that it gives more accurate predictions of user actions and self-reported satisfaction than existing models based on clicks alone. We use the CAS model to build a novel evaluation metric that can be applied to non-linear SERP layouts and that can account for the utility that users obtain directly on a SERP. We demonstrate that this metric shows better agreement with user-reported satisfaction than conventional evaluation metrics.
Tasks
Published	2016-09-02
URL	https://arxiv.org/abs/1609.00552v1
PDF	https://arxiv.org/pdf/1609.00552v1.pdf
PWC	https://paperswithcode.com/paper/incorporating-clicks-attention-and
Repo	https://github.com/varepsilon/cas-eval
Framework	none

PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents


Title	PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents
Authors	Sebastian Sudholt, Gernot A. Fink
Abstract	In recent years, deep convolutional neural networks have achieved state of the art performance in various computer vision task such as classification, detection or segmentation. Due to their outstanding performance, CNNs are more and more used in the field of document image analysis as well. In this work, we present a CNN architecture that is trained with the recently proposed PHOC representation. We show empirically that our CNN architecture is able to outperform state of the art results for various word spotting benchmarks while exhibiting short training and test times.
Tasks	Word Spotting In Handwritten Documents
Published	2016-04-01
URL	http://arxiv.org/abs/1604.00187v3
PDF	http://arxiv.org/pdf/1604.00187v3.pdf
PWC	https://paperswithcode.com/paper/phocnet-a-deep-convolutional-neural-network
Repo	https://github.com/pinakinathc/phocnet_keras
Framework	tf

Coupling Adaptive Batch Sizes with Learning Rates


Title	Coupling Adaptive Batch Sizes with Learning Rates
Authors	Lukas Balles, Javier Romero, Philipp Hennig
Abstract	Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On popular image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.
Tasks	Image Classification, Stochastic Optimization
Published	2016-12-15
URL	http://arxiv.org/abs/1612.05086v2
PDF	http://arxiv.org/pdf/1612.05086v2.pdf
PWC	https://paperswithcode.com/paper/coupling-adaptive-batch-sizes-with-learning
Repo	https://github.com/ProbabilisticNumerics/cabs
Framework	tf

Learning Deep Representations of Fine-grained Visual Descriptions


Title	Learning Deep Representations of Fine-grained Visual Descriptions
Authors	Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee
Abstract	State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories. Despite good performance, attributes have limitations: (1) finer-grained recognition requires commensurately more attributes, and (2) attributes do not provide a natural language interface. We propose to overcome these limitations by training neural language models from scratch; i.e. without pre-training and only consuming words and characters. Our proposed models train end-to-end to align with the fine-grained and category-specific content of images. Natural language provides a flexible and compact way of encoding only the salient visual aspects for distinguishing categories. By training on raw text, our model can do inference on raw text as well, providing humans a familiar mode both for annotation and retrieval. Our model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero-shot classification on the Caltech UCSD Birds 200-2011 dataset.
Tasks	Image Retrieval, Zero-Shot Learning
Published	2016-05-17
URL	http://arxiv.org/abs/1605.05395v1
PDF	http://arxiv.org/pdf/1605.05395v1.pdf
PWC	https://paperswithcode.com/paper/learning-deep-representations-of-fine-grained
Repo	https://github.com/rafiahmed40/stack-adverserial-network
Framework	tf

Learning to Compose Words into Sentences with Reinforcement Learning


Title	Learning to Compose Words into Sentences with Reinforcement Learning
Authors	Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, Wang Ling
Abstract	We use reinforcement learning to learn tree-structured neural networks for computing representations of natural language sentences. In contrast with prior work on tree-structured models in which the trees are either provided as input or predicted using supervision from explicit treebank annotations, the tree structures in this work are optimized to improve performance on a downstream task. Experiments demonstrate the benefit of learning task-specific composition orders, outperforming both sequential encoders and recursive encoders based on treebank annotations. We analyze the induced trees and show that while they discover some linguistically intuitive structures (e.g., noun phrases, simple verb phrases), they are different than conventional English syntactic structures.
Tasks
Published	2016-11-28
URL	http://arxiv.org/abs/1611.09100v1
PDF	http://arxiv.org/pdf/1611.09100v1.pdf
PWC	https://paperswithcode.com/paper/learning-to-compose-words-into-sentences-with
Repo	https://github.com/rintukutum/rapid-ct-RL
Framework	none

Using Filter Banks in Convolutional Neural Networks for Texture Classification


Title	Using Filter Banks in Convolutional Neural Networks for Texture Classification
Authors	Vincent Andrearczyk, Paul F. Whelan
Abstract	Deep learning has established many new state of the art solutions in the last decade in areas such as object, scene and speech recognition. In particular Convolutional Neural Network (CNN) is a category of deep learning which obtains excellent results in object detection and recognition tasks. Its architecture is indeed well suited to object analysis by learning and classifying complex (deep) features that represent parts of an object or the object itself. However, some of its features are very similar to texture analysis methods. CNN layers can be thought of as filter banks of complexity increasing with the depth. Filter banks are powerful tools to extract texture features and have been widely used in texture analysis. In this paper we develop a simple network architecture named Texture CNN (T-CNN) which explores this observation. It is built on the idea that the overall shape information extracted by the fully connected layers of a classic CNN is of minor importance in texture analysis. Therefore, we pool an energy measure from the last convolution layer which we connect to a fully connected layer. We show that our approach can improve the performance of a network while greatly reducing the memory usage and computation.
Tasks	Object Detection, Speech Recognition, Texture Classification
Published	2016-01-12
URL	http://arxiv.org/abs/1601.02919v5
PDF	http://arxiv.org/pdf/1601.02919v5.pdf
PWC	https://paperswithcode.com/paper/using-filter-banks-in-convolutional-neural
Repo	https://github.com/v-andrearczyk/caffe-TCNN
Framework	none