Paper Group AWR 7
DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. Distributed and parallel time series feature extraction for industrial big data applications. A Riemannian Framework for Statistical Analysis of Topological Persistence Diagrams. Programming P …
DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification
Title | DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification |
Authors | Rohit Babbar, Bernhard Shoelkopf |
Abstract | Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousands or even millions of labels. Datasets in extreme classification exhibit fit to power-law distribution, i.e. a large fraction of labels have very few positive instances in the data distribution. Most state-of-the-art approaches for extreme multi-label classification attempt to capture correlation among labels by embedding the label matrix to a low-dimensional linear sub-space. However, in the presence of power-law distributed extremely large and diverse label spaces, structural assumptions such as low rank can be easily violated. In this work, we present DiSMEC, which is a large-scale distributed framework for learning one-versus-rest linear classifiers coupled with explicit capacity control to control model size. Unlike most state-of-the-art methods, DiSMEC does not make any low rank assumptions on the label matrix. Using double layer of parallelization, DiSMEC can learn classifiers for datasets consisting hundreds of thousands labels within few hours. The explicit capacity control mechanism filters out spurious parameters which keep the model compact in size, without losing prediction accuracy. We conduct extensive empirical evaluation on publicly available real-world datasets consisting upto 670,000 labels. We compare DiSMEC with recent state-of-the-art approaches, including - SLEEC which is a leading approach for learning sparse local embeddings, and FastXML which is a tree-based approach optimizing ranking based loss function. On some of the datasets, DiSMEC can significantly boost prediction accuracies - 10% better compared to SLECC and 15% better compared to FastXML, in absolute terms. |
Tasks | Extreme Multi-Label Classification, Multi-Label Classification, Multi-Label Learning |
Published | 2016-09-08 |
URL | http://arxiv.org/abs/1609.02521v1 |
http://arxiv.org/pdf/1609.02521v1.pdf | |
PWC | https://paperswithcode.com/paper/dismec-distributed-sparse-machines-for |
Repo | https://github.com/Refefer/fastxml |
Framework | none |
Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
Title | Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data |
Authors | Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, Kunal Talwar |
Abstract | Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore reveal sensitive information. To address this problem, we demonstrate a generally applicable approach to providing strong privacy guarantees for training data: Private Aggregation of Teacher Ensembles (PATE). The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as “teachers” for a “student” model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student’s privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the student’s training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings. Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-convex models like DNNs. We achieve state-of-the-art privacy/utility trade-offs on MNIST and SVHN thanks to an improved privacy analysis and semi-supervised learning. |
Tasks | Transfer Learning |
Published | 2016-10-18 |
URL | http://arxiv.org/abs/1610.05755v4 |
http://arxiv.org/pdf/1610.05755v4.pdf | |
PWC | https://paperswithcode.com/paper/semi-supervised-knowledge-transfer-for-deep |
Repo | https://github.com/styluna7/60-days-of-Udacity |
Framework | pytorch |
Distributed and parallel time series feature extraction for industrial big data applications
Title | Distributed and parallel time series feature extraction for industrial big data applications |
Authors | Maximilian Christ, Andreas W. Kempa-Liehr, Michael Feindt |
Abstract | The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics. |
Tasks | Feature Importance, Feature Selection, Time Series, Time Series Classification |
Published | 2016-10-25 |
URL | http://arxiv.org/abs/1610.07717v3 |
http://arxiv.org/pdf/1610.07717v3.pdf | |
PWC | https://paperswithcode.com/paper/distributed-and-parallel-time-series-feature |
Repo | https://github.com/blue-yonder/tsfresh |
Framework | none |
A Riemannian Framework for Statistical Analysis of Topological Persistence Diagrams
Title | A Riemannian Framework for Statistical Analysis of Topological Persistence Diagrams |
Authors | Rushil Anirudh, Vinay Venkataraman, Karthikeyan Natesan Ramamurthy, Pavan Turaga |
Abstract | Topological data analysis is becoming a popular way to study high dimensional feature spaces without any contextual clues or assumptions. This paper concerns itself with one popular topological feature, which is the number of $d-$dimensional holes in the dataset, also known as the Betti$-d$ number. The persistence of the Betti numbers over various scales is encoded into a persistence diagram (PD), which indicates the birth and death times of these holes as scale varies. A common way to compare PDs is by a point-to-point matching, which is given by the $n$-Wasserstein metric. However, a big drawback of this approach is the need to solve correspondence between points before computing the distance; for $n$ points, the complexity grows according to $\mathcal{O}($n$^3)$. Instead, we propose to use an entirely new framework built on Riemannian geometry, that models PDs as 2D probability density functions that are represented in the square-root framework on a Hilbert Sphere. The resulting space is much more intuitive with closed form expressions for common operations. The distance metric is 1) correspondence-free and also 2) independent of the number of points in the dataset. The complexity of computing distance between PDs now grows according to $\mathcal{O}(K^2)$, for a $K \times K$ discretization of $[0,1]^2$. This also enables the use of existing machinery in differential geometry towards statistical analysis of PDs such as computing the mean, geodesics, classification etc. We report competitive results with the Wasserstein metric, at a much lower computational load, indicating the favorable properties of the proposed approach. |
Tasks | Topological Data Analysis |
Published | 2016-05-28 |
URL | http://arxiv.org/abs/1605.08912v1 |
http://arxiv.org/pdf/1605.08912v1.pdf | |
PWC | https://paperswithcode.com/paper/a-riemannian-framework-for-statistical |
Repo | https://github.com/rushilanirudh/pdsphere |
Framework | none |
Programming Patterns in Dataflow Matrix Machines and Generalized Recurrent Neural Nets
Title | Programming Patterns in Dataflow Matrix Machines and Generalized Recurrent Neural Nets |
Authors | Michael Bukatin, Steve Matthews, Andrey Radul |
Abstract | Dataflow matrix machines arise naturally in the context of synchronous dataflow programming with linear streams. They can be viewed as a rather powerful generalization of recurrent neural networks. Similarly to recurrent neural networks, large classes of dataflow matrix machines are described by matrices of numbers, and therefore dataflow matrix machines can be synthesized by computing their matrices. At the same time, the evidence is fairly strong that dataflow matrix machines have sufficient expressive power to be a convenient general-purpose programming platform. Because of the network nature of this platform, programming patterns often correspond to patterns of connectivity in the generalized recurrent neural networks understood as programs. This paper explores a variety of such programming patterns. |
Tasks | |
Published | 2016-06-30 |
URL | http://arxiv.org/abs/1606.09470v2 |
http://arxiv.org/pdf/1606.09470v2.pdf | |
PWC | https://paperswithcode.com/paper/programming-patterns-in-dataflow-matrix |
Repo | https://github.com/anhinga/fluid |
Framework | none |
Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection
Title | Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection |
Authors | Vered Shwartz, Enrico Santus, Dominik Schlechtweg |
Abstract | The fundamental role of hypernymy in NLP has motivated the development of many methods for the automatic identification of this relation, most of which rely on word distribution. We investigate an extensive number of such unsupervised measures, using several distributional semantic models that differ by context type and feature weighting. We analyze the performance of the different methods based on their linguistic motivation. Comparison to the state-of-the-art supervised methods shows that while supervised methods generally outperform the unsupervised ones, the former are sensitive to the distribution of training instances, hurting their reliability. Being based on general linguistic hypotheses and independent from training data, unsupervised measures are more robust, and therefore are still useful artillery for hypernymy detection. |
Tasks | Hypernym Discovery |
Published | 2016-12-14 |
URL | http://arxiv.org/abs/1612.04460v2 |
http://arxiv.org/pdf/1612.04460v2.pdf | |
PWC | https://paperswithcode.com/paper/hypernyms-under-siege-linguistically |
Repo | https://github.com/vered1986/UnsupervisedHypernymy |
Framework | none |
Can Active Memory Replace Attention?
Title | Can Active Memory Replace Attention? |
Authors | Łukasz Kaiser, Samy Bengio |
Abstract | Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation. Recently, similar improvements have been obtained using alternative mechanisms that do not focus on a single part of a memory but operate on all of it in parallel, in a uniform way. Such mechanism, which we call active memory, improved over attention in algorithmic tasks, image processing, and in generative modelling. So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice. |
Tasks | Image Captioning, Machine Translation |
Published | 2016-10-27 |
URL | http://arxiv.org/abs/1610.08613v2 |
http://arxiv.org/pdf/1610.08613v2.pdf | |
PWC | https://paperswithcode.com/paper/can-active-memory-replace-attention |
Repo | https://github.com/tensorflow/models/tree/master/research/neural_gpu |
Framework | tf |
Sequential Voting Promotes Collective Discovery in Social Recommendation Systems
Title | Sequential Voting Promotes Collective Discovery in Social Recommendation Systems |
Authors | L. Elisa Celis, Peter M. Krafft, Nathan Kobe |
Abstract | One goal of online social recommendation systems is to harness the wisdom of crowds in order to identify high quality content. Yet the sequential voting mechanisms that are commonly used by these systems are at odds with existing theoretical and empirical literature on optimal aggregation. This literature suggests that sequential voting will promote herding—the tendency for individuals to copy the decisions of others around them—and hence lead to suboptimal content recommendation. Is there a problem with our practice, or a problem with our theory? Previous attempts at answering this question have been limited by a lack of objective measurements of content quality. Quality is typically defined endogenously as the popularity of content in absence of social influence. The flaw of this metric is its presupposition that the preferences of the crowd are aligned with underlying quality. Domains in which content quality can be defined exogenously and measured objectively are thus needed in order to better assess the design choices of social recommendation systems. In this work, we look to the domain of education, where content quality can be measured via how well students are able to learn from the material presented to them. Through a behavioral experiment involving a simulated massive open online course (MOOC) run on Amazon Mechanical Turk, we show that sequential voting systems can surface better content than systems that elicit independent votes. |
Tasks | Recommendation Systems |
Published | 2016-03-14 |
URL | http://arxiv.org/abs/1603.04466v1 |
http://arxiv.org/pdf/1603.04466v1.pdf | |
PWC | https://paperswithcode.com/paper/sequential-voting-promotes-collective |
Repo | https://github.com/pkrafft/Sequential-Voting-Promotes-Collective-Discovery-in-Social-Recommendation-Systems |
Framework | none |
LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling
Title | LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling |
Authors | Zhen Li, Yukang Gan, Xiaodan Liang, Yizhou Yu, Hui Cheng, Liang Lin |
Abstract | Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by i) developing a novel Long Short-Term Memorized Context Fusion (LSTM-CF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Specifically, contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both short-range and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state of the art, i.e., 48.1% and 49.4% average class accuracy over 37 categories (2.2% and 5.4% improvement) on the large-scale SUNRGBD dataset and the NYUDv2dataset, respectively. |
Tasks | Scene Labeling |
Published | 2016-04-18 |
URL | http://arxiv.org/abs/1604.05000v3 |
http://arxiv.org/pdf/1604.05000v3.pdf | |
PWC | https://paperswithcode.com/paper/lstm-cf-unifying-context-modeling-and-fusion |
Repo | https://github.com/icemansina/LSTM-CF |
Framework | none |
Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model
Title | Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model |
Authors | Aleksandr Chuklin, Maarten de Rijke |
Abstract | Modern search engine result pages often provide immediate value to users and organize information in such a way that it is easy to navigate. The core ranking function contributes to this and so do result snippets, smart organization of result blocks and extensive use of one-box answers or side panels. While they are useful to the user and help search engines to stand out, such features present two big challenges for evaluation. First, the presence of such elements on a search engine result page (SERP) may lead to the absence of clicks, which is, however, not related to dissatisfaction, so-called “good abandonments.” Second, the non-linear layout and visual difference of SERP items may lead to non-trivial patterns of user attention, which is not captured by existing evaluation metrics. In this paper we propose a model of user behavior on a SERP that jointly captures click behavior, user attention and satisfaction, the CAS model, and demonstrate that it gives more accurate predictions of user actions and self-reported satisfaction than existing models based on clicks alone. We use the CAS model to build a novel evaluation metric that can be applied to non-linear SERP layouts and that can account for the utility that users obtain directly on a SERP. We demonstrate that this metric shows better agreement with user-reported satisfaction than conventional evaluation metrics. |
Tasks | |
Published | 2016-09-02 |
URL | https://arxiv.org/abs/1609.00552v1 |
https://arxiv.org/pdf/1609.00552v1.pdf | |
PWC | https://paperswithcode.com/paper/incorporating-clicks-attention-and |
Repo | https://github.com/varepsilon/cas-eval |
Framework | none |
PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents
Title | PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents |
Authors | Sebastian Sudholt, Gernot A. Fink |
Abstract | In recent years, deep convolutional neural networks have achieved state of the art performance in various computer vision task such as classification, detection or segmentation. Due to their outstanding performance, CNNs are more and more used in the field of document image analysis as well. In this work, we present a CNN architecture that is trained with the recently proposed PHOC representation. We show empirically that our CNN architecture is able to outperform state of the art results for various word spotting benchmarks while exhibiting short training and test times. |
Tasks | Word Spotting In Handwritten Documents |
Published | 2016-04-01 |
URL | http://arxiv.org/abs/1604.00187v3 |
http://arxiv.org/pdf/1604.00187v3.pdf | |
PWC | https://paperswithcode.com/paper/phocnet-a-deep-convolutional-neural-network |
Repo | https://github.com/pinakinathc/phocnet_keras |
Framework | tf |
Coupling Adaptive Batch Sizes with Learning Rates
Title | Coupling Adaptive Batch Sizes with Learning Rates |
Authors | Lukas Balles, Javier Romero, Philipp Hennig |
Abstract | Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On popular image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available. |
Tasks | Image Classification, Stochastic Optimization |
Published | 2016-12-15 |
URL | http://arxiv.org/abs/1612.05086v2 |
http://arxiv.org/pdf/1612.05086v2.pdf | |
PWC | https://paperswithcode.com/paper/coupling-adaptive-batch-sizes-with-learning |
Repo | https://github.com/ProbabilisticNumerics/cabs |
Framework | tf |
Learning Deep Representations of Fine-grained Visual Descriptions
Title | Learning Deep Representations of Fine-grained Visual Descriptions |
Authors | Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee |
Abstract | State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories. Despite good performance, attributes have limitations: (1) finer-grained recognition requires commensurately more attributes, and (2) attributes do not provide a natural language interface. We propose to overcome these limitations by training neural language models from scratch; i.e. without pre-training and only consuming words and characters. Our proposed models train end-to-end to align with the fine-grained and category-specific content of images. Natural language provides a flexible and compact way of encoding only the salient visual aspects for distinguishing categories. By training on raw text, our model can do inference on raw text as well, providing humans a familiar mode both for annotation and retrieval. Our model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero-shot classification on the Caltech UCSD Birds 200-2011 dataset. |
Tasks | Image Retrieval, Zero-Shot Learning |
Published | 2016-05-17 |
URL | http://arxiv.org/abs/1605.05395v1 |
http://arxiv.org/pdf/1605.05395v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-deep-representations-of-fine-grained |
Repo | https://github.com/rafiahmed40/stack-adverserial-network |
Framework | tf |
Learning to Compose Words into Sentences with Reinforcement Learning
Title | Learning to Compose Words into Sentences with Reinforcement Learning |
Authors | Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, Wang Ling |
Abstract | We use reinforcement learning to learn tree-structured neural networks for computing representations of natural language sentences. In contrast with prior work on tree-structured models in which the trees are either provided as input or predicted using supervision from explicit treebank annotations, the tree structures in this work are optimized to improve performance on a downstream task. Experiments demonstrate the benefit of learning task-specific composition orders, outperforming both sequential encoders and recursive encoders based on treebank annotations. We analyze the induced trees and show that while they discover some linguistically intuitive structures (e.g., noun phrases, simple verb phrases), they are different than conventional English syntactic structures. |
Tasks | |
Published | 2016-11-28 |
URL | http://arxiv.org/abs/1611.09100v1 |
http://arxiv.org/pdf/1611.09100v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-to-compose-words-into-sentences-with |
Repo | https://github.com/rintukutum/rapid-ct-RL |
Framework | none |
Using Filter Banks in Convolutional Neural Networks for Texture Classification
Title | Using Filter Banks in Convolutional Neural Networks for Texture Classification |
Authors | Vincent Andrearczyk, Paul F. Whelan |
Abstract | Deep learning has established many new state of the art solutions in the last decade in areas such as object, scene and speech recognition. In particular Convolutional Neural Network (CNN) is a category of deep learning which obtains excellent results in object detection and recognition tasks. Its architecture is indeed well suited to object analysis by learning and classifying complex (deep) features that represent parts of an object or the object itself. However, some of its features are very similar to texture analysis methods. CNN layers can be thought of as filter banks of complexity increasing with the depth. Filter banks are powerful tools to extract texture features and have been widely used in texture analysis. In this paper we develop a simple network architecture named Texture CNN (T-CNN) which explores this observation. It is built on the idea that the overall shape information extracted by the fully connected layers of a classic CNN is of minor importance in texture analysis. Therefore, we pool an energy measure from the last convolution layer which we connect to a fully connected layer. We show that our approach can improve the performance of a network while greatly reducing the memory usage and computation. |
Tasks | Object Detection, Speech Recognition, Texture Classification |
Published | 2016-01-12 |
URL | http://arxiv.org/abs/1601.02919v5 |
http://arxiv.org/pdf/1601.02919v5.pdf | |
PWC | https://paperswithcode.com/paper/using-filter-banks-in-convolutional-neural |
Repo | https://github.com/v-andrearczyk/caffe-TCNN |
Framework | none |