Paper Group ANR 669
Dynamic Model Selection for Prediction Under a Budget. Video Question Answering via Attribute-Augmented Attention Network Learning. Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks. Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-me …
Dynamic Model Selection for Prediction Under a Budget
Title | Dynamic Model Selection for Prediction Under a Budget |
Authors | Feng Nan, Venkatesh Saligrama |
Abstract | We present a dynamic model selection approach for resource-constrained prediction. Given an input instance at test-time, a gating function identifies a prediction model for the input among a collection of models. Our objective is to minimize overall average cost without sacrificing accuracy. We learn gating and prediction models on fully labeled training data by means of a bottom-up strategy. Our novel bottom-up method is a recursive scheme whereby a high-accuracy complex model is first trained. Then a low-complexity gating and prediction model are subsequently learnt to adaptively approximate the high-accuracy model in regions where low-cost models are capable of making highly accurate predictions. We pose an empirical loss minimization problem with cost constraints to jointly train gating and prediction models. On a number of benchmark datasets our method outperforms state-of-the-art achieving higher accuracy for the same cost. |
Tasks | Model Selection |
Published | 2017-04-25 |
URL | http://arxiv.org/abs/1704.07505v1 |
http://arxiv.org/pdf/1704.07505v1.pdf | |
PWC | https://paperswithcode.com/paper/dynamic-model-selection-for-prediction-under |
Repo | |
Framework | |
Video Question Answering via Attribute-Augmented Attention Network Learning
Title | Video Question Answering via Attribute-Augmented Attention Network Learning |
Authors | Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, Yueting Zhuang |
Abstract | Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question. However, the existing visual question answering approaches mainly tackle the problem of static image question, which may be ineffectively for video question answering due to the insufficiency of modeling the temporal dynamics of video contents. In this paper, we study the problem of video question answering by modeling its temporal dynamics with frame-level attention mechanism. We propose the attribute-augmented attention network learning framework that enables the joint frame-level attribute detection and unified video representation learning for video question answering. We then incorporate the multi-step reasoning process for our proposed attention network to further improve the performance. We construct a large-scale video question answering dataset. We conduct the experiments on both multiple-choice and open-ended video question answering tasks to show the effectiveness of the proposed method. |
Tasks | Information Retrieval, Question Answering, Representation Learning, Video Question Answering, Visual Question Answering |
Published | 2017-07-20 |
URL | http://arxiv.org/abs/1707.06355v1 |
http://arxiv.org/pdf/1707.06355v1.pdf | |
PWC | https://paperswithcode.com/paper/video-question-answering-via-attribute |
Repo | |
Framework | |
Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks
Title | Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks |
Authors | Zheng Xu, Yen-Chang Hsu, Jiawei Huang |
Abstract | There is an increasing interest on accelerating neural networks for real-time applications. We study the student-teacher strategy, in which a small and fast student network is trained with the auxiliary information learned from a large and accurate teacher network. We propose to use conditional adversarial networks to learn the loss function to transfer knowledge from teacher to student. The proposed method is particularly effective for relatively small student networks. Moreover, experimental results show the effect of network size when the modern networks are used as student. We empirically study the trade-off between inference time and classification accuracy, and provide suggestions on choosing a proper student network. |
Tasks | |
Published | 2017-09-02 |
URL | http://arxiv.org/abs/1709.00513v2 |
http://arxiv.org/pdf/1709.00513v2.pdf | |
PWC | https://paperswithcode.com/paper/training-shallow-and-thin-networks-for |
Repo | |
Framework | |
Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means
Title | Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-means |
Authors | Dennis Forster, Jörg Lücke |
Abstract | One iteration of standard $k$-means (i.e., Lloyd’s algorithm) or standard EM for Gaussian mixture models (GMMs) scales linearly with the number of clusters $C$, data points $N$, and data dimensionality $D$. In this study, we explore whether one iteration of $k$-means or EM for GMMs can scale sublinearly with $C$ at run-time, while improving the clustering objective remains effective. The tool we apply for complexity reduction is variational EM, which is typically used to make training of generative models with exponentially many hidden states tractable. Here, we apply novel theoretical results on truncated variational EM to make tractable clustering algorithms more efficient. The basic idea is to use a partial variational E-step which reduces the linear complexity of $\mathcal{O}(NCD)$ required for a full E-step to a sublinear complexity. Our main observation is that the linear dependency on $C$ can be reduced to a dependency on a much smaller parameter $G$ which relates to cluster neighborhood relations. We focus on two versions of partial variational EM for clustering: variational GMM, scaling with $\mathcal{O}(NG^2D)$, and variational $k$-means, scaling with $\mathcal{O}(NGD)$ per iteration. Empirical results show that these algorithms still require comparable numbers of iterations to improve the clustering objective to same values as $k$-means. For data with many clusters, we consequently observe reductions of net computational demands between two and three orders of magnitude. More generally, our results provide substantial empirical evidence in favor of clustering to scale sublinearly with $C$. |
Tasks | |
Published | 2017-11-09 |
URL | http://arxiv.org/abs/1711.03431v2 |
http://arxiv.org/pdf/1711.03431v2.pdf | |
PWC | https://paperswithcode.com/paper/can-clustering-scale-sublinearly-with-its |
Repo | |
Framework | |
Estimating the Operating Characteristics of Ensemble Methods
Title | Estimating the Operating Characteristics of Ensemble Methods |
Authors | Anthony Gamst, Jay-Calvin Reyes, Alden Walker |
Abstract | In this paper we present a technique for using the bootstrap to estimate the operating characteristics and their variability for certain types of ensemble methods. Bootstrapping a model can require a huge amount of work if the training data set is large. Fortunately in many cases the technique lets us determine the effect of infinite resampling without actually refitting a single model. We apply the technique to the study of meta-parameter selection for random forests. We demonstrate that alternatives to bootstrap aggregation and to considering \sqrt{d} features to split each node, where d is the number of features, can produce improvements in predictive accuracy. |
Tasks | |
Published | 2017-10-24 |
URL | http://arxiv.org/abs/1710.08952v1 |
http://arxiv.org/pdf/1710.08952v1.pdf | |
PWC | https://paperswithcode.com/paper/estimating-the-operating-characteristics-of |
Repo | |
Framework | |
Dependency Injection for Programming by Optimization
Title | Dependency Injection for Programming by Optimization |
Authors | Zoltan A. Kocsis, Jerry Swan |
Abstract | Programming by Optimization tools perform automatic software configuration according to the specification supplied by a software developer. Developers specify design spaces for program components, and the onerous task of determining which configuration best suits a given use case is determined using automated analysis tools and optimization heuristics. However, in current approaches to Programming by Optimization, design space specification and exploration relies on external configuration algorithms, executable wrappers and fragile, preprocessed programming language extensions. Here we show that the architectural pattern of Dependency Injection provides a superior alternative to the traditional Programming by Optimization pipeline. We demonstrate that configuration tools based on Dependency Injection fit naturally into the software development process, while requiring less overhead than current wrapper-based mechanisms. Furthermore, the structural correspondence between Dependency Injection and context-free grammars yields a new class of evolutionary metaheuristics for automated algorithm configuration. We found that the new heuristics significantly outperform existing configuration algorithms on many problems of interest (in one case by two orders of magnitude). We anticipate that these developments will make Programming by Optimization immediately applicable to a large number of enterprise software projects. |
Tasks | |
Published | 2017-07-13 |
URL | http://arxiv.org/abs/1707.04016v1 |
http://arxiv.org/pdf/1707.04016v1.pdf | |
PWC | https://paperswithcode.com/paper/dependency-injection-for-programming-by |
Repo | |
Framework | |
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
Title | Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes |
Authors | Lei Wu, Zhanxing Zhu, Weinan E |
Abstract | It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don’t. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments. |
Tasks | |
Published | 2017-06-30 |
URL | http://arxiv.org/abs/1706.10239v2 |
http://arxiv.org/pdf/1706.10239v2.pdf | |
PWC | https://paperswithcode.com/paper/towards-understanding-generalization-of-deep |
Repo | |
Framework | |
Spectral Algorithms for Computing Fair Support Vector Machines
Title | Spectral Algorithms for Computing Fair Support Vector Machines |
Authors | Matt Olfat, Anil Aswani |
Abstract | Classifiers and rating scores are prone to implicitly codifying biases, which may be present in the training data, against protected classes (i.e., age, gender, or race). So it is important to understand how to design classifiers and scores that prevent discrimination in predictions. This paper develops computationally tractable algorithms for designing accurate but fair support vector machines (SVM’s). Our approach imposes a constraint on the covariance matrices conditioned on each protected class, which leads to a nonconvex quadratic constraint in the SVM formulation. We develop iterative algorithms to compute fair linear and kernel SVM’s, which solve a sequence of relaxations constructed using a spectral decomposition of the nonconvex constraint. Its effectiveness in achieving high prediction accuracy while ensuring fairness is shown through numerical experiments on several data sets. |
Tasks | |
Published | 2017-10-16 |
URL | http://arxiv.org/abs/1710.05895v1 |
http://arxiv.org/pdf/1710.05895v1.pdf | |
PWC | https://paperswithcode.com/paper/spectral-algorithms-for-computing-fair |
Repo | |
Framework | |
On the convergence of mirror descent beyond stochastic convex programming
Title | On the convergence of mirror descent beyond stochastic convex programming |
Authors | Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Stephen Boyd, Peter Glynn |
Abstract | In this paper, we examine the convergence of mirror descent in a class of stochastic optimization problems that are not necessarily convex (or even quasi-convex), and which we call variationally coherent. Since the standard technique of “ergodic averaging” offers no tangible benefits beyond convex programming, we focus directly on the algorithm’s last generated sample (its “last iterate”), and we show that it converges with probabiility $1$ if the underlying problem is coherent. We further consider a localized version of variational coherence which ensures local convergence of stochastic mirror descent (SMD) with high probability. These results contribute to the landscape of non-convex stochastic optimization by showing that (quasi-)convexity is not essential for convergence to a global minimum: rather, variational coherence, a much weaker requirement, suffices. Finally, building on the above, we reveal an interesting insight regarding the convergence speed of SMD: in problems with sharp minima (such as generic linear programs or concave minimization problems), SMD reaches a minimum point in a finite number of steps (a.s.), even in the presence of persistent gradient noise. This result is to be contrasted with existing black-box convergence rate estimates that are only asymptotic. |
Tasks | Stochastic Optimization |
Published | 2017-06-18 |
URL | http://arxiv.org/abs/1706.05681v2 |
http://arxiv.org/pdf/1706.05681v2.pdf | |
PWC | https://paperswithcode.com/paper/on-the-convergence-of-mirror-descent-beyond |
Repo | |
Framework | |
Contradiction-Centricity: A Uniform Model for Formation of Swarm Intelligence and its Simulations
Title | Contradiction-Centricity: A Uniform Model for Formation of Swarm Intelligence and its Simulations |
Authors | Wenpin Jiao |
Abstract | It is a grand challenge to model the emergence of swarm intelligence and many principles or models had been proposed. However, existing models do not catch the nature of swarm intelligence and they are not generic enough to describe various types of emergence phenomena. In this work, we propose a contradiction-centric model for emergence of swarm intelligence, in which individuals’ contradictions dominate their appearances whilst they are associated and interacting to update their contradictions. This model hypothesizes that 1) the emergence of swarm intelligence is rooted in the development of contradictions of individuals and the interactions among associated individuals and 2) swarm intelligence is essentially a combinative reflection of the configurations of contradictions inside individuals and the distributions of contradictions among individuals. To verify the feasibility of the model, we simulate four types of swarm intelligence. As the simulations show, our model is truly generic and can describe the emergence of a variety of swarm intelligence, and it is also very simple and can be easily applied to demonstrate the emergence of swarm intelligence without needing complicated computations. |
Tasks | |
Published | 2017-12-12 |
URL | http://arxiv.org/abs/1712.04182v1 |
http://arxiv.org/pdf/1712.04182v1.pdf | |
PWC | https://paperswithcode.com/paper/contradiction-centricity-a-uniform-model-for |
Repo | |
Framework | |
Fidelity-Weighted Learning
Title | Fidelity-Weighted Learning |
Authors | Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, Bernhard Schölkopf |
Abstract | Training deep neural networks requires many training samples, but in practice training labels are expensive to obtain and may be of varying quality, as some may be from trusted expert labelers while others might be from heuristics or other sources of weak supervision such as crowd-sourcing. This creates a fundamental quality versus-quantity trade-off in the learning process. Do we learn from the small amount of high-quality data or the potentially large amount of weakly-labeled data? We argue that if the learner could somehow know and take the label-quality into account when learning the data representation, we could get the best of both worlds. To this end, we propose “fidelity-weighted learning” (FWL), a semi-supervised student-teacher approach for training deep neural networks using weakly-labeled data. FWL modulates the parameter updates to a student network (trained on the task we care about) on a per-sample basis according to the posterior confidence of its label-quality estimated by a teacher (who has access to the high-quality labels). Both student and teacher are learned from the data. We evaluate FWL on two tasks in information retrieval and natural language processing where we outperform state-of-the-art alternative semi-supervised methods, indicating that our approach makes better use of strong and weak labels, and leads to better task-dependent data representations. |
Tasks | Ad-Hoc Information Retrieval, Information Retrieval |
Published | 2017-11-08 |
URL | http://arxiv.org/abs/1711.02799v2 |
http://arxiv.org/pdf/1711.02799v2.pdf | |
PWC | https://paperswithcode.com/paper/fidelity-weighted-learning |
Repo | |
Framework | |
Feature Agglomeration Networks for Single Stage Face Detection
Title | Feature Agglomeration Networks for Single Stage Face Detection |
Authors | Jialiang Zhang, Xiongwei Wu, Jianke Zhu, Steven C. H. Hoi |
Abstract | Recent years have witnessed promising results of face detection using deep learning. Despite making remarkable progresses, face detection in the wild remains an open research challenge especially when detecting faces at vastly different scales and characteristics. In this paper, we propose a novel simple yet effective framework of “Feature Agglomeration Networks” (FANet) to build a new single stage face detector, which not only achieves state-of-the-art performance but also runs efficiently. As inspired by Feature Pyramid Networks (FPN), the key idea of our framework is to exploit inherent multi-scale features of a single convolutional neural network by aggregating higher-level semantic feature maps of different scales as contextual cues to augment lower-level feature maps via a hierarchical agglomeration manner at marginal extra computation cost. We further propose a Hierarchical Loss to effectively train the FANet model. We evaluate the proposed FANet detector on several public face detection benchmarks, including PASCAL face, FDDB and WIDER FACE datasets and achieved state-of-the-art results. Our detector can run in real time for VGA-resolution images on GPU. |
Tasks | Face Detection |
Published | 2017-12-03 |
URL | http://arxiv.org/abs/1712.00721v2 |
http://arxiv.org/pdf/1712.00721v2.pdf | |
PWC | https://paperswithcode.com/paper/feature-agglomeration-networks-for-single |
Repo | |
Framework | |
Efficient Asymmetric Co-Tracking using Uncertainty Sampling
Title | Efficient Asymmetric Co-Tracking using Uncertainty Sampling |
Authors | Kourosh Meshgi, Maryam Sadat Mirzaei, Shigeyuki Oba, Shin Ishii |
Abstract | Adaptive tracking-by-detection approaches are popular for tracking arbitrary objects. They treat the tracking problem as a classification task and use online learning techniques to update the object model. However, these approaches are heavily invested in the efficiency and effectiveness of their detectors. Evaluating a massive number of samples for each frame (e.g., obtained by a sliding window) forces the detector to trade the accuracy in favor of speed. Furthermore, misclassification of borderline samples in the detector introduce accumulating errors in tracking. In this study, we propose a co-tracking based on the efficient cooperation of two detectors: a rapid adaptive exemplar-based detector and another more sophisticated but slower detector with a long-term memory. The sampling labeling and co-learning of the detectors are conducted by an uncertainty sampling unit, which improves the speed and accuracy of the system. We also introduce a budgeting mechanism which prevents the unbounded growth in the number of examples in the first detector to maintain its rapid response. Experiments demonstrate the efficiency and effectiveness of the proposed tracker against its baselines and its superior performance against state-of-the-art trackers on various benchmark videos. |
Tasks | |
Published | 2017-03-31 |
URL | http://arxiv.org/abs/1704.00083v1 |
http://arxiv.org/pdf/1704.00083v1.pdf | |
PWC | https://paperswithcode.com/paper/efficient-asymmetric-co-tracking-using |
Repo | |
Framework | |
Deep Reinforcement Learning for Inquiry Dialog Policies with Logical Formula Embeddings
Title | Deep Reinforcement Learning for Inquiry Dialog Policies with Logical Formula Embeddings |
Authors | Takuya Hiraoka, Masaaki Tsuchida, Yotaro Watanabe |
Abstract | This paper is the first attempt to learn the policy of an inquiry dialog system (IDS) by using deep reinforcement learning (DRL). Most IDS frameworks represent dialog states and dialog acts with logical formulae. In order to make learning inquiry dialog policies more effective, we introduce a logical formula embedding framework based on a recursive neural network. The results of experiments to evaluate the effect of 1) the DRL and 2) the logical formula embedding framework show that the combination of the two are as effective or even better than existing rule-based methods for inquiry dialog policies. |
Tasks | |
Published | 2017-08-02 |
URL | http://arxiv.org/abs/1708.00667v1 |
http://arxiv.org/pdf/1708.00667v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-reinforcement-learning-for-inquiry |
Repo | |
Framework | |
Rotations and Interpretability of Word Embeddings: the Case of the Russian Language
Title | Rotations and Interpretability of Word Embeddings: the Case of the Russian Language |
Authors | Alexey Zobnin |
Abstract | Consider a continuous word embedding model. Usually, the cosines between word vectors are used as a measure of similarity of words. These cosines do not change under orthogonal transformations of the embedding space. We demonstrate that, using some canonical orthogonal transformations from SVD, it is possible both to increase the meaning of some components and to make the components more stable under re-learning. We study the interpretability of components for publicly available models for the Russian language (RusVectores, fastText, RDT). |
Tasks | Word Embeddings |
Published | 2017-07-14 |
URL | http://arxiv.org/abs/1707.04662v1 |
http://arxiv.org/pdf/1707.04662v1.pdf | |
PWC | https://paperswithcode.com/paper/rotations-and-interpretability-of-word |
Repo | |
Framework | |