January 30, 2020

2910 words 14 mins read

Paper Group ANR 282

Fitting ReLUs via SGD and Quantized SGD. Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. Switchable Normalization for Learning-to-Normalize Deep Representation. Meta Learning for Task-Driven Video Summarization. Are deep ResNets provably better than linear predictors?. Reference Network for Neural …

Fitting ReLUs via SGD and Quantized SGD


Title	Fitting ReLUs via SGD and Quantized SGD
Authors	Seyed Mohammadreza Mousavi Kalan, Mahdi Soltanolkotabi, A. Salman Avestimehr
Abstract	In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form $\mathbf{x}\rightarrow \max(0,\langle\mathbf{w},\mathbf{x}\rangle)$ with $\mathbf{w}\in\mathbb{R}^d$ denoting the weight vector. We focus on a planted model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2.
Tasks
Published	2019-01-19
URL	http://arxiv.org/abs/1901.06587v2
PDF	http://arxiv.org/pdf/1901.06587v2.pdf
PWC	https://paperswithcode.com/paper/fitting-relus-via-sgd-and-quantized-sgd
Repo
Framework

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions


Title	Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions
Authors	Prashanth L. A., Krishna Jagannathan, Ravi Kumar Kolla
Abstract	Conditional Value-at-Risk (CVaR) is a widely used risk metric in applications such as finance. We derive concentration bounds for CVaR estimates, considering separately the cases of light-tailed and heavy-tailed distributions. In the light-tailed case, we use a classical CVaR estimator based on the empirical distribution constructed from the samples. For heavy-tailed random variables, we assume a mild `bounded moment’ condition, and derive a concentration bound for a truncation-based estimator. Notably, our concentration bounds enjoy an exponential decay in the sample size, for heavy-tailed as well as light-tailed distributions. To demonstrate the applicability of our concentration results, we consider a CVaR optimization problem in a multi-armed bandit setting. Specifically, we address the best CVaR-arm identification problem under a fixed budget. We modify the well-known successive rejects algorithm to incorporate a CVaR-based criterion. Using the CVaR concentration result, we derive an upper-bound on the probability of incorrect identification by the proposed algorithm. \|
Tasks	Multi-Armed Bandits
Published	2019-01-04
URL	https://arxiv.org/abs/1901.00997v2
PDF	https://arxiv.org/pdf/1901.00997v2.pdf
PWC	https://paperswithcode.com/paper/risk-aware-multi-armed-bandits-using
Repo
Framework

Switchable Normalization for Learning-to-Normalize Deep Representation


Title	Switchable Normalization for Learning-to-Normalize Deep Representation
Authors	Ping Luo, Ruimao Zhang, Jiamin Ren, Zhanglin Peng, Jingyu Li
Abstract	We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks. Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, MegaFace, and Kinetics. Analyses of SN are also presented to answer the following three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN has been released at https://github.com/switchablenorms.
Tasks
Published	2019-07-22
URL	https://arxiv.org/abs/1907.10473v1
PDF	https://arxiv.org/pdf/1907.10473v1.pdf
PWC	https://paperswithcode.com/paper/switchable-normalization-for-learning-to
Repo
Framework

Meta Learning for Task-Driven Video Summarization


Title	Meta Learning for Task-Driven Video Summarization
Authors	Xuelong Li, Hongli Li, Yongsheng Dong
Abstract	Existing video summarization approaches mainly concentrate on sequential or structural characteristic of video data. However, they do not pay enough attention to the video summarization task itself. In this paper, we propose a meta learning method for performing task-driven video summarization, denoted by MetaL-TDVS, to explicitly explore the video summarization mechanism among summarizing processes on different videos. Particularly, MetaL-TDVS aims to excavate the latent mechanism for summarizing video by reformulating video summarization as a meta learning problem and promote generalization ability of the trained model. MetaL-TDVS regards summarizing each video as a single task to make better use of the experience and knowledge learned from processes of summarizing other videos to summarize new ones. Furthermore, MetaL-TDVS updates models via a two-fold back propagation which forces the model optimized on one video to obtain high accuracy on another video in every training step. Extensive experiments on benchmark datasets demonstrate the superiority and better generalization ability of MetaL-TDVS against several state-of-the-art methods.
Tasks	Meta-Learning, Video Summarization
Published	2019-07-29
URL	https://arxiv.org/abs/1907.12342v1
PDF	https://arxiv.org/pdf/1907.12342v1.pdf
PWC	https://paperswithcode.com/paper/meta-learning-for-task-driven-video
Repo
Framework

Are deep ResNets provably better than linear predictors?


Title	Are deep ResNets provably better than linear predictors?
Authors	Chulhee Yun, Suvrit Sra, Ali Jadbabaie
Abstract	Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the “near-identity regions” of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.
Tasks
Published	2019-07-09
URL	https://arxiv.org/abs/1907.03922v2
PDF	https://arxiv.org/pdf/1907.03922v2.pdf
PWC	https://paperswithcode.com/paper/are-deep-resnets-provably-better-than-linear
Repo
Framework

Reference Network for Neural Machine Translation


Title	Reference Network for Neural Machine Translation
Authors	Han Fu, Chenghao Liu, Jianling Sun
Abstract	Neural Machine Translation (NMT) has achieved notable success in recent years. Such a framework usually generates translations in isolation. In contrast, human translators often refer to reference data, either rephrasing the intricate sentence fragments with common terms in source language, or just accessing to the golden translation directly. In this paper, we propose a Reference Network to incorporate referring process into translation decoding of NMT. To construct a \emph{reference book}, an intuitive way is to store the detailed translation history with extra memory, which is computationally expensive. Instead, we employ Local Coordinates Coding (LCC) to obtain global context vectors containing monolingual and bilingual contextual information for NMT decoding. Experimental results on Chinese-English and English-German tasks demonstrate that our proposed model is effective in improving the translation quality with lightweight computation cost.
Tasks	Machine Translation
Published	2019-08-23
URL	https://arxiv.org/abs/1908.09920v1
PDF	https://arxiv.org/pdf/1908.09920v1.pdf
PWC	https://paperswithcode.com/paper/reference-network-for-neural-machine-1
Repo
Framework

On the Computational Complexity of Multi-Agent Pathfinding on Directed Graphs


Title	On the Computational Complexity of Multi-Agent Pathfinding on Directed Graphs
Authors	Bernhard Nebel
Abstract	The determination of the computational complexity of multi-agent pathfinding on directed graphs has been an open problem for many years. For undirected graphs, solvability can be decided in polynomial time, as has been shown already in the eighties. Further, recently it has been shown that a special case on directed graphs is solvable in polynomial time. In this paper, we show that the problem is NP-hard in the general case. In addition, some upper bounds are proven.
Tasks
Published	2019-11-11
URL	https://arxiv.org/abs/1911.04871v1
PDF	https://arxiv.org/pdf/1911.04871v1.pdf
PWC	https://paperswithcode.com/paper/on-the-computational-complexity-of-multi
Repo
Framework

A Performance Comparison of Loss Functions for Deep Face Recognition


Title	A Performance Comparison of Loss Functions for Deep Face Recognition
Authors	Yash Srivastava, Vaishnav Murali, Shiv Ram Dubey
Abstract	Face recognition is one of the most widely publicized feature in the devices today and hence represents an important problem that should be studied with the utmost priority. As per the recent trends, the Convolutional Neural Network (CNN) based approaches are highly successful in many tasks of Computer Vision including face recognition. The loss function is used on the top of CNN to judge the goodness of any network. In this paper, we present a performance comparison of different loss functions such as Cross-Entropy, Angular Softmax, Additive-Margin Softmax, ArcFace and Marginal Loss for face recognition. The experiments are conducted with two CNN architectures namely, ResNet and MobileNet. Two widely used face datasets namely, CASIA-Webface and MS-Celeb-1M are used for the training and benchmark Labeled Faces in the Wild (LFW) face dataset is used for the testing.
Tasks	Face Recognition
Published	2019-01-01
URL	https://arxiv.org/abs/1901.05903v2
PDF	https://arxiv.org/pdf/1901.05903v2.pdf
PWC	https://paperswithcode.com/paper/a-performance-comparison-of-loss-functions
Repo
Framework

Classification of Cognitive Load and Expertise for Adaptive Simulation using Deep Multitask Learning


Title	Classification of Cognitive Load and Expertise for Adaptive Simulation using Deep Multitask Learning
Authors	Pritam Sarkar, Kyle Ross, Aaron J. Ruberto, Dirk Rodenburg, Paul Hungler, Ali Etemad
Abstract	Simulations are a pedagogical means of enabling a risk-free way for healthcare practitioners to learn, maintain, or enhance their knowledge and skills. Such simulations should provide an optimum amount of cognitive load to the learner and be tailored to their levels of expertise. However, most current simulations are a one-type-fits-all tool used to train different learners regardless of their existing skills, expertise, and ability to handle cognitive load. To address this problem, we propose an end-to-end framework for a trauma simulation that actively classifies a participant’s level of cognitive load and expertise for the development of a dynamically adaptive simulation. To facilitate this solution, trauma simulations were developed for the collection of electrocardiogram (ECG) signals of both novice and expert practitioners. A multitask deep neural network was developed to utilize this data and classify high and low cognitive load, as well as expert and novice participants. A leave-one-subject-out (LOSO) validation was used to evaluate the effectiveness of our model, achieving an accuracy of 89.4% and 96.6% for classification of cognitive load and expertise, respectively.
Tasks
Published	2019-07-31
URL	https://arxiv.org/abs/1908.00385v1
PDF	https://arxiv.org/pdf/1908.00385v1.pdf
PWC	https://paperswithcode.com/paper/classification-of-cognitive-load-and
Repo
Framework

Categorization of Program Regions for Agile Compilation using Machine Learning and Hardware Support


Title	Categorization of Program Regions for Agile Compilation using Machine Learning and Hardware Support
Authors	Sanket Tavarageri
Abstract	A compiler processes the code written in a high level language and produces machine executable code. The compiler writers often face the challenge of keeping the compilation times reasonable. That is because aggressive optimization passes which potentially will give rise to high performance are often expensive in terms of running time and memory footprint. Consequently the compiler designers arrive at a compromise where they either simplify the optimization algorithm which may decrease the performance of the produced code, or they will restrict the optimization to the subset of the overall input program in which case large parts of the input application will go un-optimized. The problem we address in this paper is that of keeping the compilation times reasonable, and at the same time optimizing the input program to the fullest extent possible. Consequently, the performance of the produced code will match the performance when all the aggressive optimization passes are applied over the entire input program.
Tasks
Published	2019-05-29
URL	https://arxiv.org/abs/1905.12292v1
PDF	https://arxiv.org/pdf/1905.12292v1.pdf
PWC	https://paperswithcode.com/paper/categorization-of-program-regions-for-agile
Repo
Framework

Multi-type Resource Allocation with Partial Preferences


Title	Multi-type Resource Allocation with Partial Preferences
Authors	Haibin Wang, Sujoy Sikdar, Xiaoxi Guo, Lirong Xia, Yongzhi Cao, Hanpin Wang
Abstract	We propose multi-type probabilistic serial (MPS) and multi-type random priority (MRP) as extensions of the well known PS and RP mechanisms to the multi-type resource allocation problem (MTRA) with partial preferences. In our setting, there are multiple types of divisible items, and a group of agents who have partial order preferences over bundles consisting of one item of each type. We show that for the unrestricted domain of partial order preferences, no mechanism satisfies both sd-efficiency and sd-envy-freeness. Notwithstanding this impossibility result, our main message is positive: When agents’ preferences are represented by acyclic CP-nets, MPS satisfies sd-efficiency, sd-envy-freeness, ordinal fairness, and upper invariance, while MRP satisfies ex-post-efficiency, sd-strategy-proofness, and upper invariance, recovering the properties of PS and RP.
Tasks
Published	2019-06-13
URL	https://arxiv.org/abs/1906.06836v2
PDF	https://arxiv.org/pdf/1906.06836v2.pdf
PWC	https://paperswithcode.com/paper/multi-type-resource-allocation-with-partial
Repo
Framework

Response to “Visual Dialogue without Vision or Dialogue” (Massiceti et al., 2018)


Title	Response to “Visual Dialogue without Vision or Dialogue” (Massiceti et al., 2018)
Authors	Abhishek Das, Devi Parikh, Dhruv Batra
Abstract	In a recent workshop paper, Massiceti et al. presented a baseline model and subsequent critique of Visual Dialog (Das et al., CVPR 2017) that raises what we believe to be unfounded concerns about the dataset and evaluation. This article intends to rebut the critique and clarify potential confusions for practitioners and future participants in the Visual Dialog challenge.
Tasks	Visual Dialog
Published	2019-01-16
URL	http://arxiv.org/abs/1901.05531v1
PDF	http://arxiv.org/pdf/1901.05531v1.pdf
PWC	https://paperswithcode.com/paper/response-to-visual-dialogue-without-vision-or
Repo
Framework

Word-level Lexical Normalisation using Context-Dependent Embeddings


Title	Word-level Lexical Normalisation using Context-Dependent Embeddings
Authors	Michael Stewart, Wei Liu, Rachel Cardell-Oliver
Abstract	Lexical normalisation (LN) is the process of correcting each word in a dataset to its canonical form so that it may be more easily and more accurately analysed. Most lexical normalisation systems operate at the character-level, while word-level models are seldom used. Recent language models offer solutions to the drawbacks of word-level LN models, yet, to the best of our knowledge, no research has investigated their effectiveness on LN. In this paper we introduce a word-level GRU-based LN model and investigate the effectiveness of recent embedding techniques on word-level LN. Our results show that our GRU-based word-level model produces greater results than character-level models, and outperforms existing deep-learning based LN techniques on Twitter data. We also find that randomly-initialised embeddings are capable of outperforming pre-trained embedding models in certain scenarios. Finally, we release a substantial lexical normalisation dataset to the community.
Tasks
Published	2019-11-13
URL	https://arxiv.org/abs/1911.06172v1
PDF	https://arxiv.org/pdf/1911.06172v1.pdf
PWC	https://paperswithcode.com/paper/word-level-lexical-normalisation-using
Repo
Framework

Iterative Algorithm for Discrete Structure Recovery


Title	Iterative Algorithm for Discrete Structure Recovery
Authors	Chao Gao, Anderson Y. Zhang
Abstract	We propose a general modeling and algorithmic framework for discrete structure recovery that can be applied to a wide range of problems. Under this framework, we are able to study the recovery of clustering labels, ranks of players, and signs of regression coefficients from a unified perspective. A simple iterative algorithm is proposed for discrete structure recovery, which generalizes methods including Lloyd’s algorithm and the iterative feature matching algorithm. A linear convergence result for the proposed algorithm is established in this paper under appropriate abstract conditions on stochastic errors and initialization. We illustrate our general theory by applying it on three representative problems: clustering in Gaussian mixture model, approximate ranking, and sign recovery in compressed sensing, and show that minimax rate is achieved in each case.
Tasks
Published	2019-11-04
URL	https://arxiv.org/abs/1911.01018v1
PDF	https://arxiv.org/pdf/1911.01018v1.pdf
PWC	https://paperswithcode.com/paper/iterative-algorithm-for-discrete-structure
Repo
Framework

Using Bursty Announcements for Early Detection of BGP Routing Anomalies


Title	Using Bursty Announcements for Early Detection of BGP Routing Anomalies
Authors	Pablo Moriano, Raquel Hill, L. Jean Camp
Abstract	Despite the robust structure of the Internet, it is still susceptible to disruptive routing updates that prevent network traffic from reaching its destination. In this work, we propose a method for early detection of large-scale disruptions based on the analysis of bursty BGP announcements. We hypothesize that the occurrence of large-scale disruptions is preceded by bursty announcements. Our method is grounded in analysis of changes in the inter-arrival times of announcements. BGP announcements that are associated with disruptive updates tend to occur in groups of relatively high frequency, followed by periods of infrequent activity. To test our hypothesis, we quantify the burstiness of inter-arrival times around the date and times of three large-scale incidents: the Indosat hijacking event in April 2014, the Telecom Malaysia leak in June 2015, and the Bharti Airtel Ltd. hijack in November 2015. We show that we can detect these events several hours prior to when they were originally detected. We propose an algorithm that leverages the burstiness of disruptive updates to provide early detection of large-scale malicious incidents using local collector data. We describe limitations, open challenges, and how this method can be used for large-scale routing anomaly detection.
Tasks	Anomaly Detection
Published	2019-05-14
URL	https://arxiv.org/abs/1905.05835v1
PDF	https://arxiv.org/pdf/1905.05835v1.pdf
PWC	https://paperswithcode.com/paper/using-bursty-announcements-for-early
Repo
Framework