January 26, 2020

2877 words 14 mins read

Paper Group ANR 1531

Scalable Source Code Similarity Detection in Large Code Repositories. It Runs in the Family: Searching for Similar Names using Digitized Family Trees. Where is the Information in a Deep Neural Network?. Automatic Discovery of Privacy-Utility Pareto Fronts. Surface Type Classification for Autonomous Robot Indoor Navigation. A Group-Theoretic Framewo …

Scalable Source Code Similarity Detection in Large Code Repositories


Title	Scalable Source Code Similarity Detection in Large Code Repositories
Authors	F Alomari, M Harbi
Abstract	Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions.
Tasks
Published	2019-07-26
URL	https://arxiv.org/abs/1907.11817v1
PDF	https://arxiv.org/pdf/1907.11817v1.pdf
PWC	https://paperswithcode.com/paper/scalable-source-code-similarity-detection-in
Repo
Framework

It Runs in the Family: Searching for Similar Names using Digitized Family Trees


Title	It Runs in the Family: Searching for Similar Names using Digitized Family Trees
Authors	Aviad Elyashar, Rami Puzis, Michael Fire
Abstract	Searching for a person’s name is a common online activity. However, web search engines suffer from low numbers of accurate results to a query containing names. Underlying these poor results are the multiple legitimate spelling variations for a given name, as opposed to regular text with a single way to be spelled correctly. Today, most of the techniques used to suggest related names in an online search are based on pattern matching and phonetic encoding. However, they frequently lead to poor performance. Here, we propose a revolutionary approach to tackle the problem of similar name suggestions. Our novel algorithm utilizes historical data collected from genealogy websites, along with graph algorithms. In contrast to previous approaches, we propose a general method that suggests similar names based on the construction and analysis of digitized ancestral family trees. Similar names are extracted from the graph using generic ordering functions that outperform other algorithms that suggest names based on a single dimension, which limits their performance. Utilizing a large-scale online genealogy dataset with over 17 million profiles and more than 200,000 unique first names, we constructed a vast name-based graph. Using this graph along with 7,399 labeled given names with their true synonyms, we evaluated our proposed approach. The results showed that our approach gave superior performance in terms of accuracy, F1, and precision compared to other algorithms, including phonetic and string similarity algorithms. We propose our algorithm as a useful and valuable tool for suggesting similar names based on constructing a name-based graph using family trees.
Tasks
Published	2019-12-09
URL	https://arxiv.org/abs/1912.04003v2
PDF	https://arxiv.org/pdf/1912.04003v2.pdf
PWC	https://paperswithcode.com/paper/it-runs-in-the-family-searching-for-similar
Repo
Framework

Where is the Information in a Deep Neural Network?


Title	Where is the Information in a Deep Neural Network?
Authors	Alessandro Achille, Stefano Soatto
Abstract	Whatever information a Deep Neural Network has gleaned from past data is encoded in its weights. How this information affects the response of the network to future data is largely an open question. In fact, even how to define and measure information in a network is still not settled. We introduce the notion of Information in the Weights as the optimal trade-off between accuracy of the network and complexity of the weights, relative to a prior. Depending on the prior, the definition reduces to known information measures such as Shannon Mutual Information and Fisher Information, but affords added flexibility that enables us to relate it to generalization, via the PAC-Bayes bound, and to invariance. This relation hinges not only on the architecture of the model, but surprisingly on how it is trained. We then introduce a notion of effective information in the activations, which are deterministic functions of future inputs, resolving inconsistencies in prior work. We relate this to the Information in the Weights, and use this result to show that models of low complexity not only generalize better, but are bound to learn invariant representations of future inputs.
Tasks
Published	2019-05-29
URL	https://arxiv.org/abs/1905.12213v2
PDF	https://arxiv.org/pdf/1905.12213v2.pdf
PWC	https://paperswithcode.com/paper/where-is-the-information-in-a-deep-neural
Repo
Framework

Automatic Discovery of Privacy-Utility Pareto Fronts


Title	Automatic Discovery of Privacy-Utility Pareto Fronts
Authors	Brendan Avent, Javier Gonzalez, Tom Diethe, Andrei Paleyes, Borja Balle
Abstract	Differential privacy is a mathematical framework for privacy-preserving data analysis. Changing the hyperparameters of a differentially private algorithm allows one to trade off privacy and utility in a principled way. Quantifying this trade-off in advance is essential to decision-makers tasked with deciding how much privacy can be provided in a particular application while maintaining acceptable utility. Analytical utility guarantees offer a rigorous tool to reason about this trade-off, but are generally only available for relatively simple problems. For more complex tasks, such as training neural networks under differential privacy, the utility achieved by a given algorithm can only be measured empirically. This paper presents a Bayesian optimization methodology for efficiently characterizing the privacy–utility trade-off of any differentially private algorithm using only empirical measurements of its utility. The versatility of our method is illustrated on a number of machine learning tasks involving multiple models, optimizers, and datasets.
Tasks
Published	2019-05-26
URL	https://arxiv.org/abs/1905.10862v3
PDF	https://arxiv.org/pdf/1905.10862v3.pdf
PWC	https://paperswithcode.com/paper/automatic-discovery-of-privacy-utility-pareto
Repo
Framework


Title	Surface Type Classification for Autonomous Robot Indoor Navigation
Authors	Francesco Lomio, Erjon Skenderi, Damoon Mohamadi, Jussi Collin, Reza Ghabcheloo, Heikki Huttunen
Abstract	In this work we describe the preparation of a time series dataset of inertial measurements for determining the surface type under a wheeled robot. The data consists of over 7600 labeled time series samples, with the corresponding surface type annotation. This data was used in two public competitions with over 1500 participant in total. Additionally, we describe the performance of state-of-art deep learning models for time series classification, as well as propose a baseline model based on an ensemble of machine learning methods. The baseline achieves an accuracy of over 68% with our nine-category dataset.
Tasks	Time Series, Time Series Classification
Published	2019-05-01
URL	http://arxiv.org/abs/1905.00252v1
PDF	http://arxiv.org/pdf/1905.00252v1.pdf
PWC	https://paperswithcode.com/paper/surface-type-classification-for-autonomous
Repo
Framework

A Group-Theoretic Framework for Data Augmentation


Title	A Group-Theoretic Framework for Data Augmentation
Authors	Shuxiao Chen, Edgar Dobriban, Jane H Lee
Abstract	Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant. We prove that it leads to variance reduction. We study empirical risk minimization, and the examples of exponential families, linear regression, and certain two-layer neural networks. We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).
Tasks	Data Augmentation, Image Classification
Published	2019-07-25
URL	https://arxiv.org/abs/1907.10905v3
PDF	https://arxiv.org/pdf/1907.10905v3.pdf
PWC	https://paperswithcode.com/paper/invariance-reduces-variance-understanding
Repo
Framework

Ensembling Strategies for Answering Natural Questions


Title	Ensembling Strategies for Answering Natural Questions
Authors	Anthony Ferritto, Lin Pan, Rishav Chakravarti, Salim Roukos, Radu Florian, J. William Murdock, Avirup Sil
Abstract	Many of the top question answering systems today utilize ensembling to improve their performance on tasks such as the Stanford Question Answering Dataset (SQuAD) and Natural Questions (NQ) challenges. Unfortunately most of these systems do not publish their ensembling strategies used in their leaderboard submissions. In this work, we investigate a number of ensembling techniques and demonstrate a strategy which improves our F1 score for short answers on the dev set for NQ by 2.3 F1 points over our single model (which outperforms the previous SOTA by 1.9 F1 points).
Tasks	Question Answering
Published	2019-10-30
URL	https://arxiv.org/abs/1911.00337v3
PDF	https://arxiv.org/pdf/1911.00337v3.pdf
PWC	https://paperswithcode.com/paper/ensembling-strategies-for-answering-natural
Repo
Framework

Limits of Private Learning with Access to Public Data


Title	Limits of Private Learning with Access to Public Data
Authors	Noga Alon, Raef Bassily, Shay Moran
Abstract	We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. This setting interpolates between private learning (where all examples are private) and classical learning (where all examples are public). We study the limits of learning in this setting in terms of private and public sample complexities. We show that any hypothesis class of VC-dimension $d$ can be agnostically learned up to an excess error of $\alpha$ using only (roughly) $d/\alpha$ public examples and $d/\alpha^2$ private labeled examples. This result holds even when the public examples are unlabeled. This gives a quadratic improvement over the standard $d/\alpha^2$ upper bound on the public sample complexity (where private examples can be ignored altogether if the public examples are labeled). Furthermore, we give a nearly matching lower bound, which we prove via a generic reduction from this setting to the one of private learning without public data.
Tasks
Published	2019-10-25
URL	https://arxiv.org/abs/1910.11519v1
PDF	https://arxiv.org/pdf/1910.11519v1.pdf
PWC	https://paperswithcode.com/paper/limits-of-private-learning-with-access-to
Repo
Framework

Model Selection With Graphical Neighbour Information


Title	Model Selection With Graphical Neighbour Information
Authors	Robert O’Shea
Abstract	Accurate model selection is a fundamental requirement for statistical analysis. In many real-world applications of graphical modelling, correct model structure identification is the ultimate objective. Standard model validation procedures such as information theoretic scores and cross validation have demonstrated poor performance in the high dimensional setting. Specialised methods such as EBIC, StARS and RIC have been developed for the explicit purpose of high-dimensional Gaussian graphical model selection. We present a novel model score criterion, Graphical Neighbour Information. This method demonstrates oracle performance in high-dimensional model selection, outperforming the current state-of-the-art in our simulations. The Graphical Neighbour Information criterion has the additional advantage of efficient, closed-form computability, sparing the costly inference of multiple models on data subsamples. We provide a theoretical analysis of the method and benchmark simulations versus the current state of the art.
Tasks	Model Selection
Published	2019-08-27
URL	https://arxiv.org/abs/1908.10243v1
PDF	https://arxiv.org/pdf/1908.10243v1.pdf
PWC	https://paperswithcode.com/paper/model-selection-with-graphical-neighbour
Repo
Framework

Explaining versus Describing Human Decisions. Hilbert Space Structures in Decision Theory


Title	Explaining versus Describing Human Decisions. Hilbert Space Structures in Decision Theory
Authors	Sandro Sozzo
Abstract	Despite the impressive success of quantum structures to model long-standing human judgement and decision puzzles, the {\it quantum cognition research programme} still faces challenges about its explanatory power. Indeed, quantum models introduce new parameters, which may fit empirical data without necessarily explaining them. Also, one wonders whether more general non-classical structures are better equipped to model cognitive phenomena. In this paper, we provide a {\it realistic-operational foundation of decision processes} using a known decision-making puzzle, the {\it Ellsberg paradox}, as a case study. Then, we elaborate a novel representation of the Ellsberg decision situation applying standard quantum correspondence rules which map realistic-operational entities into quantum mathematical terms. This result opens the way towards an independent, foundational rather than phenomenological, motivation for a general use of quantum Hilbert space structures in human cognition.
Tasks	Decision Making
Published	2019-04-04
URL	http://arxiv.org/abs/1904.02523v1
PDF	http://arxiv.org/pdf/1904.02523v1.pdf
PWC	https://paperswithcode.com/paper/explaining-versus-describing-human-decisions
Repo
Framework

The problem with DDPG: understanding failures in deterministic environments with sparse rewards


Title	The problem with DDPG: understanding failures in deterministic environments with sparse rewards
Authors	Guillaume Matheron, Nicolas Perrin, Olivier Sigaud
Abstract	In environments with continuous state and action spaces, state-of-the-art actor-critic reinforcement learning algorithms can solve very complex problems, yet can also fail in environments that seem trivial, but the reason for such failures is still poorly understood. In this paper, we contribute a formal explanation of these failures in the particular case of sparse reward and deterministic environments. First, using a very elementary control problem, we illustrate that the learning process can get stuck into a fixed point corresponding to a poor solution. Then, generalizing from the studied example, we provide a detailed analysis of the underlying mechanisms which results in a new understanding of one of the convergence regimes of these algorithms. The resulting perspective casts a new light on already existing solutions to the issues we have highlighted, and suggests other potential approaches.
Tasks
Published	2019-11-26
URL	https://arxiv.org/abs/1911.11679v1
PDF	https://arxiv.org/pdf/1911.11679v1.pdf
PWC	https://paperswithcode.com/paper/the-problem-with-ddpg-understanding-failures-1
Repo
Framework

Task-Aware Feature Generation for Zero-Shot Compositional Learning


Title	Task-Aware Feature Generation for Zero-Shot Compositional Learning
Authors	Xin Wang, Fisher Yu, Trevor Darrell, Joseph E. Gonzalez
Abstract	Visual concepts (e.g., red apple, big elephant) are often semantically compositional and each element of the compositions can be reused to construct novel concepts (e.g., red elephant). Compositional feature synthesis, which generates image feature distributions exploiting the semantic compositionality, is a promising approach to sample-efficient model generalization. In this work, we propose a task-aware feature generation (TFG) framework for compositional learning, which generates features of novel visual concepts by transferring knowledge from previously seen concepts. These synthetic features are then used to train a classifier to recognize novel concepts in a zero-shot manner. Our novel TFG design injects task-conditioned noise layer-by-layer, producing task-relevant variation at each level. We find the proposed generator design improves classification accuracy and sample efficiency. Our model establishes a new state of the art on three zero-shot compositional learning (ZSCL) benchmarks, outperforming the previous discriminative models by a large margin. Our model improves the performance of the prior arts by over 2x in the generalized ZSCL setting.
Tasks	Zero-Shot Learning
Published	2019-06-11
URL	https://arxiv.org/abs/1906.04854v2
PDF	https://arxiv.org/pdf/1906.04854v2.pdf
PWC	https://paperswithcode.com/paper/task-aware-deep-sampling-for-feature
Repo
Framework

CELNet: Evidence Localization for Pathology Images using Weakly Supervised Learning


Title	CELNet: Evidence Localization for Pathology Images using Weakly Supervised Learning
Authors	Yongxiang Huang, Albert C. S. Chung
Abstract	Despite deep convolutional neural networks boost the performance of image classification and segmentation in digital pathology analysis, they are usually weak in interpretability for clinical applications or require heavy annotations to achieve object localization. To overcome this problem, we propose a weakly supervised learning-based approach that can effectively learn to localize the discriminative evidence for a diagnostic label from weakly labeled training data. Experimental results show that our proposed method can reliably pinpoint the location of cancerous evidence supporting the decision of interest, while still achieving a competitive performance on glimpse-level and slide-level histopathologic cancer detection tasks.
Tasks	Image Classification, Object Localization
Published	2019-09-16
URL	https://arxiv.org/abs/1909.07097v1
PDF	https://arxiv.org/pdf/1909.07097v1.pdf
PWC	https://paperswithcode.com/paper/celnet-evidence-localization-for-pathology
Repo
Framework

Assessing the Impact of Blood Pressure on Cardiac Function Using Interpretable Biomarkers and Variational Autoencoders


Title	Assessing the Impact of Blood Pressure on Cardiac Function Using Interpretable Biomarkers and Variational Autoencoders
Authors	Esther Puyol-Antón, Bram Ruijsink, James R. Clough, Ilkay Oksuz, Daniel Rueckert, Reza Razavi, Andrew P. King
Abstract	Maintaining good cardiac function for as long as possible is a major concern for healthcare systems worldwide and there is much interest in learning more about the impact of different risk factors on cardiac health. The aim of this study is to analyze the impact of systolic blood pressure (SBP) on cardiac function while preserving the interpretability of the model using known clinical biomarkers in a large cohort of the UK Biobank population. We propose a novel framework that combines deep learning based estimation of interpretable clinical biomarkers from cardiac cine MR data with a variational autoencoder (VAE). The VAE architecture integrates a regression loss in the latent space, which enables the progression of cardiac health with SBP to be learnt. Results on 3,600 subjects from the UK Biobank show that the proposed model allows us to gain important insight into the deterioration of cardiac function with increasing SBP, identify key interpretable factors involved in this process, and lastly exploit the model to understand patterns of positive and adverse adaptation of cardiac function.
Tasks
Published	2019-08-13
URL	https://arxiv.org/abs/1908.04538v1
PDF	https://arxiv.org/pdf/1908.04538v1.pdf
PWC	https://paperswithcode.com/paper/assessing-the-impact-of-blood-pressure-on
Repo
Framework

FingerNet: Pushing The Limits of Fingerprint Recognition Using Convolutional Neural Network


Title	FingerNet: Pushing The Limits of Fingerprint Recognition Using Convolutional Neural Network
Authors	Shervin Minaee, Elham Azimi, Amirali Abdolrashidi
Abstract	Fingerprint recognition has been utilized for cellphone authentication, airport security and beyond. Many different features and algorithms have been proposed to improve fingerprint recognition. In this paper, we propose an end-to-end deep learning framework for fingerprint recognition using convolutional neural networks (CNNs) which can jointly learn the feature representation and perform recognition. We train our model on a large-scale fingerprint recognition dataset, and improve over previous approaches in terms of accuracy. Our proposed model is able to achieve a very high recognition accuracy on a well-known fingerprint dataset. We believe this framework can be widely used for biometrics recognition tasks, making more scalable and accurate systems possible. We have also used a visualization technique to highlight the important areas in an input fingerprint image, that mostly impact the recognition results.
Tasks
Published	2019-07-28
URL	https://arxiv.org/abs/1907.12956v1
PDF	https://arxiv.org/pdf/1907.12956v1.pdf
PWC	https://paperswithcode.com/paper/fingernet-pushing-the-limits-of-fingerprint
Repo
Framework