Paper Group ANR 1531
Scalable Source Code Similarity Detection in Large Code Repositories. It Runs in the Family: Searching for Similar Names using Digitized Family Trees. Where is the Information in a Deep Neural Network?. Automatic Discovery of Privacy-Utility Pareto Fronts. Surface Type Classification for Autonomous Robot Indoor Navigation. A Group-Theoretic Framewo …
Scalable Source Code Similarity Detection in Large Code Repositories
Title | Scalable Source Code Similarity Detection in Large Code Repositories |
Authors | F Alomari, M Harbi |
Abstract | Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions. |
Tasks | |
Published | 2019-07-26 |
URL | https://arxiv.org/abs/1907.11817v1 |
https://arxiv.org/pdf/1907.11817v1.pdf | |
PWC | https://paperswithcode.com/paper/scalable-source-code-similarity-detection-in |
Repo | |
Framework | |
It Runs in the Family: Searching for Similar Names using Digitized Family Trees
Title | It Runs in the Family: Searching for Similar Names using Digitized Family Trees |
Authors | Aviad Elyashar, Rami Puzis, Michael Fire |
Abstract | Searching for a person’s name is a common online activity. However, web search engines suffer from low numbers of accurate results to a query containing names. Underlying these poor results are the multiple legitimate spelling variations for a given name, as opposed to regular text with a single way to be spelled correctly. Today, most of the techniques used to suggest related names in an online search are based on pattern matching and phonetic encoding. However, they frequently lead to poor performance. Here, we propose a revolutionary approach to tackle the problem of similar name suggestions. Our novel algorithm utilizes historical data collected from genealogy websites, along with graph algorithms. In contrast to previous approaches, we propose a general method that suggests similar names based on the construction and analysis of digitized ancestral family trees. Similar names are extracted from the graph using generic ordering functions that outperform other algorithms that suggest names based on a single dimension, which limits their performance. Utilizing a large-scale online genealogy dataset with over 17 million profiles and more than 200,000 unique first names, we constructed a vast name-based graph. Using this graph along with 7,399 labeled given names with their true synonyms, we evaluated our proposed approach. The results showed that our approach gave superior performance in terms of accuracy, F1, and precision compared to other algorithms, including phonetic and string similarity algorithms. We propose our algorithm as a useful and valuable tool for suggesting similar names based on constructing a name-based graph using family trees. |
Tasks | |
Published | 2019-12-09 |
URL | https://arxiv.org/abs/1912.04003v2 |
https://arxiv.org/pdf/1912.04003v2.pdf | |
PWC | https://paperswithcode.com/paper/it-runs-in-the-family-searching-for-similar |
Repo | |
Framework | |
Where is the Information in a Deep Neural Network?
Title | Where is the Information in a Deep Neural Network? |
Authors | Alessandro Achille, Stefano Soatto |
Abstract | Whatever information a Deep Neural Network has gleaned from past data is encoded in its weights. How this information affects the response of the network to future data is largely an open question. In fact, even how to define and measure information in a network is still not settled. We introduce the notion of Information in the Weights as the optimal trade-off between accuracy of the network and complexity of the weights, relative to a prior. Depending on the prior, the definition reduces to known information measures such as Shannon Mutual Information and Fisher Information, but affords added flexibility that enables us to relate it to generalization, via the PAC-Bayes bound, and to invariance. This relation hinges not only on the architecture of the model, but surprisingly on how it is trained. We then introduce a notion of effective information in the activations, which are deterministic functions of future inputs, resolving inconsistencies in prior work. We relate this to the Information in the Weights, and use this result to show that models of low complexity not only generalize better, but are bound to learn invariant representations of future inputs. |
Tasks | |
Published | 2019-05-29 |
URL | https://arxiv.org/abs/1905.12213v2 |
https://arxiv.org/pdf/1905.12213v2.pdf | |
PWC | https://paperswithcode.com/paper/where-is-the-information-in-a-deep-neural |
Repo | |
Framework | |
Automatic Discovery of Privacy-Utility Pareto Fronts
Title | Automatic Discovery of Privacy-Utility Pareto Fronts |
Authors | Brendan Avent, Javier Gonzalez, Tom Diethe, Andrei Paleyes, Borja Balle |
Abstract | Differential privacy is a mathematical framework for privacy-preserving data analysis. Changing the hyperparameters of a differentially private algorithm allows one to trade off privacy and utility in a principled way. Quantifying this trade-off in advance is essential to decision-makers tasked with deciding how much privacy can be provided in a particular application while maintaining acceptable utility. Analytical utility guarantees offer a rigorous tool to reason about this trade-off, but are generally only available for relatively simple problems. For more complex tasks, such as training neural networks under differential privacy, the utility achieved by a given algorithm can only be measured empirically. This paper presents a Bayesian optimization methodology for efficiently characterizing the privacy–utility trade-off of any differentially private algorithm using only empirical measurements of its utility. The versatility of our method is illustrated on a number of machine learning tasks involving multiple models, optimizers, and datasets. |
Tasks | |
Published | 2019-05-26 |
URL | https://arxiv.org/abs/1905.10862v3 |
https://arxiv.org/pdf/1905.10862v3.pdf | |
PWC | https://paperswithcode.com/paper/automatic-discovery-of-privacy-utility-pareto |
Repo | |
Framework | |
Surface Type Classification for Autonomous Robot Indoor Navigation
Title | Surface Type Classification for Autonomous Robot Indoor Navigation |
Authors | Francesco Lomio, Erjon Skenderi, Damoon Mohamadi, Jussi Collin, Reza Ghabcheloo, Heikki Huttunen |
Abstract | In this work we describe the preparation of a time series dataset of inertial measurements for determining the surface type under a wheeled robot. The data consists of over 7600 labeled time series samples, with the corresponding surface type annotation. This data was used in two public competitions with over 1500 participant in total. Additionally, we describe the performance of state-of-art deep learning models for time series classification, as well as propose a baseline model based on an ensemble of machine learning methods. The baseline achieves an accuracy of over 68% with our nine-category dataset. |
Tasks | Time Series, Time Series Classification |
Published | 2019-05-01 |
URL | http://arxiv.org/abs/1905.00252v1 |
http://arxiv.org/pdf/1905.00252v1.pdf | |
PWC | https://paperswithcode.com/paper/surface-type-classification-for-autonomous |
Repo | |
Framework | |
A Group-Theoretic Framework for Data Augmentation
Title | A Group-Theoretic Framework for Data Augmentation |
Authors | Shuxiao Chen, Edgar Dobriban, Jane H Lee |
Abstract | Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant. We prove that it leads to variance reduction. We study empirical risk minimization, and the examples of exponential families, linear regression, and certain two-layer neural networks. We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM). |
Tasks | Data Augmentation, Image Classification |
Published | 2019-07-25 |
URL | https://arxiv.org/abs/1907.10905v3 |
https://arxiv.org/pdf/1907.10905v3.pdf | |
PWC | https://paperswithcode.com/paper/invariance-reduces-variance-understanding |
Repo | |
Framework | |
Ensembling Strategies for Answering Natural Questions
Title | Ensembling Strategies for Answering Natural Questions |
Authors | Anthony Ferritto, Lin Pan, Rishav Chakravarti, Salim Roukos, Radu Florian, J. William Murdock, Avirup Sil |
Abstract | Many of the top question answering systems today utilize ensembling to improve their performance on tasks such as the Stanford Question Answering Dataset (SQuAD) and Natural Questions (NQ) challenges. Unfortunately most of these systems do not publish their ensembling strategies used in their leaderboard submissions. In this work, we investigate a number of ensembling techniques and demonstrate a strategy which improves our F1 score for short answers on the dev set for NQ by 2.3 F1 points over our single model (which outperforms the previous SOTA by 1.9 F1 points). |
Tasks | Question Answering |
Published | 2019-10-30 |
URL | https://arxiv.org/abs/1911.00337v3 |
https://arxiv.org/pdf/1911.00337v3.pdf | |
PWC | https://paperswithcode.com/paper/ensembling-strategies-for-answering-natural |
Repo | |
Framework | |
Limits of Private Learning with Access to Public Data
Title | Limits of Private Learning with Access to Public Data |
Authors | Noga Alon, Raef Bassily, Shay Moran |
Abstract | We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. This setting interpolates between private learning (where all examples are private) and classical learning (where all examples are public). We study the limits of learning in this setting in terms of private and public sample complexities. We show that any hypothesis class of VC-dimension $d$ can be agnostically learned up to an excess error of $\alpha$ using only (roughly) $d/\alpha$ public examples and $d/\alpha^2$ private labeled examples. This result holds even when the public examples are unlabeled. This gives a quadratic improvement over the standard $d/\alpha^2$ upper bound on the public sample complexity (where private examples can be ignored altogether if the public examples are labeled). Furthermore, we give a nearly matching lower bound, which we prove via a generic reduction from this setting to the one of private learning without public data. |
Tasks | |
Published | 2019-10-25 |
URL | https://arxiv.org/abs/1910.11519v1 |
https://arxiv.org/pdf/1910.11519v1.pdf | |
PWC | https://paperswithcode.com/paper/limits-of-private-learning-with-access-to |
Repo | |
Framework | |
Model Selection With Graphical Neighbour Information
Title | Model Selection With Graphical Neighbour Information |
Authors | Robert O’Shea |
Abstract | Accurate model selection is a fundamental requirement for statistical analysis. In many real-world applications of graphical modelling, correct model structure identification is the ultimate objective. Standard model validation procedures such as information theoretic scores and cross validation have demonstrated poor performance in the high dimensional setting. Specialised methods such as EBIC, StARS and RIC have been developed for the explicit purpose of high-dimensional Gaussian graphical model selection. We present a novel model score criterion, Graphical Neighbour Information. This method demonstrates oracle performance in high-dimensional model selection, outperforming the current state-of-the-art in our simulations. The Graphical Neighbour Information criterion has the additional advantage of efficient, closed-form computability, sparing the costly inference of multiple models on data subsamples. We provide a theoretical analysis of the method and benchmark simulations versus the current state of the art. |
Tasks | Model Selection |
Published | 2019-08-27 |
URL | https://arxiv.org/abs/1908.10243v1 |
https://arxiv.org/pdf/1908.10243v1.pdf | |
PWC | https://paperswithcode.com/paper/model-selection-with-graphical-neighbour |
Repo | |
Framework | |
Explaining versus Describing Human Decisions. Hilbert Space Structures in Decision Theory
Title | Explaining versus Describing Human Decisions. Hilbert Space Structures in Decision Theory |
Authors | Sandro Sozzo |
Abstract | Despite the impressive success of quantum structures to model long-standing human judgement and decision puzzles, the {\it quantum cognition research programme} still faces challenges about its explanatory power. Indeed, quantum models introduce new parameters, which may fit empirical data without necessarily explaining them. Also, one wonders whether more general non-classical structures are better equipped to model cognitive phenomena. In this paper, we provide a {\it realistic-operational foundation of decision processes} using a known decision-making puzzle, the {\it Ellsberg paradox}, as a case study. Then, we elaborate a novel representation of the Ellsberg decision situation applying standard quantum correspondence rules which map realistic-operational entities into quantum mathematical terms. This result opens the way towards an independent, foundational rather than phenomenological, motivation for a general use of quantum Hilbert space structures in human cognition. |
Tasks | Decision Making |
Published | 2019-04-04 |
URL | http://arxiv.org/abs/1904.02523v1 |
http://arxiv.org/pdf/1904.02523v1.pdf | |
PWC | https://paperswithcode.com/paper/explaining-versus-describing-human-decisions |
Repo | |
Framework | |
The problem with DDPG: understanding failures in deterministic environments with sparse rewards
Title | The problem with DDPG: understanding failures in deterministic environments with sparse rewards |
Authors | Guillaume Matheron, Nicolas Perrin, Olivier Sigaud |
Abstract | In environments with continuous state and action spaces, state-of-the-art actor-critic reinforcement learning algorithms can solve very complex problems, yet can also fail in environments that seem trivial, but the reason for such failures is still poorly understood. In this paper, we contribute a formal explanation of these failures in the particular case of sparse reward and deterministic environments. First, using a very elementary control problem, we illustrate that the learning process can get stuck into a fixed point corresponding to a poor solution. Then, generalizing from the studied example, we provide a detailed analysis of the underlying mechanisms which results in a new understanding of one of the convergence regimes of these algorithms. The resulting perspective casts a new light on already existing solutions to the issues we have highlighted, and suggests other potential approaches. |
Tasks | |
Published | 2019-11-26 |
URL | https://arxiv.org/abs/1911.11679v1 |
https://arxiv.org/pdf/1911.11679v1.pdf | |
PWC | https://paperswithcode.com/paper/the-problem-with-ddpg-understanding-failures-1 |
Repo | |
Framework | |
Task-Aware Feature Generation for Zero-Shot Compositional Learning
Title | Task-Aware Feature Generation for Zero-Shot Compositional Learning |
Authors | Xin Wang, Fisher Yu, Trevor Darrell, Joseph E. Gonzalez |
Abstract | Visual concepts (e.g., red apple, big elephant) are often semantically compositional and each element of the compositions can be reused to construct novel concepts (e.g., red elephant). Compositional feature synthesis, which generates image feature distributions exploiting the semantic compositionality, is a promising approach to sample-efficient model generalization. In this work, we propose a task-aware feature generation (TFG) framework for compositional learning, which generates features of novel visual concepts by transferring knowledge from previously seen concepts. These synthetic features are then used to train a classifier to recognize novel concepts in a zero-shot manner. Our novel TFG design injects task-conditioned noise layer-by-layer, producing task-relevant variation at each level. We find the proposed generator design improves classification accuracy and sample efficiency. Our model establishes a new state of the art on three zero-shot compositional learning (ZSCL) benchmarks, outperforming the previous discriminative models by a large margin. Our model improves the performance of the prior arts by over 2x in the generalized ZSCL setting. |
Tasks | Zero-Shot Learning |
Published | 2019-06-11 |
URL | https://arxiv.org/abs/1906.04854v2 |
https://arxiv.org/pdf/1906.04854v2.pdf | |
PWC | https://paperswithcode.com/paper/task-aware-deep-sampling-for-feature |
Repo | |
Framework | |
CELNet: Evidence Localization for Pathology Images using Weakly Supervised Learning
Title | CELNet: Evidence Localization for Pathology Images using Weakly Supervised Learning |
Authors | Yongxiang Huang, Albert C. S. Chung |
Abstract | Despite deep convolutional neural networks boost the performance of image classification and segmentation in digital pathology analysis, they are usually weak in interpretability for clinical applications or require heavy annotations to achieve object localization. To overcome this problem, we propose a weakly supervised learning-based approach that can effectively learn to localize the discriminative evidence for a diagnostic label from weakly labeled training data. Experimental results show that our proposed method can reliably pinpoint the location of cancerous evidence supporting the decision of interest, while still achieving a competitive performance on glimpse-level and slide-level histopathologic cancer detection tasks. |
Tasks | Image Classification, Object Localization |
Published | 2019-09-16 |
URL | https://arxiv.org/abs/1909.07097v1 |
https://arxiv.org/pdf/1909.07097v1.pdf | |
PWC | https://paperswithcode.com/paper/celnet-evidence-localization-for-pathology |
Repo | |
Framework | |
Assessing the Impact of Blood Pressure on Cardiac Function Using Interpretable Biomarkers and Variational Autoencoders
Title | Assessing the Impact of Blood Pressure on Cardiac Function Using Interpretable Biomarkers and Variational Autoencoders |
Authors | Esther Puyol-Antón, Bram Ruijsink, James R. Clough, Ilkay Oksuz, Daniel Rueckert, Reza Razavi, Andrew P. King |
Abstract | Maintaining good cardiac function for as long as possible is a major concern for healthcare systems worldwide and there is much interest in learning more about the impact of different risk factors on cardiac health. The aim of this study is to analyze the impact of systolic blood pressure (SBP) on cardiac function while preserving the interpretability of the model using known clinical biomarkers in a large cohort of the UK Biobank population. We propose a novel framework that combines deep learning based estimation of interpretable clinical biomarkers from cardiac cine MR data with a variational autoencoder (VAE). The VAE architecture integrates a regression loss in the latent space, which enables the progression of cardiac health with SBP to be learnt. Results on 3,600 subjects from the UK Biobank show that the proposed model allows us to gain important insight into the deterioration of cardiac function with increasing SBP, identify key interpretable factors involved in this process, and lastly exploit the model to understand patterns of positive and adverse adaptation of cardiac function. |
Tasks | |
Published | 2019-08-13 |
URL | https://arxiv.org/abs/1908.04538v1 |
https://arxiv.org/pdf/1908.04538v1.pdf | |
PWC | https://paperswithcode.com/paper/assessing-the-impact-of-blood-pressure-on |
Repo | |
Framework | |
FingerNet: Pushing The Limits of Fingerprint Recognition Using Convolutional Neural Network
Title | FingerNet: Pushing The Limits of Fingerprint Recognition Using Convolutional Neural Network |
Authors | Shervin Minaee, Elham Azimi, Amirali Abdolrashidi |
Abstract | Fingerprint recognition has been utilized for cellphone authentication, airport security and beyond. Many different features and algorithms have been proposed to improve fingerprint recognition. In this paper, we propose an end-to-end deep learning framework for fingerprint recognition using convolutional neural networks (CNNs) which can jointly learn the feature representation and perform recognition. We train our model on a large-scale fingerprint recognition dataset, and improve over previous approaches in terms of accuracy. Our proposed model is able to achieve a very high recognition accuracy on a well-known fingerprint dataset. We believe this framework can be widely used for biometrics recognition tasks, making more scalable and accurate systems possible. We have also used a visualization technique to highlight the important areas in an input fingerprint image, that mostly impact the recognition results. |
Tasks | |
Published | 2019-07-28 |
URL | https://arxiv.org/abs/1907.12956v1 |
https://arxiv.org/pdf/1907.12956v1.pdf | |
PWC | https://paperswithcode.com/paper/fingernet-pushing-the-limits-of-fingerprint |
Repo | |
Framework | |