January 28, 2020

3353 words 16 mins read

Paper Group ANR 814

Scaling Up Collaborative Filtering Data Sets through Randomized Fractal Expansions. Data Exploration and Validation on dense knowledge graphs for biomedical research. Adaptive Artificial Intelligent Q&A Platform. Making Good on LSTMs’ Unfulfilled Promise. An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model. CommentsRadar …

Scaling Up Collaborative Filtering Data Sets through Randomized Fractal Expansions


Title	Scaling Up Collaborative Filtering Data Sets through Randomized Fractal Expansions
Authors	Francois Belletti, Karthik Lakshmanan, Walid Krichene, Nicolas Mayoraz, Yi-Fan Chen, John Anderson, Taylor Robie, Tayo Oguntebi, Dan Shirron, Amit Bleiwess
Abstract	Recommender system research suffers from a disconnect between the size of academic data sets and the scale of industrial production systems. In order to bridge that gap, we propose to generate large-scale user/item interaction data sets by expanding pre-existing public data sets. Our key contribution is a technique that expands user/item incidence matrices matrices to large numbers of rows (users), columns (items), and non-zero values (interactions). The proposed method adapts Kronecker Graph Theory to preserve key higher order statistical properties such as the fat-tailed distribution of user engagements, item popularity, and singular value spectra of user/item interaction matrices. Preserving such properties is key to building large realistic synthetic data sets which in turn can be employed reliably to benchmark recommender systems and the systems employed to train them. We further apply our stochastic expansion algorithm to the binarized MovieLens 20M data set, which comprises 20M interactions between 27K movies and 138K users. The resulting expanded data set has 1.2B ratings, 2.2M users, and 855K items, which can be scaled up or down.
Tasks	Recommendation Systems
Published	2019-04-08
URL	http://arxiv.org/abs/1905.09874v1
PDF	http://arxiv.org/pdf/1905.09874v1.pdf
PWC	https://paperswithcode.com/paper/190509874
Repo
Framework

Data Exploration and Validation on dense knowledge graphs for biomedical research


Title	Data Exploration and Validation on dense knowledge graphs for biomedical research
Authors	Jens Dörpinghaus, Alexander Apke, Vanessa Lage-Rupprecht, Andreas Stefan
Abstract	Here we present a holistic approach for data exploration on dense knowledge graphs as a novel approach with a proof-of-concept in biomedical research. Knowledge graphs are increasingly becoming a vital factor in knowledge mining and discovery as they connect data using technologies from the semantic web. In this paper we extend a basic knowledge graph extracted from biomedical literature by context data like named entities and relations obtained by text mining and other linked data sources like ontologies and databases. We will present an overview about this novel network. The aim of this work was to extend this current knowledge with approaches from graph theory. This method will build the foundation for quality control, validation of hypothesis, detection of missing data and time series analysis of biomedical knowledge in general. In this context we tried to apply multiple-valued decision diagrams to these questions. In addition this knowledge representation of linked data can be used as FAIR approach to answer semantic questions. This paper sheds new lights on dense and very large knowledge graphs and the importance of a graph-theoretic understanding of these networks.
Tasks	Knowledge Graphs, Time Series, Time Series Analysis
Published	2019-12-08
URL	https://arxiv.org/abs/1912.06194v1
PDF	https://arxiv.org/pdf/1912.06194v1.pdf
PWC	https://paperswithcode.com/paper/data-exploration-and-validation-on-dense
Repo
Framework

Adaptive Artificial Intelligent Q&A Platform


Title	Adaptive Artificial Intelligent Q&A Platform
Authors	M. R, Akram, C. P, Singhabahu, M. S. M Saad, P, Deleepa, Anupiya, Nugaliyadde, Yashas, Mallawarachchi
Abstract	The paper presents an approach to build a question and answer system that is capable of processing the information in a large dataset and allows the user to gain knowledge from this dataset by asking questions in natural language form. Key content of this research covers four dimensions which are; Corpus Preprocessing, Question Preprocessing, Deep Neural Network for Answer Extraction and Answer Generation. The system is capable of understanding the question, responds to the user’s query in natural language form as well. The goal is to make the user feel as if they were interacting with a person than a machine.
Tasks
Published	2019-01-19
URL	http://arxiv.org/abs/1902.02162v1
PDF	http://arxiv.org/pdf/1902.02162v1.pdf
PWC	https://paperswithcode.com/paper/adaptive-artificial-intelligent-qa-platform
Repo
Framework

Making Good on LSTMs’ Unfulfilled Promise


Title	Making Good on LSTMs’ Unfulfilled Promise
Authors	Daniel Philps, Artur d’Avila Garcez, Tillman Weyde
Abstract	LSTMs promise much to financial time-series analysis, temporal and cross-sectional inference, but we find that they do not deliver in a real-world financial management task. We examine an alternative called Continual Learning (CL), a memory-augmented approach, which can provide transparent explanations, i.e. which memory did what and when. This work has implications for many financial applications including credit, time-varying fairness in decision making and more. We make three important new observations. Firstly, as well as being more explainable, time-series CL approaches outperform LSTMs as well as a simple sliding window learner using feed-forward neural networks (FFNN). Secondly, we show that CL based on a sliding window learner (FFNN) is more effective than CL based on a sequential learner (LSTM). Thirdly, we examine how real-world, time-series noise impacts several similarity approaches used in CL memory addressing. We provide these insights using an approach called Continual Learning Augmentation (CLA) tested on a complex real-world problem, emerging market equities investment decision making. CLA provides a test-bed as it can be based on different types of time-series learners, allowing testing of LSTM and FFNN learners side by side. CLA is also used to test several distance approaches used in a memory recall-gate: Euclidean distance (ED), dynamic time warping (DTW), auto-encoders (AE) and a novel hybrid approach, warp-AE. We find that ED under-performs DTW and AE but warp-AE shows the best overall performance in a real-world financial task.
Tasks	Continual Learning, Decision Making, Time Series, Time Series Analysis
Published	2019-11-11
URL	https://arxiv.org/abs/1911.04489v4
PDF	https://arxiv.org/pdf/1911.04489v4.pdf
PWC	https://paperswithcode.com/paper/making-good-on-lstms-unfulfilled-promise
Repo
Framework

An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model


Title	An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model
Authors	Oluwatobi Olabiyi, Anish Khazane, Alan Salimov, Erik T. Mueller
Abstract	In this paper, we extend the persona-based sequence-to-sequence (Seq2Seq) neural network conversation model to a multi-turn dialogue scenario by modifying the state-of-the-art hredGAN architecture to simultaneously capture utterance attributes such as speaker identity, dialogue topic, speaker sentiments and so on. The proposed system, phredGAN has a persona-based HRED generator (PHRED) and a conditional discriminator. We also explore two approaches to accomplish the conditional discriminator: (1) phredGAN_a, a system that passes the attribute representation as an additional input into a traditional adversarial discriminator, and (2) phredGAN_d, a dual discriminator system which in addition to the adversarial discriminator, collaboratively predicts the attribute(s) that generated the input utterance. To demonstrate the superior performance of phredGAN over the persona Seq2Seq model, we experiment with two conversational datasets, the Ubuntu Dialogue Corpus (UDC) and TV series transcripts from the Big Bang Theory and Friends. Performance comparison is made with respect to a variety of quantitative measures as well as crowd-sourced human evaluation. We also explore the trade-offs from using either variant of phredGAN on datasets with many but weak attribute modalities (such as with Big Bang Theory and Friends) and ones with few but strong attribute modalities (customer-agent interactions in Ubuntu dataset).
Tasks
Published	2019-04-29
URL	https://arxiv.org/abs/1905.01992v2
PDF	https://arxiv.org/pdf/1905.01992v2.pdf
PWC	https://paperswithcode.com/paper/190501992
Repo
Framework

CommentsRadar: Dive into Unique Data on All Comments on the Web


Title	CommentsRadar: Dive into Unique Data on All Comments on the Web
Authors	Sergey Nikolenko, Elena Tutubalina, Zulfat Miftahutdinov, Eugene Beloded
Abstract	We introduce an entity-centric search engineCommentsRadarthatpairs entity queries with articles and user opinions covering a widerange of topics from top commented sites. The engine aggregatesarticles and comments for these articles, extracts named entities,links them together and with knowledge base entries, performssentiment analysis, and aggregates the results, aiming to mine fortemporal trends and other insights. In this work, we present thegeneral engine, discuss the models used for all steps of this pipeline,and introduce several case studies that discover important insightsfrom online commenting data.
Tasks
Published	2019-08-16
URL	https://arxiv.org/abs/1908.07069v1
PDF	https://arxiv.org/pdf/1908.07069v1.pdf
PWC	https://paperswithcode.com/paper/commentsradar-dive-into-unique-data-on-all
Repo
Framework

HUSE: Hierarchical Universal Semantic Embeddings


Title	HUSE: Hierarchical Universal Semantic Embeddings
Authors	Pradyumna Narayana, Aniket Pednekar, Abishek Krishnamoorthy, Kazoo Sone, Sugato Basu
Abstract	There is a recent surge of interest in cross-modal representation learning corresponding to images and text. The main challenge lies in mapping images and text to a shared latent space where the embeddings corresponding to a similar semantic concept lie closer to each other than the embeddings corresponding to different semantic concepts, irrespective of the modality. Ranking losses are commonly used to create such shared latent space – however, they do not impose any constraints on inter-class relationships resulting in neighboring clusters to be completely unrelated. The works in the domain of visual semantic embeddings address this problem by first constructing a semantic embedding space based on some external knowledge and projecting image embeddings onto this fixed semantic embedding space. These works are confined only to image domain and constraining the embeddings to a fixed space adds additional burden on learning. This paper proposes a novel method, HUSE, to learn cross-modal representation with semantic information. HUSE learns a shared latent space where the distance between any two universal embeddings is similar to the distance between their corresponding class embeddings in the semantic embedding space. HUSE also uses a classification objective with a shared classification layer to make sure that the image and text embeddings are in the same shared latent space. Experiments on UPMC Food-101 show our method outperforms previous state-of-the-art on retrieval, hierarchical precision and classification results.
Tasks	Representation Learning
Published	2019-11-14
URL	https://arxiv.org/abs/1911.05978v1
PDF	https://arxiv.org/pdf/1911.05978v1.pdf
PWC	https://paperswithcode.com/paper/huse-hierarchical-universal-semantic
Repo
Framework

Two Case Studies of Experience Prototyping Machine Learning Systems in the Wild


Title	Two Case Studies of Experience Prototyping Machine Learning Systems in the Wild
Authors	Qian Yang
Abstract	Throughout the course of my Ph.D., I have been designing the user experience (UX) of various machine learning (ML) systems. In this workshop, I share two projects as case studies in which people engage with ML in much more complicated and nuanced ways than the technical HCML work might assume. The first case study describes how cardiology teams in three hospitals used a clinical decision-support system that helps them decide whether and when to implant an artificial heart to a heart failure patient. I demonstrate that physicians cannot draw on their decision-making experience by seeing only patient data on paper. They are also confused by some fundamental premises upon which ML operates. For example, physicians asked: Are ML predictions made based on clinicians’ best efforts? Is it ethical to make decisions based on previous patients’ collective outcomes? In the second case study, my collaborators and I designed an intelligent text editor, with the goal of improving authors’ writing experience with NLP (Natural Language Processing) technologies. We prototyped a number of generative functionalities where the system provides phrase-or-sentence-level writing suggestions upon user request. When writing with the prototype, however, authors shared that they need to “see where the sentence is going two paragraphs later” in order to decide whether the suggestion aligns with their writing; Some even considered adopting machine suggestions as plagiarism, therefore “is simply wrong”. By sharing these unexpected and intriguing responses from these real-world ML users, I hope to start a discussion about such previously-unknown complexities and nuances of – as the workshop proposal states – “putting ML at the service of people in a way that is accessible, useful, and trustworthy to all”.
Tasks	Decision Making
Published	2019-10-21
URL	https://arxiv.org/abs/1910.09137v1
PDF	https://arxiv.org/pdf/1910.09137v1.pdf
PWC	https://paperswithcode.com/paper/two-case-studies-of-experience-prototyping
Repo
Framework

Legal document retrieval across languages: topic hierarchies based on synsets


Title	Legal document retrieval across languages: topic hierarchies based on synsets
Authors	Carlos Badenes-Olmedo, Jose-Luis Redondo-Garcia, Oscar Corcho
Abstract	Cross-lingual annotations of legislative texts enable us to explore major themes covered in multilingual legal data and are a key facilitator of semantic similarity when searching for similar documents. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space, which limits the amount of scenarios where this technique can be used. In this work, we provide an unsupervised document similarity algorithm based on hierarchies of multi-lingual concepts to describe topics across languages. The algorithm does not require parallel or comparable corpora, or any other type of translation resource. Experiments performed on the English, Spanish, French and Portuguese editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
Tasks	Semantic Similarity, Semantic Textual Similarity, Topic Models
Published	2019-11-28
URL	https://arxiv.org/abs/1911.12637v1
PDF	https://arxiv.org/pdf/1911.12637v1.pdf
PWC	https://paperswithcode.com/paper/legal-document-retrieval-across-languages
Repo
Framework

Topical Phrase Extraction from Clinical Reports by Incorporating both Local and Global Context


Title	Topical Phrase Extraction from Clinical Reports by Incorporating both Local and Global Context
Authors	Gabriele Pergola, Yulan He, David Lowe
Abstract	Making sense of words often requires to simultaneously examine the surrounding context of a term as well as the global themes characterizing the overall corpus. Several topic models have already exploited word embeddings to recognize local context, however, it has been weakly combined with the global context during the topic inference. This paper proposes to extract topical phrases corroborating the word embedding information with the global context detected by Latent Semantic Analysis, and then combine them by means of the P'{o}lya urn model. To highlight the effectiveness of this combined approach the model was assessed analyzing clinical reports, a challenging scenario characterized by technical jargon and a limited word statistics available. Results show it outperforms the state-of-the-art approaches in terms of both topic coherence and computational cost.
Tasks	Topic Models, Word Embeddings
Published	2019-11-22
URL	https://arxiv.org/abs/1911.10180v1
PDF	https://arxiv.org/pdf/1911.10180v1.pdf
PWC	https://paperswithcode.com/paper/topical-phrase-extraction-from-clinical
Repo
Framework

Deep Adversarial Learning in Intrusion Detection: A Data Augmentation Enhanced Framework


Title	Deep Adversarial Learning in Intrusion Detection: A Data Augmentation Enhanced Framework
Authors	He Zhang, Xingrui Yu, Peng Ren, Chunbo Luo, Geyong Min
Abstract	Intrusion detection systems (IDSs) play an important role in identifying malicious attacks and threats in networking systems. As fundamental tools of IDSs, learning based classification methods have been widely employed. When it comes to detecting network intrusions in small sample sizes (e.g., emerging intrusions), the limited number and imbalanced proportion of training samples usually cause significant challenges in training supervised and semi-supervised classifiers. In this paper, we propose a general network intrusion detection framework to address the challenges of both \emph{data scarcity} and \emph{data imbalance}. The novelty of the proposed framework focuses on incorporating deep adversarial learning with statistical learning and exploiting learning based data augmentation. Given a small set of network intrusion samples, it first derives a Poisson-Gamma joint probabilistic generative model to generate synthesised intrusion data using Monte Carlo methods. Those synthesised data are then augmented by deep generative neural networks through adversarial learning. Finally, it adopts the augmented intrusion data to train supervised models for detecting network intrusions. Comprehensive experimental validations on KDD Cup 99 dataset show that the proposed framework outperforms the existing learning based IDSs in terms of improved accuracy, precision, recall, and F1-score.
Tasks	Data Augmentation, Intrusion Detection, Network Intrusion Detection
Published	2019-01-23
URL	http://arxiv.org/abs/1901.07949v3
PDF	http://arxiv.org/pdf/1901.07949v3.pdf
PWC	https://paperswithcode.com/paper/deep-adversarial-learning-in-intrusion
Repo
Framework

Visus: An Interactive System for Automatic Machine Learning Model Building and Curation


Title	Visus: An Interactive System for Automatic Machine Learning Model Building and Curation
Authors	Aécio Santos, Sonia Castelo, Cristian Felix, Jorge Piazentin Ono, Bowen Yu, Sungsoo Hong, Cláudio T. Silva, Enrico Bertini, Juliana Freire
Abstract	While the demand for machine learning (ML) applications is booming, there is a scarcity of data scientists capable of building such models. Automatic machine learning (AutoML) approaches have been proposed that help with this problem by synthesizing end-to-end ML data processing pipelines. However, these follow a best-effort approach and a user in the loop is necessary to curate and refine the derived pipelines. Since domain experts often have little or no expertise in machine learning, easy-to-use interactive interfaces that guide them throughout the model building process are necessary. In this paper, we present Visus, a system designed to support the model building process and curation of ML data processing pipelines generated by AutoML systems. We describe the framework used to ground our design choices and a usage scenario enabled by Visus. Finally, we discuss the feedback received in user testing sessions with domain experts.
Tasks	AutoML
Published	2019-07-05
URL	https://arxiv.org/abs/1907.02889v1
PDF	https://arxiv.org/pdf/1907.02889v1.pdf
PWC	https://paperswithcode.com/paper/visus-an-interactive-system-for-automatic
Repo
Framework


Title	Borrow from Anywhere: Pseudo Multi-modal Object Detection in Thermal Imagery
Authors	Chaitanya Devaguptapu, Ninad Akolekar, Manuj M Sharma, Vineeth N Balasubramanian
Abstract	Can we improve detection in the thermal domain by borrowing features from rich domains like visual RGB? In this paper, we propose a pseudo-multimodal object detector trained on natural image domain data to help improve the performance of object detection in thermal images. We assume access to a large-scale dataset in the visual RGB domain and relatively smaller dataset (in terms of instances) in the thermal domain, as is common today. We propose the use of well-known image-to-image translation frameworks to generate pseudo-RGB equivalents of a given thermal image and then use a multi-modal architecture for object detection in the thermal image. We show that our framework outperforms existing benchmarks without the explicit need for paired training examples from the two domains. We also show that our framework has the ability to learn with less data from thermal domain when using our approach.
Tasks	Image-to-Image Translation, Object Detection
Published	2019-05-21
URL	https://arxiv.org/abs/1905.08789v1
PDF	https://arxiv.org/pdf/1905.08789v1.pdf
PWC	https://paperswithcode.com/paper/borrow-from-anywhere-pseudo-multi-modal
Repo
Framework

Building Automated Survey Coders via Interactive Machine Learning


Title	Building Automated Survey Coders via Interactive Machine Learning
Authors	Andrea Esuli, Alejandro Moreo, Fabrizio Sebastiani
Abstract	Software systems trained via machine learning to automatically classify open-ended answers (a.k.a. verbatims) are by now a reality. Still, their adoption in the survey coding industry has been less widespread than it might have been. Among the factors that have hindered a more massive takeup of this technology are the effort involved in manually coding a sufficient amount of training data, the fact that small studies do not seem to justify this effort, and the fact that the process needs to be repeated anew when brand new coding tasks arise. In this paper we will argue for an approach to building verbatim classifiers that we will call “Interactive Learning”, and that addresses all the above problems. We will show that, for the same amount of training effort, interactive learning delivers much better coding accuracy than standard “non-interactive” learning. This is especially true when the amount of data we are willing to manually code is small, which makes this approach attractive also for small-scale studies. Interactive learning also lends itself to reusing previously trained classifiers for dealing with new (albeit related) coding tasks. Interactive learning also integrates better in the daily workflow of the survey specialist, and delivers a better user experience overall.
Tasks
Published	2019-03-28
URL	http://arxiv.org/abs/1903.12110v1
PDF	http://arxiv.org/pdf/1903.12110v1.pdf
PWC	https://paperswithcode.com/paper/building-automated-survey-coders-via
Repo
Framework

Two-stage Best-scored Random Forest for Large-scale Regression


Title	Two-stage Best-scored Random Forest for Large-scale Regression
Authors	Hanyuan Hang, Yingyi Chen, Johan A. K. Suykens
Abstract	We propose a novel method designed for large-scale regression problems, namely the two-stage best-scored random forest (TBRF). “Best-scored” means to select one regression tree with the best empirical performance out of a certain number of purely random regression tree candidates, and “two-stage” means to divide the original random tree splitting procedure into two: In stage one, the feature space is partitioned into non-overlapping cells; in stage two, child trees grow separately on these cells. The strengths of this algorithm can be summarized as follows: First of all, the pure randomness in TBRF leads to the almost optimal learning rates, and also makes ensemble learning possible, which resolves the boundary discontinuities long plaguing the existing algorithms. Secondly, the two-stage procedure paves the way for parallel computing, leading to computational efficiency. Last but not least, TBRF can serve as an inclusive framework where different mainstream regression strategies such as linear predictor and least squares support vector machines (LS-SVMs) can also be incorporated as value assignment approaches on leaves of the child trees, depending on the characteristics of the underlying data sets. Numerical assessments on comparisons with other state-of-the-art methods on several large-scale real data sets validate the promising prediction accuracy and high computational efficiency of our algorithm.
Tasks
Published	2019-05-09
URL	https://arxiv.org/abs/1905.03438v1
PDF	https://arxiv.org/pdf/1905.03438v1.pdf
PWC	https://paperswithcode.com/paper/190503438
Repo
Framework