July 26, 2019

2361 words 12 mins read

Paper Group NANR 76

Paper Group NANR 76

Marine Variable Linker: Exploring Relations between Changing Variables in Marine Science Literature. Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds. From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction. Unsupervised Learning of Morphology with Gr …

Marine Variable Linker: Exploring Relations between Changing Variables in Marine Science Literature

Title Marine Variable Linker: Exploring Relations between Changing Variables in Marine Science Literature
Authors Erwin Marsi, Pinar Pinar {\O}zturk, Murat V. Ardelan
Abstract We report on a demonstration system for text mining of literature in marine science and related disciplines. It automatically extracts variables ({}CO2{''}) involved in events of change/increase/decrease ({}increasing CO2{''}), as well as co-occurrence and causal relations among these events ({``}increasing CO2 causes a decrease in pH in seawater{''}), resulting in a big knowledge graph. A web-based graphical user interface targeted at marine scientists facilitates searching, browsing and visualising events and their relations in an interactive way. |
Tasks
Published 2017-04-01
URL https://www.aclweb.org/anthology/E17-3023/
PDF https://www.aclweb.org/anthology/E17-3023
PWC https://paperswithcode.com/paper/marine-variable-linker-exploring-relations
Repo
Framework

Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds

Title Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds
Authors Mahmoud El-Haj, Paul Rayson, Scott Piao, Stephen Wattam
Abstract Creating high-quality wide-coverage multilingual semantic lexicons to support knowledge-based approaches is a challenging time-consuming manual task. This has traditionally been performed by linguistic experts: a slow and expensive process. We present an experiment in which we adapt and evaluate crowdsourcing methods employing native speakers to generate a list of coarse-grained senses under a common multilingual semantic taxonomy for sets of words in six languages. 451 non-experts (including 427 Mechanical Turk workers) and 15 expert participants semantically annotated 250 words manually for Arabic, Chinese, English, Italian, Portuguese and Urdu lexicons. In order to avoid erroneous (spam) crowdsourced results, we used a novel task-specific two-phase filtering process where users were asked to identify synonyms in the target language, and remove erroneous senses.
Tasks Named Entity Recognition, Sentiment Analysis, Word Sense Disambiguation
Published 2017-04-01
URL https://www.aclweb.org/anthology/W17-1908/
PDF https://www.aclweb.org/anthology/W17-1908
PWC https://paperswithcode.com/paper/creating-and-validating-multilingual-semantic
Repo
Framework

From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction

Title From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction
Authors Toms Bergmanis, Sharon Goldwater
Abstract A major motivation for unsupervised morphological analysis is to reduce the sparse data problem in under-resourced languages. Most previous work focus on segmenting surface forms into their constituent morphs (taking: tak +ing), but surface form segmentation does not solve the sparse data problem as the analyses of take and taking are not connected to each other. We present a system that adapts the MorphoChains system (Narasimhan et al., 2015) to provide morphological analyses that aim to abstract over spelling differences in functionally similar morphs. This results in analyses that are not compelled to use all the orthographic material of a word (stopping: stop +ing) or limited to only that material (acidified: acid +ify +ed). On average across six typologically varied languages our system has a similar or better F-score on EMMA (a measure of underlying morpheme accuracy) than three strong baselines; moreover, the total number of distinct morphemes identified by our system is on average 12.8{%} lower than for Morfessor (Virpioja et al., 2013), a state-of-the-art surface segmentation system.
Tasks Morphological Analysis
Published 2017-04-01
URL https://www.aclweb.org/anthology/E17-1032/
PDF https://www.aclweb.org/anthology/E17-1032
PWC https://paperswithcode.com/paper/from-segmentation-to-analyses-a-probabilistic
Repo
Framework

Unsupervised Learning of Morphology with Graph Sampling

Title Unsupervised Learning of Morphology with Graph Sampling
Authors Maciej Sumalvico
Abstract We introduce a language-independent, graph-based probabilistic model of morphology, which uses transformation rules operating on whole words instead of the traditional morphological segmentation. The morphological analysis of a set of words is expressed through a graph having words as vertices and structural relationships between words as edges. We define a probability distribution over such graphs and develop a sampler based on the Metropolis-Hastings algorithm. The sampling is applied in order to determine the strength of morphological relationships between words, filter out accidental similarities and reduce the set of rules necessary to explain the data. The model is evaluated on the task of finding pairs of morphologically similar words, as well as generating new words. The results are compared to a state-of-the-art segmentation-based approach.
Tasks Morphological Analysis
Published 2017-09-01
URL https://www.aclweb.org/anthology/R17-1093/
PDF https://doi.org/10.26615/978-954-452-049-6_093
PWC https://paperswithcode.com/paper/unsupervised-learning-of-morphology-with
Repo
Framework

A study of N-gram and Embedding Representations for Native Language Identification

Title A study of N-gram and Embedding Representations for Native Language Identification
Authors Sowmya Vajjala, Sagnik Banerjee
Abstract We report on our experiments with N-gram and embedding based feature representations for Native Language Identification (NLI) as a part of the NLI Shared Task 2017 (team name: NLI-ISU). Our best performing system on the test set for written essays had a macro F1 of 0.8264 and was based on word uni, bi and trigram features. We explored n-grams covering word, character, POS and word-POS mixed representations for this task. For embedding based feature representations, we employed both word and document embeddings. We had a relatively poor performance with all embedding representations compared to n-grams, which could be because of the fact that embeddings capture semantic similarities whereas L1 differences are more stylistic in nature.
Tasks Feature Engineering, Language Acquisition, Language Identification, Native Language Identification
Published 2017-09-01
URL https://www.aclweb.org/anthology/W17-5026/
PDF https://www.aclweb.org/anthology/W17-5026
PWC https://paperswithcode.com/paper/a-study-of-n-gram-and-embedding
Repo
Framework

Practical Neural Machine Translation

Title Practical Neural Machine Translation
Authors Rico Sennrich, Barry Haddow
Abstract Neural Machine Translation (NMT) has achieved new breakthroughs in machine translation in recent years. It has dominated recent shared translation tasks in machine translation research, and is also being quickly adopted in industry. The technical differences between NMT and the previously dominant phrase-based statistical approach require that practictioners learn new best practices for building MT systems, ranging from different hardware requirements, new techniques for handling rare words and monolingual data, to new opportunities in continued learning and domain adaptation.This tutorial is aimed at researchers and users of machine translation interested in working with NMT. The tutorial will cover a basic theoretical introduction to NMT, discuss the components of state-of-the-art systems, and provide practical advice for building NMT systems.
Tasks Machine Translation
Published 2017-04-01
URL https://www.aclweb.org/anthology/E17-5002/
PDF https://www.aclweb.org/anthology/E17-5002
PWC https://paperswithcode.com/paper/practical-neural-machine-translation
Repo
Framework

Length, Interchangeability, and External Knowledge: Observations from Predicting Argument Convincingness

Title Length, Interchangeability, and External Knowledge: Observations from Predicting Argument Convincingness
Authors Peter Potash, Robin Bhattacharya, Anna Rumshisky
Abstract In this work, we provide insight into three key aspects related to predicting argument convincingness. First, we explicitly display the power that text length possesses for predicting convincingness in an unsupervised setting. Second, we show that a bag-of-words embedding model posts state-of-the-art on a dataset of arguments annotated for convincingness, outperforming an SVM with numerous hand-crafted features as well as recurrent neural network models that attempt to capture semantic composition. Finally, we assess the feasibility of integrating external knowledge when predicting convincingness, as arguments are often more convincing when they contain abundant information and facts. We finish by analyzing the correlations between the various models we propose.
Tasks Semantic Composition
Published 2017-11-01
URL https://www.aclweb.org/anthology/I17-1035/
PDF https://www.aclweb.org/anthology/I17-1035
PWC https://paperswithcode.com/paper/length-interchangeability-and-external
Repo
Framework

Clustering Billions of Reads for DNA Data Storage

Title Clustering Billions of Reads for DNA Data Storage
Authors Cyrus Rashtchian, Konstantin Makarychev, Miklos Racz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, Karin Strauss
Abstract Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA data storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.
Tasks
Published 2017-12-01
URL http://papers.nips.cc/paper/6928-clustering-billions-of-reads-for-dna-data-storage
PDF http://papers.nips.cc/paper/6928-clustering-billions-of-reads-for-dna-data-storage.pdf
PWC https://paperswithcode.com/paper/clustering-billions-of-reads-for-dna-data
Repo
Framework

Streaming Text Analytics for Real-Time Event Recognition

Title Streaming Text Analytics for Real-Time Event Recognition
Authors Philippe Thomas, Johannes Kirschnick, Leonhard Hennig, Renlong Ai, Sven Schmeier, Holmer Hemsen, Feiyu Xu, Hans Uszkoreit
Abstract A huge body of continuously growing written knowledge is available on the web in the form of social media posts, RSS feeds, and news articles. Real-time information extraction from such high velocity, high volume text streams requires scalable, distributed natural language processing pipelines. We introduce such a system for fine-grained event recognition within the big data framework Flink, and demonstrate its capabilities for extracting and geo-locating mobility- and industry-related events from heterogeneous text sources. Performance analyses conducted on several large datasets show that our system achieves high throughput and maintains low latency, which is crucial when events need to be detected and acted upon in real-time. We also present promising experimental results for the event extraction component of our system, which recognizes a novel set of event types. The demo system is available at \url{http://dfki.de/sd4m-sta-demo/}.
Tasks Entity Linking, Named Entity Recognition, Relation Extraction
Published 2017-09-01
URL https://www.aclweb.org/anthology/R17-1096/
PDF https://doi.org/10.26615/978-954-452-049-6_096
PWC https://paperswithcode.com/paper/streaming-text-analytics-for-real-time-event
Repo
Framework

Improving Viterbi is Hard: Better Runtimes Imply Faster Clique Algorithms

Title Improving Viterbi is Hard: Better Runtimes Imply Faster Clique Algorithms
Authors Arturs Backurs, Christos Tzamos
Abstract The classic algorithm of Viterbi computes the most likely path in a Hidden Markov Model (HMM) that results in a given sequence of observations. It runs in time $O(Tn^2)$ given a sequence of T observations from a HMM with n states. Despite significant interest in the problem and prolonged effort by different communities, no known algorithm achieves more than a polylogarithmic speedup. In this paper, we explain this difficulty by providing matching conditional lower bounds. Our lower bounds are based on assumptions that the best known algorithms for the All-Pairs Shortest Paths problem (APSP) and for the Max-Weight k-Clique problem in edge-weighted graphs are essentially tight. Finally, using a recent algorithm by Green Larsen and Williams for online Boolean matrix-vector multiplication, we get a $2^{\Omega(\sqrt{\log n})}$ speedup for the Viterbi algorithm when there are few distinct transition probabilities in the HMM.
Tasks
Published 2017-08-01
URL https://icml.cc/Conferences/2017/Schedule?showEvent=728
PDF http://proceedings.mlr.press/v70/backurs17a/backurs17a.pdf
PWC https://paperswithcode.com/paper/improving-viterbi-is-hard-better-runtimes
Repo
Framework

Ambiguss, a game for building a Sense Annotated Corpus for French

Title Ambiguss, a game for building a Sense Annotated Corpus for French
Authors Mathieu Lafourcade, Nathalie Le Brun
Abstract
Tasks Common Sense Reasoning
Published 2017-01-01
URL https://www.aclweb.org/anthology/W17-6920/
PDF https://www.aclweb.org/anthology/W17-6920
PWC https://paperswithcode.com/paper/ambiguss-a-game-for-building-a-sense
Repo
Framework

On-line Dialogue Policy Learning with Companion Teaching

Title On-line Dialogue Policy Learning with Companion Teaching
Authors Lu Chen, Runzhe Yang, Cheng Chang, Zihao Ye, Xiang Zhou, Kai Yu
Abstract On-line dialogue policy learning is the key for building evolvable conversational agent in real world scenarios. Poor initial policy can easily lead to bad user experience and consequently fail to attract sufficient users for policy training. A novel framework, companion teaching, is proposed to include a human teacher in the dialogue policy training loop to address the cold start problem. Here, dialogue policy is trained using not only user{'}s reward, but also teacher{'}s example action as well as estimated immediate reward at turn level. Simulation experiments showed that, with small number of human teaching dialogues, the proposed approach can effectively improve user experience at the beginning and smoothly lead to good performance with more user interaction data.
Tasks Dialogue Management
Published 2017-04-01
URL https://www.aclweb.org/anthology/E17-2032/
PDF https://www.aclweb.org/anthology/E17-2032
PWC https://paperswithcode.com/paper/on-line-dialogue-policy-learning-with
Repo
Framework

Acquiring Predicate Paraphrases from News Tweets

Title Acquiring Predicate Paraphrases from News Tweets
Authors Vered Shwartz, Gabriel Stanovsky, Ido Dagan
Abstract We present a simple method for ever-growing extraction of predicate paraphrases from news headlines in Twitter. Analysis of the output of ten weeks of collection shows that the accuracy of paraphrases with different support levels is estimated between 60-86{%}. We also demonstrate that our resource is to a large extent complementary to existing resources, providing many novel paraphrases. Our resource is publicly available, continuously expanding based on daily news.
Tasks Natural Language Inference, Question Answering
Published 2017-08-01
URL https://www.aclweb.org/anthology/S17-1019/
PDF https://www.aclweb.org/anthology/S17-1019
PWC https://paperswithcode.com/paper/acquiring-predicate-paraphrases-from-news
Repo
Framework

Biasing Attention-Based Recurrent Neural Networks Using External Alignment Information

Title Biasing Attention-Based Recurrent Neural Networks Using External Alignment Information
Authors Tamer Alkhouli, Hermann Ney
Abstract
Tasks Machine Translation
Published 2017-09-01
URL https://www.aclweb.org/anthology/W17-4711/
PDF https://www.aclweb.org/anthology/W17-4711
PWC https://paperswithcode.com/paper/biasing-attention-based-recurrent-neural
Repo
Framework

Universal Joint Morph-Syntactic Processing: The Open University of Israel’s Submission to The CoNLL 2017 Shared Task

Title Universal Joint Morph-Syntactic Processing: The Open University of Israel’s Submission to The CoNLL 2017 Shared Task
Authors Amir More, Reut Tsarfaty
Abstract We present the Open University{'}s submission to the CoNLL 2017 Shared Task on multilingual parsing from raw text to Universal Dependencies. The core of our system is a joint morphological disambiguator and syntactic parser which accepts morphologically analyzed surface tokens as input and returns morphologically disambiguated dependency trees as output. Our parser requires a lattice as input, so we generate morphological analyses of surface tokens using a data-driven morphological analyzer that derives its lexicon from the UD training corpora, and we rely on UDPipe for sentence segmentation and surface-level tokenization. We report our official macro-average LAS is 56.56. Although our model is not as performant as many others, it does not make use of neural networks, therefore we do not rely on word embeddings or any other data source other than the corpora themselves. In addition, we show the utility of a lexicon-backed morphological analyzer for the MRL Modern Hebrew. We use our results on Modern Hebrew to argue that the UD community should define a UD-compatible standard for access to lexical resources, which we argue is crucial for MRLs and low resource languages in particular.
Tasks Tokenization, Word Embeddings
Published 2017-08-01
URL https://www.aclweb.org/anthology/K17-3027/
PDF https://www.aclweb.org/anthology/K17-3027
PWC https://paperswithcode.com/paper/universal-joint-morph-syntactic-processing
Repo
Framework
comments powered by Disqus