May 5, 2019

2563 words 13 mins read

Paper Group NANR 120

Paper Group NANR 120

Ensemble Learning for Multi-Source Neural Machine Translation. Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation. CSIRO Data61 at the WNUT Geo Shared Task. SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis. An Analysis of the Ability of Statistical La …

Ensemble Learning for Multi-Source Neural Machine Translation

Title Ensemble Learning for Multi-Source Neural Machine Translation
Authors Ekaterina Garmash, Christof Monz
Abstract In this paper we describe and evaluate methods to perform ensemble prediction in neural machine translation (NMT). We compare two methods of ensemble set induction: sampling parameter initializations for an NMT system, which is a relatively established method in NMT (Sutskever et al., 2014), and NMT systems translating from different source languages into the same target language, i.e., multi-source ensembles, a method recently introduced by Firat et al. (2016). We are motivated by the observation that for different language pairs systems make different types of mistakes. We propose several methods with different degrees of parameterization to combine individual predictions of NMT systems so that they mutually compensate for each other{'}s mistakes and improve overall performance. We find that the biggest improvements can be obtained from a context-dependent weighting scheme for multi-source ensembles. This result offers stronger support for the linguistic motivation of using multi-source ensembles than previous approaches. Evaluation is carried out for German and French into English translation. The best multi-source ensemble method achieves an improvement of up to 2.2 BLEU points over the strongest single-source ensemble baseline, and a 2 BLEU improvement over a multi-source ensemble baseline.
Tasks Machine Translation
Published 2016-12-01
URL https://www.aclweb.org/anthology/C16-1133/
PDF https://www.aclweb.org/anthology/C16-1133
PWC https://paperswithcode.com/paper/ensemble-learning-for-multi-source-neural
Repo
Framework

Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation

Title Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation
Authors Emmanuel Abbe, Colin Sandon
Abstract The stochastic block model (SBM) has long been studied in machine learning and network science as a canonical model for clustering and community detection. In the recent years, new developments have demonstrated the presence of threshold phenomena for this model, which have set new challenges for algorithms. For the {\it detection} problem in symmetric SBMs, Decelle et al.\ conjectured that the so-called Kesten-Stigum (KS) threshold can be achieved efficiently. This was proved for two communities, but remained open from three communities. We prove this conjecture here, obtaining a more general result that applies to arbitrary SBMs with linear size communities. The developed algorithm is a linearized acyclic belief propagation (ABP) algorithm, which mitigates the effects of cycles while provably achieving the KS threshold in $O(n \ln n)$ time. This extends prior methods by achieving universally the KS threshold while reducing or preserving the computational complexity. ABP is also connected to a power iteration method on a generalized nonbacktracking operator, formalizing the spectral-message passing interplay described in Krzakala et al., and extending results from Bordenave et al.
Tasks Community Detection
Published 2016-12-01
URL http://papers.nips.cc/paper/6365-achieving-the-ks-threshold-in-the-general-stochastic-block-model-with-linearized-acyclic-belief-propagation
PDF http://papers.nips.cc/paper/6365-achieving-the-ks-threshold-in-the-general-stochastic-block-model-with-linearized-acyclic-belief-propagation.pdf
PWC https://paperswithcode.com/paper/achieving-the-ks-threshold-in-the-general
Repo
Framework

CSIRO Data61 at the WNUT Geo Shared Task

Title CSIRO Data61 at the WNUT Geo Shared Task
Authors Gaya Jayasinghe, Brian Jin, James Mchugh, Bella Robinson, Stephen Wan
Abstract In this paper, we describe CSIRO Data61{'}s participation in the Geolocation shared task at the Workshop for Noisy User-generated Text. Our approach was to use ensemble methods to capitalise on four component methods: heuristics based on metadata, a label propagation method, timezone text classifiers, and an information retrieval approach. The ensembles we explored focused on examining the role of language technologies in geolocation prediction and also in examining the use of hard voting and cascading ensemble methods. Based on the accuracy of city-level predictions, our systems were the best performing submissions at this year{'}s shared task. Furthermore, when estimating the latitude and longitude of a user, our median error distance was accurate to within 30 kilometers.
Tasks Information Retrieval
Published 2016-12-01
URL https://www.aclweb.org/anthology/W16-3929/
PDF https://www.aclweb.org/anthology/W16-3929
PWC https://paperswithcode.com/paper/csiro-data61-at-the-wnut-geo-shared-task
Repo
Framework

SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis

Title SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis
Authors Ina Roesiger
Abstract This paper presents SciCorp, a corpus of full-text English scientific papers of two disciplines, genetics and computational linguistics. The corpus comprises co-reference and bridging information as well as information status labels. Since SciCorp is annotated with both labels and the respective co-referent and bridging links, we believe it is a valuable resource for NLP researchers working on scientific articles or on applications such as co-reference resolution, bridging resolution or information status classification. The corpus has been reliably annotated by independent human coders with moderate inter-annotator agreement (average kappa = 0.71). In total, we have annotated 14 full papers containing 61,045 tokens and marked 8,708 definite noun phrases. The paper describes in detail the annotation scheme as well as the resulting corpus. The corpus is available for download in two different formats: in an offset-based format and for the co-reference annotations in the widely-used, tabular CoNLL-2012 format.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1275/
PDF https://www.aclweb.org/anthology/L16-1275
PWC https://paperswithcode.com/paper/scicorp-a-corpus-of-english-scientific
Repo
Framework

An Analysis of the Ability of Statistical Language Models to Capture the Structural Properties of Language

Title An Analysis of the Ability of Statistical Language Models to Capture the Structural Properties of Language
Authors Aneiss Ghodsi, John DeNero
Abstract
Tasks Language Modelling, Text Generation
Published 2016-09-01
URL https://www.aclweb.org/anthology/W16-6637/
PDF https://www.aclweb.org/anthology/W16-6637
PWC https://paperswithcode.com/paper/an-analysis-of-the-ability-of-statistical
Repo
Framework

Mining the Spoken Wikipedia for Speech Data and Beyond

Title Mining the Spoken Wikipedia for Speech Data and Beyond
Authors Arne K{"o}hn, Florian Stegen, Timo Baumann
Abstract We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1735/
PDF https://www.aclweb.org/anthology/L16-1735
PWC https://paperswithcode.com/paper/mining-the-spoken-wikipedia-for-speech-data
Repo
Framework

TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation

Title TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation
Authors Adrien Bougouin, Sabine Barreaux, Laurent Romary, Florian Boudin, B{'e}atrice Daille
Abstract Keyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1304/
PDF https://www.aclweb.org/anthology/L16-1304
PWC https://paperswithcode.com/paper/termith-eval-a-french-standard-based-resource
Repo
Framework

A Singing Voice Database in Basque for Statistical Singing Synthesis of Bertsolaritza

Title A Singing Voice Database in Basque for Statistical Singing Synthesis of Bertsolaritza
Authors Xabier Sarasola, Eva Navas, David Tavarez, Daniel Erro, Ibon Saratxaga, Inma Hernaez
Abstract This paper describes the characteristics and structure of a Basque singing voice database of bertsolaritza. Bertsolaritza is a popular singing style from Basque Country sung exclusively in Basque that is improvised and a capella. The database is designed to be used in statistical singing voice synthesis for bertsolaritza style. Starting from the recordings and transcriptions of numerous singers, diarization and phoneme alignment experiments have been made to extract the singing voice from the recordings and create phoneme alignments. This labelling processes have been performed applying standard speech processing techniques and the results prove that these techniques can be used in this specific singing style.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1120/
PDF https://www.aclweb.org/anthology/L16-1120
PWC https://paperswithcode.com/paper/a-singing-voice-database-in-basque-for
Repo
Framework

Analyzing Linguistic Knowledge in Sequential Model of Sentence

Title Analyzing Linguistic Knowledge in Sequential Model of Sentence
Authors Peng Qian, Xipeng Qiu, Xuanjing Huang
Abstract
Tasks Language Modelling, Text Generation
Published 2016-11-01
URL https://www.aclweb.org/anthology/D16-1079/
PDF https://www.aclweb.org/anthology/D16-1079
PWC https://paperswithcode.com/paper/analyzing-linguistic-knowledge-in-sequential
Repo
Framework
Title Identification of Drug-Related Medical Conditions in Social Media
Authors Fran{\c{c}}ois Morlane-Hond{`e}re, Cyril Grouin, Pierre Zweigenbaum
Abstract Monitoring social media has been shown to be an interesting approach for the early detection of drug adverse effects. In this paper, we describe a system which extracts medical entities in French drug reviews written by users. We focus on the identification of medical conditions, which is based on the concept of post-coordination: we first extract minimal medical-related entities (pain, stomach) then we combine them to identify complex ones (It was the worst [pain I ever felt in my stomach]). These two steps are respectively performed by two classifiers, the first being based on Conditional Random Fields and the second one on Support Vector Machines. The overall results of the minimal entity classifier are the following: P=0.926; R=0.849; F1=0.886. A thourough analysis of the feature set shows that, when combined with word lemmas, clusters generated by word2vec are the most valuable features. When trained on the output of the first classifier, the second classifier{'}s performances are the following: p=0.683;r=0.956;f1=0.797. The addition of post-processing rules did not add any significant global improvement but was found to modify the precision/recall ratio.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1320/
PDF https://www.aclweb.org/anthology/L16-1320
PWC https://paperswithcode.com/paper/identification-of-drug-related-medical
Repo
Framework

Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus

Title Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus
Authors SoHyun Park, Afsaneh Fazly, Annie Lee, Br Seibel, on, Wenjie Zi, Paul Cook
Abstract In this paper we consider the problem of out-of-vocabulary term classification in web forum text from the automotive domain. We develop a set of nine domain- and application-specific categories for out-of-vocabulary terms. We then propose a supervised approach to classify out-of-vocabulary terms according to these categories, drawing on features based on word embeddings, and linguistic knowledge of common properties of out-of-vocabulary terms. We show that the features based on word embeddings are particularly informative for this task. The categories that we predict could serve as a preliminary, automatically-generated source of lexical knowledge about out-of-vocabulary terms. Furthermore, we show that this approach can be adapted to give a semi-automated method for identifying out-of-vocabulary terms of a particular category, automotive named entities, that is of particular interest to us.
Tasks Word Embeddings
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1474/
PDF https://www.aclweb.org/anthology/L16-1474
PWC https://paperswithcode.com/paper/classifying-out-of-vocabulary-terms-in-a
Repo
Framework

IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis

Title IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis
Authors Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal, Chris Biemann
Abstract
Tasks Aspect-Based Sentiment Analysis, Opinion Mining, Sentiment Analysis
Published 2016-06-01
URL https://www.aclweb.org/anthology/S16-1174/
PDF https://www.aclweb.org/anthology/S16-1174
PWC https://paperswithcode.com/paper/iit-tuda-at-semeval-2016-task-5-beyond
Repo
Framework

Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource

Title Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Authors Ana{"\i}s Tack, Thomas Fran{\c{c}}ois, Anne-Laure Ligozat, C{'e}drick Fairon
Abstract This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4{%} to 92.3{%}), they do not yet seem to cover the learners{'} lack of knowledge very well.
Tasks Lexical Simplification
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1035/
PDF https://www.aclweb.org/anthology/L16-1035
PWC https://paperswithcode.com/paper/evaluating-lexical-simplification-and
Repo
Framework

Event Coreference Resolution with Multi-Pass Sieves

Title Event Coreference Resolution with Multi-Pass Sieves
Authors Jing Lu, Vincent Ng
Abstract Multi-pass sieve approaches have been successfully applied to entity coreference resolution and many other tasks in natural language processing (NLP), owing in part to the ease of designing high-precision rules for these tasks. However, the same is not true for event coreference resolution: typically lying towards the end of the standard information extraction pipeline, an event coreference resolver assumes as input the noisy outputs of its upstream components such as the trigger identification component and the entity coreference resolution component. The difficulty in designing high-precision rules makes it challenging to successfully apply a multi-pass sieve approach to event coreference resolution. In this paper, we investigate this challenge, proposing the first multi-pass sieve approach to event coreference resolution. When evaluated on the version of the KBP 2015 corpus available to the participants of EN Task 2 (Event Nugget Detection and Coreference), our approach achieves an Avg F-score of 40.32{%}, outperforming the best participating system by 0.67{%} in Avg F-score.
Tasks Coreference Resolution
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1631/
PDF https://www.aclweb.org/anthology/L16-1631
PWC https://paperswithcode.com/paper/event-coreference-resolution-with-multi-pass
Repo
Framework

Merging Data Resources for Inflectional and Derivational Morphology in Czech

Title Merging Data Resources for Inflectional and Derivational Morphology in Czech
Authors Zden{\v{e}}k {\v{Z}}abokrtsk{'y}, Magda {\v{S}}ev{\v{c}}{'\i}kov{'a}, Milan Straka, Jon{'a}{\v{s}} Vidra, Ad{'e}la Limbursk{'a}
Abstract The paper deals with merging two complementary resources of morphological data previously existing for Czech, namely the inflectional dictionary MorfFlex CZ and the recently developed lexical network DeriNet. The MorfFlex CZ dictionary has been used by a morphological analyzer capable of analyzing/generating several million Czech word forms according to the rules of Czech inflection. The DeriNet network contains several hundred thousand Czech lemmas interconnected with links corresponding to derivational relations (relations between base words and words derived from them). After summarizing basic characteristics of both resources, the process of merging is described, focusing on both rather technical aspects (growth of the data, measuring the quality of newly added derivational relations) and linguistic issues (treating lexical homonymy and vowel/consonant alternations). The resulting resource contains 970 thousand lemmas connected with 715 thousand derivational relations and is publicly available on the web under the CC-BY-NC-SA license. The data were incorporated in the MorphoDiTa library version 2.0 (which provides morphological analysis, generation, tagging and lemmatization for Czech) and can be browsed and searched by two web tools (DeriNet Viewer and DeriNet Search tool).
Tasks Lemmatization, Morphological Analysis
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1208/
PDF https://www.aclweb.org/anthology/L16-1208
PWC https://paperswithcode.com/paper/merging-data-resources-for-inflectional-and
Repo
Framework
comments powered by Disqus