May 5, 2019

2563 words 13 mins read

Paper Group NANR 120

Ensemble Learning for Multi-Source Neural Machine Translation. Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation. CSIRO Data61 at the WNUT Geo Shared Task. SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis. An Analysis of the Ability of Statistical La …

Ensemble Learning for Multi-Source Neural Machine Translation


Title	Ensemble Learning for Multi-Source Neural Machine Translation
Authors	Ekaterina Garmash, Christof Monz
Abstract	In this paper we describe and evaluate methods to perform ensemble prediction in neural machine translation (NMT). We compare two methods of ensemble set induction: sampling parameter initializations for an NMT system, which is a relatively established method in NMT (Sutskever et al., 2014), and NMT systems translating from different source languages into the same target language, i.e., multi-source ensembles, a method recently introduced by Firat et al. (2016). We are motivated by the observation that for different language pairs systems make different types of mistakes. We propose several methods with different degrees of parameterization to combine individual predictions of NMT systems so that they mutually compensate for each other{'}s mistakes and improve overall performance. We find that the biggest improvements can be obtained from a context-dependent weighting scheme for multi-source ensembles. This result offers stronger support for the linguistic motivation of using multi-source ensembles than previous approaches. Evaluation is carried out for German and French into English translation. The best multi-source ensemble method achieves an improvement of up to 2.2 BLEU points over the strongest single-source ensemble baseline, and a 2 BLEU improvement over a multi-source ensemble baseline.
Tasks	Machine Translation
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-1133/
PDF	https://www.aclweb.org/anthology/C16-1133
PWC	https://paperswithcode.com/paper/ensemble-learning-for-multi-source-neural
Repo
Framework

Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation


Title	Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation
Authors	Emmanuel Abbe, Colin Sandon
Abstract	The stochastic block model (SBM) has long been studied in machine learning and network science as a canonical model for clustering and community detection. In the recent years, new developments have demonstrated the presence of threshold phenomena for this model, which have set new challenges for algorithms. For the {\it detection} problem in symmetric SBMs, Decelle et al.\ conjectured that the so-called Kesten-Stigum (KS) threshold can be achieved efficiently. This was proved for two communities, but remained open from three communities. We prove this conjecture here, obtaining a more general result that applies to arbitrary SBMs with linear size communities. The developed algorithm is a linearized acyclic belief propagation (ABP) algorithm, which mitigates the effects of cycles while provably achieving the KS threshold in $O(n \ln n)$ time. This extends prior methods by achieving universally the KS threshold while reducing or preserving the computational complexity. ABP is also connected to a power iteration method on a generalized nonbacktracking operator, formalizing the spectral-message passing interplay described in Krzakala et al., and extending results from Bordenave et al.
Tasks	Community Detection
Published	2016-12-01
URL	http://papers.nips.cc/paper/6365-achieving-the-ks-threshold-in-the-general-stochastic-block-model-with-linearized-acyclic-belief-propagation
PDF	http://papers.nips.cc/paper/6365-achieving-the-ks-threshold-in-the-general-stochastic-block-model-with-linearized-acyclic-belief-propagation.pdf
PWC	https://paperswithcode.com/paper/achieving-the-ks-threshold-in-the-general
Repo
Framework

CSIRO Data61 at the WNUT Geo Shared Task


Title	CSIRO Data61 at the WNUT Geo Shared Task
Authors	Gaya Jayasinghe, Brian Jin, James Mchugh, Bella Robinson, Stephen Wan
Abstract	In this paper, we describe CSIRO Data61{'}s participation in the Geolocation shared task at the Workshop for Noisy User-generated Text. Our approach was to use ensemble methods to capitalise on four component methods: heuristics based on metadata, a label propagation method, timezone text classifiers, and an information retrieval approach. The ensembles we explored focused on examining the role of language technologies in geolocation prediction and also in examining the use of hard voting and cascading ensemble methods. Based on the accuracy of city-level predictions, our systems were the best performing submissions at this year{'}s shared task. Furthermore, when estimating the latitude and longitude of a user, our median error distance was accurate to within 30 kilometers.
Tasks	Information Retrieval
Published	2016-12-01
URL	https://www.aclweb.org/anthology/W16-3929/
PDF	https://www.aclweb.org/anthology/W16-3929
PWC	https://paperswithcode.com/paper/csiro-data61-at-the-wnut-geo-shared-task
Repo
Framework

SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis


Title	SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis
Authors	Ina Roesiger
Abstract	This paper presents SciCorp, a corpus of full-text English scientific papers of two disciplines, genetics and computational linguistics. The corpus comprises co-reference and bridging information as well as information status labels. Since SciCorp is annotated with both labels and the respective co-referent and bridging links, we believe it is a valuable resource for NLP researchers working on scientific articles or on applications such as co-reference resolution, bridging resolution or information status classification. The corpus has been reliably annotated by independent human coders with moderate inter-annotator agreement (average kappa = 0.71). In total, we have annotated 14 full papers containing 61,045 tokens and marked 8,708 definite noun phrases. The paper describes in detail the annotation scheme as well as the resulting corpus. The corpus is available for download in two different formats: in an offset-based format and for the co-reference annotations in the widely-used, tabular CoNLL-2012 format.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1275/
PDF	https://www.aclweb.org/anthology/L16-1275
PWC	https://paperswithcode.com/paper/scicorp-a-corpus-of-english-scientific
Repo
Framework

An Analysis of the Ability of Statistical Language Models to Capture the Structural Properties of Language


Title	An Analysis of the Ability of Statistical Language Models to Capture the Structural Properties of Language
Authors	Aneiss Ghodsi, John DeNero
Abstract
Tasks	Language Modelling, Text Generation
Published	2016-09-01
URL	https://www.aclweb.org/anthology/W16-6637/
PDF	https://www.aclweb.org/anthology/W16-6637
PWC	https://paperswithcode.com/paper/an-analysis-of-the-ability-of-statistical
Repo
Framework

Mining the Spoken Wikipedia for Speech Data and Beyond


Title	Mining the Spoken Wikipedia for Speech Data and Beyond
Authors	Arne K{"o}hn, Florian Stegen, Timo Baumann
Abstract	We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1735/
PDF	https://www.aclweb.org/anthology/L16-1735
PWC	https://paperswithcode.com/paper/mining-the-spoken-wikipedia-for-speech-data
Repo
Framework

TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation


Title	TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation
Authors	Adrien Bougouin, Sabine Barreaux, Laurent Romary, Florian Boudin, B{'e}atrice Daille
Abstract	Keyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1304/
PDF	https://www.aclweb.org/anthology/L16-1304
PWC	https://paperswithcode.com/paper/termith-eval-a-french-standard-based-resource
Repo
Framework

A Singing Voice Database in Basque for Statistical Singing Synthesis of Bertsolaritza


Title	A Singing Voice Database in Basque for Statistical Singing Synthesis of Bertsolaritza
Authors	Xabier Sarasola, Eva Navas, David Tavarez, Daniel Erro, Ibon Saratxaga, Inma Hernaez
Abstract	This paper describes the characteristics and structure of a Basque singing voice database of bertsolaritza. Bertsolaritza is a popular singing style from Basque Country sung exclusively in Basque that is improvised and a capella. The database is designed to be used in statistical singing voice synthesis for bertsolaritza style. Starting from the recordings and transcriptions of numerous singers, diarization and phoneme alignment experiments have been made to extract the singing voice from the recordings and create phoneme alignments. This labelling processes have been performed applying standard speech processing techniques and the results prove that these techniques can be used in this specific singing style.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1120/
PDF	https://www.aclweb.org/anthology/L16-1120
PWC	https://paperswithcode.com/paper/a-singing-voice-database-in-basque-for
Repo
Framework

Analyzing Linguistic Knowledge in Sequential Model of Sentence


Title	Analyzing Linguistic Knowledge in Sequential Model of Sentence
Authors	Peng Qian, Xipeng Qiu, Xuanjing Huang
Abstract
Tasks	Language Modelling, Text Generation
Published	2016-11-01
URL	https://www.aclweb.org/anthology/D16-1079/
PDF	https://www.aclweb.org/anthology/D16-1079
PWC	https://paperswithcode.com/paper/analyzing-linguistic-knowledge-in-sequential
Repo
Framework


Title	Identification of Drug-Related Medical Conditions in Social Media
Authors	Fran{\c{c}}ois Morlane-Hond{`e}re, Cyril Grouin, Pierre Zweigenbaum
Abstract	Monitoring social media has been shown to be an interesting approach for the early detection of drug adverse effects. In this paper, we describe a system which extracts medical entities in French drug reviews written by users. We focus on the identification of medical conditions, which is based on the concept of post-coordination: we first extract minimal medical-related entities (pain, stomach) then we combine them to identify complex ones (It was the worst [pain I ever felt in my stomach]). These two steps are respectively performed by two classifiers, the first being based on Conditional Random Fields and the second one on Support Vector Machines. The overall results of the minimal entity classifier are the following: P=0.926; R=0.849; F1=0.886. A thourough analysis of the feature set shows that, when combined with word lemmas, clusters generated by word2vec are the most valuable features. When trained on the output of the first classifier, the second classifier{'}s performances are the following: p=0.683;r=0.956;f1=0.797. The addition of post-processing rules did not add any significant global improvement but was found to modify the precision/recall ratio.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1320/
PDF	https://www.aclweb.org/anthology/L16-1320
PWC	https://paperswithcode.com/paper/identification-of-drug-related-medical
Repo
Framework


Title	Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus
Authors	SoHyun Park, Afsaneh Fazly, Annie Lee, Br Seibel, on, Wenjie Zi, Paul Cook
Abstract	In this paper we consider the problem of out-of-vocabulary term classification in web forum text from the automotive domain. We develop a set of nine domain- and application-specific categories for out-of-vocabulary terms. We then propose a supervised approach to classify out-of-vocabulary terms according to these categories, drawing on features based on word embeddings, and linguistic knowledge of common properties of out-of-vocabulary terms. We show that the features based on word embeddings are particularly informative for this task. The categories that we predict could serve as a preliminary, automatically-generated source of lexical knowledge about out-of-vocabulary terms. Furthermore, we show that this approach can be adapted to give a semi-automated method for identifying out-of-vocabulary terms of a particular category, automotive named entities, that is of particular interest to us.
Tasks	Word Embeddings
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1474/
PDF	https://www.aclweb.org/anthology/L16-1474
PWC	https://paperswithcode.com/paper/classifying-out-of-vocabulary-terms-in-a
Repo
Framework

IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis


Title	IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis
Authors	Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal, Chris Biemann
Abstract
Tasks	Aspect-Based Sentiment Analysis, Opinion Mining, Sentiment Analysis
Published	2016-06-01
URL	https://www.aclweb.org/anthology/S16-1174/
PDF	https://www.aclweb.org/anthology/S16-1174
PWC	https://paperswithcode.com/paper/iit-tuda-at-semeval-2016-task-5-beyond
Repo
Framework

Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource


Title	Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Authors	Ana{"\i}s Tack, Thomas Fran{\c{c}}ois, Anne-Laure Ligozat, C{'e}drick Fairon
Abstract	This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4{%} to 92.3{%}), they do not yet seem to cover the learners{'} lack of knowledge very well.
Tasks	Lexical Simplification
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1035/
PDF	https://www.aclweb.org/anthology/L16-1035
PWC	https://paperswithcode.com/paper/evaluating-lexical-simplification-and
Repo
Framework

Event Coreference Resolution with Multi-Pass Sieves


Title	Event Coreference Resolution with Multi-Pass Sieves
Authors	Jing Lu, Vincent Ng
Abstract	Multi-pass sieve approaches have been successfully applied to entity coreference resolution and many other tasks in natural language processing (NLP), owing in part to the ease of designing high-precision rules for these tasks. However, the same is not true for event coreference resolution: typically lying towards the end of the standard information extraction pipeline, an event coreference resolver assumes as input the noisy outputs of its upstream components such as the trigger identification component and the entity coreference resolution component. The difficulty in designing high-precision rules makes it challenging to successfully apply a multi-pass sieve approach to event coreference resolution. In this paper, we investigate this challenge, proposing the first multi-pass sieve approach to event coreference resolution. When evaluated on the version of the KBP 2015 corpus available to the participants of EN Task 2 (Event Nugget Detection and Coreference), our approach achieves an Avg F-score of 40.32{%}, outperforming the best participating system by 0.67{%} in Avg F-score.
Tasks	Coreference Resolution
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1631/
PDF	https://www.aclweb.org/anthology/L16-1631
PWC	https://paperswithcode.com/paper/event-coreference-resolution-with-multi-pass
Repo
Framework

Merging Data Resources for Inflectional and Derivational Morphology in Czech


Title	Merging Data Resources for Inflectional and Derivational Morphology in Czech
Authors	Zden{\v{e}}k {\v{Z}}abokrtsk{'y}, Magda {\v{S}}ev{\v{c}}{'\i}kov{'a}, Milan Straka, Jon{'a}{\v{s}} Vidra, Ad{'e}la Limbursk{'a}
Abstract	The paper deals with merging two complementary resources of morphological data previously existing for Czech, namely the inflectional dictionary MorfFlex CZ and the recently developed lexical network DeriNet. The MorfFlex CZ dictionary has been used by a morphological analyzer capable of analyzing/generating several million Czech word forms according to the rules of Czech inflection. The DeriNet network contains several hundred thousand Czech lemmas interconnected with links corresponding to derivational relations (relations between base words and words derived from them). After summarizing basic characteristics of both resources, the process of merging is described, focusing on both rather technical aspects (growth of the data, measuring the quality of newly added derivational relations) and linguistic issues (treating lexical homonymy and vowel/consonant alternations). The resulting resource contains 970 thousand lemmas connected with 715 thousand derivational relations and is publicly available on the web under the CC-BY-NC-SA license. The data were incorporated in the MorphoDiTa library version 2.0 (which provides morphological analysis, generation, tagging and lemmatization for Czech) and can be browsed and searched by two web tools (DeriNet Viewer and DeriNet Search tool).
Tasks	Lemmatization, Morphological Analysis
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1208/
PDF	https://www.aclweb.org/anthology/L16-1208
PWC	https://paperswithcode.com/paper/merging-data-resources-for-inflectional-and
Repo
Framework