Paper Group NANR 120
Ensemble Learning for Multi-Source Neural Machine Translation. Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation. CSIRO Data61 at the WNUT Geo Shared Task. SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis. An Analysis of the Ability of Statistical La …
Ensemble Learning for Multi-Source Neural Machine Translation
Title | Ensemble Learning for Multi-Source Neural Machine Translation |
Authors | Ekaterina Garmash, Christof Monz |
Abstract | In this paper we describe and evaluate methods to perform ensemble prediction in neural machine translation (NMT). We compare two methods of ensemble set induction: sampling parameter initializations for an NMT system, which is a relatively established method in NMT (Sutskever et al., 2014), and NMT systems translating from different source languages into the same target language, i.e., multi-source ensembles, a method recently introduced by Firat et al. (2016). We are motivated by the observation that for different language pairs systems make different types of mistakes. We propose several methods with different degrees of parameterization to combine individual predictions of NMT systems so that they mutually compensate for each other{'}s mistakes and improve overall performance. We find that the biggest improvements can be obtained from a context-dependent weighting scheme for multi-source ensembles. This result offers stronger support for the linguistic motivation of using multi-source ensembles than previous approaches. Evaluation is carried out for German and French into English translation. The best multi-source ensemble method achieves an improvement of up to 2.2 BLEU points over the strongest single-source ensemble baseline, and a 2 BLEU improvement over a multi-source ensemble baseline. |
Tasks | Machine Translation |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-1133/ |
https://www.aclweb.org/anthology/C16-1133 | |
PWC | https://paperswithcode.com/paper/ensemble-learning-for-multi-source-neural |
Repo | |
Framework | |
Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation
Title | Achieving the KS threshold in the general stochastic block model with linearized acyclic belief propagation |
Authors | Emmanuel Abbe, Colin Sandon |
Abstract | The stochastic block model (SBM) has long been studied in machine learning and network science as a canonical model for clustering and community detection. In the recent years, new developments have demonstrated the presence of threshold phenomena for this model, which have set new challenges for algorithms. For the {\it detection} problem in symmetric SBMs, Decelle et al.\ conjectured that the so-called Kesten-Stigum (KS) threshold can be achieved efficiently. This was proved for two communities, but remained open from three communities. We prove this conjecture here, obtaining a more general result that applies to arbitrary SBMs with linear size communities. The developed algorithm is a linearized acyclic belief propagation (ABP) algorithm, which mitigates the effects of cycles while provably achieving the KS threshold in $O(n \ln n)$ time. This extends prior methods by achieving universally the KS threshold while reducing or preserving the computational complexity. ABP is also connected to a power iteration method on a generalized nonbacktracking operator, formalizing the spectral-message passing interplay described in Krzakala et al., and extending results from Bordenave et al. |
Tasks | Community Detection |
Published | 2016-12-01 |
URL | http://papers.nips.cc/paper/6365-achieving-the-ks-threshold-in-the-general-stochastic-block-model-with-linearized-acyclic-belief-propagation |
http://papers.nips.cc/paper/6365-achieving-the-ks-threshold-in-the-general-stochastic-block-model-with-linearized-acyclic-belief-propagation.pdf | |
PWC | https://paperswithcode.com/paper/achieving-the-ks-threshold-in-the-general |
Repo | |
Framework | |
CSIRO Data61 at the WNUT Geo Shared Task
Title | CSIRO Data61 at the WNUT Geo Shared Task |
Authors | Gaya Jayasinghe, Brian Jin, James Mchugh, Bella Robinson, Stephen Wan |
Abstract | In this paper, we describe CSIRO Data61{'}s participation in the Geolocation shared task at the Workshop for Noisy User-generated Text. Our approach was to use ensemble methods to capitalise on four component methods: heuristics based on metadata, a label propagation method, timezone text classifiers, and an information retrieval approach. The ensembles we explored focused on examining the role of language technologies in geolocation prediction and also in examining the use of hard voting and cascading ensemble methods. Based on the accuracy of city-level predictions, our systems were the best performing submissions at this year{'}s shared task. Furthermore, when estimating the latitude and longitude of a user, our median error distance was accurate to within 30 kilometers. |
Tasks | Information Retrieval |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-3929/ |
https://www.aclweb.org/anthology/W16-3929 | |
PWC | https://paperswithcode.com/paper/csiro-data61-at-the-wnut-geo-shared-task |
Repo | |
Framework | |
SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis
Title | SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis |
Authors | Ina Roesiger |
Abstract | This paper presents SciCorp, a corpus of full-text English scientific papers of two disciplines, genetics and computational linguistics. The corpus comprises co-reference and bridging information as well as information status labels. Since SciCorp is annotated with both labels and the respective co-referent and bridging links, we believe it is a valuable resource for NLP researchers working on scientific articles or on applications such as co-reference resolution, bridging resolution or information status classification. The corpus has been reliably annotated by independent human coders with moderate inter-annotator agreement (average kappa = 0.71). In total, we have annotated 14 full papers containing 61,045 tokens and marked 8,708 definite noun phrases. The paper describes in detail the annotation scheme as well as the resulting corpus. The corpus is available for download in two different formats: in an offset-based format and for the co-reference annotations in the widely-used, tabular CoNLL-2012 format. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1275/ |
https://www.aclweb.org/anthology/L16-1275 | |
PWC | https://paperswithcode.com/paper/scicorp-a-corpus-of-english-scientific |
Repo | |
Framework | |
An Analysis of the Ability of Statistical Language Models to Capture the Structural Properties of Language
Title | An Analysis of the Ability of Statistical Language Models to Capture the Structural Properties of Language |
Authors | Aneiss Ghodsi, John DeNero |
Abstract | |
Tasks | Language Modelling, Text Generation |
Published | 2016-09-01 |
URL | https://www.aclweb.org/anthology/W16-6637/ |
https://www.aclweb.org/anthology/W16-6637 | |
PWC | https://paperswithcode.com/paper/an-analysis-of-the-ability-of-statistical |
Repo | |
Framework | |
Mining the Spoken Wikipedia for Speech Data and Beyond
Title | Mining the Spoken Wikipedia for Speech Data and Beyond |
Authors | Arne K{"o}hn, Florian Stegen, Timo Baumann |
Abstract | We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1735/ |
https://www.aclweb.org/anthology/L16-1735 | |
PWC | https://paperswithcode.com/paper/mining-the-spoken-wikipedia-for-speech-data |
Repo | |
Framework | |
TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation
Title | TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation |
Authors | Adrien Bougouin, Sabine Barreaux, Laurent Romary, Florian Boudin, B{'e}atrice Daille |
Abstract | Keyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1304/ |
https://www.aclweb.org/anthology/L16-1304 | |
PWC | https://paperswithcode.com/paper/termith-eval-a-french-standard-based-resource |
Repo | |
Framework | |
A Singing Voice Database in Basque for Statistical Singing Synthesis of Bertsolaritza
Title | A Singing Voice Database in Basque for Statistical Singing Synthesis of Bertsolaritza |
Authors | Xabier Sarasola, Eva Navas, David Tavarez, Daniel Erro, Ibon Saratxaga, Inma Hernaez |
Abstract | This paper describes the characteristics and structure of a Basque singing voice database of bertsolaritza. Bertsolaritza is a popular singing style from Basque Country sung exclusively in Basque that is improvised and a capella. The database is designed to be used in statistical singing voice synthesis for bertsolaritza style. Starting from the recordings and transcriptions of numerous singers, diarization and phoneme alignment experiments have been made to extract the singing voice from the recordings and create phoneme alignments. This labelling processes have been performed applying standard speech processing techniques and the results prove that these techniques can be used in this specific singing style. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1120/ |
https://www.aclweb.org/anthology/L16-1120 | |
PWC | https://paperswithcode.com/paper/a-singing-voice-database-in-basque-for |
Repo | |
Framework | |
Analyzing Linguistic Knowledge in Sequential Model of Sentence
Title | Analyzing Linguistic Knowledge in Sequential Model of Sentence |
Authors | Peng Qian, Xipeng Qiu, Xuanjing Huang |
Abstract | |
Tasks | Language Modelling, Text Generation |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/D16-1079/ |
https://www.aclweb.org/anthology/D16-1079 | |
PWC | https://paperswithcode.com/paper/analyzing-linguistic-knowledge-in-sequential |
Repo | |
Framework | |
Identification of Drug-Related Medical Conditions in Social Media
Title | Identification of Drug-Related Medical Conditions in Social Media |
Authors | Fran{\c{c}}ois Morlane-Hond{`e}re, Cyril Grouin, Pierre Zweigenbaum |
Abstract | Monitoring social media has been shown to be an interesting approach for the early detection of drug adverse effects. In this paper, we describe a system which extracts medical entities in French drug reviews written by users. We focus on the identification of medical conditions, which is based on the concept of post-coordination: we first extract minimal medical-related entities (pain, stomach) then we combine them to identify complex ones (It was the worst [pain I ever felt in my stomach]). These two steps are respectively performed by two classifiers, the first being based on Conditional Random Fields and the second one on Support Vector Machines. The overall results of the minimal entity classifier are the following: P=0.926; R=0.849; F1=0.886. A thourough analysis of the feature set shows that, when combined with word lemmas, clusters generated by word2vec are the most valuable features. When trained on the output of the first classifier, the second classifier{'}s performances are the following: p=0.683;r=0.956;f1=0.797. The addition of post-processing rules did not add any significant global improvement but was found to modify the precision/recall ratio. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1320/ |
https://www.aclweb.org/anthology/L16-1320 | |
PWC | https://paperswithcode.com/paper/identification-of-drug-related-medical |
Repo | |
Framework | |
Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus
Title | Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus |
Authors | SoHyun Park, Afsaneh Fazly, Annie Lee, Br Seibel, on, Wenjie Zi, Paul Cook |
Abstract | In this paper we consider the problem of out-of-vocabulary term classification in web forum text from the automotive domain. We develop a set of nine domain- and application-specific categories for out-of-vocabulary terms. We then propose a supervised approach to classify out-of-vocabulary terms according to these categories, drawing on features based on word embeddings, and linguistic knowledge of common properties of out-of-vocabulary terms. We show that the features based on word embeddings are particularly informative for this task. The categories that we predict could serve as a preliminary, automatically-generated source of lexical knowledge about out-of-vocabulary terms. Furthermore, we show that this approach can be adapted to give a semi-automated method for identifying out-of-vocabulary terms of a particular category, automotive named entities, that is of particular interest to us. |
Tasks | Word Embeddings |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1474/ |
https://www.aclweb.org/anthology/L16-1474 | |
PWC | https://paperswithcode.com/paper/classifying-out-of-vocabulary-terms-in-a |
Repo | |
Framework | |
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis
Title | IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis |
Authors | Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal, Chris Biemann |
Abstract | |
Tasks | Aspect-Based Sentiment Analysis, Opinion Mining, Sentiment Analysis |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/S16-1174/ |
https://www.aclweb.org/anthology/S16-1174 | |
PWC | https://paperswithcode.com/paper/iit-tuda-at-semeval-2016-task-5-beyond |
Repo | |
Framework | |
Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Title | Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource |
Authors | Ana{"\i}s Tack, Thomas Fran{\c{c}}ois, Anne-Laure Ligozat, C{'e}drick Fairon |
Abstract | This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4{%} to 92.3{%}), they do not yet seem to cover the learners{'} lack of knowledge very well. |
Tasks | Lexical Simplification |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1035/ |
https://www.aclweb.org/anthology/L16-1035 | |
PWC | https://paperswithcode.com/paper/evaluating-lexical-simplification-and |
Repo | |
Framework | |
Event Coreference Resolution with Multi-Pass Sieves
Title | Event Coreference Resolution with Multi-Pass Sieves |
Authors | Jing Lu, Vincent Ng |
Abstract | Multi-pass sieve approaches have been successfully applied to entity coreference resolution and many other tasks in natural language processing (NLP), owing in part to the ease of designing high-precision rules for these tasks. However, the same is not true for event coreference resolution: typically lying towards the end of the standard information extraction pipeline, an event coreference resolver assumes as input the noisy outputs of its upstream components such as the trigger identification component and the entity coreference resolution component. The difficulty in designing high-precision rules makes it challenging to successfully apply a multi-pass sieve approach to event coreference resolution. In this paper, we investigate this challenge, proposing the first multi-pass sieve approach to event coreference resolution. When evaluated on the version of the KBP 2015 corpus available to the participants of EN Task 2 (Event Nugget Detection and Coreference), our approach achieves an Avg F-score of 40.32{%}, outperforming the best participating system by 0.67{%} in Avg F-score. |
Tasks | Coreference Resolution |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1631/ |
https://www.aclweb.org/anthology/L16-1631 | |
PWC | https://paperswithcode.com/paper/event-coreference-resolution-with-multi-pass |
Repo | |
Framework | |
Merging Data Resources for Inflectional and Derivational Morphology in Czech
Title | Merging Data Resources for Inflectional and Derivational Morphology in Czech |
Authors | Zden{\v{e}}k {\v{Z}}abokrtsk{'y}, Magda {\v{S}}ev{\v{c}}{'\i}kov{'a}, Milan Straka, Jon{'a}{\v{s}} Vidra, Ad{'e}la Limbursk{'a} |
Abstract | The paper deals with merging two complementary resources of morphological data previously existing for Czech, namely the inflectional dictionary MorfFlex CZ and the recently developed lexical network DeriNet. The MorfFlex CZ dictionary has been used by a morphological analyzer capable of analyzing/generating several million Czech word forms according to the rules of Czech inflection. The DeriNet network contains several hundred thousand Czech lemmas interconnected with links corresponding to derivational relations (relations between base words and words derived from them). After summarizing basic characteristics of both resources, the process of merging is described, focusing on both rather technical aspects (growth of the data, measuring the quality of newly added derivational relations) and linguistic issues (treating lexical homonymy and vowel/consonant alternations). The resulting resource contains 970 thousand lemmas connected with 715 thousand derivational relations and is publicly available on the web under the CC-BY-NC-SA license. The data were incorporated in the MorphoDiTa library version 2.0 (which provides morphological analysis, generation, tagging and lemmatization for Czech) and can be browsed and searched by two web tools (DeriNet Viewer and DeriNet Search tool). |
Tasks | Lemmatization, Morphological Analysis |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1208/ |
https://www.aclweb.org/anthology/L16-1208 | |
PWC | https://paperswithcode.com/paper/merging-data-resources-for-inflectional-and |
Repo | |
Framework | |