May 5, 2019

2427 words 12 mins read

Paper Group NANR 53

Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine. Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair. A Multi-domain Corpus of Swedish Word Sense Annotation. A Post-editing Interfac …

Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings


Title	Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings
Authors	Francisco Guzm{'a}n, Houda Bouamor, Ramy Baly, Nizar Habash
Abstract	Evaluation of machine translation (MT) into morphologically rich languages (MRL) has not been well studied despite posing many challenges. In this paper, we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an MRL. Specifically we report on Arabic, a language with complex and rich morphology. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75{%} increase in correlation with human judgments on pairwise MT evaluation quality task. More importantly, we demonstrate the usefulness of morpho-syntactic representations to model sentence similarity for MT evaluation and address complex linguistic phenomena of Arabic.
Tasks	Community Question Answering, Machine Translation, Morphological Analysis, Morphological Inflection, Question Answering, Word Embeddings
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-1132/
PDF	https://www.aclweb.org/anthology/C16-1132
PWC	https://paperswithcode.com/paper/machine-translation-evaluation-for-arabic
Repo
Framework

The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine


Title	The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine
Authors	Mariana Neves, Antonio Jimeno Yepes, Aur{'e}lie N{'e}v{'e}ol
Abstract	The biomedical scientific literature is a rich source of information not only in the English language, for which it is more abundant, but also in other languages, such as Portuguese, Spanish and French. We present the first freely available parallel corpus of scientific publications for the biomedical domain. Documents from the {''}Biological Sciences{''} and {''}Health Sciences{''} categories were retrieved from the Scielo database and parallel titles and abstracts are available for the following language pairs: Portuguese/English (about 86,000 documents in total), Spanish/English (about 95,000 documents) and French/English (about 2,000 documents). Additionally, monolingual data was also collected for all four languages. Sentences in the parallel corpus were automatically aligned and a manual analysis of 200 documents by native experts found that a minimum of 79{%} of sentences were correctly aligned in all language pairs. We demonstrate the utility of the corpus by running baseline machine translation experiments. We show that for all language pairs, a statistical machine translation system trained on the parallel corpora achieves performance that rivals or exceeds the state of the art in the biomedical domain. Furthermore, the corpora are currently being used in the biomedical task in the First Conference on Machine Translation (WMT{'}16).
Tasks	Machine Translation
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1470/
PDF	https://www.aclweb.org/anthology/L16-1470
PWC	https://paperswithcode.com/paper/the-scielo-corpus-a-parallel-corpus-of
Repo
Framework

Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair


Title	Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair
Authors	Nikola Ljube{\v{s}}i{'c}, Miquel Espl{`a}-Gomis, Antonio Toral, Sergio Ortiz Rojas, Filip Klubi{\v{c}}ka
Abstract	This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain {`}.hr{''} and the Slovene top-level domain {`}.si{''}, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1471/
PDF	https://www.aclweb.org/anthology/L16-1471
PWC	https://paperswithcode.com/paper/producing-monolingual-and-parallel-web
Repo
Framework

A Multi-domain Corpus of Swedish Word Sense Annotation


Title	A Multi-domain Corpus of Swedish Word Sense Annotation
Authors	Richard Johansson, Yvonne Adesam, Gerlof Bouma, Karin Hedberg
Abstract	We describe the word sense annotation layer in \textit{Eukalyptus}, a freely available five-domain corpus of contemporary Swedish with several annotation layers. The annotation uses the SALDO lexicon to define the sense inventory, and allows word sense annotation of compound segments and multiword units. We give an overview of the new annotation tool developed for this project, and finally present an analysis of the inter-annotator agreement between two annotators.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1482/
PDF	https://www.aclweb.org/anthology/L16-1482
PWC	https://paperswithcode.com/paper/a-multi-domain-corpus-of-swedish-word-sense
Repo
Framework

A Post-editing Interface for Immediate Adaptation in Statistical Machine Translation


Title	A Post-editing Interface for Immediate Adaptation in Statistical Machine Translation
Authors	Patrick Simianer, Sariya Karimova, Stefan Riezler
Abstract	Adaptive machine translation (MT) systems are a promising approach for improving the effectiveness of computer-aided translation (CAT) environments. There is, however, virtually only theoretical work that examines how such a system could be implemented. We present an open source post-editing interface for adaptive statistical MT, which has in-depth monitoring capabilities and excellent expandability, and can facilitate practical studies. To this end, we designed text-based and graphical post-editing interfaces. The graphical interface offers means for displaying and editing a rich view of the MT output. Our translation systems may learn from post-edits using several weight, language model and novel translation model adaptation techniques, in part by exploiting the output of the graphical interface. In a user study we show that using the proposed interface and adaptation methods, reductions in technical effort and time can be achieved.
Tasks	Domain Adaptation, Language Modelling, Machine Translation
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-2004/
PDF	https://www.aclweb.org/anthology/C16-2004
PWC	https://paperswithcode.com/paper/a-post-editing-interface-for-immediate
Repo
Framework

New Developments in the LRE Map


Title	New Developments in the LRE Map
Authors	Vladimir Popescu, Lin Liu, Riccardo Del Gratta, Khalid Choukri, Nicoletta Calzolari
Abstract	In this paper we describe the new developments brought to LRE Map, especially in terms of the user interface of the Web application, of the searching of the information therein, and of the data model updates.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1716/
PDF	https://www.aclweb.org/anthology/L16-1716
PWC	https://paperswithcode.com/paper/new-developments-in-the-lre-map
Repo
Framework

Can Tweets Predict TV Ratings?


Title	Can Tweets Predict TV Ratings?
Authors	Bridget Sommerdijk, S, Eric ers, Antal van den Bosch
Abstract	We set out to investigate whether TV ratings and mentions of TV programmes on the Twitter social media platform are correlated. If such a correlation exists, Twitter may be used as an alternative source for estimating viewer popularity. Moreover, the Twitter-based rating estimates may be generated during the programme, or even before. We count the occurrences of programme-specific hashtags in an archive of Dutch tweets of eleven popular TV shows broadcast in the Netherlands in one season, and perform correlation tests. Overall we find a strong correlation of 0.82; the correlation remains strong, 0.79, if tweets are counted a half hour before broadcast time. However, the two most popular TV shows account for most of the positive effect; if we leave out the single and second most popular TV shows, the correlation drops to being moderate to weak. Also, within a TV show, correlations between ratings and tweet counts are mostly weak, while correlations between TV ratings of the previous and next shows are strong. In absence of information on previous shows, Twitter-based counts may be a viable alternative to classic estimation methods for TV ratings. Estimates are more reliable with more popular TV shows.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1473/
PDF	https://www.aclweb.org/anthology/L16-1473
PWC	https://paperswithcode.com/paper/can-tweets-predict-tv-ratings
Repo
Framework

Managing Linguistic and Terminological Variation in a Medical Dialogue System


Title	Managing Linguistic and Terminological Variation in a Medical Dialogue System
Authors	Leonardo Campillos Llanos, Dhouha Bouamor, Pierre Zweigenbaum, Sophie Rosset
Abstract	We introduce a dialogue task between a virtual patient and a doctor where the dialogue system, playing the patient part in a simulated consultation, must reconcile a specialized level, to understand what the doctor says, and a lay level, to output realistic patient-language utterances. This increases the challenges in the analysis and generation phases of the dialogue. This paper proposes methods to manage linguistic and terminological variation in that situation and illustrates how they help produce realistic dialogues. Our system makes use of lexical resources for processing synonyms, inflectional and derivational variants, or pronoun/verb agreement. In addition, specialized knowledge is used for processing medical roots and affixes, ontological relations and concept mapping, and for generating lay variants of terms according to the patient{'}s non-expert discourse. We also report the results of a first evaluation carried out by 11 users interacting with the system. We evaluated the non-contextual analysis module, which supports the Spoken Language Understanding step. The annotation of task domain entities obtained 91.8{%} of Precision, 82.5{%} of Recall, 86.9{%} of F-measure, 19.0{%} of Slot Error Rate, and 32.9{%} of Sentence Error Rate.
Tasks	Spoken Language Understanding
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1505/
PDF	https://www.aclweb.org/anthology/L16-1505
PWC	https://paperswithcode.com/paper/managing-linguistic-and-terminological
Repo
Framework

A Verbal and Gestural Corpus of Story Retellings to an Expressive Embodied Virtual Character


Title	A Verbal and Gestural Corpus of Story Retellings to an Expressive Embodied Virtual Character
Authors	Jackson Tolins, Kris Liu, Michael Neff, Marilyn Walker, Jean Fox Tree
Abstract	We present a corpus of 44 human-agent verbal and gestural story retellings designed to explore whether humans would gesturally entrain to an embodied intelligent virtual agent. We used a novel data collection method where an agent presented story components in installments, which the human would then retell to the agent. At the end of the installments, the human would then retell the embodied animated agent the story as a whole. This method was designed to allow us to observe whether changes in the agent{'}s gestural behavior would result in human gestural changes. The agent modified its gestures over the course of the story, by starting out the first installment with gestural behaviors designed to manifest extraversion, and slowly modifying gestures to express introversion over time, or the reverse. The corpus contains the verbal and gestural transcripts of the human story retellings. The gestures were coded for type, handedness, temporal structure, spatial extent, and the degree to which the participants{'} gestures match those produced by the agent. The corpus illustrates the variation in expressive behaviors produced by users interacting with embodied virtual characters, and the degree to which their gestures were influenced by the agent{'}s dynamic changes in personality-based expressive style.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1552/
PDF	https://www.aclweb.org/anthology/L16-1552
PWC	https://paperswithcode.com/paper/a-verbal-and-gestural-corpus-of-story
Repo
Framework

The CASS Technique for Evaluating the Performance of Argument Mining


Title	The CASS Technique for Evaluating the Performance of Argument Mining
Authors	Rory Duthie, John Lawrence, Katarzyna Budzynska, Chris Reed
Abstract
Tasks	Argument Mining
Published	2016-08-01
URL	https://www.aclweb.org/anthology/W16-2805/
PDF	https://www.aclweb.org/anthology/W16-2805
PWC	https://paperswithcode.com/paper/the-cass-technique-for-evaluating-the
Repo
Framework

Argumentation: Content, Structure, and Relationship with Essay Quality


Title	Argumentation: Content, Structure, and Relationship with Essay Quality
Authors	Beata Beigman Klebanov, Christian Stab, Jill Burstein, Yi Song, Binod Gyawali, Iryna Gurevych
Abstract
Tasks	Argument Mining
Published	2016-08-01
URL	https://www.aclweb.org/anthology/W16-2808/
PDF	https://www.aclweb.org/anthology/W16-2808
PWC	https://paperswithcode.com/paper/argumentation-content-structure-and
Repo
Framework

Extracting Case Law Sentences for Argumentation about the Meaning of Statutory Terms


Title	Extracting Case Law Sentences for Argumentation about the Meaning of Statutory Terms
Authors	Jarom{'\i}r {\v{S}}avelka, Kevin D. Ashley
Abstract
Tasks	Argument Mining
Published	2016-08-01
URL	https://www.aclweb.org/anthology/W16-2806/
PDF	https://www.aclweb.org/anthology/W16-2806
PWC	https://paperswithcode.com/paper/extracting-case-law-sentences-for
Repo
Framework

Government Domain Named Entity Recognition for South African Languages


Title	Government Domain Named Entity Recognition for South African Languages
Authors	Roald Eiselen
Abstract	This paper describes the named entity language resources developed as part of a development project for the South African languages. The development efforts focused on creating protocols and annotated data sets with at least 15,000 annotated named entity tokens for ten of the official South African languages. The description of the protocols and annotated data sets provide an overview of the problems encountered during the annotation of the data sets. Based on these annotated data sets, CRF named entity recognition systems are developed that leverage existing linguistic resources. The newly created named entity recognisers are evaluated, with F-scores of between 0.64 and 0.77, and error analysis is performed to identify possible avenues for improving the quality of the systems.
Tasks	Named Entity Recognition
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1533/
PDF	https://www.aclweb.org/anthology/L16-1533
PWC	https://paperswithcode.com/paper/government-domain-named-entity-recognition
Repo
Framework

Wikification for Scriptio Continua


Title	Wikification for Scriptio Continua
Authors	Yugo Murawaki, Shinsuke Mori
Abstract	The fact that Japanese employs scriptio continua, or a writing system without spaces, complicates the first step of an NLP pipeline. Word segmentation is widely used in Japanese language processing, and lexical knowledge is crucial for reliable identification of words in text. Although external lexical resources like Wikipedia are potentially useful, segmentation mismatch prevents them from being straightforwardly incorporated into the word segmentation task. If we intentionally violate segmentation standards with the direct incorporation, quantitative evaluation will be no longer feasible. To address this problem, we propose to define a separate task that directly links given texts to an external resource, that is, wikification in the case of Wikipedia. By doing so, we can circumvent segmentation mismatch that may not necessarily be important for downstream applications. As the first step to realize the idea, we design the task of Japanese wikification and construct wikification corpora. We annotated subsets of the Balanced Corpus of Contemporary Written Japanese plus Twitter short messages. We also implement a simple wikifier and investigate its performance on these corpora.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1214/
PDF	https://www.aclweb.org/anthology/L16-1214
PWC	https://paperswithcode.com/paper/wikification-for-scriptio-continua
Repo
Framework

Constructing and Evaluating Controlled Bilingual Terminologies


Title	Constructing and Evaluating Controlled Bilingual Terminologies
Authors	Rei Miyata, Kyo Kageura
Abstract	This paper presents the construction and evaluation of Japanese and English controlled bilingual terminologies that are particularly intended for controlled authoring and machine translation with special reference to the Japanese municipal domain. Our terminologies are constructed by extracting terms from municipal website texts, and the term variations are controlled by defining preferred and proscribed terms for both the source Japanese and the target English. To assess the coverage of the terms/concepts in the municipal domain and validate the quality of the control, we employ a quantitative extrapolation method that estimates the potential vocabulary size. Using Large-Number-of-Rare-Event (LNRE) modelling, we compare two parameters: (1) uncontrolled and controlled and (2) Japanese and English. The results show that our terminologies currently cover about 45{–}65{%} of the terms and 50{–}65{%} of the concepts in the municipal domain, and are well controlled. The detailed analysis of growth patterns of terminologies also provides insight into the extent to which we can enlarge the terminologies within the realistic range.
Tasks	Machine Translation
Published	2016-12-01
URL	https://www.aclweb.org/anthology/W16-4710/
PDF	https://www.aclweb.org/anthology/W16-4710
PWC	https://paperswithcode.com/paper/constructing-and-evaluating-controlled
Repo
Framework