May 5, 2019

1964 words 10 mins read

Paper Group NANR 32

Paper Group NANR 32

Cross-Lingual Question Answering Using Common Semantic Space. CORILSE: a Spanish Sign Language Repository for Linguistic Analysis. Measuring Lexical Quality of a Historical Finnish Newspaper Collection ― Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. Phonotactic Modeling of Extremely Low Resource Languages. Using Rel …

Cross-Lingual Question Answering Using Common Semantic Space

Title Cross-Lingual Question Answering Using Common Semantic Space
Authors Amir Pouran Ben Veyseh
Abstract
Tasks Entity Linking, Keyword Extraction, Question Answering, Semantic Parsing, Semantic Textual Similarity
Published 2016-06-01
URL https://www.aclweb.org/anthology/W16-1403/
PDF https://www.aclweb.org/anthology/W16-1403
PWC https://paperswithcode.com/paper/cross-lingual-question-answering-using-common
Repo
Framework

CORILSE: a Spanish Sign Language Repository for Linguistic Analysis

Title CORILSE: a Spanish Sign Language Repository for Linguistic Analysis
Authors Mar{'\i}a del Carmen Cabeza-Pereiro, Jos{'e} Mª Garcia-Miguel, Carmen Garc{'\i}a Mateo, Jos{'e} Luis Alba Castro
Abstract CORILSE is a computerized corpus of Spanish Sign Language (Lengua de Signos Espa{~n}ola, LSE). It consists of a set of recordings from different discourse genres by Galician signers living in the city of Vigo. In this paper we describe its annotation system, developed on the basis of pre-existing ones (mostly the model of Auslan corpus). This includes primary annotation of id-glosses for manual signs, annotation of non-manual component, and secondary annotation of grammatical categories and relations, because this corpus is been built for grammatical analysis, in particular argument structures in LSE. Up until this moment the annotation has been basically made by hand, which is a slow and time-consuming task. The need to facilitate this process leads us to engage in the development of automatic or semi-automatic tools for manual and facial recognition. Finally, we also present the web repository that will make the corpus available to different types of users, and will allow its exploitation for research purposes and other applications (e.g. teaching of LSE or design of tasks for signed language assessment).
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1223/
PDF https://www.aclweb.org/anthology/L16-1223
PWC https://paperswithcode.com/paper/corilse-a-spanish-sign-language-repository
Repo
Framework

Measuring Lexical Quality of a Historical Finnish Newspaper Collection ― Analysis of Garbled OCR Data with Basic Language Technology Tools and Means

Title Measuring Lexical Quality of a Historical Finnish Newspaper Collection ― Analysis of Garbled OCR Data with Basic Language Technology Tools and Means
Authors Kimmo Kettunen, Tuula P{"a}{"a}kk{"o}nen
Abstract The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library{'}s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium . The collection can also be accessed through the Korp environment that has been developed by Spr{"a}kbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (J{"a}rvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.
Tasks Information Retrieval, Optical Character Recognition
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1152/
PDF https://www.aclweb.org/anthology/L16-1152
PWC https://paperswithcode.com/paper/measuring-lexical-quality-of-a-historical
Repo
Framework

Phonotactic Modeling of Extremely Low Resource Languages

Title Phonotactic Modeling of Extremely Low Resource Languages
Authors Andrei Shcherbakov, Ekaterina Vylomova, Nick Thieberger
Abstract
Tasks Language Modelling
Published 2016-12-01
URL https://www.aclweb.org/anthology/U16-1009/
PDF https://www.aclweb.org/anthology/U16-1009
PWC https://paperswithcode.com/paper/phonotactic-modeling-of-extremely-low
Repo
Framework

Using Relevant Public Posts to Enhance News Article Summarization

Title Using Relevant Public Posts to Enhance News Article Summarization
Authors Chen Li, Zhongyu Wei, Yang Liu, Yang Jin, Fei Huang
Abstract A news article summary usually consists of 2-3 key sentences that reflect the gist of that news article. In this paper we explore using public posts following a new article to improve automatic summary generation for the news article. We propose different approaches to incorporate information from public posts, including using frequency information from the posts to re-estimate bigram weights in the ILP-based summarization model and to re-weight a dependency tree edge{'}s importance for sentence compression, directly selecting sentences from posts as the final summary, and finally a strategy to combine the summarization results generated from news articles and posts. Our experiments on data collected from Facebook show that relevant public posts provide useful information and can be effectively leveraged to improve news article summarization results.
Tasks Sentence Compression
Published 2016-12-01
URL https://www.aclweb.org/anthology/C16-1054/
PDF https://www.aclweb.org/anthology/C16-1054
PWC https://paperswithcode.com/paper/using-relevant-public-posts-to-enhance-news
Repo
Framework

Using SMT for OCR Error Correction of Historical Texts

Title Using SMT for OCR Error Correction of Historical Texts
Authors Haithem Afli, Zhengwei Qiu, Andy Way, P{'a}raic Sheridan
Abstract A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13{%} relative improvement compared to the initial baseline.
Tasks Language Modelling, Machine Translation, Optical Character Recognition
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1153/
PDF https://www.aclweb.org/anthology/L16-1153
PWC https://paperswithcode.com/paper/using-smt-for-ocr-error-correction-of
Repo
Framework

Crowdsourcing Ontology Lexicons

Title Crowdsourcing Ontology Lexicons
Authors Bettina Lanser, Christina Unger, Philipp Cimiano
Abstract In order to make the growing amount of conceptual knowledge available through ontologies and datasets accessible to humans, NLP applications need access to information on how this knowledge can be verbalized in natural language. One way to provide this kind of information are ontology lexicons, which apart from the actual verbalizations in a given target language can provide further, rich linguistic information about them. Compiling such lexicons manually is a very time-consuming task and requires expertise both in Semantic Web technologies and lexicon engineering, as well as a very good knowledge of the target language at hand. In this paper we present an alternative approach to generating ontology lexicons by means of crowdsourcing: We use CrowdFlower to generate a small Japanese ontology lexicon for ten exemplary ontology elements from the DBpedia ontology according to a two-stage workflow, the main underlying idea of which is to turn the task of generating lexicon entries into a translation task; the starting point of this translation task is a manually created English lexicon for DBpedia. Comparison of the results to a manually created Japanese lexicon shows that the presented workflow is a viable option if an English seed lexicon is already available.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1554/
PDF https://www.aclweb.org/anthology/L16-1554
PWC https://paperswithcode.com/paper/crowdsourcing-ontology-lexicons
Repo
Framework

Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus

Title Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus
Authors Simon Clematide, Lenz Furrer, Martin Volk
Abstract Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historic text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 month, achieving an OCR gold standard with a systematically evaluated accuracy of 99.7{%} on the word level. The crowdsourced OCR gold standard and the corresponding original OCR recognition results from Abby FineReader 7 for each page are available as a resource. Additionally, the scanned images (300dpi) of all pages are included in order to facilitate tests with other OCR software.
Tasks Optical Character Recognition
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1155/
PDF https://www.aclweb.org/anthology/L16-1155
PWC https://paperswithcode.com/paper/crowdsourcing-an-ocr-gold-standard-for-a
Repo
Framework

A semantic-affective compositional approach for the affective labelling of adjective-noun and noun-noun pairs

Title A semantic-affective compositional approach for the affective labelling of adjective-noun and noun-noun pairs
Authors Elisavet Palogiannidi, Elias Iosif, Polychronis Koutsakis, Alex Potamianos, ros
Abstract
Tasks Semantic Textual Similarity, Sentiment Analysis
Published 2016-06-01
URL https://www.aclweb.org/anthology/W16-0424/
PDF https://www.aclweb.org/anthology/W16-0424
PWC https://paperswithcode.com/paper/a-semantic-affective-compositional-approach
Repo
Framework

A Finite-state Morphological Analyser for Tuvan

Title A Finite-state Morphological Analyser for Tuvan
Authors Francis Tyers, Aziyana Bayyr-ool, Aelita Salchak, Jonathan Washington
Abstract {\textasciitilde}This paper describes the development of free/open-source finite-state morphological transducers for Tuvan, a Turkic language spoken in and around the Tuvan Republic in Russia. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST), we use the lexc formalism for modelling the morphotactics and twol formalism for modelling morphophonological alternations. We present a novel description of the morphological combinatorics of pseudo-derivational morphemes in Tuvan. An evaluation is presented which shows that the transducer has a reasonable coverage―around 93{%}―on freely-available corpora of the languages, and high precision―over 99{%}―on a manually verified test set.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1407/
PDF https://www.aclweb.org/anthology/L16-1407
PWC https://paperswithcode.com/paper/a-finite-state-morphological-analyser-for
Repo
Framework

The Physics of Text: Ontological Realism in Information Extraction

Title The Physics of Text: Ontological Realism in Information Extraction
Authors Stuart Russell, Ole Torp Lassen, Justin Uang, Wei Wang
Abstract
Tasks Common Sense Reasoning, Open Information Extraction, Probabilistic Programming, Question Answering
Published 2016-06-01
URL https://www.aclweb.org/anthology/W16-1310/
PDF https://www.aclweb.org/anthology/W16-1310
PWC https://paperswithcode.com/paper/the-physics-of-text-ontological-realism-in
Repo
Framework

Mining Knowledge in Storytelling Systems for Narrative Generation

Title Mining Knowledge in Storytelling Systems for Narrative Generation
Authors Eugenio Concepci{'o}n, Pablo Gerv{'a}s, Gonzalo M{'e}ndez
Abstract
Tasks Text Generation
Published 2016-09-01
URL https://www.aclweb.org/anthology/W16-5507/
PDF https://www.aclweb.org/anthology/W16-5507
PWC https://paperswithcode.com/paper/mining-knowledge-in-storytelling-systems-for
Repo
Framework

FOLK-Gold ― A Gold Standard for Part-of-Speech-Tagging of Spoken German

Title FOLK-Gold ― A Gold Standard for Part-of-Speech-Tagging of Spoken German
Authors Swantje Westpfahl, Thomas Schmidt
Abstract In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers ― transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags ― all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart T{"u}bingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.
Tasks Lemmatization, Part-Of-Speech Tagging
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1237/
PDF https://www.aclweb.org/anthology/L16-1237
PWC https://paperswithcode.com/paper/folk-gold-a-gold-standard-for-part-of-speech
Repo
Framework

Incremental Generation of Visually Grounded Language in Situated Dialogue (demonstration system)

Title Incremental Generation of Visually Grounded Language in Situated Dialogue (demonstration system)
Authors Yanchao Yu, Arash Eshghi, Oliver Lemon
Abstract
Tasks Text Generation
Published 2016-09-01
URL https://www.aclweb.org/anthology/W16-6619/
PDF https://www.aclweb.org/anthology/W16-6619
PWC https://paperswithcode.com/paper/incremental-generation-of-visually-grounded
Repo
Framework

A framework for the automatic inference of stochastic turn-taking styles

Title A framework for the automatic inference of stochastic turn-taking styles
Authors Kornel Laskowski
Abstract
Tasks Speaker Diarization
Published 2016-09-01
URL https://www.aclweb.org/anthology/W16-3624/
PDF https://www.aclweb.org/anthology/W16-3624
PWC https://paperswithcode.com/paper/a-framework-for-the-automatic-inference-of
Repo
Framework
comments powered by Disqus