May 5, 2019

1952 words 10 mins read

Paper Group NANR 49

A Comparison of Weak Supervision methods for Knowledge Base Construction. Predictability of Distributional Semantics in Derivational Word Formation. CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation. Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks. Facilitating Metadata Interoperability in CLA …

A Comparison of Weak Supervision methods for Knowledge Base Construction


Title	A Comparison of Weak Supervision methods for Knowledge Base Construction
Authors	Ameet Soni, Dileep Viswanathan, Niranjan Pachaiyappan, Sriraam Natarajan
Abstract
Tasks	Knowledge Base Population, Relation Extraction
Published	2016-06-01
URL	https://www.aclweb.org/anthology/W16-1318/
PDF	https://www.aclweb.org/anthology/W16-1318
PWC	https://paperswithcode.com/paper/a-comparison-of-weak-supervision-methods-for
Repo
Framework

Predictability of Distributional Semantics in Derivational Word Formation


Title	Predictability of Distributional Semantics in Derivational Word Formation
Authors	Sebastian Pad{'o}, Aur{'e}lie Herbelot, Max Kisselew, Jan {\v{S}}najder
Abstract	Compositional distributional semantic models (CDSMs) have successfully been applied to the task of predicting the meaning of a range of linguistic constructions. Their performance on semi-compositional word formation process of (morphological) derivation, however, has been extremely variable, with no large-scale empirical investigation to date. This paper fills that gap, performing an analysis of CDSM predictions on a large dataset (over 30,000 German derivationally related word pairs). We use linear regression models to analyze CDSM performance and obtain insights into the linguistic factors that influence how predictable the distributional context of a derived word is going to be. We identify various such factors, notably part of speech, argument structure, and semantic regularity.
Tasks	Machine Translation, Sentiment Analysis, Word Embeddings
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-1122/
PDF	https://www.aclweb.org/anthology/C16-1122
PWC	https://paperswithcode.com/paper/predictability-of-distributional-semantics-in
Repo
Framework

CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation


Title	CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation
Authors	Anas Shahrour, Salam Khalifa, Dima Taji, Nizar Habash
Abstract	In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features. CamelParser uses a state-of-the-art morphological disambiguator and improves its results using syntactically driven features. The system offers a number of output formats that include basic dependency with morphological features, two tree visualization modes, and traditional Arabic grammatical analysis.
Tasks	Dependency Parsing, Morphological Analysis, Transliteration
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-2048/
PDF	https://www.aclweb.org/anthology/C16-2048
PWC	https://paperswithcode.com/paper/camelparser-a-system-for-arabic-syntactic
Repo
Framework

Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks


Title	Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks
Authors	Juhani Luotolahti, Jenna Kanerva, Filip Ginter
Abstract
Tasks	Language Modelling, Machine Translation
Published	2016-08-01
URL	https://www.aclweb.org/anthology/W16-2353/
PDF	https://www.aclweb.org/anthology/W16-2353
PWC	https://paperswithcode.com/paper/cross-lingual-pronoun-prediction-with-deep
Repo
Framework

Facilitating Metadata Interoperability in CLARIN-DK


Title	Facilitating Metadata Interoperability in CLARIN-DK
Authors	Lene Offersgaard, Dorte Haltrup Hansen
Abstract	The issue for CLARIN archives at the metadata level is to facilitate the user{'}s possibility to describe their data, even with their own standard, and at the same time make these metadata meaningful for a variety of users with a variety of resource types, and ensure that the metadata are useful for search across all resources both at the national and at the European level. We see that different people from different research communities fill in the metadata in different ways even though the metadata was defined and documented. This has impacted when the metadata are harvested and displayed in different environments. A loss of information is at stake. In this paper we view the challenges of ensuring metadata interoperability through examples of propagation of metadata values from the CLARIN-DK archive to the VLO. We see that the CLARIN Community in many ways support interoperability, but argue that agreeing upon standards, making clear definitions of the semantics of the metadata and their content is inevitable for the interoperability to work successfully. The key points are clear and freely available definitions, accessible documentation and easily usable facilities and guidelines for the metadata creators.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1398/
PDF	https://www.aclweb.org/anthology/L16-1398
PWC	https://paperswithcode.com/paper/facilitating-metadata-interoperability-in
Repo
Framework

The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language


Title	The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Authors	Dan Tufi{\textcommabelow{s}}, Verginica Barbu Mititelu, Elena Irimia, {\textcommabelow{S}}tefan Daniel Dumitrescu, Tiberiu Boro{\textcommabelow{s}}
Abstract	The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country{'}s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.
Tasks	Lemmatization, Part-Of-Speech Tagging, Tokenization
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1399/
PDF	https://www.aclweb.org/anthology/L16-1399
PWC	https://paperswithcode.com/paper/the-ipr-cleared-corpus-of-contemporary
Repo
Framework

The Public License Selector: Making Open Licensing Easier


Title	The Public License Selector: Making Open Licensing Easier
Authors	Pawel Kamocki, Pavel Stra{\v{n}}{'a}k, Michal Sedl{'a}k
Abstract	Researchers in Natural Language Processing rely on availability of data and software, ideally under open licenses, but little is done to actively encourage it. In fact, the current Copyright framework grants exclusive rights to authors to copy their works, make them available to the public and make derivative works (such as annotated language corpora). Moreover, in the EU databases are protected against unauthorized extraction and re-utilization of their contents. Therefore, proper public licensing plays a crucial role in providing access to research data. A public license is a license that grants certain rights not to one particular user, but to the general public (everybody). Our article presents a tool that we developed and whose purpose is to assist the user in the licensing process. As software and data should be licensed under different licenses, the tool is composed of two separate parts: Data and Software. The underlying logic as well as elements of the graphic interface are presented below.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1402/
PDF	https://www.aclweb.org/anthology/L16-1402
PWC	https://paperswithcode.com/paper/the-public-license-selector-making-open
Repo
Framework

NLP Infrastructure for the Lithuanian Language


Title	NLP Infrastructure for the Lithuanian Language
Authors	Daiva Vitkut{.e}-Ad{\v{z}}gauskien{.e}, Andrius Utka, Darius Amilevi{\v{c}}ius, Tomas Krilavi{\v{c}}ius
Abstract	The Information System for Syntactic and Semantic Analysis of the Lithuanian language (lith. Lietuvi{\k{u}} kalbos sintaksin{.e}s ir semantin{.e}s analiz{.e}s informacin{.e} sistema, LKSSAIS) is the first infrastructure for the Lithuanian language combining Lithuanian language tools and resources for diverse linguistic research and applications tasks. It provides access to the basic as well as advanced natural language processing tools and resources, including tools for corpus creation and management, text preprocessing and annotation, ontology building, named entity recognition, morphosyntactic and semantic analysis, sentiment analysis, etc. It is an important platform for researchers and developers in the field of natural language technology.
Tasks	Named Entity Recognition, Sentiment Analysis
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1403/
PDF	https://www.aclweb.org/anthology/L16-1403
PWC	https://paperswithcode.com/paper/nlp-infrastructure-for-the-lithuanian
Repo
Framework

CodE Alltag: A German-Language E-Mail Corpus


Title	CodE Alltag: A German-Language E-Mail Corpus
Authors	Ulrike Krieg-Holz, Christian Schuschnig, Franz Matthies, Benjamin Redling, Udo Hahn
Abstract	We introduce CODE ALLTAG, a text corpus composed of German-language e-mails. It is divided into two partitions: the first of these portions, CODE ALLTAG{_}XL, consists of a bulk-size collection drawn from an openly accessible e-mail archive (roughly 1.5M e-mails), whereas the second portion, CODE ALLTAG{_}S+d, is much smaller in size (less than thousand e-mails), yet excels with demographic data from each author of an e-mail. CODE ALLTAG, thus, currently constitutes the largest E-Mail corpus ever built. In this paper, we describe, for both parts, the solicitation process for gathering e-mails, present descriptive statistical properties of the corpus, and, for CODE ALLTAG{_}S+d, reveal a compilation of demographic features of the donors of e-mails.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1404/
PDF	https://www.aclweb.org/anthology/L16-1404
PWC	https://paperswithcode.com/paper/code-alltag-a-german-language-e-mail-corpus
Repo
Framework

Rapid Development of Morphological Analyzers for Typologically Diverse Languages


Title	Rapid Development of Morphological Analyzers for Typologically Diverse Languages
Authors	Seth Kulick, Ann Bies
Abstract	The Low Resource Language research conducted under DARPA{'}s Broad Operational Language Translation (BOLT) program required the rapid creation of text corpora of typologically diverse languages (Turkish, Hausa, and Uzbek) which were annotated with morphological information, along with other types of annotation. Since the output of morphological analyzers is a significant aid to morphological annotation, we developed a morphological analyzer for each language in order to support the annotation task, and also as a deliverable by itself. Our framework for analyzer creation results in tables similar to those used in the successful SAMA analyzer for Arabic, but with a more abstract linguistic level, from which the tables are derived. A lexicon was developed from available resources for integration with the analyzer, and given the speed of development and uncertain coverage of the lexicon, we assumed that the analyzer would necessarily be lacking in some coverage for the project annotation. Our analyzer framework was therefore focused on rapid implementation of the key structures of the language, together with accepting {``}wildcard{''} solutions as possible analyses for a word with an unknown stem, building upon our similar experiences with morphological annotation with Modern Standard Arabic and Egyptian Arabic. \|
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1405/
PDF	https://www.aclweb.org/anthology/L16-1405
PWC	https://paperswithcode.com/paper/rapid-development-of-morphological-analyzers
Repo
Framework

Evaluating word embeddings with fMRI and eye-tracking


Title	Evaluating word embeddings with fMRI and eye-tracking
Authors	Anders S{\o}gaard
Abstract
Tasks	Eye Tracking, Word Embeddings
Published	2016-08-01
URL	https://www.aclweb.org/anthology/W16-2521/
PDF	https://www.aclweb.org/anthology/W16-2521
PWC	https://paperswithcode.com/paper/evaluating-word-embeddings-with-fmri-and-eye
Repo
Framework

A Finite-State Morphological Analyser for Sindhi


Title	A Finite-State Morphological Analyser for Sindhi
Authors	Raveesh Motlani, Francis Tyers, Dipti Sharma
Abstract	Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium{'}s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia corpus and achieved a reasonable coverage of 81{%} and a precision of over 97{%}.
Tasks	Information Retrieval, Machine Translation, Morphological Analysis, Part-Of-Speech Tagging
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1409/
PDF	https://www.aclweb.org/anthology/L16-1409
PWC	https://paperswithcode.com/paper/a-finite-state-morphological-analyser-for-1
Repo
Framework

Morphological Analysis of Sahidic Coptic for Automatic Glossing


Title	Morphological Analysis of Sahidic Coptic for Automatic Glossing
Authors	Daniel Smith, Mans Hulden
Abstract	We report on the implementation of a morphological analyzer for the Sahidic dialect of Coptic, a now extinct Afro-Asiatic language. The system is developed in the finite-state paradigm. The main purpose of the project is provide a method by which scholars and linguists can semi-automatically gloss extant texts written in Sahidic. Since a complete lexicon containing all attested forms in different manuscripts requires significant expertise in Coptic spanning almost 1,000 years, we have equipped the analyzer with a core lexicon and extended it with a {``}guesser{''} ability to capture out-of-vocabulary items in any inflection. We also suggest an ASCII transliteration for the language. A brief evaluation is provided. \|
Tasks	Morphological Analysis, Transliteration
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1411/
PDF	https://www.aclweb.org/anthology/L16-1411
PWC	https://paperswithcode.com/paper/morphological-analysis-of-sahidic-coptic-for
Repo
Framework

Summarising News Stories for Children


Title	Summarising News Stories for Children
Authors	Iain Macdonald, Advaith Siddharthan
Abstract
Tasks	Text Generation, Text Simplification
Published	2016-09-01
URL	https://www.aclweb.org/anthology/W16-6601/
PDF	https://www.aclweb.org/anthology/W16-6601
PWC	https://paperswithcode.com/paper/summarising-news-stories-for-children
Repo
Framework

Exploring Stylistic Variation with Age and Income on Twitter


Title	Exploring Stylistic Variation with Age and Income on Twitter
Authors	Lucie Flekova, Daniel Preo{\c{t}}iuc-Pietro, Lyle Ungar
Abstract
Tasks	Text Simplification
Published	2016-08-01
URL	https://www.aclweb.org/anthology/P16-2051/
PDF	https://www.aclweb.org/anthology/P16-2051
PWC	https://paperswithcode.com/paper/exploring-stylistic-variation-with-age-and
Repo
Framework