May 5, 2019

1952 words 10 mins read

Paper Group NANR 49

Paper Group NANR 49

A Comparison of Weak Supervision methods for Knowledge Base Construction. Predictability of Distributional Semantics in Derivational Word Formation. CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation. Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks. Facilitating Metadata Interoperability in CLA …

A Comparison of Weak Supervision methods for Knowledge Base Construction

Title A Comparison of Weak Supervision methods for Knowledge Base Construction
Authors Ameet Soni, Dileep Viswanathan, Niranjan Pachaiyappan, Sriraam Natarajan
Abstract
Tasks Knowledge Base Population, Relation Extraction
Published 2016-06-01
URL https://www.aclweb.org/anthology/W16-1318/
PDF https://www.aclweb.org/anthology/W16-1318
PWC https://paperswithcode.com/paper/a-comparison-of-weak-supervision-methods-for
Repo
Framework

Predictability of Distributional Semantics in Derivational Word Formation

Title Predictability of Distributional Semantics in Derivational Word Formation
Authors Sebastian Pad{'o}, Aur{'e}lie Herbelot, Max Kisselew, Jan {\v{S}}najder
Abstract Compositional distributional semantic models (CDSMs) have successfully been applied to the task of predicting the meaning of a range of linguistic constructions. Their performance on semi-compositional word formation process of (morphological) derivation, however, has been extremely variable, with no large-scale empirical investigation to date. This paper fills that gap, performing an analysis of CDSM predictions on a large dataset (over 30,000 German derivationally related word pairs). We use linear regression models to analyze CDSM performance and obtain insights into the linguistic factors that influence how predictable the distributional context of a derived word is going to be. We identify various such factors, notably part of speech, argument structure, and semantic regularity.
Tasks Machine Translation, Sentiment Analysis, Word Embeddings
Published 2016-12-01
URL https://www.aclweb.org/anthology/C16-1122/
PDF https://www.aclweb.org/anthology/C16-1122
PWC https://paperswithcode.com/paper/predictability-of-distributional-semantics-in
Repo
Framework

CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation

Title CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation
Authors Anas Shahrour, Salam Khalifa, Dima Taji, Nizar Habash
Abstract In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features. CamelParser uses a state-of-the-art morphological disambiguator and improves its results using syntactically driven features. The system offers a number of output formats that include basic dependency with morphological features, two tree visualization modes, and traditional Arabic grammatical analysis.
Tasks Dependency Parsing, Morphological Analysis, Transliteration
Published 2016-12-01
URL https://www.aclweb.org/anthology/C16-2048/
PDF https://www.aclweb.org/anthology/C16-2048
PWC https://paperswithcode.com/paper/camelparser-a-system-for-arabic-syntactic
Repo
Framework

Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks

Title Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks
Authors Juhani Luotolahti, Jenna Kanerva, Filip Ginter
Abstract
Tasks Language Modelling, Machine Translation
Published 2016-08-01
URL https://www.aclweb.org/anthology/W16-2353/
PDF https://www.aclweb.org/anthology/W16-2353
PWC https://paperswithcode.com/paper/cross-lingual-pronoun-prediction-with-deep
Repo
Framework

Facilitating Metadata Interoperability in CLARIN-DK

Title Facilitating Metadata Interoperability in CLARIN-DK
Authors Lene Offersgaard, Dorte Haltrup Hansen
Abstract The issue for CLARIN archives at the metadata level is to facilitate the user{'}s possibility to describe their data, even with their own standard, and at the same time make these metadata meaningful for a variety of users with a variety of resource types, and ensure that the metadata are useful for search across all resources both at the national and at the European level. We see that different people from different research communities fill in the metadata in different ways even though the metadata was defined and documented. This has impacted when the metadata are harvested and displayed in different environments. A loss of information is at stake. In this paper we view the challenges of ensuring metadata interoperability through examples of propagation of metadata values from the CLARIN-DK archive to the VLO. We see that the CLARIN Community in many ways support interoperability, but argue that agreeing upon standards, making clear definitions of the semantics of the metadata and their content is inevitable for the interoperability to work successfully. The key points are clear and freely available definitions, accessible documentation and easily usable facilities and guidelines for the metadata creators.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1398/
PDF https://www.aclweb.org/anthology/L16-1398
PWC https://paperswithcode.com/paper/facilitating-metadata-interoperability-in
Repo
Framework

The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language

Title The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Authors Dan Tufi{\textcommabelow{s}}, Verginica Barbu Mititelu, Elena Irimia, {\textcommabelow{S}}tefan Daniel Dumitrescu, Tiberiu Boro{\textcommabelow{s}}
Abstract The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country{'}s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.
Tasks Lemmatization, Part-Of-Speech Tagging, Tokenization
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1399/
PDF https://www.aclweb.org/anthology/L16-1399
PWC https://paperswithcode.com/paper/the-ipr-cleared-corpus-of-contemporary
Repo
Framework

The Public License Selector: Making Open Licensing Easier

Title The Public License Selector: Making Open Licensing Easier
Authors Pawel Kamocki, Pavel Stra{\v{n}}{'a}k, Michal Sedl{'a}k
Abstract Researchers in Natural Language Processing rely on availability of data and software, ideally under open licenses, but little is done to actively encourage it. In fact, the current Copyright framework grants exclusive rights to authors to copy their works, make them available to the public and make derivative works (such as annotated language corpora). Moreover, in the EU databases are protected against unauthorized extraction and re-utilization of their contents. Therefore, proper public licensing plays a crucial role in providing access to research data. A public license is a license that grants certain rights not to one particular user, but to the general public (everybody). Our article presents a tool that we developed and whose purpose is to assist the user in the licensing process. As software and data should be licensed under different licenses, the tool is composed of two separate parts: Data and Software. The underlying logic as well as elements of the graphic interface are presented below.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1402/
PDF https://www.aclweb.org/anthology/L16-1402
PWC https://paperswithcode.com/paper/the-public-license-selector-making-open
Repo
Framework

NLP Infrastructure for the Lithuanian Language

Title NLP Infrastructure for the Lithuanian Language
Authors Daiva Vitkut{.e}-Ad{\v{z}}gauskien{.e}, Andrius Utka, Darius Amilevi{\v{c}}ius, Tomas Krilavi{\v{c}}ius
Abstract The Information System for Syntactic and Semantic Analysis of the Lithuanian language (lith. Lietuvi{\k{u}} kalbos sintaksin{.e}s ir semantin{.e}s analiz{.e}s informacin{.e} sistema, LKSSAIS) is the first infrastructure for the Lithuanian language combining Lithuanian language tools and resources for diverse linguistic research and applications tasks. It provides access to the basic as well as advanced natural language processing tools and resources, including tools for corpus creation and management, text preprocessing and annotation, ontology building, named entity recognition, morphosyntactic and semantic analysis, sentiment analysis, etc. It is an important platform for researchers and developers in the field of natural language technology.
Tasks Named Entity Recognition, Sentiment Analysis
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1403/
PDF https://www.aclweb.org/anthology/L16-1403
PWC https://paperswithcode.com/paper/nlp-infrastructure-for-the-lithuanian
Repo
Framework

CodE Alltag: A German-Language E-Mail Corpus

Title CodE Alltag: A German-Language E-Mail Corpus
Authors Ulrike Krieg-Holz, Christian Schuschnig, Franz Matthies, Benjamin Redling, Udo Hahn
Abstract We introduce CODE ALLTAG, a text corpus composed of German-language e-mails. It is divided into two partitions: the first of these portions, CODE ALLTAG{_}XL, consists of a bulk-size collection drawn from an openly accessible e-mail archive (roughly 1.5M e-mails), whereas the second portion, CODE ALLTAG{_}S+d, is much smaller in size (less than thousand e-mails), yet excels with demographic data from each author of an e-mail. CODE ALLTAG, thus, currently constitutes the largest E-Mail corpus ever built. In this paper, we describe, for both parts, the solicitation process for gathering e-mails, present descriptive statistical properties of the corpus, and, for CODE ALLTAG{_}S+d, reveal a compilation of demographic features of the donors of e-mails.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1404/
PDF https://www.aclweb.org/anthology/L16-1404
PWC https://paperswithcode.com/paper/code-alltag-a-german-language-e-mail-corpus
Repo
Framework

Rapid Development of Morphological Analyzers for Typologically Diverse Languages

Title Rapid Development of Morphological Analyzers for Typologically Diverse Languages
Authors Seth Kulick, Ann Bies
Abstract The Low Resource Language research conducted under DARPA{'}s Broad Operational Language Translation (BOLT) program required the rapid creation of text corpora of typologically diverse languages (Turkish, Hausa, and Uzbek) which were annotated with morphological information, along with other types of annotation. Since the output of morphological analyzers is a significant aid to morphological annotation, we developed a morphological analyzer for each language in order to support the annotation task, and also as a deliverable by itself. Our framework for analyzer creation results in tables similar to those used in the successful SAMA analyzer for Arabic, but with a more abstract linguistic level, from which the tables are derived. A lexicon was developed from available resources for integration with the analyzer, and given the speed of development and uncertain coverage of the lexicon, we assumed that the analyzer would necessarily be lacking in some coverage for the project annotation. Our analyzer framework was therefore focused on rapid implementation of the key structures of the language, together with accepting {``}wildcard{''} solutions as possible analyses for a word with an unknown stem, building upon our similar experiences with morphological annotation with Modern Standard Arabic and Egyptian Arabic. |
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1405/
PDF https://www.aclweb.org/anthology/L16-1405
PWC https://paperswithcode.com/paper/rapid-development-of-morphological-analyzers
Repo
Framework

Evaluating word embeddings with fMRI and eye-tracking

Title Evaluating word embeddings with fMRI and eye-tracking
Authors Anders S{\o}gaard
Abstract
Tasks Eye Tracking, Word Embeddings
Published 2016-08-01
URL https://www.aclweb.org/anthology/W16-2521/
PDF https://www.aclweb.org/anthology/W16-2521
PWC https://paperswithcode.com/paper/evaluating-word-embeddings-with-fmri-and-eye
Repo
Framework

A Finite-State Morphological Analyser for Sindhi

Title A Finite-State Morphological Analyser for Sindhi
Authors Raveesh Motlani, Francis Tyers, Dipti Sharma
Abstract Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium{'}s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia corpus and achieved a reasonable coverage of 81{%} and a precision of over 97{%}.
Tasks Information Retrieval, Machine Translation, Morphological Analysis, Part-Of-Speech Tagging
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1409/
PDF https://www.aclweb.org/anthology/L16-1409
PWC https://paperswithcode.com/paper/a-finite-state-morphological-analyser-for-1
Repo
Framework

Morphological Analysis of Sahidic Coptic for Automatic Glossing

Title Morphological Analysis of Sahidic Coptic for Automatic Glossing
Authors Daniel Smith, Mans Hulden
Abstract We report on the implementation of a morphological analyzer for the Sahidic dialect of Coptic, a now extinct Afro-Asiatic language. The system is developed in the finite-state paradigm. The main purpose of the project is provide a method by which scholars and linguists can semi-automatically gloss extant texts written in Sahidic. Since a complete lexicon containing all attested forms in different manuscripts requires significant expertise in Coptic spanning almost 1,000 years, we have equipped the analyzer with a core lexicon and extended it with a {``}guesser{''} ability to capture out-of-vocabulary items in any inflection. We also suggest an ASCII transliteration for the language. A brief evaluation is provided. |
Tasks Morphological Analysis, Transliteration
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1411/
PDF https://www.aclweb.org/anthology/L16-1411
PWC https://paperswithcode.com/paper/morphological-analysis-of-sahidic-coptic-for
Repo
Framework

Summarising News Stories for Children

Title Summarising News Stories for Children
Authors Iain Macdonald, Advaith Siddharthan
Abstract
Tasks Text Generation, Text Simplification
Published 2016-09-01
URL https://www.aclweb.org/anthology/W16-6601/
PDF https://www.aclweb.org/anthology/W16-6601
PWC https://paperswithcode.com/paper/summarising-news-stories-for-children
Repo
Framework

Exploring Stylistic Variation with Age and Income on Twitter

Title Exploring Stylistic Variation with Age and Income on Twitter
Authors Lucie Flekova, Daniel Preo{\c{t}}iuc-Pietro, Lyle Ungar
Abstract
Tasks Text Simplification
Published 2016-08-01
URL https://www.aclweb.org/anthology/P16-2051/
PDF https://www.aclweb.org/anthology/P16-2051
PWC https://paperswithcode.com/paper/exploring-stylistic-variation-with-age-and
Repo
Framework
comments powered by Disqus