Paper Group NANR 49
A Comparison of Weak Supervision methods for Knowledge Base Construction. Predictability of Distributional Semantics in Derivational Word Formation. CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation. Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks. Facilitating Metadata Interoperability in CLA …
A Comparison of Weak Supervision methods for Knowledge Base Construction
Title | A Comparison of Weak Supervision methods for Knowledge Base Construction |
Authors | Ameet Soni, Dileep Viswanathan, Niranjan Pachaiyappan, Sriraam Natarajan |
Abstract | |
Tasks | Knowledge Base Population, Relation Extraction |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/W16-1318/ |
https://www.aclweb.org/anthology/W16-1318 | |
PWC | https://paperswithcode.com/paper/a-comparison-of-weak-supervision-methods-for |
Repo | |
Framework | |
Predictability of Distributional Semantics in Derivational Word Formation
Title | Predictability of Distributional Semantics in Derivational Word Formation |
Authors | Sebastian Pad{'o}, Aur{'e}lie Herbelot, Max Kisselew, Jan {\v{S}}najder |
Abstract | Compositional distributional semantic models (CDSMs) have successfully been applied to the task of predicting the meaning of a range of linguistic constructions. Their performance on semi-compositional word formation process of (morphological) derivation, however, has been extremely variable, with no large-scale empirical investigation to date. This paper fills that gap, performing an analysis of CDSM predictions on a large dataset (over 30,000 German derivationally related word pairs). We use linear regression models to analyze CDSM performance and obtain insights into the linguistic factors that influence how predictable the distributional context of a derived word is going to be. We identify various such factors, notably part of speech, argument structure, and semantic regularity. |
Tasks | Machine Translation, Sentiment Analysis, Word Embeddings |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-1122/ |
https://www.aclweb.org/anthology/C16-1122 | |
PWC | https://paperswithcode.com/paper/predictability-of-distributional-semantics-in |
Repo | |
Framework | |
CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation
Title | CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation |
Authors | Anas Shahrour, Salam Khalifa, Dima Taji, Nizar Habash |
Abstract | In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features. CamelParser uses a state-of-the-art morphological disambiguator and improves its results using syntactically driven features. The system offers a number of output formats that include basic dependency with morphological features, two tree visualization modes, and traditional Arabic grammatical analysis. |
Tasks | Dependency Parsing, Morphological Analysis, Transliteration |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-2048/ |
https://www.aclweb.org/anthology/C16-2048 | |
PWC | https://paperswithcode.com/paper/camelparser-a-system-for-arabic-syntactic |
Repo | |
Framework | |
Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks
Title | Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks |
Authors | Juhani Luotolahti, Jenna Kanerva, Filip Ginter |
Abstract | |
Tasks | Language Modelling, Machine Translation |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/W16-2353/ |
https://www.aclweb.org/anthology/W16-2353 | |
PWC | https://paperswithcode.com/paper/cross-lingual-pronoun-prediction-with-deep |
Repo | |
Framework | |
Facilitating Metadata Interoperability in CLARIN-DK
Title | Facilitating Metadata Interoperability in CLARIN-DK |
Authors | Lene Offersgaard, Dorte Haltrup Hansen |
Abstract | The issue for CLARIN archives at the metadata level is to facilitate the user{'}s possibility to describe their data, even with their own standard, and at the same time make these metadata meaningful for a variety of users with a variety of resource types, and ensure that the metadata are useful for search across all resources both at the national and at the European level. We see that different people from different research communities fill in the metadata in different ways even though the metadata was defined and documented. This has impacted when the metadata are harvested and displayed in different environments. A loss of information is at stake. In this paper we view the challenges of ensuring metadata interoperability through examples of propagation of metadata values from the CLARIN-DK archive to the VLO. We see that the CLARIN Community in many ways support interoperability, but argue that agreeing upon standards, making clear definitions of the semantics of the metadata and their content is inevitable for the interoperability to work successfully. The key points are clear and freely available definitions, accessible documentation and easily usable facilities and guidelines for the metadata creators. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1398/ |
https://www.aclweb.org/anthology/L16-1398 | |
PWC | https://paperswithcode.com/paper/facilitating-metadata-interoperability-in |
Repo | |
Framework | |
The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Title | The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language |
Authors | Dan Tufi{\textcommabelow{s}}, Verginica Barbu Mititelu, Elena Irimia, {\textcommabelow{S}}tefan Daniel Dumitrescu, Tiberiu Boro{\textcommabelow{s}} |
Abstract | The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country{'}s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers. |
Tasks | Lemmatization, Part-Of-Speech Tagging, Tokenization |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1399/ |
https://www.aclweb.org/anthology/L16-1399 | |
PWC | https://paperswithcode.com/paper/the-ipr-cleared-corpus-of-contemporary |
Repo | |
Framework | |
The Public License Selector: Making Open Licensing Easier
Title | The Public License Selector: Making Open Licensing Easier |
Authors | Pawel Kamocki, Pavel Stra{\v{n}}{'a}k, Michal Sedl{'a}k |
Abstract | Researchers in Natural Language Processing rely on availability of data and software, ideally under open licenses, but little is done to actively encourage it. In fact, the current Copyright framework grants exclusive rights to authors to copy their works, make them available to the public and make derivative works (such as annotated language corpora). Moreover, in the EU databases are protected against unauthorized extraction and re-utilization of their contents. Therefore, proper public licensing plays a crucial role in providing access to research data. A public license is a license that grants certain rights not to one particular user, but to the general public (everybody). Our article presents a tool that we developed and whose purpose is to assist the user in the licensing process. As software and data should be licensed under different licenses, the tool is composed of two separate parts: Data and Software. The underlying logic as well as elements of the graphic interface are presented below. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1402/ |
https://www.aclweb.org/anthology/L16-1402 | |
PWC | https://paperswithcode.com/paper/the-public-license-selector-making-open |
Repo | |
Framework | |
NLP Infrastructure for the Lithuanian Language
Title | NLP Infrastructure for the Lithuanian Language |
Authors | Daiva Vitkut{.e}-Ad{\v{z}}gauskien{.e}, Andrius Utka, Darius Amilevi{\v{c}}ius, Tomas Krilavi{\v{c}}ius |
Abstract | The Information System for Syntactic and Semantic Analysis of the Lithuanian language (lith. Lietuvi{\k{u}} kalbos sintaksin{.e}s ir semantin{.e}s analiz{.e}s informacin{.e} sistema, LKSSAIS) is the first infrastructure for the Lithuanian language combining Lithuanian language tools and resources for diverse linguistic research and applications tasks. It provides access to the basic as well as advanced natural language processing tools and resources, including tools for corpus creation and management, text preprocessing and annotation, ontology building, named entity recognition, morphosyntactic and semantic analysis, sentiment analysis, etc. It is an important platform for researchers and developers in the field of natural language technology. |
Tasks | Named Entity Recognition, Sentiment Analysis |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1403/ |
https://www.aclweb.org/anthology/L16-1403 | |
PWC | https://paperswithcode.com/paper/nlp-infrastructure-for-the-lithuanian |
Repo | |
Framework | |
CodE Alltag: A German-Language E-Mail Corpus
Title | CodE Alltag: A German-Language E-Mail Corpus |
Authors | Ulrike Krieg-Holz, Christian Schuschnig, Franz Matthies, Benjamin Redling, Udo Hahn |
Abstract | We introduce CODE ALLTAG, a text corpus composed of German-language e-mails. It is divided into two partitions: the first of these portions, CODE ALLTAG{_}XL, consists of a bulk-size collection drawn from an openly accessible e-mail archive (roughly 1.5M e-mails), whereas the second portion, CODE ALLTAG{_}S+d, is much smaller in size (less than thousand e-mails), yet excels with demographic data from each author of an e-mail. CODE ALLTAG, thus, currently constitutes the largest E-Mail corpus ever built. In this paper, we describe, for both parts, the solicitation process for gathering e-mails, present descriptive statistical properties of the corpus, and, for CODE ALLTAG{_}S+d, reveal a compilation of demographic features of the donors of e-mails. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1404/ |
https://www.aclweb.org/anthology/L16-1404 | |
PWC | https://paperswithcode.com/paper/code-alltag-a-german-language-e-mail-corpus |
Repo | |
Framework | |
Rapid Development of Morphological Analyzers for Typologically Diverse Languages
Title | Rapid Development of Morphological Analyzers for Typologically Diverse Languages |
Authors | Seth Kulick, Ann Bies |
Abstract | The Low Resource Language research conducted under DARPA{'}s Broad Operational Language Translation (BOLT) program required the rapid creation of text corpora of typologically diverse languages (Turkish, Hausa, and Uzbek) which were annotated with morphological information, along with other types of annotation. Since the output of morphological analyzers is a significant aid to morphological annotation, we developed a morphological analyzer for each language in order to support the annotation task, and also as a deliverable by itself. Our framework for analyzer creation results in tables similar to those used in the successful SAMA analyzer for Arabic, but with a more abstract linguistic level, from which the tables are derived. A lexicon was developed from available resources for integration with the analyzer, and given the speed of development and uncertain coverage of the lexicon, we assumed that the analyzer would necessarily be lacking in some coverage for the project annotation. Our analyzer framework was therefore focused on rapid implementation of the key structures of the language, together with accepting {``}wildcard{''} solutions as possible analyses for a word with an unknown stem, building upon our similar experiences with morphological annotation with Modern Standard Arabic and Egyptian Arabic. | |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1405/ |
https://www.aclweb.org/anthology/L16-1405 | |
PWC | https://paperswithcode.com/paper/rapid-development-of-morphological-analyzers |
Repo | |
Framework | |
Evaluating word embeddings with fMRI and eye-tracking
Title | Evaluating word embeddings with fMRI and eye-tracking |
Authors | Anders S{\o}gaard |
Abstract | |
Tasks | Eye Tracking, Word Embeddings |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/W16-2521/ |
https://www.aclweb.org/anthology/W16-2521 | |
PWC | https://paperswithcode.com/paper/evaluating-word-embeddings-with-fmri-and-eye |
Repo | |
Framework | |
A Finite-State Morphological Analyser for Sindhi
Title | A Finite-State Morphological Analyser for Sindhi |
Authors | Raveesh Motlani, Francis Tyers, Dipti Sharma |
Abstract | Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium{'}s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia corpus and achieved a reasonable coverage of 81{%} and a precision of over 97{%}. |
Tasks | Information Retrieval, Machine Translation, Morphological Analysis, Part-Of-Speech Tagging |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1409/ |
https://www.aclweb.org/anthology/L16-1409 | |
PWC | https://paperswithcode.com/paper/a-finite-state-morphological-analyser-for-1 |
Repo | |
Framework | |
Morphological Analysis of Sahidic Coptic for Automatic Glossing
Title | Morphological Analysis of Sahidic Coptic for Automatic Glossing |
Authors | Daniel Smith, Mans Hulden |
Abstract | We report on the implementation of a morphological analyzer for the Sahidic dialect of Coptic, a now extinct Afro-Asiatic language. The system is developed in the finite-state paradigm. The main purpose of the project is provide a method by which scholars and linguists can semi-automatically gloss extant texts written in Sahidic. Since a complete lexicon containing all attested forms in different manuscripts requires significant expertise in Coptic spanning almost 1,000 years, we have equipped the analyzer with a core lexicon and extended it with a {``}guesser{''} ability to capture out-of-vocabulary items in any inflection. We also suggest an ASCII transliteration for the language. A brief evaluation is provided. | |
Tasks | Morphological Analysis, Transliteration |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1411/ |
https://www.aclweb.org/anthology/L16-1411 | |
PWC | https://paperswithcode.com/paper/morphological-analysis-of-sahidic-coptic-for |
Repo | |
Framework | |
Summarising News Stories for Children
Title | Summarising News Stories for Children |
Authors | Iain Macdonald, Advaith Siddharthan |
Abstract | |
Tasks | Text Generation, Text Simplification |
Published | 2016-09-01 |
URL | https://www.aclweb.org/anthology/W16-6601/ |
https://www.aclweb.org/anthology/W16-6601 | |
PWC | https://paperswithcode.com/paper/summarising-news-stories-for-children |
Repo | |
Framework | |
Exploring Stylistic Variation with Age and Income on Twitter
Title | Exploring Stylistic Variation with Age and Income on Twitter |
Authors | Lucie Flekova, Daniel Preo{\c{t}}iuc-Pietro, Lyle Ungar |
Abstract | |
Tasks | Text Simplification |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-2051/ |
https://www.aclweb.org/anthology/P16-2051 | |
PWC | https://paperswithcode.com/paper/exploring-stylistic-variation-with-age-and |
Repo | |
Framework | |