May 5, 2019

2220 words 11 mins read

Paper Group NANR 58

Paper Group NANR 58

Understanding Satirical Articles Using Common-Sense. A Study of Suggestions in Opinionated Texts and their Automatic Detection. The United Nations Parallel Corpus v1.0. Multilingual Code-switching Identification via LSTM Recurrent Neural Networks. Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions. NorGramBank: A `Deep’ Tr …

Understanding Satirical Articles Using Common-Sense

Title Understanding Satirical Articles Using Common-Sense
Authors Dan Goldwasser, Xiao Zhang
Abstract Automatic satire detection is a subtle text classification task, for machines and at times, even for humans. In this paper we argue that satire detection should be approached using common-sense inferences, rather than traditional text classification methods. We present a highly structured latent variable model capturing the required inferences. The model abstracts over the specific entities appearing in the articles, grouping them into generalized categories, thus allowing the model to adapt to previously unseen situations.
Tasks Common Sense Reasoning, Text Classification
Published 2016-01-01
URL https://www.aclweb.org/anthology/Q16-1038/
PDF https://www.aclweb.org/anthology/Q16-1038
PWC https://paperswithcode.com/paper/understanding-satirical-articles-using-common
Repo
Framework

A Study of Suggestions in Opinionated Texts and their Automatic Detection

Title A Study of Suggestions in Opinionated Texts and their Automatic Detection
Authors Sapna Negi, Kartik Asooja, Shubham Mehrotra, Paul Buitelaar
Abstract
Tasks Opinion Mining, Sentence Classification, Sentiment Analysis
Published 2016-08-01
URL https://www.aclweb.org/anthology/S16-2022/
PDF https://www.aclweb.org/anthology/S16-2022
PWC https://paperswithcode.com/paper/a-study-of-suggestions-in-opinionated-texts
Repo
Framework

The United Nations Parallel Corpus v1.0

Title The United Nations Parallel Corpus v1.0
Authors Micha{\l} Ziemski, Marcin Junczys-Dowmunt, Bruno Pouliquen
Abstract This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1561/
PDF https://www.aclweb.org/anthology/L16-1561
PWC https://paperswithcode.com/paper/the-united-nations-parallel-corpus-v10
Repo
Framework

Multilingual Code-switching Identification via LSTM Recurrent Neural Networks

Title Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Authors Younes Samih, Suraj Maharjan, Mohammed Attia, Laura Kallmeyer, Thamar Solorio
Abstract
Tasks Language Identification
Published 2016-11-01
URL https://www.aclweb.org/anthology/W16-5806/
PDF https://www.aclweb.org/anthology/W16-5806
PWC https://paperswithcode.com/paper/multilingual-code-switching-identification
Repo
Framework

Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions

Title Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions
Authors Liesbeth Augustinus, V, Vincent eghinste, Tom Vanallemeersch
Abstract We present Poly-GrETEL, an online tool which enables syntactic querying in parallel treebanks, based on the monolingual GrETEL environment. We provide online access to the Europarl parallel treebank for Dutch and English, allowing users to query the treebank using either an XPath expression or an example sentence in order to look for similar constructions. We provide automatic alignments between the nodes. By combining example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1564/
PDF https://www.aclweb.org/anthology/L16-1564
PWC https://paperswithcode.com/paper/poly-gretel-cross-lingual-example-based
Repo
Framework

NorGramBank: A `Deep’ Treebank for Norwegian

Title NorGramBank: A `Deep’ Treebank for Norwegian |
Authors Helge Dyvik, Paul Meurer, Victoria Ros{'e}n, Koenraad De Smedt, Petter Haugereid, Gyri Sm{\o}rdal Losnegaard, Gunn Inger Lyse, Martha Thunes
Abstract We present NorGramBank, a treebank for Norwegian with highly detailed LFG analyses. It is one of many treebanks made available through the INESS treebanking infrastructure. NorGramBank was constructed as a parsebank, i.e. by automatically parsing a corpus, using the wide coverage grammar NorGram. One part consisting of 350,000 words has been manually disambiguated using computer-generated discriminants. A larger part of 50 M words has been stochastically disambiguated. The treebank is dynamic: by global reparsing at certain intervals it is kept compatible with the latest versions of the grammar and the lexicon, which are continually further developed in interaction with the annotators. A powerful query language, INESS Search, has been developed for search across formalisms in the INESS treebanks, including LFG c- and f-structures. Evaluation shows that the grammar provides about 85{%} of randomly selected sentences with good analyses. Agreement among the annotators responsible for manual disambiguation is satisfactory, but also suggests desirable simplifications of the grammar.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1565/
PDF https://www.aclweb.org/anthology/L16-1565
PWC https://paperswithcode.com/paper/norgrambank-a-deep-treebank-for-norwegian
Repo
Framework

Launch and Iterate: Reducing Prediction Churn

Title Launch and Iterate: Reducing Prediction Churn
Authors Mahdi Milani Fard, Quentin Cormier, Kevin Canini, Maya Gupta
Abstract Practical applications of machine learning often involve successive training iterations with changes to features and training examples. Ideally, changes in the output of any new model should only be improvements (wins) over the previous iteration, but in practice the predictions may change neutrally for many examples, resulting in extra net-zero wins and losses, referred to as unnecessary churn. These changes in the predictions are problematic for usability for some applications, and make it harder and more expensive to measure if a change is statistically significant positive. In this paper, we formulate the problem and present a stabilization operator to regularize a classifier towards a previous classifier. We use a Markov chain Monte Carlo stabilization operator to produce a model with more consistent predictions without adversely affecting accuracy. We investigate the properties of the proposal with theoretical analysis. Experiments on benchmark datasets for different classification algorithms demonstrate the method and the resulting reduction in churn.
Tasks
Published 2016-12-01
URL http://papers.nips.cc/paper/6053-launch-and-iterate-reducing-prediction-churn
PDF http://papers.nips.cc/paper/6053-launch-and-iterate-reducing-prediction-churn.pdf
PWC https://paperswithcode.com/paper/launch-and-iterate-reducing-prediction-churn
Repo
Framework

NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity

Title NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity
Authors Kolawole Adebayo, Luigi Di Caro, Guido Boella
Abstract
Tasks Machine Translation, Semantic Textual Similarity, Text Summarization
Published 2016-06-01
URL https://www.aclweb.org/anthology/S16-1111/
PDF https://www.aclweb.org/anthology/S16-1111
PWC https://paperswithcode.com/paper/normas-at-semeval-2016-task-1-semsim-a-multi
Repo
Framework

Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts

Title Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts
Authors Anne-Kathrin Schumann, Stefan Fischer
Abstract The specialised lexicon belongs to the most prominent attributes of specialised writing: Terms function as semantically dense encodings of specialised concepts, which, in the absence of terms, would require lengthy explanations and descriptions. In this paper, we argue that terms are the result of diachronic processes on both the semantic and the morpho-syntactic level. Very little is known about these processes. We therefore present a corpus annotation project aiming at revealing how terms are coined and how they evolve to fit their function as semantically and morpho-syntactically dense encodings of specialised knowledge. The scope of this paper is two-fold: Firstly, we outline our methodology for annotating terminology in a diachronic corpus of scientific publications. Moreover, we provide a detailed analysis of our annotation results and suggest methods for improving the accuracy of annotations in a setting as difficult as ours. Secondly, we present results of a pilot study based on the annotated terms. The results suggest that terms in older texts are linguistically relatively simple units that are hard to distinguish from the lexicon of general language. We believe that this supports our hypothesis that terminology undergoes diachronic processes of densification and specialisation.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1568/
PDF https://www.aclweb.org/anthology/L16-1568
PWC https://paperswithcode.com/paper/compasses-magnets-water-microscopes
Repo
Framework

KorAP Architecture ― Diving in the Deep Sea of Corpus Data

Title KorAP Architecture ― Diving in the Deep Sea of Corpus Data
Authors Nils Diewald, Michael Hanl, Eliza Margaretha, Joachim Bingel, Marc Kupietz, Piotr Ba{'n}ski, Andreas Witt
Abstract KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP{'}s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DeReKo for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1569/
PDF https://www.aclweb.org/anthology/L16-1569
PWC https://paperswithcode.com/paper/korap-architecture-a-diving-in-the-deep-sea
Repo
Framework

Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data

Title Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data
Authors Kristina Gulordava, Paola Merlo
Abstract The growing work in multi-lingual parsing faces the challenge of fair comparative evaluation and performance analysis across languages and their treebanks. The difficulty lies in teasing apart the properties of treebanks, such as their size or average sentence length, from those of the annotation scheme, and from the linguistic properties of languages. We propose a method to evaluate the effects of word order of a language on dependency parsing performance, while controlling for confounding treebank properties. The method uses artificially-generated treebanks that are minimal permutations of actual treebanks with respect to two word order properties: word order variation and dependency lengths. Based on these artificial data on twelve languages, we show that longer dependencies and higher word order variability degrade parsing performance. Our method also extends to minimal pairs of individual sentences, leading to a finer-grained understanding of parsing errors.
Tasks Dependency Parsing
Published 2016-01-01
URL https://www.aclweb.org/anthology/Q16-1025/
PDF https://www.aclweb.org/anthology/Q16-1025
PWC https://paperswithcode.com/paper/multi-lingual-dependency-parsing-evaluation-a
Repo
Framework

The Methodius Corpus of Rhetorical Discourse Structures and Generated Texts

Title The Methodius Corpus of Rhetorical Discourse Structures and Generated Texts
Authors Amy Isard
Abstract Using the Methodius Natural Language Generation (NLG) System, we have created a corpus which consists of a collection of generated texts which describe ancient Greek artefacts. Each text is linked to two representations created as part of the NLG process. The first is a content plan, which uses rhetorical relations to describe the high-level discourse structure of the text, and the second is a logical form describing the syntactic structure, which is sent to the OpenCCG surface realization module to produce the final text output. In recent work, White and Howcroft (2015) have used the SPaRKy restaurant corpus, which contains similar combination of texts and representations, for their research on the induction of rules for the combination of clauses. In the first instance this corpus will be used to test their algorithms on an additional domain, and extend their work to include the learning of referring expression generation rules. As far as we know, the SPaRKy restaurant corpus is the only existing corpus of this type, and we hope that the creation of this new corpus in a different domain will provide a useful resource to the Natural Language Generation community.
Tasks Text Generation
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1273/
PDF https://www.aclweb.org/anthology/L16-1273
PWC https://paperswithcode.com/paper/the-methodius-corpus-of-rhetorical-discourse
Repo
Framework

Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings

Title Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings
Authors Shoma Yamaki, Hiroyuki Shinnou, Kanako Komiya, Minoru Sasaki
Abstract
Tasks Word Embeddings, Word Sense Disambiguation
Published 2016-10-01
URL https://www.aclweb.org/anthology/Y16-2009/
PDF https://www.aclweb.org/anthology/Y16-2009
PWC https://paperswithcode.com/paper/supervised-word-sense-disambiguation-with
Repo
Framework

The PsyMine Corpus - A Corpus annotated with Psychiatric Disorders and their Etiological Factors

Title The PsyMine Corpus - A Corpus annotated with Psychiatric Disorders and their Etiological Factors
Authors Tilia Ellendorff, Simon Foster, Fabio Rinaldi
Abstract We present the first version of a corpus annotated for psychiatric disorders and their etiological factors. The paper describes the choice of text, annotated entities and events/relations as well as the annotation scheme and procedure applied. The corpus is featuring a selection of focus psychiatric disorders including depressive disorder, anxiety disorder, obsessive-compulsive disorder, phobic disorders and panic disorder. Etiological factors for these focus disorders are widespread and include genetic, physiological, sociological and environmental factors among others. Etiological events, including annotated evidence text, represent the interactions between their focus disorders and their etiological factors. Additionally to these core events, symptomatic and treatment events have been annotated. The current version of the corpus includes 175 scientific abstracts. All entities and events/relations have been manually annotated by domain experts and scores of inter-annotator agreement are presented. The aim of the corpus is to provide a first gold standard to support the development of biomedical text mining applications for the specific area of mental disorders which belong to the main contributors to the contemporary burden of disease.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1590/
PDF https://www.aclweb.org/anthology/L16-1590
PWC https://paperswithcode.com/paper/the-psymine-corpus-a-corpus-annotated-with
Repo
Framework

Guidelines and Framework for a Large Scale Arabic Diacritized Corpus

Title Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
Authors Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, Kemal Oflazer
Abstract This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1577/
PDF https://www.aclweb.org/anthology/L16-1577
PWC https://paperswithcode.com/paper/guidelines-and-framework-for-a-large-scale
Repo
Framework
comments powered by Disqus