May 5, 2019

2220 words 11 mins read

Paper Group NANR 58

Understanding Satirical Articles Using Common-Sense. A Study of Suggestions in Opinionated Texts and their Automatic Detection. The United Nations Parallel Corpus v1.0. Multilingual Code-switching Identification via LSTM Recurrent Neural Networks. Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions. NorGramBank: A `Deep’ Tr …

Understanding Satirical Articles Using Common-Sense


Title	Understanding Satirical Articles Using Common-Sense
Authors	Dan Goldwasser, Xiao Zhang
Abstract	Automatic satire detection is a subtle text classification task, for machines and at times, even for humans. In this paper we argue that satire detection should be approached using common-sense inferences, rather than traditional text classification methods. We present a highly structured latent variable model capturing the required inferences. The model abstracts over the specific entities appearing in the articles, grouping them into generalized categories, thus allowing the model to adapt to previously unseen situations.
Tasks	Common Sense Reasoning, Text Classification
Published	2016-01-01
URL	https://www.aclweb.org/anthology/Q16-1038/
PDF	https://www.aclweb.org/anthology/Q16-1038
PWC	https://paperswithcode.com/paper/understanding-satirical-articles-using-common
Repo
Framework

A Study of Suggestions in Opinionated Texts and their Automatic Detection


Title	A Study of Suggestions in Opinionated Texts and their Automatic Detection
Authors	Sapna Negi, Kartik Asooja, Shubham Mehrotra, Paul Buitelaar
Abstract
Tasks	Opinion Mining, Sentence Classification, Sentiment Analysis
Published	2016-08-01
URL	https://www.aclweb.org/anthology/S16-2022/
PDF	https://www.aclweb.org/anthology/S16-2022
PWC	https://paperswithcode.com/paper/a-study-of-suggestions-in-opinionated-texts
Repo
Framework

The United Nations Parallel Corpus v1.0


Title	The United Nations Parallel Corpus v1.0
Authors	Micha{\l} Ziemski, Marcin Junczys-Dowmunt, Bruno Pouliquen
Abstract	This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1561/
PDF	https://www.aclweb.org/anthology/L16-1561
PWC	https://paperswithcode.com/paper/the-united-nations-parallel-corpus-v10
Repo
Framework

Multilingual Code-switching Identification via LSTM Recurrent Neural Networks


Title	Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Authors	Younes Samih, Suraj Maharjan, Mohammed Attia, Laura Kallmeyer, Thamar Solorio
Abstract
Tasks	Language Identification
Published	2016-11-01
URL	https://www.aclweb.org/anthology/W16-5806/
PDF	https://www.aclweb.org/anthology/W16-5806
PWC	https://paperswithcode.com/paper/multilingual-code-switching-identification
Repo
Framework

Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions


Title	Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions
Authors	Liesbeth Augustinus, V, Vincent eghinste, Tom Vanallemeersch
Abstract	We present Poly-GrETEL, an online tool which enables syntactic querying in parallel treebanks, based on the monolingual GrETEL environment. We provide online access to the Europarl parallel treebank for Dutch and English, allowing users to query the treebank using either an XPath expression or an example sentence in order to look for similar constructions. We provide automatic alignments between the nodes. By combining example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1564/
PDF	https://www.aclweb.org/anthology/L16-1564
PWC	https://paperswithcode.com/paper/poly-gretel-cross-lingual-example-based
Repo
Framework

NorGramBank: A `Deep’ Treebank for Norwegian


Title	NorGramBank: A `Deep’ Treebank for Norwegian \|
Authors	Helge Dyvik, Paul Meurer, Victoria Ros{'e}n, Koenraad De Smedt, Petter Haugereid, Gyri Sm{\o}rdal Losnegaard, Gunn Inger Lyse, Martha Thunes
Abstract	We present NorGramBank, a treebank for Norwegian with highly detailed LFG analyses. It is one of many treebanks made available through the INESS treebanking infrastructure. NorGramBank was constructed as a parsebank, i.e. by automatically parsing a corpus, using the wide coverage grammar NorGram. One part consisting of 350,000 words has been manually disambiguated using computer-generated discriminants. A larger part of 50 M words has been stochastically disambiguated. The treebank is dynamic: by global reparsing at certain intervals it is kept compatible with the latest versions of the grammar and the lexicon, which are continually further developed in interaction with the annotators. A powerful query language, INESS Search, has been developed for search across formalisms in the INESS treebanks, including LFG c- and f-structures. Evaluation shows that the grammar provides about 85{%} of randomly selected sentences with good analyses. Agreement among the annotators responsible for manual disambiguation is satisfactory, but also suggests desirable simplifications of the grammar.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1565/
PDF	https://www.aclweb.org/anthology/L16-1565
PWC	https://paperswithcode.com/paper/norgrambank-a-deep-treebank-for-norwegian
Repo
Framework

Launch and Iterate: Reducing Prediction Churn


Title	Launch and Iterate: Reducing Prediction Churn
Authors	Mahdi Milani Fard, Quentin Cormier, Kevin Canini, Maya Gupta
Abstract	Practical applications of machine learning often involve successive training iterations with changes to features and training examples. Ideally, changes in the output of any new model should only be improvements (wins) over the previous iteration, but in practice the predictions may change neutrally for many examples, resulting in extra net-zero wins and losses, referred to as unnecessary churn. These changes in the predictions are problematic for usability for some applications, and make it harder and more expensive to measure if a change is statistically significant positive. In this paper, we formulate the problem and present a stabilization operator to regularize a classifier towards a previous classifier. We use a Markov chain Monte Carlo stabilization operator to produce a model with more consistent predictions without adversely affecting accuracy. We investigate the properties of the proposal with theoretical analysis. Experiments on benchmark datasets for different classification algorithms demonstrate the method and the resulting reduction in churn.
Tasks
Published	2016-12-01
URL	http://papers.nips.cc/paper/6053-launch-and-iterate-reducing-prediction-churn
PDF	http://papers.nips.cc/paper/6053-launch-and-iterate-reducing-prediction-churn.pdf
PWC	https://paperswithcode.com/paper/launch-and-iterate-reducing-prediction-churn
Repo
Framework

NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity


Title	NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity
Authors	Kolawole Adebayo, Luigi Di Caro, Guido Boella
Abstract
Tasks	Machine Translation, Semantic Textual Similarity, Text Summarization
Published	2016-06-01
URL	https://www.aclweb.org/anthology/S16-1111/
PDF	https://www.aclweb.org/anthology/S16-1111
PWC	https://paperswithcode.com/paper/normas-at-semeval-2016-task-1-semsim-a-multi
Repo
Framework

Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts


Title	Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts
Authors	Anne-Kathrin Schumann, Stefan Fischer
Abstract	The specialised lexicon belongs to the most prominent attributes of specialised writing: Terms function as semantically dense encodings of specialised concepts, which, in the absence of terms, would require lengthy explanations and descriptions. In this paper, we argue that terms are the result of diachronic processes on both the semantic and the morpho-syntactic level. Very little is known about these processes. We therefore present a corpus annotation project aiming at revealing how terms are coined and how they evolve to fit their function as semantically and morpho-syntactically dense encodings of specialised knowledge. The scope of this paper is two-fold: Firstly, we outline our methodology for annotating terminology in a diachronic corpus of scientific publications. Moreover, we provide a detailed analysis of our annotation results and suggest methods for improving the accuracy of annotations in a setting as difficult as ours. Secondly, we present results of a pilot study based on the annotated terms. The results suggest that terms in older texts are linguistically relatively simple units that are hard to distinguish from the lexicon of general language. We believe that this supports our hypothesis that terminology undergoes diachronic processes of densification and specialisation.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1568/
PDF	https://www.aclweb.org/anthology/L16-1568
PWC	https://paperswithcode.com/paper/compasses-magnets-water-microscopes
Repo
Framework

KorAP Architecture â€• Diving in the Deep Sea of Corpus Data


Title	KorAP Architecture â€• Diving in the Deep Sea of Corpus Data
Authors	Nils Diewald, Michael Hanl, Eliza Margaretha, Joachim Bingel, Marc Kupietz, Piotr Ba{'n}ski, Andreas Witt
Abstract	KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP{'}s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DeReKo for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1569/
PDF	https://www.aclweb.org/anthology/L16-1569
PWC	https://paperswithcode.com/paper/korap-architecture-a-diving-in-the-deep-sea
Repo
Framework

Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data


Title	Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data
Authors	Kristina Gulordava, Paola Merlo
Abstract	The growing work in multi-lingual parsing faces the challenge of fair comparative evaluation and performance analysis across languages and their treebanks. The difficulty lies in teasing apart the properties of treebanks, such as their size or average sentence length, from those of the annotation scheme, and from the linguistic properties of languages. We propose a method to evaluate the effects of word order of a language on dependency parsing performance, while controlling for confounding treebank properties. The method uses artificially-generated treebanks that are minimal permutations of actual treebanks with respect to two word order properties: word order variation and dependency lengths. Based on these artificial data on twelve languages, we show that longer dependencies and higher word order variability degrade parsing performance. Our method also extends to minimal pairs of individual sentences, leading to a finer-grained understanding of parsing errors.
Tasks	Dependency Parsing
Published	2016-01-01
URL	https://www.aclweb.org/anthology/Q16-1025/
PDF	https://www.aclweb.org/anthology/Q16-1025
PWC	https://paperswithcode.com/paper/multi-lingual-dependency-parsing-evaluation-a
Repo
Framework

The Methodius Corpus of Rhetorical Discourse Structures and Generated Texts


Title	The Methodius Corpus of Rhetorical Discourse Structures and Generated Texts
Authors	Amy Isard
Abstract	Using the Methodius Natural Language Generation (NLG) System, we have created a corpus which consists of a collection of generated texts which describe ancient Greek artefacts. Each text is linked to two representations created as part of the NLG process. The first is a content plan, which uses rhetorical relations to describe the high-level discourse structure of the text, and the second is a logical form describing the syntactic structure, which is sent to the OpenCCG surface realization module to produce the final text output. In recent work, White and Howcroft (2015) have used the SPaRKy restaurant corpus, which contains similar combination of texts and representations, for their research on the induction of rules for the combination of clauses. In the first instance this corpus will be used to test their algorithms on an additional domain, and extend their work to include the learning of referring expression generation rules. As far as we know, the SPaRKy restaurant corpus is the only existing corpus of this type, and we hope that the creation of this new corpus in a different domain will provide a useful resource to the Natural Language Generation community.
Tasks	Text Generation
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1273/
PDF	https://www.aclweb.org/anthology/L16-1273
PWC	https://paperswithcode.com/paper/the-methodius-corpus-of-rhetorical-discourse
Repo
Framework

Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings


Title	Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings
Authors	Shoma Yamaki, Hiroyuki Shinnou, Kanako Komiya, Minoru Sasaki
Abstract
Tasks	Word Embeddings, Word Sense Disambiguation
Published	2016-10-01
URL	https://www.aclweb.org/anthology/Y16-2009/
PDF	https://www.aclweb.org/anthology/Y16-2009
PWC	https://paperswithcode.com/paper/supervised-word-sense-disambiguation-with
Repo
Framework

The PsyMine Corpus - A Corpus annotated with Psychiatric Disorders and their Etiological Factors


Title	The PsyMine Corpus - A Corpus annotated with Psychiatric Disorders and their Etiological Factors
Authors	Tilia Ellendorff, Simon Foster, Fabio Rinaldi
Abstract	We present the first version of a corpus annotated for psychiatric disorders and their etiological factors. The paper describes the choice of text, annotated entities and events/relations as well as the annotation scheme and procedure applied. The corpus is featuring a selection of focus psychiatric disorders including depressive disorder, anxiety disorder, obsessive-compulsive disorder, phobic disorders and panic disorder. Etiological factors for these focus disorders are widespread and include genetic, physiological, sociological and environmental factors among others. Etiological events, including annotated evidence text, represent the interactions between their focus disorders and their etiological factors. Additionally to these core events, symptomatic and treatment events have been annotated. The current version of the corpus includes 175 scientific abstracts. All entities and events/relations have been manually annotated by domain experts and scores of inter-annotator agreement are presented. The aim of the corpus is to provide a first gold standard to support the development of biomedical text mining applications for the specific area of mental disorders which belong to the main contributors to the contemporary burden of disease.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1590/
PDF	https://www.aclweb.org/anthology/L16-1590
PWC	https://paperswithcode.com/paper/the-psymine-corpus-a-corpus-annotated-with
Repo
Framework

Guidelines and Framework for a Large Scale Arabic Diacritized Corpus


Title	Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
Authors	Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, Kemal Oflazer
Abstract	This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1577/
PDF	https://www.aclweb.org/anthology/L16-1577
PWC	https://paperswithcode.com/paper/guidelines-and-framework-for-a-large-scale
Repo
Framework