Paper Group NANR 58
Understanding Satirical Articles Using Common-Sense. A Study of Suggestions in Opinionated Texts and their Automatic Detection. The United Nations Parallel Corpus v1.0. Multilingual Code-switching Identification via LSTM Recurrent Neural Networks. Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions. NorGramBank: A `Deep’ Tr …
Understanding Satirical Articles Using Common-Sense
Title | Understanding Satirical Articles Using Common-Sense |
Authors | Dan Goldwasser, Xiao Zhang |
Abstract | Automatic satire detection is a subtle text classification task, for machines and at times, even for humans. In this paper we argue that satire detection should be approached using common-sense inferences, rather than traditional text classification methods. We present a highly structured latent variable model capturing the required inferences. The model abstracts over the specific entities appearing in the articles, grouping them into generalized categories, thus allowing the model to adapt to previously unseen situations. |
Tasks | Common Sense Reasoning, Text Classification |
Published | 2016-01-01 |
URL | https://www.aclweb.org/anthology/Q16-1038/ |
https://www.aclweb.org/anthology/Q16-1038 | |
PWC | https://paperswithcode.com/paper/understanding-satirical-articles-using-common |
Repo | |
Framework | |
A Study of Suggestions in Opinionated Texts and their Automatic Detection
Title | A Study of Suggestions in Opinionated Texts and their Automatic Detection |
Authors | Sapna Negi, Kartik Asooja, Shubham Mehrotra, Paul Buitelaar |
Abstract | |
Tasks | Opinion Mining, Sentence Classification, Sentiment Analysis |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/S16-2022/ |
https://www.aclweb.org/anthology/S16-2022 | |
PWC | https://paperswithcode.com/paper/a-study-of-suggestions-in-opinionated-texts |
Repo | |
Framework | |
The United Nations Parallel Corpus v1.0
Title | The United Nations Parallel Corpus v1.0 |
Authors | Micha{\l} Ziemski, Marcin Junczys-Dowmunt, Bruno Pouliquen |
Abstract | This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1561/ |
https://www.aclweb.org/anthology/L16-1561 | |
PWC | https://paperswithcode.com/paper/the-united-nations-parallel-corpus-v10 |
Repo | |
Framework | |
Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Title | Multilingual Code-switching Identification via LSTM Recurrent Neural Networks |
Authors | Younes Samih, Suraj Maharjan, Mohammed Attia, Laura Kallmeyer, Thamar Solorio |
Abstract | |
Tasks | Language Identification |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/W16-5806/ |
https://www.aclweb.org/anthology/W16-5806 | |
PWC | https://paperswithcode.com/paper/multilingual-code-switching-identification |
Repo | |
Framework | |
Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions
Title | Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions |
Authors | Liesbeth Augustinus, V, Vincent eghinste, Tom Vanallemeersch |
Abstract | We present Poly-GrETEL, an online tool which enables syntactic querying in parallel treebanks, based on the monolingual GrETEL environment. We provide online access to the Europarl parallel treebank for Dutch and English, allowing users to query the treebank using either an XPath expression or an example sentence in order to look for similar constructions. We provide automatic alignments between the nodes. By combining example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1564/ |
https://www.aclweb.org/anthology/L16-1564 | |
PWC | https://paperswithcode.com/paper/poly-gretel-cross-lingual-example-based |
Repo | |
Framework | |
NorGramBank: A `Deep’ Treebank for Norwegian
Title | NorGramBank: A `Deep’ Treebank for Norwegian | |
Authors | Helge Dyvik, Paul Meurer, Victoria Ros{'e}n, Koenraad De Smedt, Petter Haugereid, Gyri Sm{\o}rdal Losnegaard, Gunn Inger Lyse, Martha Thunes |
Abstract | We present NorGramBank, a treebank for Norwegian with highly detailed LFG analyses. It is one of many treebanks made available through the INESS treebanking infrastructure. NorGramBank was constructed as a parsebank, i.e. by automatically parsing a corpus, using the wide coverage grammar NorGram. One part consisting of 350,000 words has been manually disambiguated using computer-generated discriminants. A larger part of 50 M words has been stochastically disambiguated. The treebank is dynamic: by global reparsing at certain intervals it is kept compatible with the latest versions of the grammar and the lexicon, which are continually further developed in interaction with the annotators. A powerful query language, INESS Search, has been developed for search across formalisms in the INESS treebanks, including LFG c- and f-structures. Evaluation shows that the grammar provides about 85{%} of randomly selected sentences with good analyses. Agreement among the annotators responsible for manual disambiguation is satisfactory, but also suggests desirable simplifications of the grammar. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1565/ |
https://www.aclweb.org/anthology/L16-1565 | |
PWC | https://paperswithcode.com/paper/norgrambank-a-deep-treebank-for-norwegian |
Repo | |
Framework | |
Launch and Iterate: Reducing Prediction Churn
Title | Launch and Iterate: Reducing Prediction Churn |
Authors | Mahdi Milani Fard, Quentin Cormier, Kevin Canini, Maya Gupta |
Abstract | Practical applications of machine learning often involve successive training iterations with changes to features and training examples. Ideally, changes in the output of any new model should only be improvements (wins) over the previous iteration, but in practice the predictions may change neutrally for many examples, resulting in extra net-zero wins and losses, referred to as unnecessary churn. These changes in the predictions are problematic for usability for some applications, and make it harder and more expensive to measure if a change is statistically significant positive. In this paper, we formulate the problem and present a stabilization operator to regularize a classifier towards a previous classifier. We use a Markov chain Monte Carlo stabilization operator to produce a model with more consistent predictions without adversely affecting accuracy. We investigate the properties of the proposal with theoretical analysis. Experiments on benchmark datasets for different classification algorithms demonstrate the method and the resulting reduction in churn. |
Tasks | |
Published | 2016-12-01 |
URL | http://papers.nips.cc/paper/6053-launch-and-iterate-reducing-prediction-churn |
http://papers.nips.cc/paper/6053-launch-and-iterate-reducing-prediction-churn.pdf | |
PWC | https://paperswithcode.com/paper/launch-and-iterate-reducing-prediction-churn |
Repo | |
Framework | |
NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity
Title | NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity |
Authors | Kolawole Adebayo, Luigi Di Caro, Guido Boella |
Abstract | |
Tasks | Machine Translation, Semantic Textual Similarity, Text Summarization |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/S16-1111/ |
https://www.aclweb.org/anthology/S16-1111 | |
PWC | https://paperswithcode.com/paper/normas-at-semeval-2016-task-1-semsim-a-multi |
Repo | |
Framework | |
Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts
Title | Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts |
Authors | Anne-Kathrin Schumann, Stefan Fischer |
Abstract | The specialised lexicon belongs to the most prominent attributes of specialised writing: Terms function as semantically dense encodings of specialised concepts, which, in the absence of terms, would require lengthy explanations and descriptions. In this paper, we argue that terms are the result of diachronic processes on both the semantic and the morpho-syntactic level. Very little is known about these processes. We therefore present a corpus annotation project aiming at revealing how terms are coined and how they evolve to fit their function as semantically and morpho-syntactically dense encodings of specialised knowledge. The scope of this paper is two-fold: Firstly, we outline our methodology for annotating terminology in a diachronic corpus of scientific publications. Moreover, we provide a detailed analysis of our annotation results and suggest methods for improving the accuracy of annotations in a setting as difficult as ours. Secondly, we present results of a pilot study based on the annotated terms. The results suggest that terms in older texts are linguistically relatively simple units that are hard to distinguish from the lexicon of general language. We believe that this supports our hypothesis that terminology undergoes diachronic processes of densification and specialisation. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1568/ |
https://www.aclweb.org/anthology/L16-1568 | |
PWC | https://paperswithcode.com/paper/compasses-magnets-water-microscopes |
Repo | |
Framework | |
KorAP Architecture ― Diving in the Deep Sea of Corpus Data
Title | KorAP Architecture ― Diving in the Deep Sea of Corpus Data |
Authors | Nils Diewald, Michael Hanl, Eliza Margaretha, Joachim Bingel, Marc Kupietz, Piotr Ba{'n}ski, Andreas Witt |
Abstract | KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP{'}s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DeReKo for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1569/ |
https://www.aclweb.org/anthology/L16-1569 | |
PWC | https://paperswithcode.com/paper/korap-architecture-a-diving-in-the-deep-sea |
Repo | |
Framework | |
Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data
Title | Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data |
Authors | Kristina Gulordava, Paola Merlo |
Abstract | The growing work in multi-lingual parsing faces the challenge of fair comparative evaluation and performance analysis across languages and their treebanks. The difficulty lies in teasing apart the properties of treebanks, such as their size or average sentence length, from those of the annotation scheme, and from the linguistic properties of languages. We propose a method to evaluate the effects of word order of a language on dependency parsing performance, while controlling for confounding treebank properties. The method uses artificially-generated treebanks that are minimal permutations of actual treebanks with respect to two word order properties: word order variation and dependency lengths. Based on these artificial data on twelve languages, we show that longer dependencies and higher word order variability degrade parsing performance. Our method also extends to minimal pairs of individual sentences, leading to a finer-grained understanding of parsing errors. |
Tasks | Dependency Parsing |
Published | 2016-01-01 |
URL | https://www.aclweb.org/anthology/Q16-1025/ |
https://www.aclweb.org/anthology/Q16-1025 | |
PWC | https://paperswithcode.com/paper/multi-lingual-dependency-parsing-evaluation-a |
Repo | |
Framework | |
The Methodius Corpus of Rhetorical Discourse Structures and Generated Texts
Title | The Methodius Corpus of Rhetorical Discourse Structures and Generated Texts |
Authors | Amy Isard |
Abstract | Using the Methodius Natural Language Generation (NLG) System, we have created a corpus which consists of a collection of generated texts which describe ancient Greek artefacts. Each text is linked to two representations created as part of the NLG process. The first is a content plan, which uses rhetorical relations to describe the high-level discourse structure of the text, and the second is a logical form describing the syntactic structure, which is sent to the OpenCCG surface realization module to produce the final text output. In recent work, White and Howcroft (2015) have used the SPaRKy restaurant corpus, which contains similar combination of texts and representations, for their research on the induction of rules for the combination of clauses. In the first instance this corpus will be used to test their algorithms on an additional domain, and extend their work to include the learning of referring expression generation rules. As far as we know, the SPaRKy restaurant corpus is the only existing corpus of this type, and we hope that the creation of this new corpus in a different domain will provide a useful resource to the Natural Language Generation community. |
Tasks | Text Generation |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1273/ |
https://www.aclweb.org/anthology/L16-1273 | |
PWC | https://paperswithcode.com/paper/the-methodius-corpus-of-rhetorical-discourse |
Repo | |
Framework | |
Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings
Title | Supervised Word Sense Disambiguation with Sentences Similarities from Context Word Embeddings |
Authors | Shoma Yamaki, Hiroyuki Shinnou, Kanako Komiya, Minoru Sasaki |
Abstract | |
Tasks | Word Embeddings, Word Sense Disambiguation |
Published | 2016-10-01 |
URL | https://www.aclweb.org/anthology/Y16-2009/ |
https://www.aclweb.org/anthology/Y16-2009 | |
PWC | https://paperswithcode.com/paper/supervised-word-sense-disambiguation-with |
Repo | |
Framework | |
The PsyMine Corpus - A Corpus annotated with Psychiatric Disorders and their Etiological Factors
Title | The PsyMine Corpus - A Corpus annotated with Psychiatric Disorders and their Etiological Factors |
Authors | Tilia Ellendorff, Simon Foster, Fabio Rinaldi |
Abstract | We present the first version of a corpus annotated for psychiatric disorders and their etiological factors. The paper describes the choice of text, annotated entities and events/relations as well as the annotation scheme and procedure applied. The corpus is featuring a selection of focus psychiatric disorders including depressive disorder, anxiety disorder, obsessive-compulsive disorder, phobic disorders and panic disorder. Etiological factors for these focus disorders are widespread and include genetic, physiological, sociological and environmental factors among others. Etiological events, including annotated evidence text, represent the interactions between their focus disorders and their etiological factors. Additionally to these core events, symptomatic and treatment events have been annotated. The current version of the corpus includes 175 scientific abstracts. All entities and events/relations have been manually annotated by domain experts and scores of inter-annotator agreement are presented. The aim of the corpus is to provide a first gold standard to support the development of biomedical text mining applications for the specific area of mental disorders which belong to the main contributors to the contemporary burden of disease. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1590/ |
https://www.aclweb.org/anthology/L16-1590 | |
PWC | https://paperswithcode.com/paper/the-psymine-corpus-a-corpus-annotated-with |
Repo | |
Framework | |
Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
Title | Guidelines and Framework for a Large Scale Arabic Diacritized Corpus |
Authors | Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, Kemal Oflazer |
Abstract | This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1577/ |
https://www.aclweb.org/anthology/L16-1577 | |
PWC | https://paperswithcode.com/paper/guidelines-and-framework-for-a-large-scale |
Repo | |
Framework | |