Paper Group NANR 4
Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style. Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin. HHU at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Data using Machine Learning Methods. INF-UFRGS at SemEva …
Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style
Title | Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style |
Authors | Ben Verhoeven, Iza {\v{S}}krjanec, Senja Pollak |
Abstract | We present results of the first gender classification experiments on Slovene text to our knowledge. Inspired by the TwiSty corpus and experiments (Verhoeven et al., 2016), we employed the Janes corpus (Erjavec et al., 2016) and its gender annotations to perform gender classification experiments on Twitter text comparing a token-based and a lemma-based approach. We find that the token-based approach (92.6{%} accuracy), containing gender markings related to the author, outperforms the lemma-based approach by about 5{%}. Especially in the lemmatized version, we also observe stylistic and content-based differences in writing between men (e.g. more profane language, numerals and beer mentions) and women (e.g. more pronouns, emoticons and character flooding). Many of our findings corroborate previous research on other languages. |
Tasks | Lemmatization |
Published | 2017-04-01 |
URL | https://www.aclweb.org/anthology/W17-1418/ |
https://www.aclweb.org/anthology/W17-1418 | |
PWC | https://paperswithcode.com/paper/gender-profiling-for-slovene-twitter |
Repo | |
Framework | |
Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin
Title | Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin |
Authors | G{'e}raldine Walther, Beno{^\i}t Sagot |
Abstract | In this paper, we present ongoing work for developing language resources and basic NLP tools for an undocumented variety of Romansh, in the context of a language documentation and language acquisition project. Our tools are meant to improve the speed and reliability of corpus annotations for noisy data involving large amounts of code-switching, occurrences of child-speech and orthographic noise. Being able to increase the efficiency of language resource development for language documentation and acquisition research also constitutes a step towards solving the data sparsity issues with which researchers have been struggling. |
Tasks | Language Acquisition, Spelling Correction |
Published | 2017-08-01 |
URL | https://www.aclweb.org/anthology/W17-2212/ |
https://www.aclweb.org/anthology/W17-2212 | |
PWC | https://paperswithcode.com/paper/speeding-up-corpus-development-for-linguistic |
Repo | |
Framework | |
HHU at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Data using Machine Learning Methods
Title | HHU at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Data using Machine Learning Methods |
Authors | Tobias Cabanski, Julia Romberg, Stefan Conrad |
Abstract | In this Paper a system for solving SemEval-2017 Task 5 is presented. This task is divided into two tracks where the sentiment of microblog messages and news headlines has to be predicted. Since two submissions were allowed, two different machine learning methods were developed to solve this task, a support vector machine approach and a recurrent neural network approach. To feed in data for these approaches, different feature extraction methods are used, mainly word representations and lexica. The best submissions for both tracks are provided by the recurrent neural network which achieves a F1-score of 0.729 in track 1 and 0.702 in track 2. |
Tasks | Feature Selection, Sentiment Analysis |
Published | 2017-08-01 |
URL | https://www.aclweb.org/anthology/S17-2141/ |
https://www.aclweb.org/anthology/S17-2141 | |
PWC | https://paperswithcode.com/paper/hhu-at-semeval-2017-task-5-fine-grained |
Repo | |
Framework | |
INF-UFRGS at SemEval-2017 Task 5: A Supervised Identification of Sentiment Score in Tweets and Headlines
Title | INF-UFRGS at SemEval-2017 Task 5: A Supervised Identification of Sentiment Score in Tweets and Headlines |
Authors | Tiago Zini, Karin Becker, Marcelo Dias |
Abstract | This paper describes a supervised solution for detecting the polarity scores of tweets or headline news in the financial domain, submitted to the SemEval 2017 Fine-Grained Sentiment Analysis on Financial Microblogs and News Task. The premise is that it is possible to understand market reaction over a company stock by measuring the positive/negative sentiment contained in the financial tweets and news headlines, where polarity is measured in a continuous scale ranging from -1.0 (very bearish) to 1.0 (very bullish). Our system receives as input the textual content of tweets or news headlines, together with their ids, stock cashtag or name of target company, and the polarity score gold standard for the training dataset. Our solution retrieves features from these text instances using n-gram, hashtags, sentiment score calculated by a external APIs and others features to train a regression model capable to detect continuous score of these sentiments with precision. |
Tasks | Opinion Mining, Sentiment Analysis |
Published | 2017-08-01 |
URL | https://www.aclweb.org/anthology/S17-2142/ |
https://www.aclweb.org/anthology/S17-2142 | |
PWC | https://paperswithcode.com/paper/inf-ufrgs-at-semeval-2017-task-5-a-supervised |
Repo | |
Framework | |
Building Large Chinese Corpus for Spoken Dialogue Research in Specific Domains
Title | Building Large Chinese Corpus for Spoken Dialogue Research in Specific Domains |
Authors | Changliang Li, Xiuying Wang |
Abstract | Corpus is a valuable resource for information retrieval and data-driven natural language processing systems,especially for spoken dialogue research in specific domains. However,there is little non-English corpora, particular for ones in Chinese. Spoken by the nation with the largest population in the world, Chinese become increasingly prevalent and popular among millions of people worldwide. In this paper, we build a large-scale and high-quality Chinese corpus, called CSDC (Chinese Spoken Dialogue Corpus). It contains five domains and more than 140 thousand dialogues in all. Each sentence in this corpus is annotated with slot information additionally compared to other corpora. To our best knowledge, this is the largest Chinese spoken dialogue corpus, as well as the first one with slot information. With this corpus, we proposed a method and did a well-designed experiment. The indicative result is reported at last. |
Tasks | Information Retrieval |
Published | 2017-11-01 |
URL | https://www.aclweb.org/anthology/I17-2054/ |
https://www.aclweb.org/anthology/I17-2054 | |
PWC | https://paperswithcode.com/paper/building-large-chinese-corpus-for-spoken |
Repo | |
Framework | |
Measuring Topic Coherence through Optimal Word Buckets
Title | Measuring Topic Coherence through Optimal Word Buckets |
Authors | Nitin Ramrakhiyani, Sachin Pawar, Swapnil Hingmire, Girish Palshikar |
Abstract | Measuring topic quality is essential for scoring the learned topics and their subsequent use in Information Retrieval and Text classification. To measure quality of Latent Dirichlet Allocation (LDA) based topics learned from text, we propose a novel approach based on grouping of topic words into buckets (TBuckets). A single large bucket signifies a single coherent theme, in turn indicating high topic coherence. TBuckets uses word embeddings of topic words and employs singular value decomposition (SVD) and Integer Linear Programming based optimization to create coherent word buckets. TBuckets outperforms the state-of-the-art techniques when evaluated using 3 publicly available datasets and on another one proposed in this paper. |
Tasks | Information Retrieval, Text Classification, Topic Models, Word Embeddings |
Published | 2017-04-01 |
URL | https://www.aclweb.org/anthology/E17-2070/ |
https://www.aclweb.org/anthology/E17-2070 | |
PWC | https://paperswithcode.com/paper/measuring-topic-coherence-through-optimal |
Repo | |
Framework | |
PP Attachment: Where do We Stand?
Title | PP Attachment: Where do We Stand? |
Authors | Dani{"e}l de Kok, Jianqiang Ma, Corina Dima, Erhard Hinrichs |
Abstract | Prepostitional phrase (PP) attachment is a well known challenge to parsing. In this paper, we combine the insights of different works, namely: (1) treating PP attachment as a classification task with an arbitrary number of attachment candidates; (2) using auxiliary distributions to augment the data beyond the hand-annotated training set; (3) using topological fields to get information about the distribution of PP attachment throughout clauses and (4) using state-of-the-art techniques such as word embeddings and neural networks. We show that jointly using these techniques leads to substantial improvements. We also conduct a qualitative analysis to gauge where the ceiling of the task is in a realistic setup. |
Tasks | Word Embeddings |
Published | 2017-04-01 |
URL | https://www.aclweb.org/anthology/E17-2050/ |
https://www.aclweb.org/anthology/E17-2050 | |
PWC | https://paperswithcode.com/paper/pp-attachment-where-do-we-stand |
Repo | |
Framework | |
Exploiting Argument Information to Improve Event Detection via Supervised Attention Mechanisms
Title | Exploiting Argument Information to Improve Event Detection via Supervised Attention Mechanisms |
Authors | Shulin Liu, Yubo Chen, Kang Liu, Jun Zhao |
Abstract | This paper tackles the task of event detection (ED), which involves identifying and categorizing events. We argue that arguments provide significant clues to this task, but they are either completely ignored or exploited in an indirect manner in existing detection approaches. In this work, we propose to exploit argument information explicitly for ED via supervised attention mechanisms. In specific, we systematically investigate the proposed model under the supervision of different attention strategies. Experimental results show that our approach advances state-of-the-arts and achieves the best F1 score on ACE 2005 dataset. |
Tasks | |
Published | 2017-07-01 |
URL | https://www.aclweb.org/anthology/P17-1164/ |
https://www.aclweb.org/anthology/P17-1164 | |
PWC | https://paperswithcode.com/paper/exploiting-argument-information-to-improve |
Repo | |
Framework | |
Deep Learning in Lexical Analysis and Parsing
Title | Deep Learning in Lexical Analysis and Parsing |
Authors | Wanxiang Che, Yue Zhang |
Abstract | Neural networks, also with a fancy name deep learning, just right can overcome the above {``}feature engineering{''} problem. In theory, they can use non-linear activation functions and multiple layers to automatically find useful features. The novel network structures, such as convolutional or recurrent, help to reduce the difficulty further. These deep learning models have been successfully used for lexical analysis and parsing. In this tutorial, we will give a review of each line of work, by contrasting them with traditional statistical methods, and organizing them in consistent orders. | |
Tasks | Dependency Parsing, Feature Engineering, Lexical Analysis, Part-Of-Speech Tagging, Structured Prediction |
Published | 2017-11-01 |
URL | https://www.aclweb.org/anthology/I17-5001/ |
https://www.aclweb.org/anthology/I17-5001 | |
PWC | https://paperswithcode.com/paper/deep-learning-in-lexical-analysis-and-parsing |
Repo | |
Framework | |
UIT-DANGNT-CLNLP at SemEval-2017 Task 9: Building Scientific Concept Fixing Patterns for Improving CAMR
Title | UIT-DANGNT-CLNLP at SemEval-2017 Task 9: Building Scientific Concept Fixing Patterns for Improving CAMR |
Authors | Khoa Nguyen, Dang Nguyen |
Abstract | This paper describes the improvements that we have applied on CAMR baseline parser (Wang et al., 2016) at Task 8 of SemEval-2016. Our objective is to increase the performance of CAMR when parsing sentences from scientific articles, especially articles of biology domain more accurately. To achieve this goal, we built two wrapper layers for CAMR. The first layer, which covers the input data, will normalize, add necessary information to the input sentences to make the input dependency parser and the aligner better handle reference citations, scientific figures, formulas, etc. The second layer, which covers the output data, will modify and standardize output data based on a list of scientific concept fixing patterns. This will help CAMR better handle biological concepts which are not in the training dataset. Finally, after applying our approach, CAMR has scored 0.65 F-score on the test set of Biomedical training data and 0.61 F-score on the official blind test dataset. |
Tasks | |
Published | 2017-08-01 |
URL | https://www.aclweb.org/anthology/S17-2156/ |
https://www.aclweb.org/anthology/S17-2156 | |
PWC | https://paperswithcode.com/paper/uit-dangnt-clnlp-at-semeval-2017-task-9 |
Repo | |
Framework | |
The Projector: An Interactive Annotation Projection Visualization Tool
Title | The Projector: An Interactive Annotation Projection Visualization Tool |
Authors | Alan Akbik, Rol Vollgraf, |
Abstract | Previous works proposed annotation projection in parallel corpora to inexpensively generate treebanks or propbanks for new languages. In this approach, linguistic annotation is automatically transferred from a resource-rich source language (SL) to translations in a target language (TL). However, annotation projection may be adversely affected by translational divergences between specific language pairs. For this reason, previous work often required careful qualitative analysis of projectability of specific annotation in order to define strategies to address quality and coverage issues. In this demonstration, we present THE PROJECTOR, an interactive GUI designed to assist researchers in such analysis: it allows users to execute and visually inspect annotation projection in a range of different settings. We give an overview of the GUI, discuss use cases and illustrate how the tool can facilitate discussions with the research community. |
Tasks | |
Published | 2017-09-01 |
URL | https://www.aclweb.org/anthology/D17-2008/ |
https://www.aclweb.org/anthology/D17-2008 | |
PWC | https://paperswithcode.com/paper/the-projector-an-interactive-annotation |
Repo | |
Framework | |
FORGe at SemEval-2017 Task 9: Deep sentence generation based on a sequence of graph transducers
Title | FORGe at SemEval-2017 Task 9: Deep sentence generation based on a sequence of graph transducers |
Authors | Simon Mille, Roberto Carlini, Alicia Burga, Leo Wanner |
Abstract | We present the contribution of Universitat Pompeu Fabra{'}s NLP group to the SemEval Task 9.2 (AMR-to-English Generation). The proposed generation pipeline comprises: (i) a series of rule-based graph-transducers for the syntacticization of the input graphs and the resolution of morphological agreements, and (ii) an off-the-shelf statistical linearization component. |
Tasks | |
Published | 2017-08-01 |
URL | https://www.aclweb.org/anthology/S17-2158/ |
https://www.aclweb.org/anthology/S17-2158 | |
PWC | https://paperswithcode.com/paper/forge-at-semeval-2017-task-9-deep-sentence |
Repo | |
Framework | |
Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
Title | Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing |
Authors | Mika Koistinen, Kimmo Kettunen, Tuula P{"a}{"a}kk{"o}nen |
Abstract | |
Tasks | Boundary Detection, Information Retrieval, Machine Translation, Named Entity Recognition, Optical Character Recognition, Tokenization |
Published | 2017-05-01 |
URL | https://www.aclweb.org/anthology/W17-0238/ |
https://www.aclweb.org/anthology/W17-0238 | |
PWC | https://paperswithcode.com/paper/improving-optical-character-recognition-of |
Repo | |
Framework | |
MEANT 2.0: Accurate semantic MT evaluation for any output language
Title | MEANT 2.0: Accurate semantic MT evaluation for any output language |
Authors | Chi-kiu Lo |
Abstract | |
Tasks | Machine Translation, Semantic Role Labeling, Word Embeddings |
Published | 2017-09-01 |
URL | https://www.aclweb.org/anthology/W17-4767/ |
https://www.aclweb.org/anthology/W17-4767 | |
PWC | https://paperswithcode.com/paper/meant-20-accurate-semantic-mt-evaluation-for |
Repo | |
Framework | |
RIGOTRIO at SemEval-2017 Task 9: Combining Machine Learning and Grammar Engineering for AMR Parsing and Generation
Title | RIGOTRIO at SemEval-2017 Task 9: Combining Machine Learning and Grammar Engineering for AMR Parsing and Generation |
Authors | Normunds Gruzitis, Didzis Gosko, Guntis Barzdins |
Abstract | By addressing both text-to-AMR parsing and AMR-to-text generation, SemEval-2017 Task 9 established AMR as a powerful semantic interlingua. We strengthen the interlingual aspect of AMR by applying the multilingual Grammatical Framework (GF) for AMR-to-text generation. Our current rule-based GF approach completely covered only 12.3{%} of the test AMRs, therefore we combined it with state-of-the-art JAMR Generator to see if the combination increases or decreases the overall performance. The combined system achieved the automatic BLEU score of 18.82 and the human Trueskill score of 107.2, to be compared to the plain JAMR Generator results. As for AMR parsing, we added NER extensions to our SemEval-2016 general-domain AMR parser to handle the biomedical genre, rich in organic compound names, achieving Smatch F1=54.0{%}. |
Tasks | Amr Parsing, Text Generation |
Published | 2017-08-01 |
URL | https://www.aclweb.org/anthology/S17-2159/ |
https://www.aclweb.org/anthology/S17-2159 | |
PWC | https://paperswithcode.com/paper/rigotrio-at-semeval-2017-task-9-combining |
Repo | |
Framework | |