Paper Group NANR 9
Top a Splitter: Using Distributional Semantics for Improving Compound Splitting. Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine. Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art. Operational Assessment of Keyword Search on Oral Hi …
Top a Splitter: Using Distributional Semantics for Improving Compound Splitting
Title | Top a Splitter: Using Distributional Semantics for Improving Compound Splitting |
Authors | Patrick Ziering, Stefan M{"u}ller, Lonneke van der Plas |
Abstract | |
Tasks | Machine Translation, Semantic Textual Similarity |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/W16-1807/ |
https://www.aclweb.org/anthology/W16-1807 | |
PWC | https://paperswithcode.com/paper/top-a-splitter-using-distributional-semantics |
Repo | |
Framework | |
Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine
Title | Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine |
Authors | Esk, Ramy er, Nizar Habash, Owen Rambow, Arfath Pasha |
Abstract | Arabic dialects present a special problem for natural language processing because there are few resources, they have no standard orthography, and have not been studied much. However, as more and more written dialectal Arabic is found in social media, NLP for Arabic dialects becomes an important goal. We present a methodology for creating a morphological analyzer and a morphological tagger for dialectal Arabic, and we illustrate it on Egyptian and Levantine Arabic. To our knowledge, these are the first analyzer and tagger for Levantine. |
Tasks | Morphological Analysis |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-1326/ |
https://www.aclweb.org/anthology/C16-1326 | |
PWC | https://paperswithcode.com/paper/creating-resources-for-dialectal-arabic-from |
Repo | |
Framework | |
Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art
Title | Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art |
Authors | Steffen Eger, R{"u}diger Gleim, Alex Mehler, er |
Abstract | This paper relates to the challenge of morphological tagging and lemmatization in morphologically rich languages by example of German and Latin. We focus on the question what a practitioner can expect when using state-of-the-art solutions out of the box. Moreover, we contrast these with old(er) methods and implementations for POS tagging. We examine to what degree recent efforts in tagger development are reflected by improved accuracies ― and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-domain evaluation. Out-domain evaluations are particularly insightful because the distribution of the data which is being tagged by a user will typically differ from the distribution on which the tagger has been trained. Furthermore, two lemmatization techniques are evaluated. Finally, we compare pipeline tagging vs. a tagging approach that acknowledges dependencies between inflectional categories. |
Tasks | Lemmatization, Morphological Tagging |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1239/ |
https://www.aclweb.org/anthology/L16-1239 | |
PWC | https://paperswithcode.com/paper/lemmatization-and-morphological-tagging-in |
Repo | |
Framework | |
Operational Assessment of Keyword Search on Oral History
Title | Operational Assessment of Keyword Search on Oral History |
Authors | Elizabeth Salesky, Jessica Ray, Wade Shen |
Abstract | This project assesses the resources necessary to make oral history searchable by means of automatic speech recognition (ASR). There are many inherent challenges in applying ASR to conversational speech: smaller training set sizes and varying demographics, among others. We assess the impact of dataset size, word error rate and term-weighted value on human search capability through an information retrieval task on Mechanical Turk. We use English oral history data collected by StoryCorps, a national organization that provides all people with the opportunity to record, share and preserve their stories, and control for a variety of demographics including age, gender, birthplace, and dialect on four different training set sizes. We show comparable search performance using a standard speech recognition system as with hand-transcribed data, which is promising for increased accessibility of conversational speech and oral history archives. |
Tasks | Information Retrieval, Speech Recognition |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1049/ |
https://www.aclweb.org/anthology/L16-1049 | |
PWC | https://paperswithcode.com/paper/operational-assessment-of-keyword-search-on |
Repo | |
Framework | |
Data, tools and resources for mining social media drug chatter
Title | Data, tools and resources for mining social media drug chatter |
Authors | Abeed Sarker, Graciela Gonzalez |
Abstract | Social media has emerged into a crucial resource for obtaining population-based signals for various public health monitoring and surveillance tasks, such as pharmacovigilance. There is an abundance of knowledge hidden within social media data, and the volume is growing. Drug-related chatter on social media can include user-generated information that can provide insights into public health problems such as abuse, adverse reactions, long-term effects, and multi-drug interactions. Our objective in this paper is to present to the biomedical natural language processing, data science, and public health communities data sets (annotated and unannotated), tools and resources that we have collected and created from social media. The data we present was collected from Twitter using the generic and brand names of drugs as keywords, along with their common misspellings. Following the collection of the data, annotation guidelines were created over several iterations, which detail important aspects of social media data annotation and can be used by future researchers for developing similar data sets. The annotation guidelines were followed to prepare data sets for text classification, information extraction and normalization. In this paper, we discuss the preparation of these guidelines, outline the data sets prepared, and present an overview of our state-of-the-art systems for data collection, supervised classification, and information extraction. In addition to the development of supervised systems for classification and extraction, we developed and released unlabeled data and language models. We discuss the potential uses of these language models in data mining and the large volumes of unlabeled data from which they were generated. We believe that the summaries and repositories we present here of our data, annotation guidelines, models, and tools will be beneficial to the research community as a single-point entry for all these resources, and will promote further research in this area. |
Tasks | Epidemiology, Text Classification |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-5111/ |
https://www.aclweb.org/anthology/W16-5111 | |
PWC | https://paperswithcode.com/paper/data-tools-and-resources-for-mining-social |
Repo | |
Framework | |
Crowdsourcing Salient Information from News and Tweets
Title | Crowdsourcing Salient Information from News and Tweets |
Authors | Oana Inel, Tommaso Caselli, Lora Aroyo |
Abstract | The increasing streams of information pose challenges to both humans and machines. On the one hand, humans need to identify relevant information and consume only the information that lies at their interests. On the other hand, machines need to understand the information that is published in online data streams and generate concise and meaningful overviews. We consider events as prime factors to query for information and generate meaningful context. The focus of this paper is to acquire empirical insights for identifying salience features in tweets and news about a target event, i.e., the event of {``}whaling{''}. We first derive a methodology to identify such features by building up a knowledge space of the event enriched with relevant phrases, sentiments and ranked by their novelty. We applied this methodology on tweets and we have performed preliminary work towards adapting it to news articles. Our results show that crowdsourcing text relevance, sentiments and novelty (1) can be a main step in identifying salient information, and (2) provides a deeper and more precise understanding of the data at hand compared to state-of-the-art approaches. | |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1625/ |
https://www.aclweb.org/anthology/L16-1625 | |
PWC | https://paperswithcode.com/paper/crowdsourcing-salient-information-from-news |
Repo | |
Framework | |
Metrics for Evaluation of Word-level Machine Translation Quality Estimation
Title | Metrics for Evaluation of Word-level Machine Translation Quality Estimation |
Authors | Varvara Logacheva, Michal Lukasik, Lucia Specia |
Abstract | |
Tasks | Machine Translation |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-2095/ |
https://www.aclweb.org/anthology/P16-2095 | |
PWC | https://paperswithcode.com/paper/metrics-for-evaluation-of-word-level-machine |
Repo | |
Framework | |
Parallel Speech Corpora of Japanese Dialects
Title | Parallel Speech Corpora of Japanese Dialects |
Authors | Koichiro Yoshino, Naoki Hirayama, Shinsuke Mori, Fumihiko Takahashi, Katsutoshi Itoyama, Hiroshi G. Okuno |
Abstract | Binary file summaries/549.html matches |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1737/ |
https://www.aclweb.org/anthology/L16-1737 | |
PWC | https://paperswithcode.com/paper/parallel-speech-corpora-of-japanese-dialects |
Repo | |
Framework | |
Detection of Text Reuse in French Medical Corpora
Title | Detection of Text Reuse in French Medical Corpora |
Authors | Eva D{'}hondt, Cyril Grouin, Aur{'e}lie N{'e}v{'e}ol, Efstathios Stamatatos, Pierre Zweigenbaum |
Abstract | Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals{'} health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest. |
Tasks | Optical Character Recognition |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-5112/ |
https://www.aclweb.org/anthology/W16-5112 | |
PWC | https://paperswithcode.com/paper/detection-of-text-reuse-in-french-medical |
Repo | |
Framework | |
Learning to Generate Textual Data
Title | Learning to Generate Textual Data |
Authors | Guillaume Bouchard, Pontus Stenetorp, Sebastian Riedel |
Abstract | |
Tasks | Recommendation Systems, Transfer Learning |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/D16-1167/ |
https://www.aclweb.org/anthology/D16-1167 | |
PWC | https://paperswithcode.com/paper/learning-to-generate-textual-data |
Repo | |
Framework | |
Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes
Title | Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes |
Authors | Davaajav Jargalsaikhan, Naoaki Okazaki, Koji Matsuda, Kentaro Inui |
Abstract | |
Tasks | Coreference Resolution, Entity Linking, Information Retrieval, Knowledge Base Population, Question Answering |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-3021/ |
https://www.aclweb.org/anthology/P16-3021 | |
PWC | https://paperswithcode.com/paper/building-a-corpus-for-japanese-wikification |
Repo | |
Framework | |
Suggestion Mining from Opinionated Text
Title | Suggestion Mining from Opinionated Text |
Authors | Sapna Negi |
Abstract | |
Tasks | Opinion Mining, Sentence Classification |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-3018/ |
https://www.aclweb.org/anthology/P16-3018 | |
PWC | https://paperswithcode.com/paper/suggestion-mining-from-opinionated-text |
Repo | |
Framework | |
Inspire at SemEval-2016 Task 2: Interpretable Semantic Textual Similarity Alignment based on Answer Set Programming
Title | Inspire at SemEval-2016 Task 2: Interpretable Semantic Textual Similarity Alignment based on Answer Set Programming |
Authors | Mishal Kazmi, Peter Sch{"u}ller |
Abstract | |
Tasks | Chunking, Semantic Textual Similarity |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/S16-1171/ |
https://www.aclweb.org/anthology/S16-1171 | |
PWC | https://paperswithcode.com/paper/inspire-at-semeval-2016-task-2-interpretable |
Repo | |
Framework | |
Learning Additive Exponential Family Graphical Models via \ell_{2,1}-norm Regularized M-Estimation
Title | Learning Additive Exponential Family Graphical Models via \ell_{2,1}-norm Regularized M-Estimation |
Authors | Xiaotong Yuan, Ping Li, Tong Zhang, Qingshan Liu, Guangcan Liu |
Abstract | We investigate a subclass of exponential family graphical models of which the sufficient statistics are defined by arbitrary additive forms. We propose two $\ell_{2,1}$-norm regularized maximum likelihood estimators to learn the model parameters from i.i.d. samples. The first one is a joint MLE estimator which estimates all the parameters simultaneously. The second one is a node-wise conditional MLE estimator which estimates the parameters for each node individually. For both estimators, statistical analysis shows that under mild conditions the extra flexibility gained by the additive exponential family models comes at almost no cost of statistical efficiency. A Monte-Carlo approximation method is developed to efficiently optimize the proposed estimators. The advantages of our estimators over Gaussian graphical models and Nonparanormal estimators are demonstrated on synthetic and real data sets. |
Tasks | |
Published | 2016-12-01 |
URL | http://papers.nips.cc/paper/6106-learning-additive-exponential-family-graphical-models-via-ell_21-norm-regularized-m-estimation |
http://papers.nips.cc/paper/6106-learning-additive-exponential-family-graphical-models-via-ell_21-norm-regularized-m-estimation.pdf | |
PWC | https://paperswithcode.com/paper/learning-additive-exponential-family |
Repo | |
Framework | |
TGB at SemEval-2016 Task 5: Multi-Lingual Constraint System for Aspect Based Sentiment Analysis
Title | TGB at SemEval-2016 Task 5: Multi-Lingual Constraint System for Aspect Based Sentiment Analysis |
Authors | Fatih Samet {\c{C}}etin, Ezgi Y{\i}ld{\i}r{\i}m, Can {"O}zbey, G{"u}l{\c{s}}en Eryi{\u{g}}it |
Abstract | |
Tasks | Aspect-Based Sentiment Analysis, Opinion Mining, Sentiment Analysis |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/S16-1054/ |
https://www.aclweb.org/anthology/S16-1054 | |
PWC | https://paperswithcode.com/paper/tgb-at-semeval-2016-task-5-multi-lingual |
Repo | |
Framework | |