July 26, 2019

2165 words 11 mins read

Paper Group NANR 110

CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects. Arabic Dialect Identification Using iVectors and ASR Transcripts. Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Compositionality for perceptual classification. Ex …

CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects


Title	CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects
Authors	Simon Clematide, Peter Makarov
Abstract	Our submissions for the GDI 2017 Shared Task are the results from three different types of classifiers: Na{"\i}ve Bayes, Conditional Random Fields (CRF), and Support Vector Machine (SVM). Our CRF-based run achieves a weighted F1 score of 65{%} (third rank) being beaten by the best system by 0.9{%}. Measured by classification accuracy, our ensemble run (Na{"\i}ve Bayes, CRF, SVM) reaches 67{%} (second rank) being 1{%} lower than the best system. We also describe our experiments with Recurrent Neural Network (RNN) architectures. Since they performed worse than our non-neural approaches we did not include them in the submission.
Tasks	Language Identification, Text Classification
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1221/
PDF	https://www.aclweb.org/anthology/W17-1221
PWC	https://paperswithcode.com/paper/cluzh-at-vardial-gdi-2017-testing-a-variety
Repo
Framework

Arabic Dialect Identification Using iVectors and ASR Transcripts


Title	Arabic Dialect Identification Using iVectors and ASR Transcripts
Authors	Shervin Malmasi, Marcos Zampieri
Abstract	This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using both audio and text transcriptions. The ADI shared task dataset included Modern Standard Arabic (MSA) and four Arabic dialects: Egyptian, Gulf, Levantine, and North-African. The three systems submitted by MAZA are based on combinations of multiple machine learning classifiers arranged as (1) voting ensemble; (2) mean probability ensemble; (3) meta-classifier. The best results were obtained by the meta-classifier achieving 71.7{%} accuracy, ranking second among the six teams which participated in the ADI shared task.
Tasks	Machine Translation
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1222/
PDF	https://www.aclweb.org/anthology/W17-1222
PWC	https://paperswithcode.com/paper/arabic-dialect-identification-using-ivectors
Repo
Framework


Title	Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Authors
Abstract
Tasks
Published	2017-09-01
URL	https://www.aclweb.org/anthology/W17-5200/
PDF	https://www.aclweb.org/anthology/W17-5200
PWC	https://paperswithcode.com/paper/proceedings-of-the-8th-workshop-on
Repo
Framework

Compositionality for perceptual classification


Title	Compositionality for perceptual classification
Authors	Staffan Larsson
Abstract
Tasks
Published	2017-01-01
URL	https://www.aclweb.org/anthology/W17-6923/
PDF	https://www.aclweb.org/anthology/W17-6923
PWC	https://paperswithcode.com/paper/compositionality-for-perceptual
Repo
Framework

Exploring Lexical and Syntactic Features for Language Variety Identification


Title	Exploring Lexical and Syntactic Features for Language Variety Identification
Authors	Chris van der Lee, Antal van den Bosch
Abstract	We present a method to discriminate between texts written in either the Netherlandic or the Flemish variant of the Dutch language. The method draws on a feature bundle representing text statistics, syntactic features, and word $n$-grams. Text statistics include average word length and sentence length, while syntactic features include ratios of function words and part-of-speech $n$-grams. The effectiveness of the classifier was measured by classifying Dutch subtitles developed for either Dutch or Flemish television. Several machine learning algorithms were compared as well as feature combination methods in order to find the optimal generalization performance. A machine-learning meta classifier based on AdaBoost attained the best F-score of 0.92.
Tasks	Language Identification, Text Classification
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1224/
PDF	https://www.aclweb.org/anthology/W17-1224
PWC	https://paperswithcode.com/paper/exploring-lexical-and-syntactic-features-for
Repo
Framework

Learning to Identify Arabic and German Dialects using Multiple Kernels


Title	Learning to Identify Arabic and German Dialects using Multiple Kernels
Authors	Radu Tudor Ionescu, Andrei Butnaru
Abstract	We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32{%} (4.62{%} above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67{%} (2.57{%} below the first place).
Tasks
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1225/
PDF	https://www.aclweb.org/anthology/W17-1225
PWC	https://paperswithcode.com/paper/learning-to-identify-arabic-and-german
Repo
Framework

LIPN at SemEval-2017 Task 10: Filtering Candidate Keyphrases from Scientific Publications with Part-of-Speech Tag Sequences to Train a Sequence Labeling Model


Title	LIPN at SemEval-2017 Task 10: Filtering Candidate Keyphrases from Scientific Publications with Part-of-Speech Tag Sequences to Train a Sequence Labeling Model
Authors	Hern, Simon David ez, Davide Buscaldi, Thierry Charnois
Abstract	This paper describes the system used by the team LIPN in SemEval 2017 Task 10: Extracting Keyphrases and Relations from Scientific Publications. The team participated in Scenario 1, that includes three subtasks, Identification of keyphrases (Subtask A), Classification of identified keyphrases (Subtask B) and Extraction of relationships between two identified keyphrases (Subtask C). The presented system was mainly focused on the use of part-of-speech tag sequences to filter candidate keyphrases for Subtask A. Subtasks A and B were addressed as a sequence labeling problem using Conditional Random Fields (CRFs) and even though Subtask C was out of the scope of this approach, one rule was included to identify synonyms.
Tasks
Published	2017-08-01
URL	https://www.aclweb.org/anthology/S17-2174/
PDF	https://www.aclweb.org/anthology/S17-2174
PWC	https://paperswithcode.com/paper/lipn-at-semeval-2017-task-10-filtering
Repo
Framework

IITP at IJCNLP-2017 Task 4: Auto Analysis of Customer Feedback using CNN and GRU Network


Title	IITP at IJCNLP-2017 Task 4: Auto Analysis of Customer Feedback using CNN and GRU Network
Authors	Deepak Gupta, Pabitra Lenka, Harsimran Bedi, Asif Ekbal, Pushpak Bhattacharyya
Abstract	Analyzing customer feedback is the best way to channelize the data into new marketing strategies that benefit entrepreneurs as well as customers. Therefore an automated system which can analyze the customer behavior is in great demand. Users may write feedbacks in any language, and hence mining appropriate information often becomes intractable. Especially in a traditional feature-based supervised model, it is difficult to build a generic system as one has to understand the concerned language for finding the relevant features. In order to overcome this, we propose deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based approaches that do not require handcrafting of features. We evaluate these techniques for analyzing customer feedback sentences on four languages, namely English, French, Japanese and Spanish. Our empirical analysis shows that our models perform well in all the four languages on the setups of IJCNLP Shared Task on Customer Feedback Analysis. Our model achieved the second rank in French, with an accuracy of 71.75{%} and third ranks for all the other languages.
Tasks	Document Classification, Emotion Classification, Sentiment Analysis
Published	2017-12-01
URL	https://www.aclweb.org/anthology/I17-4031/
PDF	https://www.aclweb.org/anthology/I17-4031
PWC	https://paperswithcode.com/paper/iitp-at-ijcnlp-2017-task-4-auto-analysis-of
Repo
Framework

A Neural Architecture for Dialectal Arabic Segmentation


Title	A Neural Architecture for Dialectal Arabic Segmentation
Authors	Younes Samih, Mohammed Attia, Mohamed Eldesouki, Ahmed Abdelali, Hamdy Mubarak, Laura Kallmeyer, Kareem Darwish
Abstract	The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neural networks without any normalization or use of lexical features or lexical resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that rely on additional resources.
Tasks	Machine Translation, Morphological Analysis, Part-Of-Speech Tagging
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1306/
PDF	https://www.aclweb.org/anthology/W17-1306
PWC	https://paperswithcode.com/paper/a-neural-architecture-for-dialectal-arabic
Repo
Framework

A Morphological Analyzer for Gulf Arabic Verbs


Title	A Morphological Analyzer for Gulf Arabic Verbs
Authors	Salam Khalifa, Sara Hassan, Nizar Habash
Abstract	We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering over 2,600 verbal lemmas. We describe in detail the process of building the analyzer starting from phonetic dictionary entries to fully inflected orthographic paradigms and associated lexicon and orthographic variants. We evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token recall for identifying correct POS tag outperforms both the Modern Standard Arabic and Egyptian Arabic analyzers by over 27.4{%} and 16.9{%} absolute, respectively.
Tasks	Morphological Tagging, Part-Of-Speech Tagging
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1305/
PDF	https://www.aclweb.org/anthology/W17-1305
PWC	https://paperswithcode.com/paper/a-morphological-analyzer-for-gulf-arabic
Repo
Framework

Unsupervised Domain Adaptation for Clinical Negation Detection


Title	Unsupervised Domain Adaptation for Clinical Negation Detection
Authors	Timothy Miller, Steven Bethard, Hadi Amiri, Guergana Savova
Abstract	Detecting negated concepts in clinical texts is an important part of NLP information extraction systems. However, generalizability of negation systems is lacking, as cross-domain experiments suffer dramatic performance losses. We examine the performance of multiple unsupervised domain adaptation algorithms on clinical negation detection, finding only modest gains that fall well short of in-domain performance.
Tasks	Domain Adaptation, Negation Detection, Unsupervised Domain Adaptation
Published	2017-08-01
URL	https://www.aclweb.org/anthology/W17-2320/
PDF	https://www.aclweb.org/anthology/W17-2320
PWC	https://paperswithcode.com/paper/unsupervised-domain-adaptation-for-clinical
Repo
Framework

Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach


Title	Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach
Authors	Fahad Albogamy, Allan Ramsay, Hanady Ahmed
Abstract	In this paper, we propose using a {``}bootstrapping{''} method for constructing a dependency treebank of Arabic tweets. This method uses a rule-based parser to create a small treebank of one thousand Arabic tweets and a data-driven parser to create a larger treebank by using the small treebank as a seed training set. We are able to create a dependency treebank from unlabelled tweets without any manual intervention. Experiments results show that this method can improve the speed of training the parser and the accuracy of the resulting parsers. \|
Tasks	Domain Adaptation
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1312/
PDF	https://www.aclweb.org/anthology/W17-1312
PWC	https://paperswithcode.com/paper/arabic-tweets-treebanking-and-parsing-a
Repo
Framework

Post-Processing Techniques for Improving Predictions of Multilabel Learning Approaches


Title	Post-Processing Techniques for Improving Predictions of Multilabel Learning Approaches
Authors	Akshay Soni, Aasish Pappu, Jerry Chia-mau Ni, Troy Chevalier
Abstract	In Multilabel Learning (MLL) each training instance is associated with a set of labels and the task is to learn a function that maps an unseen instance to its corresponding label set. In this paper, we present a suite of {–} MLL algorithm independent {–} post-processing techniques that utilize the conditional and directional label-dependences in order to make the predictions from any MLL approach more coherent and precise. We solve constraint optimization problem over the output produced by any MLL approach and the result is a refined version of the input predicted label set. Using proposed techniques, we show absolute improvement of 3{%} on English News and 10{%} on Chinese E-commerce datasets for P@K metric.
Tasks
Published	2017-11-01
URL	https://www.aclweb.org/anthology/I17-2011/
PDF	https://www.aclweb.org/anthology/I17-2011
PWC	https://paperswithcode.com/paper/post-processing-techniques-for-improving
Repo
Framework

Robust Dictionary Lookup in Multiple Noisy Orthographies


Title	Robust Dictionary Lookup in Multiple Noisy Orthographies
Authors	Lingliang Zhang, Nizar Habash, Godfried Toussaint
Abstract	We present the MultiScript Phonetic Search algorithm to address the problem of language learners looking up unfamiliar words that they heard. We apply it to Arabic dictionary lookup with noisy queries done using both the Arabic and Roman scripts. Our algorithm is based on a computational phonetic distance metric that can be optionally machine learned. To benchmark our performance, we created the ArabScribe dataset, containing 10,000 noisy transcriptions of random Arabic dictionary words. Our algorithm outperforms Google Translate{'}s {``}did you mean{''} feature, as well as the Yamli smart Arabic keyboard. \|
Tasks	Transliteration
Published	2017-04-01
URL	https://www.aclweb.org/anthology/W17-1315/
PDF	https://www.aclweb.org/anthology/W17-1315
PWC	https://paperswithcode.com/paper/robust-dictionary-lookup-in-multiple-noisy
Repo
Framework

ULISBOA at SemEval-2017 Task 12: Extraction and classification of temporal expressions and events


Title	ULISBOA at SemEval-2017 Task 12: Extraction and classification of temporal expressions and events
Authors	Andre Lamurias, Diana Sousa, Sofia Pereira, Luka Clarke, Francisco M. Couto
Abstract	This paper presents our approach to participate in the SemEval 2017 Task 12: Clinical TempEval challenge, specifically in the event and time expressions span and attribute identification subtasks (ES, EA, TS, TA). Our approach consisted in training Conditional Random Fields (CRF) classifiers using the provided annotations, and in creating manually curated rules to classify the attributes of each event and time expression. We used a set of common features for the event and time CRF classifiers, and a set of features specific to each type of entity, based on domain knowledge. Training only on the source domain data, our best F-scores were 0.683 and 0.485 for event and time span identification subtasks. When adding target domain annotations to the training data, the best F-scores obtained were 0.729 and 0.554, for the same subtasks. We obtained the second highest F-score of the challenge on the event polarity subtask (0.708). The source code of our system, Clinical Timeline Annotation (CiTA), is available at \url{https://github.com/lasigeBioTM/CiTA}.
Tasks
Published	2017-08-01
URL	https://www.aclweb.org/anthology/S17-2179/
PDF	https://www.aclweb.org/anthology/S17-2179
PWC	https://paperswithcode.com/paper/ulisboa-at-semeval-2017-task-12-extraction
Repo
Framework