May 5, 2019

2246 words 11 mins read

Paper Group NANR 22

Discriminating Similar Languages with Linear SVMs and Neural Networks. Age and Gender Prediction on Health Forum Data. An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes. Towards Automatic Transcription of ILSE â€• an Interdisciplinary Longitudinal Study of Adult Development and Aging. Evaluating the Impact of Light Post-Ed …

Discriminating Similar Languages with Linear SVMs and Neural Networks


Title	Discriminating Similar Languages with Linear SVMs and Neural Networks
Authors	{\c{C}}a{\u{g}}r{\i} {\c{C}}{"o}ltekin, Taraka Rama
Abstract	This paper describes the systems we experimented with for participating in the discriminating between similar languages (DSL) shared task 2016. We submitted results of a single system based on support vector machines (SVM) with linear kernel and using character ngram features, which obtained the first rank at the closed training track for test set A. Besides the linear SVM, we also report additional experiments with a number of deep learning architectures. Despite our intuition that non-linear deep learning methods should be advantageous, linear models seems to fare better in this task, at least with the amount of data and the amount of effort we spent on tuning these models.
Tasks	Language Identification
Published	2016-12-01
URL	https://www.aclweb.org/anthology/W16-4802/
PDF	https://www.aclweb.org/anthology/W16-4802
PWC	https://paperswithcode.com/paper/discriminating-similar-languages-with-linear
Repo
Framework

Age and Gender Prediction on Health Forum Data


Title	Age and Gender Prediction on Health Forum Data
Authors	Prasha Shrestha, Nicolas Rey-Villamizar, Farig Sadeque, Ted Pedersen, Steven Bethard, Thamar Solorio
Abstract	Health support forums have become a rich source of data that can be used to improve health care outcomes. A user profile, including information such as age and gender, can support targeted analysis of forum data. But users might not always disclose their age and gender. It is desirable then to be able to automatically extract this information from users{'} content. However, to the best of our knowledge there is no such resource for author profiling of health forum data. Here we present a large corpus, with close to 85,000 users, for profiling and also outline our approach and benchmark results to automatically detect a user{'}s age and gender from their forum posts. We use a mix of features from a user{'}s text as well as forum specific features to obtain accuracy well above the baseline, thus showing that both our dataset and our method are useful and valid.
Tasks	Gender Prediction
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1541/
PDF	https://www.aclweb.org/anthology/L16-1541
PWC	https://paperswithcode.com/paper/age-and-gender-prediction-on-health-forum
Repo
Framework

An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes


Title	An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes
Authors	Kai Frederic Engelmann, Patrick Holthaus, Britta Wrede, Sebastian Wrede
Abstract	The term smart home refers to a living environment that by its connected sensors and actuators is capable of providing intelligent and contextualised support to its user. This may result in automated behaviors that blends into the user{'}s daily life. However, currently most smart homes do not provide such intelligent support. A first step towards such intelligent capabilities lies in learning automation rules by observing the user{'}s behavior. We present a new type of corpus for learning such rules from user behavior as observed from the events in a smart homes sensor and actuator network. The data contains information about intended tasks by the users and synchronized events from this network. It is derived from interactions of 59 users with the smart home in order to solve five tasks. The corpus contains recordings of more than 40 different types of data streams and has been segmented and pre-processed to increase signal quality. Overall, the data shows a high noise level on specific data types that can be filtered out by a simple smoothing approach. The resulting data provides insights into event patterns resulting from task specific user behavior and thus constitutes a basis for machine learning approaches to learn automation rules.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1228/
PDF	https://www.aclweb.org/anthology/L16-1228
PWC	https://paperswithcode.com/paper/an-interaction-centric-dataset-for-learning
Repo
Framework

Towards Automatic Transcription of ILSE â€• an Interdisciplinary Longitudinal Study of Adult Development and Aging


Title	Towards Automatic Transcription of ILSE â€• an Interdisciplinary Longitudinal Study of Adult Development and Aging
Authors	Jochen Weiner, Claudia Frankenberg, Dominic Telaar, Britta Wendelstein, Johannes Schr{"o}der, Tanja Schultz
Abstract	The Interdisciplinary Longitudinal Study on Adult Development and Aging (ILSE) was created to facilitate the study of challenges posed by rapidly aging societies in developed countries such as Germany. ILSE contains over 8,000 hours of biographic interviews recorded from more than 1,000 participants over the course of 20 years. Investigations on various aspects of aging, such as cognitive decline, often rely on the analysis of linguistic features which can be derived from spoken content like these interviews. However, transcribing speech is a time and cost consuming manual process and so far only 380 hours of ILSE interviews have been transcribed. Thus, it is the aim of our work to establish technical systems to fully automatically transcribe the ILSE interview data. The joint occurrence of poor recording quality, long audio segments, erroneous transcriptions, varying speaking styles {&} crosstalk, and emotional {&} dialectal speech in these interviews presents challenges for automatic speech recognition (ASR). We describe our ongoing work towards the fully automatic transcription of all ILSE interviews and the steps we implemented in preparing the transcriptions to meet the interviews{'} challenges. Using a recursive long audio alignment procedure 96 hours of the transcribed data have been made accessible for ASR training.
Tasks	Speech Recognition
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1114/
PDF	https://www.aclweb.org/anthology/L16-1114
PWC	https://paperswithcode.com/paper/towards-automatic-transcription-of-ilse-a-an
Repo
Framework

Evaluating the Impact of Light Post-Editing on Usability


Title	Evaluating the Impact of Light Post-Editing on Usability
Authors	Sheila Castilho, Sharon O{'}Brien
Abstract	This paper discusses a methodology to measure the usability of machine translated content by end users, comparing lightly post-edited content with raw output and with the usability of source language content. The content selected consists of Online Help articles from a software company for a spreadsheet application, translated from English into German. Three groups of five users each used either the source text - the English version (EN) -, the raw MT version (DE{_}MT), or the light PE version (DE{_}PE), and were asked to carry out six tasks. Usability was measured using an eye tracker and cognitive, temporal and pragmatic measures of usability. Satisfaction was measured via a post-task questionnaire presented after the participants had completed the tasks.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1048/
PDF	https://www.aclweb.org/anthology/L16-1048
PWC	https://paperswithcode.com/paper/evaluating-the-impact-of-light-post-editing
Repo
Framework

Japanese Post-verbal Constructions Revisited


Title	Japanese Post-verbal Constructions Revisited
Authors	Kohji Kamada
Abstract
Tasks
Published	2016-10-01
URL	https://www.aclweb.org/anthology/Y16-3002/
PDF	https://www.aclweb.org/anthology/Y16-3002
PWC	https://paperswithcode.com/paper/japanese-post-verbal-constructions-revisited
Repo
Framework

Towards a unified account of resultative constructions in Korean


Title	Towards a unified account of resultative constructions in Korean
Authors	Juwon Lee
Abstract
Tasks
Published	2016-10-01
URL	https://www.aclweb.org/anthology/Y16-3023/
PDF	https://www.aclweb.org/anthology/Y16-3023
PWC	https://paperswithcode.com/paper/towards-a-unified-account-of-resultative
Repo
Framework

Empty element recovery by spinal parser operations


Title	Empty element recovery by spinal parser operations
Authors	Katsuhiko Hayashi, Masaaki Nagata
Abstract
Tasks
Published	2016-08-01
URL	https://www.aclweb.org/anthology/P16-2016/
PDF	https://www.aclweb.org/anthology/P16-2016
PWC	https://paperswithcode.com/paper/empty-element-recovery-by-spinal-parser
Repo
Framework

OSMAN â€• A Novel Arabic Readability Metric


Title	OSMAN â€• A Novel Arabic Readability Metric
Authors	Mahmoud El-Haj, Paul Rayson
Abstract	We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. It allows researchers to calculate readability for Arabic text with and without diacritics. OSMAN is a modified version of the conventional readability formulas such as Flesch and Fog. In our work we introduce a novel approach towards counting short, long and stress syllables in Arabic which is essential for judging readability of Arabic narratives. We also introduce an additional factor called {``}Faseeh{''} which considers aspects of script usually dropped in informal Arabic writing. To evaluate our methods we used Spearman{'}s correlation metric to compare text readability for 73,000 parallel sentences from English and Arabic UN documents. The Arabic sentences were written with the absence of diacritics and in order to count the number of syllables we added the diacritics in using an open source tool called Mishkal. The results show that OSMAN readability formula correlates well with the English ones making it a useful tool for researchers and educators working with Arabic text. \|
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1038/
PDF	https://www.aclweb.org/anthology/L16-1038
PWC	https://paperswithcode.com/paper/osman-a-a-novel-arabic-readability-metric
Repo
Framework

Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation


Title	Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation
Authors	Shin Kanouchi, Katsuhito Sudoh, Mamoru Komachi
Abstract	This paper presents an improved lexicalized reordering model for phrase-based statistical machine translation using a deep neural network. Lexicalized reordering suffers from reordering ambiguity, data sparseness and noises in a phrase table. Previous neural reordering model is successful to solve the first and second problems but fails to address the third one. Therefore, we propose new features using phrase translation and word alignment to construct phrase vectors to handle inherently noisy phrase translation pairs. The experimental results show that our proposed method improves the accuracy of phrase reordering. We confirm that the proposed method works well with phrase pairs including NULL alignments.
Tasks	Machine Translation, Word Alignment
Published	2016-12-01
URL	https://www.aclweb.org/anthology/W16-4607/
PDF	https://www.aclweb.org/anthology/W16-4607
PWC	https://paperswithcode.com/paper/neural-reordering-model-considering-phrase
Repo
Framework

Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo


Title	Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo
Authors	Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, Gaël Richard
Abstract	Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) algorithms have become increasingly popular for Bayesian inference in large-scale applications. Even though these methods have proved useful in several scenarios, their performance is often limited by their bias. In this study, we propose a novel sampling algorithm that aims to reduce the bias of SG-MCMC while keeping the variance at a reasonable level. Our approach is based on a numerical sequence acceleration method, namely the Richardson-Romberg extrapolation, which simply boils down to running almost the same SG-MCMC algorithm twice in parallel with different step sizes. We illustrate our framework on the popular Stochastic Gradient Langevin Dynamics (SGLD) algorithm and propose a novel SG-MCMC algorithm referred to as Stochastic Gradient Richardson-Romberg Langevin Dynamics (SGRRLD). We provide formal theoretical analysis and show that SGRRLD is asymptotically consistent, satisfies a central limit theorem, and its non-asymptotic bias and the mean squared-error can be bounded. Our results show that SGRRLD attains higher rates of convergence than SGLD in both finite-time and asymptotically, and it achieves the theoretical accuracy of the methods that are based on higher-order integrators. We support our findings using both synthetic and real data experiments.
Tasks	Bayesian Inference
Published	2016-12-01
URL	http://papers.nips.cc/paper/6514-stochastic-gradient-richardson-romberg-markov-chain-monte-carlo
PDF	http://papers.nips.cc/paper/6514-stochastic-gradient-richardson-romberg-markov-chain-monte-carlo.pdf
PWC	https://paperswithcode.com/paper/stochastic-gradient-richardson-romberg-markov
Repo
Framework

Phonetic Inventory for an Arabic Speech Corpus


Title	Phonetic Inventory for an Arabic Speech Corpus
Authors	Nawar Halabi, Mike Wald
Abstract	Corpus design for speech synthesis is a well-researched topic in languages such as English compared to Modern Standard Arabic, and there is a tendency to focus on methods to automatically generate the orthographic transcript to be recorded (usually greedy methods). In this work, a study of Modern Standard Arabic (MSA) phonetics and phonology is conducted in order to create criteria for a greedy method to create a speech corpus transcript for recording. The size of the dataset is reduced a number of times using these optimisation methods with different parameters to yield a much smaller dataset with identical phonetic coverage than before the reduction, and this output transcript is chosen for recording. This is part of a larger work to create a completely annotated and segmented speech corpus for MSA.
Tasks	Speech Synthesis
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1116/
PDF	https://www.aclweb.org/anthology/L16-1116
PWC	https://paperswithcode.com/paper/phonetic-inventory-for-an-arabic-speech
Repo
Framework

Detecting Common Discussion Topics Across Culture From News Reader Comments


Title	Detecting Common Discussion Topics Across Culture From News Reader Comments
Authors	Bei Shi, Wai Lam, Lidong Bing, Yinqing Xu
Abstract
Tasks
Published	2016-08-01
URL	https://www.aclweb.org/anthology/P16-1064/
PDF	https://www.aclweb.org/anthology/P16-1064
PWC	https://paperswithcode.com/paper/detecting-common-discussion-topics-across
Repo
Framework

Does `well-being’ translate on Twitter?


Title	Does `well-being’ translate on Twitter? \|
Authors	Laura Smith, Salvatore Giorgi, Rishi Solanki, Johannes Eichstaedt, H. Andrew Schwartz, Muhammad Abdul-Mageed, Anneke Buffone, Lyle Ungar
Abstract
Tasks	Machine Translation, Sentiment Analysis
Published	2016-11-01
URL	https://www.aclweb.org/anthology/D16-1217/
PDF	https://www.aclweb.org/anthology/D16-1217
PWC	https://paperswithcode.com/paper/does-awell-beinga-translate-on-twitter
Repo
Framework

Issues and Challenges in Annotating Urdu Action Verbs on the IMAGACT4ALL Platform


Title	Issues and Challenges in Annotating Urdu Action Verbs on the IMAGACT4ALL Platform
Authors	Sharmin Muzaffar, Pitambar Behera, Girish Jha
Abstract	In South-Asian languages such as Hindi and Urdu, action verbs having compound constructions and serial verbs constructions pose serious problems for natural language processing and other linguistic tasks. Urdu is an Indo-Aryan language spoken by 51, 500, 0001 speakers in India. Action verbs that occur spontaneously in day-to-day communication are highly ambiguous in nature semantically and as a consequence cause disambiguation issues that are relevant and applicable to Language Technologies (LT) like Machine Translation (MT) and Natural Language Processing (NLP). IMAGACT4ALL is an ontology-driven web-based platform developed by the University of Florence for storing action verbs and their inter-relations. This group is currently collaborating with Jawaharlal Nehru University (JNU) in India to connect Indian languages on this platform. Action verbs are frequently used in both written and spoken discourses and refer to various meanings because of their polysemic nature. The IMAGACT4ALL platform stores each 3d animation image, each one of them referring to a variety of possible ontological types, which in turn makes the annotation task for the annotator quite challenging with regard to selecting verb argument structure having a range of probability distribution. The authors, in this paper, discuss the issues and challenges such as complex predicates (compound and conjunct verbs), ambiguously animated video illustrations, semantic discrepancies, and the factors of verb-selection preferences that have produced significant problems in annotating Urdu verbs on the IMAGACT ontology.
Tasks	Machine Translation
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1230/
PDF	https://www.aclweb.org/anthology/L16-1230
PWC	https://paperswithcode.com/paper/issues-and-challenges-in-annotating-urdu
Repo
Framework