May 5, 2019

2246 words 11 mins read

Paper Group NANR 22

Paper Group NANR 22

Discriminating Similar Languages with Linear SVMs and Neural Networks. Age and Gender Prediction on Health Forum Data. An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes. Towards Automatic Transcription of ILSE ― an Interdisciplinary Longitudinal Study of Adult Development and Aging. Evaluating the Impact of Light Post-Ed …

Discriminating Similar Languages with Linear SVMs and Neural Networks

Title Discriminating Similar Languages with Linear SVMs and Neural Networks
Authors {\c{C}}a{\u{g}}r{\i} {\c{C}}{"o}ltekin, Taraka Rama
Abstract This paper describes the systems we experimented with for participating in the discriminating between similar languages (DSL) shared task 2016. We submitted results of a single system based on support vector machines (SVM) with linear kernel and using character ngram features, which obtained the first rank at the closed training track for test set A. Besides the linear SVM, we also report additional experiments with a number of deep learning architectures. Despite our intuition that non-linear deep learning methods should be advantageous, linear models seems to fare better in this task, at least with the amount of data and the amount of effort we spent on tuning these models.
Tasks Language Identification
Published 2016-12-01
URL https://www.aclweb.org/anthology/W16-4802/
PDF https://www.aclweb.org/anthology/W16-4802
PWC https://paperswithcode.com/paper/discriminating-similar-languages-with-linear
Repo
Framework

Age and Gender Prediction on Health Forum Data

Title Age and Gender Prediction on Health Forum Data
Authors Prasha Shrestha, Nicolas Rey-Villamizar, Farig Sadeque, Ted Pedersen, Steven Bethard, Thamar Solorio
Abstract Health support forums have become a rich source of data that can be used to improve health care outcomes. A user profile, including information such as age and gender, can support targeted analysis of forum data. But users might not always disclose their age and gender. It is desirable then to be able to automatically extract this information from users{'} content. However, to the best of our knowledge there is no such resource for author profiling of health forum data. Here we present a large corpus, with close to 85,000 users, for profiling and also outline our approach and benchmark results to automatically detect a user{'}s age and gender from their forum posts. We use a mix of features from a user{'}s text as well as forum specific features to obtain accuracy well above the baseline, thus showing that both our dataset and our method are useful and valid.
Tasks Gender Prediction
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1541/
PDF https://www.aclweb.org/anthology/L16-1541
PWC https://paperswithcode.com/paper/age-and-gender-prediction-on-health-forum
Repo
Framework

An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes

Title An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes
Authors Kai Frederic Engelmann, Patrick Holthaus, Britta Wrede, Sebastian Wrede
Abstract The term smart home refers to a living environment that by its connected sensors and actuators is capable of providing intelligent and contextualised support to its user. This may result in automated behaviors that blends into the user{'}s daily life. However, currently most smart homes do not provide such intelligent support. A first step towards such intelligent capabilities lies in learning automation rules by observing the user{'}s behavior. We present a new type of corpus for learning such rules from user behavior as observed from the events in a smart homes sensor and actuator network. The data contains information about intended tasks by the users and synchronized events from this network. It is derived from interactions of 59 users with the smart home in order to solve five tasks. The corpus contains recordings of more than 40 different types of data streams and has been segmented and pre-processed to increase signal quality. Overall, the data shows a high noise level on specific data types that can be filtered out by a simple smoothing approach. The resulting data provides insights into event patterns resulting from task specific user behavior and thus constitutes a basis for machine learning approaches to learn automation rules.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1228/
PDF https://www.aclweb.org/anthology/L16-1228
PWC https://paperswithcode.com/paper/an-interaction-centric-dataset-for-learning
Repo
Framework

Towards Automatic Transcription of ILSE ― an Interdisciplinary Longitudinal Study of Adult Development and Aging

Title Towards Automatic Transcription of ILSE ― an Interdisciplinary Longitudinal Study of Adult Development and Aging
Authors Jochen Weiner, Claudia Frankenberg, Dominic Telaar, Britta Wendelstein, Johannes Schr{"o}der, Tanja Schultz
Abstract The Interdisciplinary Longitudinal Study on Adult Development and Aging (ILSE) was created to facilitate the study of challenges posed by rapidly aging societies in developed countries such as Germany. ILSE contains over 8,000 hours of biographic interviews recorded from more than 1,000 participants over the course of 20 years. Investigations on various aspects of aging, such as cognitive decline, often rely on the analysis of linguistic features which can be derived from spoken content like these interviews. However, transcribing speech is a time and cost consuming manual process and so far only 380 hours of ILSE interviews have been transcribed. Thus, it is the aim of our work to establish technical systems to fully automatically transcribe the ILSE interview data. The joint occurrence of poor recording quality, long audio segments, erroneous transcriptions, varying speaking styles {&} crosstalk, and emotional {&} dialectal speech in these interviews presents challenges for automatic speech recognition (ASR). We describe our ongoing work towards the fully automatic transcription of all ILSE interviews and the steps we implemented in preparing the transcriptions to meet the interviews{'} challenges. Using a recursive long audio alignment procedure 96 hours of the transcribed data have been made accessible for ASR training.
Tasks Speech Recognition
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1114/
PDF https://www.aclweb.org/anthology/L16-1114
PWC https://paperswithcode.com/paper/towards-automatic-transcription-of-ilse-a-an
Repo
Framework

Evaluating the Impact of Light Post-Editing on Usability

Title Evaluating the Impact of Light Post-Editing on Usability
Authors Sheila Castilho, Sharon O{'}Brien
Abstract This paper discusses a methodology to measure the usability of machine translated content by end users, comparing lightly post-edited content with raw output and with the usability of source language content. The content selected consists of Online Help articles from a software company for a spreadsheet application, translated from English into German. Three groups of five users each used either the source text - the English version (EN) -, the raw MT version (DE{_}MT), or the light PE version (DE{_}PE), and were asked to carry out six tasks. Usability was measured using an eye tracker and cognitive, temporal and pragmatic measures of usability. Satisfaction was measured via a post-task questionnaire presented after the participants had completed the tasks.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1048/
PDF https://www.aclweb.org/anthology/L16-1048
PWC https://paperswithcode.com/paper/evaluating-the-impact-of-light-post-editing
Repo
Framework

Japanese Post-verbal Constructions Revisited

Title Japanese Post-verbal Constructions Revisited
Authors Kohji Kamada
Abstract
Tasks
Published 2016-10-01
URL https://www.aclweb.org/anthology/Y16-3002/
PDF https://www.aclweb.org/anthology/Y16-3002
PWC https://paperswithcode.com/paper/japanese-post-verbal-constructions-revisited
Repo
Framework

Towards a unified account of resultative constructions in Korean

Title Towards a unified account of resultative constructions in Korean
Authors Juwon Lee
Abstract
Tasks
Published 2016-10-01
URL https://www.aclweb.org/anthology/Y16-3023/
PDF https://www.aclweb.org/anthology/Y16-3023
PWC https://paperswithcode.com/paper/towards-a-unified-account-of-resultative
Repo
Framework

Empty element recovery by spinal parser operations

Title Empty element recovery by spinal parser operations
Authors Katsuhiko Hayashi, Masaaki Nagata
Abstract
Tasks
Published 2016-08-01
URL https://www.aclweb.org/anthology/P16-2016/
PDF https://www.aclweb.org/anthology/P16-2016
PWC https://paperswithcode.com/paper/empty-element-recovery-by-spinal-parser
Repo
Framework

OSMAN ― A Novel Arabic Readability Metric

Title OSMAN ― A Novel Arabic Readability Metric
Authors Mahmoud El-Haj, Paul Rayson
Abstract We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. It allows researchers to calculate readability for Arabic text with and without diacritics. OSMAN is a modified version of the conventional readability formulas such as Flesch and Fog. In our work we introduce a novel approach towards counting short, long and stress syllables in Arabic which is essential for judging readability of Arabic narratives. We also introduce an additional factor called {``}Faseeh{''} which considers aspects of script usually dropped in informal Arabic writing. To evaluate our methods we used Spearman{'}s correlation metric to compare text readability for 73,000 parallel sentences from English and Arabic UN documents. The Arabic sentences were written with the absence of diacritics and in order to count the number of syllables we added the diacritics in using an open source tool called Mishkal. The results show that OSMAN readability formula correlates well with the English ones making it a useful tool for researchers and educators working with Arabic text. |
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1038/
PDF https://www.aclweb.org/anthology/L16-1038
PWC https://paperswithcode.com/paper/osman-a-a-novel-arabic-readability-metric
Repo
Framework

Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation

Title Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation
Authors Shin Kanouchi, Katsuhito Sudoh, Mamoru Komachi
Abstract This paper presents an improved lexicalized reordering model for phrase-based statistical machine translation using a deep neural network. Lexicalized reordering suffers from reordering ambiguity, data sparseness and noises in a phrase table. Previous neural reordering model is successful to solve the first and second problems but fails to address the third one. Therefore, we propose new features using phrase translation and word alignment to construct phrase vectors to handle inherently noisy phrase translation pairs. The experimental results show that our proposed method improves the accuracy of phrase reordering. We confirm that the proposed method works well with phrase pairs including NULL alignments.
Tasks Machine Translation, Word Alignment
Published 2016-12-01
URL https://www.aclweb.org/anthology/W16-4607/
PDF https://www.aclweb.org/anthology/W16-4607
PWC https://paperswithcode.com/paper/neural-reordering-model-considering-phrase
Repo
Framework

Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo

Title Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo
Authors Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, Gaël Richard
Abstract Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) algorithms have become increasingly popular for Bayesian inference in large-scale applications. Even though these methods have proved useful in several scenarios, their performance is often limited by their bias. In this study, we propose a novel sampling algorithm that aims to reduce the bias of SG-MCMC while keeping the variance at a reasonable level. Our approach is based on a numerical sequence acceleration method, namely the Richardson-Romberg extrapolation, which simply boils down to running almost the same SG-MCMC algorithm twice in parallel with different step sizes. We illustrate our framework on the popular Stochastic Gradient Langevin Dynamics (SGLD) algorithm and propose a novel SG-MCMC algorithm referred to as Stochastic Gradient Richardson-Romberg Langevin Dynamics (SGRRLD). We provide formal theoretical analysis and show that SGRRLD is asymptotically consistent, satisfies a central limit theorem, and its non-asymptotic bias and the mean squared-error can be bounded. Our results show that SGRRLD attains higher rates of convergence than SGLD in both finite-time and asymptotically, and it achieves the theoretical accuracy of the methods that are based on higher-order integrators. We support our findings using both synthetic and real data experiments.
Tasks Bayesian Inference
Published 2016-12-01
URL http://papers.nips.cc/paper/6514-stochastic-gradient-richardson-romberg-markov-chain-monte-carlo
PDF http://papers.nips.cc/paper/6514-stochastic-gradient-richardson-romberg-markov-chain-monte-carlo.pdf
PWC https://paperswithcode.com/paper/stochastic-gradient-richardson-romberg-markov
Repo
Framework

Phonetic Inventory for an Arabic Speech Corpus

Title Phonetic Inventory for an Arabic Speech Corpus
Authors Nawar Halabi, Mike Wald
Abstract Corpus design for speech synthesis is a well-researched topic in languages such as English compared to Modern Standard Arabic, and there is a tendency to focus on methods to automatically generate the orthographic transcript to be recorded (usually greedy methods). In this work, a study of Modern Standard Arabic (MSA) phonetics and phonology is conducted in order to create criteria for a greedy method to create a speech corpus transcript for recording. The size of the dataset is reduced a number of times using these optimisation methods with different parameters to yield a much smaller dataset with identical phonetic coverage than before the reduction, and this output transcript is chosen for recording. This is part of a larger work to create a completely annotated and segmented speech corpus for MSA.
Tasks Speech Synthesis
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1116/
PDF https://www.aclweb.org/anthology/L16-1116
PWC https://paperswithcode.com/paper/phonetic-inventory-for-an-arabic-speech
Repo
Framework

Detecting Common Discussion Topics Across Culture From News Reader Comments

Title Detecting Common Discussion Topics Across Culture From News Reader Comments
Authors Bei Shi, Wai Lam, Lidong Bing, Yinqing Xu
Abstract
Tasks
Published 2016-08-01
URL https://www.aclweb.org/anthology/P16-1064/
PDF https://www.aclweb.org/anthology/P16-1064
PWC https://paperswithcode.com/paper/detecting-common-discussion-topics-across
Repo
Framework

Does `well-being’ translate on Twitter?

Title Does `well-being’ translate on Twitter? |
Authors Laura Smith, Salvatore Giorgi, Rishi Solanki, Johannes Eichstaedt, H. Andrew Schwartz, Muhammad Abdul-Mageed, Anneke Buffone, Lyle Ungar
Abstract
Tasks Machine Translation, Sentiment Analysis
Published 2016-11-01
URL https://www.aclweb.org/anthology/D16-1217/
PDF https://www.aclweb.org/anthology/D16-1217
PWC https://paperswithcode.com/paper/does-awell-beinga-translate-on-twitter
Repo
Framework

Issues and Challenges in Annotating Urdu Action Verbs on the IMAGACT4ALL Platform

Title Issues and Challenges in Annotating Urdu Action Verbs on the IMAGACT4ALL Platform
Authors Sharmin Muzaffar, Pitambar Behera, Girish Jha
Abstract In South-Asian languages such as Hindi and Urdu, action verbs having compound constructions and serial verbs constructions pose serious problems for natural language processing and other linguistic tasks. Urdu is an Indo-Aryan language spoken by 51, 500, 0001 speakers in India. Action verbs that occur spontaneously in day-to-day communication are highly ambiguous in nature semantically and as a consequence cause disambiguation issues that are relevant and applicable to Language Technologies (LT) like Machine Translation (MT) and Natural Language Processing (NLP). IMAGACT4ALL is an ontology-driven web-based platform developed by the University of Florence for storing action verbs and their inter-relations. This group is currently collaborating with Jawaharlal Nehru University (JNU) in India to connect Indian languages on this platform. Action verbs are frequently used in both written and spoken discourses and refer to various meanings because of their polysemic nature. The IMAGACT4ALL platform stores each 3d animation image, each one of them referring to a variety of possible ontological types, which in turn makes the annotation task for the annotator quite challenging with regard to selecting verb argument structure having a range of probability distribution. The authors, in this paper, discuss the issues and challenges such as complex predicates (compound and conjunct verbs), ambiguously animated video illustrations, semantic discrepancies, and the factors of verb-selection preferences that have produced significant problems in annotating Urdu verbs on the IMAGACT ontology.
Tasks Machine Translation
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1230/
PDF https://www.aclweb.org/anthology/L16-1230
PWC https://paperswithcode.com/paper/issues-and-challenges-in-annotating-urdu
Repo
Framework
comments powered by Disqus