Paper Group NANR 22
Discriminating Similar Languages with Linear SVMs and Neural Networks. Age and Gender Prediction on Health Forum Data. An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes. Towards Automatic Transcription of ILSE ― an Interdisciplinary Longitudinal Study of Adult Development and Aging. Evaluating the Impact of Light Post-Ed …
Discriminating Similar Languages with Linear SVMs and Neural Networks
Title | Discriminating Similar Languages with Linear SVMs and Neural Networks |
Authors | {\c{C}}a{\u{g}}r{\i} {\c{C}}{"o}ltekin, Taraka Rama |
Abstract | This paper describes the systems we experimented with for participating in the discriminating between similar languages (DSL) shared task 2016. We submitted results of a single system based on support vector machines (SVM) with linear kernel and using character ngram features, which obtained the first rank at the closed training track for test set A. Besides the linear SVM, we also report additional experiments with a number of deep learning architectures. Despite our intuition that non-linear deep learning methods should be advantageous, linear models seems to fare better in this task, at least with the amount of data and the amount of effort we spent on tuning these models. |
Tasks | Language Identification |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-4802/ |
https://www.aclweb.org/anthology/W16-4802 | |
PWC | https://paperswithcode.com/paper/discriminating-similar-languages-with-linear |
Repo | |
Framework | |
Age and Gender Prediction on Health Forum Data
Title | Age and Gender Prediction on Health Forum Data |
Authors | Prasha Shrestha, Nicolas Rey-Villamizar, Farig Sadeque, Ted Pedersen, Steven Bethard, Thamar Solorio |
Abstract | Health support forums have become a rich source of data that can be used to improve health care outcomes. A user profile, including information such as age and gender, can support targeted analysis of forum data. But users might not always disclose their age and gender. It is desirable then to be able to automatically extract this information from users{'} content. However, to the best of our knowledge there is no such resource for author profiling of health forum data. Here we present a large corpus, with close to 85,000 users, for profiling and also outline our approach and benchmark results to automatically detect a user{'}s age and gender from their forum posts. We use a mix of features from a user{'}s text as well as forum specific features to obtain accuracy well above the baseline, thus showing that both our dataset and our method are useful and valid. |
Tasks | Gender Prediction |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1541/ |
https://www.aclweb.org/anthology/L16-1541 | |
PWC | https://paperswithcode.com/paper/age-and-gender-prediction-on-health-forum |
Repo | |
Framework | |
An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes
Title | An Interaction-Centric Dataset for Learning Automation Rules in Smart Homes |
Authors | Kai Frederic Engelmann, Patrick Holthaus, Britta Wrede, Sebastian Wrede |
Abstract | The term smart home refers to a living environment that by its connected sensors and actuators is capable of providing intelligent and contextualised support to its user. This may result in automated behaviors that blends into the user{'}s daily life. However, currently most smart homes do not provide such intelligent support. A first step towards such intelligent capabilities lies in learning automation rules by observing the user{'}s behavior. We present a new type of corpus for learning such rules from user behavior as observed from the events in a smart homes sensor and actuator network. The data contains information about intended tasks by the users and synchronized events from this network. It is derived from interactions of 59 users with the smart home in order to solve five tasks. The corpus contains recordings of more than 40 different types of data streams and has been segmented and pre-processed to increase signal quality. Overall, the data shows a high noise level on specific data types that can be filtered out by a simple smoothing approach. The resulting data provides insights into event patterns resulting from task specific user behavior and thus constitutes a basis for machine learning approaches to learn automation rules. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1228/ |
https://www.aclweb.org/anthology/L16-1228 | |
PWC | https://paperswithcode.com/paper/an-interaction-centric-dataset-for-learning |
Repo | |
Framework | |
Towards Automatic Transcription of ILSE ― an Interdisciplinary Longitudinal Study of Adult Development and Aging
Title | Towards Automatic Transcription of ILSE ― an Interdisciplinary Longitudinal Study of Adult Development and Aging |
Authors | Jochen Weiner, Claudia Frankenberg, Dominic Telaar, Britta Wendelstein, Johannes Schr{"o}der, Tanja Schultz |
Abstract | The Interdisciplinary Longitudinal Study on Adult Development and Aging (ILSE) was created to facilitate the study of challenges posed by rapidly aging societies in developed countries such as Germany. ILSE contains over 8,000 hours of biographic interviews recorded from more than 1,000 participants over the course of 20 years. Investigations on various aspects of aging, such as cognitive decline, often rely on the analysis of linguistic features which can be derived from spoken content like these interviews. However, transcribing speech is a time and cost consuming manual process and so far only 380 hours of ILSE interviews have been transcribed. Thus, it is the aim of our work to establish technical systems to fully automatically transcribe the ILSE interview data. The joint occurrence of poor recording quality, long audio segments, erroneous transcriptions, varying speaking styles {&} crosstalk, and emotional {&} dialectal speech in these interviews presents challenges for automatic speech recognition (ASR). We describe our ongoing work towards the fully automatic transcription of all ILSE interviews and the steps we implemented in preparing the transcriptions to meet the interviews{'} challenges. Using a recursive long audio alignment procedure 96 hours of the transcribed data have been made accessible for ASR training. |
Tasks | Speech Recognition |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1114/ |
https://www.aclweb.org/anthology/L16-1114 | |
PWC | https://paperswithcode.com/paper/towards-automatic-transcription-of-ilse-a-an |
Repo | |
Framework | |
Evaluating the Impact of Light Post-Editing on Usability
Title | Evaluating the Impact of Light Post-Editing on Usability |
Authors | Sheila Castilho, Sharon O{'}Brien |
Abstract | This paper discusses a methodology to measure the usability of machine translated content by end users, comparing lightly post-edited content with raw output and with the usability of source language content. The content selected consists of Online Help articles from a software company for a spreadsheet application, translated from English into German. Three groups of five users each used either the source text - the English version (EN) -, the raw MT version (DE{_}MT), or the light PE version (DE{_}PE), and were asked to carry out six tasks. Usability was measured using an eye tracker and cognitive, temporal and pragmatic measures of usability. Satisfaction was measured via a post-task questionnaire presented after the participants had completed the tasks. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1048/ |
https://www.aclweb.org/anthology/L16-1048 | |
PWC | https://paperswithcode.com/paper/evaluating-the-impact-of-light-post-editing |
Repo | |
Framework | |
Japanese Post-verbal Constructions Revisited
Title | Japanese Post-verbal Constructions Revisited |
Authors | Kohji Kamada |
Abstract | |
Tasks | |
Published | 2016-10-01 |
URL | https://www.aclweb.org/anthology/Y16-3002/ |
https://www.aclweb.org/anthology/Y16-3002 | |
PWC | https://paperswithcode.com/paper/japanese-post-verbal-constructions-revisited |
Repo | |
Framework | |
Towards a unified account of resultative constructions in Korean
Title | Towards a unified account of resultative constructions in Korean |
Authors | Juwon Lee |
Abstract | |
Tasks | |
Published | 2016-10-01 |
URL | https://www.aclweb.org/anthology/Y16-3023/ |
https://www.aclweb.org/anthology/Y16-3023 | |
PWC | https://paperswithcode.com/paper/towards-a-unified-account-of-resultative |
Repo | |
Framework | |
Empty element recovery by spinal parser operations
Title | Empty element recovery by spinal parser operations |
Authors | Katsuhiko Hayashi, Masaaki Nagata |
Abstract | |
Tasks | |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-2016/ |
https://www.aclweb.org/anthology/P16-2016 | |
PWC | https://paperswithcode.com/paper/empty-element-recovery-by-spinal-parser |
Repo | |
Framework | |
OSMAN ― A Novel Arabic Readability Metric
Title | OSMAN ― A Novel Arabic Readability Metric |
Authors | Mahmoud El-Haj, Paul Rayson |
Abstract | We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. It allows researchers to calculate readability for Arabic text with and without diacritics. OSMAN is a modified version of the conventional readability formulas such as Flesch and Fog. In our work we introduce a novel approach towards counting short, long and stress syllables in Arabic which is essential for judging readability of Arabic narratives. We also introduce an additional factor called {``}Faseeh{''} which considers aspects of script usually dropped in informal Arabic writing. To evaluate our methods we used Spearman{'}s correlation metric to compare text readability for 73,000 parallel sentences from English and Arabic UN documents. The Arabic sentences were written with the absence of diacritics and in order to count the number of syllables we added the diacritics in using an open source tool called Mishkal. The results show that OSMAN readability formula correlates well with the English ones making it a useful tool for researchers and educators working with Arabic text. | |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1038/ |
https://www.aclweb.org/anthology/L16-1038 | |
PWC | https://paperswithcode.com/paper/osman-a-a-novel-arabic-readability-metric |
Repo | |
Framework | |
Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation
Title | Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation |
Authors | Shin Kanouchi, Katsuhito Sudoh, Mamoru Komachi |
Abstract | This paper presents an improved lexicalized reordering model for phrase-based statistical machine translation using a deep neural network. Lexicalized reordering suffers from reordering ambiguity, data sparseness and noises in a phrase table. Previous neural reordering model is successful to solve the first and second problems but fails to address the third one. Therefore, we propose new features using phrase translation and word alignment to construct phrase vectors to handle inherently noisy phrase translation pairs. The experimental results show that our proposed method improves the accuracy of phrase reordering. We confirm that the proposed method works well with phrase pairs including NULL alignments. |
Tasks | Machine Translation, Word Alignment |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-4607/ |
https://www.aclweb.org/anthology/W16-4607 | |
PWC | https://paperswithcode.com/paper/neural-reordering-model-considering-phrase |
Repo | |
Framework | |
Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo
Title | Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo |
Authors | Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, Gaël Richard |
Abstract | Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) algorithms have become increasingly popular for Bayesian inference in large-scale applications. Even though these methods have proved useful in several scenarios, their performance is often limited by their bias. In this study, we propose a novel sampling algorithm that aims to reduce the bias of SG-MCMC while keeping the variance at a reasonable level. Our approach is based on a numerical sequence acceleration method, namely the Richardson-Romberg extrapolation, which simply boils down to running almost the same SG-MCMC algorithm twice in parallel with different step sizes. We illustrate our framework on the popular Stochastic Gradient Langevin Dynamics (SGLD) algorithm and propose a novel SG-MCMC algorithm referred to as Stochastic Gradient Richardson-Romberg Langevin Dynamics (SGRRLD). We provide formal theoretical analysis and show that SGRRLD is asymptotically consistent, satisfies a central limit theorem, and its non-asymptotic bias and the mean squared-error can be bounded. Our results show that SGRRLD attains higher rates of convergence than SGLD in both finite-time and asymptotically, and it achieves the theoretical accuracy of the methods that are based on higher-order integrators. We support our findings using both synthetic and real data experiments. |
Tasks | Bayesian Inference |
Published | 2016-12-01 |
URL | http://papers.nips.cc/paper/6514-stochastic-gradient-richardson-romberg-markov-chain-monte-carlo |
http://papers.nips.cc/paper/6514-stochastic-gradient-richardson-romberg-markov-chain-monte-carlo.pdf | |
PWC | https://paperswithcode.com/paper/stochastic-gradient-richardson-romberg-markov |
Repo | |
Framework | |
Phonetic Inventory for an Arabic Speech Corpus
Title | Phonetic Inventory for an Arabic Speech Corpus |
Authors | Nawar Halabi, Mike Wald |
Abstract | Corpus design for speech synthesis is a well-researched topic in languages such as English compared to Modern Standard Arabic, and there is a tendency to focus on methods to automatically generate the orthographic transcript to be recorded (usually greedy methods). In this work, a study of Modern Standard Arabic (MSA) phonetics and phonology is conducted in order to create criteria for a greedy method to create a speech corpus transcript for recording. The size of the dataset is reduced a number of times using these optimisation methods with different parameters to yield a much smaller dataset with identical phonetic coverage than before the reduction, and this output transcript is chosen for recording. This is part of a larger work to create a completely annotated and segmented speech corpus for MSA. |
Tasks | Speech Synthesis |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1116/ |
https://www.aclweb.org/anthology/L16-1116 | |
PWC | https://paperswithcode.com/paper/phonetic-inventory-for-an-arabic-speech |
Repo | |
Framework | |
Detecting Common Discussion Topics Across Culture From News Reader Comments
Title | Detecting Common Discussion Topics Across Culture From News Reader Comments |
Authors | Bei Shi, Wai Lam, Lidong Bing, Yinqing Xu |
Abstract | |
Tasks | |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-1064/ |
https://www.aclweb.org/anthology/P16-1064 | |
PWC | https://paperswithcode.com/paper/detecting-common-discussion-topics-across |
Repo | |
Framework | |
Does `well-being’ translate on Twitter?
Title | Does `well-being’ translate on Twitter? | |
Authors | Laura Smith, Salvatore Giorgi, Rishi Solanki, Johannes Eichstaedt, H. Andrew Schwartz, Muhammad Abdul-Mageed, Anneke Buffone, Lyle Ungar |
Abstract | |
Tasks | Machine Translation, Sentiment Analysis |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/D16-1217/ |
https://www.aclweb.org/anthology/D16-1217 | |
PWC | https://paperswithcode.com/paper/does-awell-beinga-translate-on-twitter |
Repo | |
Framework | |
Issues and Challenges in Annotating Urdu Action Verbs on the IMAGACT4ALL Platform
Title | Issues and Challenges in Annotating Urdu Action Verbs on the IMAGACT4ALL Platform |
Authors | Sharmin Muzaffar, Pitambar Behera, Girish Jha |
Abstract | In South-Asian languages such as Hindi and Urdu, action verbs having compound constructions and serial verbs constructions pose serious problems for natural language processing and other linguistic tasks. Urdu is an Indo-Aryan language spoken by 51, 500, 0001 speakers in India. Action verbs that occur spontaneously in day-to-day communication are highly ambiguous in nature semantically and as a consequence cause disambiguation issues that are relevant and applicable to Language Technologies (LT) like Machine Translation (MT) and Natural Language Processing (NLP). IMAGACT4ALL is an ontology-driven web-based platform developed by the University of Florence for storing action verbs and their inter-relations. This group is currently collaborating with Jawaharlal Nehru University (JNU) in India to connect Indian languages on this platform. Action verbs are frequently used in both written and spoken discourses and refer to various meanings because of their polysemic nature. The IMAGACT4ALL platform stores each 3d animation image, each one of them referring to a variety of possible ontological types, which in turn makes the annotation task for the annotator quite challenging with regard to selecting verb argument structure having a range of probability distribution. The authors, in this paper, discuss the issues and challenges such as complex predicates (compound and conjunct verbs), ambiguously animated video illustrations, semantic discrepancies, and the factors of verb-selection preferences that have produced significant problems in annotating Urdu verbs on the IMAGACT ontology. |
Tasks | Machine Translation |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1230/ |
https://www.aclweb.org/anthology/L16-1230 | |
PWC | https://paperswithcode.com/paper/issues-and-challenges-in-annotating-urdu |
Repo | |
Framework | |