Paper Group NANR 64
Improving Bilingual Terminology Extraction from Comparable Corpora via Multiple Word-Space Models. Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts. Dynamic pause assessment of keystroke logged data for the detection of complexity in translation and monolingual text …
Improving Bilingual Terminology Extraction from Comparable Corpora via Multiple Word-Space Models
Title | Improving Bilingual Terminology Extraction from Comparable Corpora via Multiple Word-Space Models |
Authors | Amir Hazem, Emmanuel Morin |
Abstract | There is a rich flora of word space models that have proven their efficiency in many different applications including information retrieval (Dumais, 1988), word sense disambiguation (Schutze, 1992), various semantic knowledge tests (Lund et al., 1995; Karlgren, 2001), and text categorization (Sahlgren, 2005). Based on the assumption that each model captures some aspects of word meanings and provides its own empirical evidence, we present in this paper a systematic exploration of the principal corpus-based word space models for bilingual terminology extraction from comparable corpora. We find that, once we have identified the best procedures, a very simple combination approach leads to significant improvements compared to individual models. |
Tasks | Information Retrieval, Text Categorization, Word Sense Disambiguation |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1661/ |
https://www.aclweb.org/anthology/L16-1661 | |
PWC | https://paperswithcode.com/paper/improving-bilingual-terminology-extraction |
Repo | |
Framework | |
Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts
Title | Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts |
Authors | Sowmya Vajjala, Detmar Meurers, Alex Eitel, er, Katharina Scheiter |
Abstract | Computational approaches to readability assessment are generally built and evaluated using gold standard corpora labeled by publishers or teachers rather than being grounded in observations about human performance. Considering that both the reading process and the outcome can be observed, there is an empirical wealth that could be used to ground computational analysis of text readability. This will also support explicit readability models connecting text complexity and the reader{'}s language proficiency to the reading process and outcomes. This paper takes a step in this direction by reporting on an experiment to study how the relation between text complexity and reader{'}s language proficiency affects the reading process and performance outcomes of readers after reading We modeled the reading process using three eye tracking variables: fixation count, average fixation count, and second pass reading duration. Our models for these variables explained 78.9{%}, 74{%} and 67.4{%} variance, respectively. Performance outcome was modeled through recall and comprehension questions, and these models explained 58.9{%} and 27.6{%} of the variance, respectively. While the online models give us a better understanding of the cognitive correlates of reading with text complexity and language proficiency, modeling of the offline measures can be particularly relevant for incorporating user aspects into readability models. |
Tasks | Eye Tracking |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-4105/ |
https://www.aclweb.org/anthology/W16-4105 | |
PWC | https://paperswithcode.com/paper/towards-grounding-computational-linguistic |
Repo | |
Framework | |
Dynamic pause assessment of keystroke logged data for the detection of complexity in translation and monolingual text production
Title | Dynamic pause assessment of keystroke logged data for the detection of complexity in translation and monolingual text production |
Authors | Arndt Heilmann, Stella Neumann |
Abstract | Pause analysis of key-stroke logged translations is a hallmark of process based translation studies. However, an exact definition of what a cognitively effortful pause during the translation process is has not been found yet (Saldanha and O{'}Brien, 2013). This paper investigates the design of a key-stroke and subject dependent identification system of cognitive effort to track complexity in translation with keystroke logging (cf. also (Dragsted, 2005) (Couto-Vale, in preparation)). It is an elastic measure that takes into account idiosyncratic pause duration of translators as well as further confounds such as bi-gram frequency, letter frequency and some motor tasks involved in writing. The method is compared to a common static threshold of 1000 ms in an analysis of cognitive effort during the translation of grammatical functions from English to German. Additionally, the results are triangulated with eye tracking data for further validation. The findings show that at least for smaller sets of data a dynamic pause assessment may lead to more accurate results than a generic static pause threshold of similar duration. |
Tasks | Eye Tracking |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-4111/ |
https://www.aclweb.org/anthology/W16-4111 | |
PWC | https://paperswithcode.com/paper/dynamic-pause-assessment-of-keystroke-logged |
Repo | |
Framework | |
Larger-Context Language Modelling with Recurrent Neural Network
Title | Larger-Context Language Modelling with Recurrent Neural Network |
Authors | Tian Wang, Kyunghyun Cho |
Abstract | |
Tasks | Language Modelling |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-1125/ |
https://www.aclweb.org/anthology/P16-1125 | |
PWC | https://paperswithcode.com/paper/larger-context-language-modelling-with |
Repo | |
Framework | |
A Turkish-German Code-Switching Corpus
Title | A Turkish-German Code-Switching Corpus |
Authors | {"O}zlem {\c{C}}etino{\u{g}}lu |
Abstract | Bilingual communities often alternate between languages both in spoken and written communication. One such community, Germany residents of Turkish origin produce Turkish-German code-switching, by heavily mixing two languages at discourse, sentence, or word level. Code-switching in general, and Turkish-German code-switching in particular, has been studied for a long time from a linguistic perspective. Yet resources to study them from a more computational perspective are limited due to either small size or licence issues. In this work we contribute the solution of this problem with a corpus. We present a Turkish-German code-switching corpus which consists of 1029 tweets, with a majority of intra-sentential switches. We share different type of code-switching we have observed in our collection and describe our processing steps. The first step is data collection and filtering. This is followed by manual tokenisation and normalisation. And finally, we annotate data with word-level language identification information. The resulting corpus is available for research purposes. |
Tasks | Language Identification |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1667/ |
https://www.aclweb.org/anthology/L16-1667 | |
PWC | https://paperswithcode.com/paper/a-turkish-german-code-switching-corpus |
Repo | |
Framework | |
A Multi-media Approach to Cross-lingual Entity Knowledge Transfer
Title | A Multi-media Approach to Cross-lingual Entity Knowledge Transfer |
Authors | Di Lu, Xiaoman Pan, Nima Pourdamghani, Shih-Fu Chang, Heng Ji, Kevin Knight |
Abstract | |
Tasks | Cross-Lingual Entity Linking, Entity Linking, Face Recognition, Image Retrieval, Machine Translation, Transfer Learning |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-1006/ |
https://www.aclweb.org/anthology/P16-1006 | |
PWC | https://paperswithcode.com/paper/a-multi-media-approach-to-cross-lingual |
Repo | |
Framework | |
Modelling a Parallel Corpus of French and French Belgian Sign Language
Title | Modelling a Parallel Corpus of French and French Belgian Sign Language |
Authors | Laurence Meurant, Maxime Gobert, Anthony Cleve |
Abstract | The overarching objective underlying this research is to develop an online tool, based on a parallel corpus of French Belgian Sign Language (LSFB) and written Belgian French. This tool is aimed to assist various set of tasks related to the comparison of LSFB and French, to the benefit of general users as well as teachers in bilingual schools, translators and interpreters, as well as linguists. These tasks include (1) the comprehension of LSFB or French texts, (2) the production of LSFB or French texts, (3) the translation between LSFB and French in both directions and (4) the contrastive analysis of these languages. The first step of investigation aims at creating an unidirectional French-LSFB concordancer, able to align a one- or multiple-word expression from the French translated text with its corresponding expressions in the videotaped LSFB productions. We aim at testing the efficiency of this concordancer for the extraction of a dictionary of meanings in context. In this paper, we will present the modelling of the different data sources at our disposal and specifically the way they interact with one another. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1670/ |
https://www.aclweb.org/anthology/L16-1670 | |
PWC | https://paperswithcode.com/paper/modelling-a-parallel-corpus-of-french-and |
Repo | |
Framework | |
Multi-language Speech Collection for NIST LRE
Title | Multi-language Speech Collection for NIST LRE |
Authors | Karen Jones, Stephanie Strassel, Kevin Walker, David Graff, Jonathan Wright |
Abstract | The Multi-language Speech (MLS) Corpus supports NIST{'}s Language Recognition Evaluation series by providing new conversational telephone speech and broadcast narrowband data in 20 languages/dialects. The corpus was built with the intention of testing system performance in the matter of distinguishing closely related or confusable linguistic varieties, and careful manual auditing of collected data was an important aspect of this work. This paper lists the specific data requirements for the collection and provides both a commentary on the rationale for those requirements as well as an outline of the various steps taken to ensure all goals were met as specified. LDC conducted a large-scale recruitment effort involving the implementation of candidate assessment and interview techniques suitable for hiring a large contingent of telecommuting workers, and this recruitment effort is discussed in detail. We also describe the telephone and broadcast collection infrastructure and protocols, and provide details of the steps taken to pre-process collected data prior to auditing. Finally, annotation training, procedures and outcomes are presented in detail. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1674/ |
https://www.aclweb.org/anthology/L16-1674 | |
PWC | https://paperswithcode.com/paper/multi-language-speech-collection-for-nist-lre |
Repo | |
Framework | |
使用字典學習法於強健性語音辨識(The Use of Dictionary Learning Approach for Robustness Speech Recognition) [In Chinese]
Title | 使用字典學習法於強健性語音辨識(The Use of Dictionary Learning Approach for Robustness Speech Recognition) [In Chinese] |
Authors | Bi-Cheng Yan, Chin-Hong Shih, Shih-Hung Liu, Berlin Chen |
Abstract | |
Tasks | Dictionary Learning, Speech Recognition |
Published | 2016-10-01 |
URL | https://www.aclweb.org/anthology/O16-1003/ |
https://www.aclweb.org/anthology/O16-1003 | |
PWC | https://paperswithcode.com/paper/a12c-aa-a-c314a14aeae3e34-ethe-use-of |
Repo | |
Framework | |
Recognizing Reference Spans and Classifying their Discourse Facets
Title | Recognizing Reference Spans and Classifying their Discourse Facets |
Authors | Kun Lu, Jin Mao, Gang Li, Jian Xu |
Abstract | |
Tasks | Information Retrieval, Learning-To-Rank, Text Classification |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/W16-1516/ |
https://www.aclweb.org/anthology/W16-1516 | |
PWC | https://paperswithcode.com/paper/recognizing-reference-spans-and-classifying |
Repo | |
Framework | |
A Position Encoding Convolutional Neural Network Based on Dependency Tree for Relation Classification
Title | A Position Encoding Convolutional Neural Network Based on Dependency Tree for Relation Classification |
Authors | Yunlun Yang, Yunhai Tong, Shulei Ma, Zhi-Hong Deng |
Abstract | |
Tasks | Feature Selection, Information Retrieval, Machine Translation, Relation Classification |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/D16-1007/ |
https://www.aclweb.org/anthology/D16-1007 | |
PWC | https://paperswithcode.com/paper/a-position-encoding-convolutional-neural |
Repo | |
Framework | |
Exploiting Arabic Diacritization for High Quality Automatic Annotation
Title | Exploiting Arabic Diacritization for High Quality Automatic Annotation |
Authors | Nizar Habash, Anas Shahrour, Muhamed Al-Khalil |
Abstract | We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97{%} on lemma, part-of-speech, and tokenization combined. |
Tasks | Tokenization |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1681/ |
https://www.aclweb.org/anthology/L16-1681 | |
PWC | https://paperswithcode.com/paper/exploiting-arabic-diacritization-for-high |
Repo | |
Framework | |
Using a Small Lexicon with CRFs Confidence Measure to Improve POS Tagging Accuracy
Title | Using a Small Lexicon with CRFs Confidence Measure to Improve POS Tagging Accuracy |
Authors | Mohamed Outahajala, Paolo Rosso |
Abstract | Like most of the languages which have only recently started being investigated for the Natural Language Processing (NLP) tasks, Amazigh lacks annotated corpora and tools and still suffers from the scarcity of linguistic tools and resources. The main aim of this paper is to present a new part-of-speech (POS) tagger based on a new Amazigh tag set (AMTS) composed of 28 tags. In line with our goal we have trained Conditional Random Fields (CRFs) to build a POS tagger for the Amazigh language. We have used the 10-fold technique to evaluate and validate our approach. The CRFs 10 folds average level is 87.95{%} and the best fold level result is 91.18{%}. In order to improve this result, we have gathered a set of about 8k words with their POS tags. The collected lexicon was used with CRFs confidence measure in order to have a more accurate POS-tagger. Hence, we have obtained a better performance of 93.82{%}. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1683/ |
https://www.aclweb.org/anthology/L16-1683 | |
PWC | https://paperswithcode.com/paper/using-a-small-lexicon-with-crfs-confidence |
Repo | |
Framework | |
Discovering Fuzzy Synsets from the Redundancy in Different Lexical-Semantic Resources
Title | Discovering Fuzzy Synsets from the Redundancy in Different Lexical-Semantic Resources |
Authors | Hugo Gon{\c{c}}alo Oliveira, F{'a}bio Santos |
Abstract | Although represented as such in wordnets, word senses are not discrete. To handle word senses as fuzzy objects, we exploit the graph structure of synonymy pairs acquired from different sources to discover synsets where words have different membership degrees that reflect confidence. Following this approach, a wide-coverage fuzzy thesaurus was discovered from a synonymy network compiled from seven Portuguese lexical-semantic resources. Based on a crowdsourcing evaluation, we can say that the quality of the obtained synsets is far from perfect but, as expected in a confidence measure, it increases significantly for higher cut-points on the membership and, at a certain point, reaches 100{%} correction rate. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1687/ |
https://www.aclweb.org/anthology/L16-1687 | |
PWC | https://paperswithcode.com/paper/discovering-fuzzy-synsets-from-the-redundancy |
Repo | |
Framework | |
Sarcasm Detection in Chinese Using a Crowdsourced Corpus
Title | Sarcasm Detection in Chinese Using a Crowdsourced Corpus |
Authors | Shih-Kai Lin, Shu-Kai Hsieh |
Abstract | |
Tasks | Sarcasm Detection, Sentiment Analysis |
Published | 2016-10-01 |
URL | https://www.aclweb.org/anthology/O16-1027/ |
https://www.aclweb.org/anthology/O16-1027 | |
PWC | https://paperswithcode.com/paper/sarcasm-detection-in-chinese-using-a |
Repo | |
Framework | |