Paper Group NANR 134
CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws. Evaluating Lexical Similarity to build Sentiment Similarity. SemEval 2016 Task 11: Complex Word Identification. The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese. Th …
CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws
Title | CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws |
Authors | Rol Sch{"a}fer, |
Abstract | In this paper, I describe a method of creating massively huge web corpora from the CommonCrawl data sets and redistributing the resulting annotations in a stand-off format. Current EU (and especially German) copyright legislation categorically forbids the redistribution of downloaded material without express prior permission by the authors. Therefore, such stand-off annotations (or other derivates) are the only format in which European researchers (like myself) are allowed to re-distribute the respective corpora. In order to make the full corpora available to the public despite such restrictions, the stand-off format presented here allows anybody to locally reconstruct the full corpora with the least possible computational effort. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1712/ |
https://www.aclweb.org/anthology/L16-1712 | |
PWC | https://paperswithcode.com/paper/commoncow-massively-huge-web-corpora-from |
Repo | |
Framework | |
Evaluating Lexical Similarity to build Sentiment Similarity
Title | Evaluating Lexical Similarity to build Sentiment Similarity |
Authors | Gr{'e}goire Jadi, Vincent Claveau, B{'e}atrice Daille, Laura Monceaux |
Abstract | In this article, we propose to evaluate the lexical similarity information provided by word representations against several opinion resources using traditional Information Retrieval tools. Word representation have been used to build and to extend opinion resources such as lexicon, and ontology and their performance have been evaluated on sentiment analysis tasks. We question this method by measuring the correlation between the sentiment proximity provided by opinion resources and the semantic similarity provided by word representations using different correlation coefficients. We also compare the neighbors found in word representations and list of similar opinion words. Our results show that the proximity of words in state-of-the-art word representations is not very effective to build sentiment similarity. |
Tasks | Information Retrieval, Semantic Similarity, Semantic Textual Similarity, Sentiment Analysis |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1190/ |
https://www.aclweb.org/anthology/L16-1190 | |
PWC | https://paperswithcode.com/paper/evaluating-lexical-similarity-to-build |
Repo | |
Framework | |
SemEval 2016 Task 11: Complex Word Identification
Title | SemEval 2016 Task 11: Complex Word Identification |
Authors | Gustavo Paetzold, Lucia Specia |
Abstract | |
Tasks | Complex Word Identification, Lexical Simplification, Text Simplification |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/S16-1085/ |
https://www.aclweb.org/anthology/S16-1085 | |
PWC | https://paperswithcode.com/paper/semeval-2016-task-11-complex-word |
Repo | |
Framework | |
The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese
Title | The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese |
Authors | Miki Nishioka, Shiro Akasegawa |
Abstract | In this paper, we discuss our creation of a web corpus of spoken Hindi (COSH), one of the Indo-Aryan languages spoken mainly in the Indian subcontinent. We also point out notable problems we{'}ve encountered in the web corpus and the special concordancer. After observing the kind of technical problems we encountered, especially regarding annotation tagged by Shiva Reddy{'}s tagger, we argue how they can be solved when using COSH for linguistic studies. Finally, we mention the kinds of linguistic research that we non-native speakers of Hindi can do using the corpus, especially in pragmatics and semantics, and from a comparative viewpoint to Japanese. |
Tasks | |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-3712/ |
https://www.aclweb.org/anthology/W16-3712 | |
PWC | https://paperswithcode.com/paper/the-development-of-a-web-corpus-of-hindi |
Repo | |
Framework | |
The Universal Dependencies Treebank of Spoken Slovenian
Title | The Universal Dependencies Treebank of Spoken Slovenian |
Authors | Kaja Dobrovoljc, Joakim Nivre |
Abstract | This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoperability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1248/ |
https://www.aclweb.org/anthology/L16-1248 | |
PWC | https://paperswithcode.com/paper/the-universal-dependencies-treebank-of-spoken |
Repo | |
Framework | |
Improving Statistical Machine Translation with Selectional Preferences
Title | Improving Statistical Machine Translation with Selectional Preferences |
Authors | Haiqing Tang, Deyi Xiong, Min Zhang, Zhengxian Gong |
Abstract | Long-distance semantic dependencies are crucial for lexical choice in statistical machine translation. In this paper, we study semantic dependencies between verbs and their arguments by modeling selectional preferences in the context of machine translation. We incorporate preferences that verbs impose on subjects and objects into translation. In addition, bilingual selectional preferences between source-side verbs and target-side arguments are also investigated. Our experiments on Chinese-to-English translation tasks with large-scale training data demonstrate that statistical machine translation using verbal selectional preferences can achieve statistically significant improvements over a state-of-the-art baseline. |
Tasks | Machine Translation, Semantic Role Labeling, Word Sense Disambiguation |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-1203/ |
https://www.aclweb.org/anthology/C16-1203 | |
PWC | https://paperswithcode.com/paper/improving-statistical-machine-translation-6 |
Repo | |
Framework | |
An Entity-Based approach to Answering Recurrent and Non-Recurrent Questions with Past Answers
Title | An Entity-Based approach to Answering Recurrent and Non-Recurrent Questions with Past Answers |
Authors | Anietie Andy, Mugizi Rwebangira, Satoshi Sekine |
Abstract | An Entity-based approach to Answering recurrent and non-recurrent questions with Past Answers Abstract Community question answering (CQA) systems such as Yahoo! Answers allow registered-users to ask and answer questions in various question categories. However, a significant percentage of asked questions in Yahoo! Answers are unanswered. In this paper, we propose to reduce this percentage by reusing answers to past resolved questions from the site. Specifically, we propose to satisfy unanswered questions in entity rich categories by searching for and reusing the best answers to past resolved questions with shared needs. For unanswered questions that do not have a past resolved question with a shared need, we propose to use the best answer to a past resolved question with similar needs. Our experiments on a Yahoo! Answers dataset shows that our approach retrieves most of the past resolved questions that have shared and similar needs to unanswered questions. |
Tasks | Community Question Answering, Entity Linking, Question Answering |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-4405/ |
https://www.aclweb.org/anthology/W16-4405 | |
PWC | https://paperswithcode.com/paper/an-entity-based-approach-to-answering |
Repo | |
Framework | |
The Manner/Result Complementarity in Chinese Motion Verbs Revisited
Title | The Manner/Result Complementarity in Chinese Motion Verbs Revisited |
Authors | Lei Qiu |
Abstract | |
Tasks | |
Published | 2016-10-01 |
URL | https://www.aclweb.org/anthology/Y16-2011/ |
https://www.aclweb.org/anthology/Y16-2011 | |
PWC | https://paperswithcode.com/paper/the-mannerresult-complementarity-in-chinese |
Repo | |
Framework | |
Learning Indonesian-Chinese Lexicon with Bilingual Word Embedding Models and Monolingual Signals
Title | Learning Indonesian-Chinese Lexicon with Bilingual Word Embedding Models and Monolingual Signals |
Authors | Xinying Qiu, Gangqin Zhu |
Abstract | We present a research on learning Indonesian-Chinese bilingual lexicon using monolingual word embedding and bilingual seed lexicons to build shared bilingual word embedding space. We take the first attempt to examine the impact of different monolingual signals for the choice of seed lexicons on the model performance. We found that although monolingual signals alone do not seem to outperform signals coverings all words, the significant improvement for learning word translation of the same signal types may suggest that linguistic features possess value for further study in distinguishing the semantic margins of the shared word embedding space. |
Tasks | Document Classification |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-3720/ |
https://www.aclweb.org/anthology/W16-3720 | |
PWC | https://paperswithcode.com/paper/learning-indonesian-chinese-lexicon-with |
Repo | |
Framework | |
UTCNN: a Deep Learning Model of Stance Classification on Social Media Text
Title | UTCNN: a Deep Learning Model of Stance Classification on Social Media Text |
Authors | Wei-Fan Chen, Lun-Wei Ku |
Abstract | Most neural network models for document classification on social media focus on text information to the neglect of other information on these platforms. In this paper, we classify post stance on social media channels and develop UTCNN, a neural network model that incorporates user tastes, topic tastes, and user comments on posts. UTCNN not only works on social media texts, but also analyzes texts in forums and message boards. Experiments performed on Chinese Facebook data and English online debate forum data show that UTCNN achieves a 0.755 macro average f-score for supportive, neutral, and unsupportive stance classes on Facebook data, which is significantly better than models in which either user, topic, or comment information is withheld. This model design greatly mitigates the lack of data for the minor class. In addition, UTCNN yields a 0.842 accuracy on English online debate forum data, which also significantly outperforms results from previous work, showing that UTCNN performs well regardless of language or platform. |
Tasks | Document Classification, Text Classification |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-1154/ |
https://www.aclweb.org/anthology/C16-1154 | |
PWC | https://paperswithcode.com/paper/utcnn-a-deep-learning-model-of-stance-1 |
Repo | |
Framework | |
ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data
Title | ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data |
Authors | Xiuwen Yi, Yu Zheng, Junbo Zhang, Tianrui Li |
Abstract | Many sensors have been deployed in the physical world, generating massive geo-tagged time series data. In reality, readings of sensors are usually lost at various unexpected moments because of sensor or communication errors. Those missing readings do not only affect real-time monitoring but also compromise the performance of further data analysis. In this paper, we propose a spatio-temporal multi-view-based learning (ST-MVL) method to collectively fill missing readings in a collection of geosensory time series data, considering 1) the temporal correlation between readings at different timestamps in the same series and 2) the spatial correlation between different time series. Our method combines empirical statistic models, consisting of Inverse Distance Weighting and Simple Exponential Smoothing, with data-driven algorithms, comprised of User-based and Item-based Collaborative Filtering. The former models handle general missing cases based on empirical assumptions derived from history data over a long period, standing for two global views from spatial and temporal perspectives respectively. The latter algorithms deal with special cases where empirical assumptions may not hold, based on recent contexts of data, denoting two local views from spatial and temporal perspectives respectively. The predictions of the four views are aggregated to a final value in a multi-view learning algorithm. We evaluate our method based on Beijing air quality and meteorological data, finding advantages to our model compared with ten baseline approaches. |
Tasks | Imputation, Multivariate Time Series Imputation, MULTI-VIEW LEARNING, Time Series |
Published | 2016-07-09 |
URL | https://www.microsoft.com/en-us/research/publication/st-mvl-filling-missing-values-in-geo-sensory-time-series-data/ |
https://www.ijcai.org/Proceedings/16/Papers/384.pdf | |
PWC | https://paperswithcode.com/paper/st-mvl-filling-missing-values-in-geo-sensory |
Repo | |
Framework | |
AMR Parsing with an Incremental Joint Model
Title | AMR Parsing with an Incremental Joint Model |
Authors | Junsheng Zhou, Feiyu Xu, Hans Uszkoreit, Weiguang Qu, Ran Li, Yanhui Gu |
Abstract | |
Tasks | Abstractive Text Summarization, Amr Parsing, Entity Linking, Machine Translation, Natural Language Inference, Question Answering |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/D16-1065/ |
https://www.aclweb.org/anthology/D16-1065 | |
PWC | https://paperswithcode.com/paper/amr-parsing-with-an-incremental-joint-model |
Repo | |
Framework | |
Using mention accessibility to improve coreference resolution
Title | Using mention accessibility to improve coreference resolution |
Authors | Kellie Webster, Joel Nothman |
Abstract | |
Tasks | Coreference Resolution |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-2070/ |
https://www.aclweb.org/anthology/P16-2070 | |
PWC | https://paperswithcode.com/paper/using-mention-accessibility-to-improve |
Repo | |
Framework | |
Incorporating Selectional Preferences in Multi-hop Relation Extraction
Title | Incorporating Selectional Preferences in Multi-hop Relation Extraction |
Authors | Rajarshi Das, Arvind Neelakantan, David Belanger, Andrew McCallum |
Abstract | |
Tasks | Knowledge Base Completion, Question Answering, Relation Extraction |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/W16-1304/ |
https://www.aclweb.org/anthology/W16-1304 | |
PWC | https://paperswithcode.com/paper/incorporating-selectional-preferences-in |
Repo | |
Framework | |
Sentence Clustering using PageRank Topic Model
Title | Sentence Clustering using PageRank Topic Model |
Authors | Kenshin Ikegami, Yukio Ohsawa |
Abstract | |
Tasks | Decision Making, Language Modelling, Topic Models |
Published | 2016-10-01 |
URL | https://www.aclweb.org/anthology/Y16-3003/ |
https://www.aclweb.org/anthology/Y16-3003 | |
PWC | https://paperswithcode.com/paper/sentence-clustering-using-pagerank-topic |
Repo | |
Framework | |