Paper Group NAWR 4
Adapting to All Domains at Once: Rewarding Domain Invariance in SMT. C4Corpus: Multilingual Web-size Corpus with Free License. ExploreKit: Automatic Feature Generation and Selection. An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability. Efficient Coarse-To-Fine PatchMatch for Large Displ …
Adapting to All Domains at Once: Rewarding Domain Invariance in SMT
Title | Adapting to All Domains at Once: Rewarding Domain Invariance in SMT |
Authors | Hoang Cuong, Khalil Sima{'}an, Ivan Titov |
Abstract | Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance. |
Tasks | Domain Adaptation, Machine Translation |
Published | 2016-01-01 |
URL | https://www.aclweb.org/anthology/Q16-1008/ |
https://www.aclweb.org/anthology/Q16-1008 | |
PWC | https://paperswithcode.com/paper/adapting-to-all-domains-at-once-rewarding |
Repo | https://github.com/hoangcuong2011/UDIT |
Framework | none |
C4Corpus: Multilingual Web-size Corpus with Free License
Title | C4Corpus: Multilingual Web-size Corpus with Free License |
Authors | Ivan Habernal, Omnia Zayed, Iryna Gurevych |
Abstract | Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs. Our highly-scalable Hadoop-based framework is able to process the full CommonCrawl corpus on 2000+ CPU cluster on the Amazon Elastic Map/Reduce infrastructure. The processing pipeline includes license identification, state-of-the-art boilerplate removal, exact duplicate and near-duplicate document removal, and language detection. The construction of the corpus is highly configurable and fully reproducible, and we provide both the framework (DKPro C4CorpusTools) and the resulting data (C4Corpus) to the research community. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1146/ |
https://www.aclweb.org/anthology/L16-1146 | |
PWC | https://paperswithcode.com/paper/c4corpus-multilingual-web-size-corpus-with |
Repo | https://github.com/dkpro/dkpro-c4corpus |
Framework | none |
ExploreKit: Automatic Feature Generation and Selection
Title | ExploreKit: Automatic Feature Generation and Selection |
Authors | Gilad Katz, Eui Chul Richard Shin, Dawn Song |
Abstract | Feature generation is one of the challenging aspects of machine learning. We present ExploreKit, a framework for automated feature generation. ExploreKit generates a large set of candidate features by combining information in the original features, with the aim of maximizing predictive performance according to user-selected criteria. To overcome the exponential growth of the feature space, ExploreKit uses a novel machine learning-based feature selection approach to predict the usefulness of new candidate features. This approach enables efficient identification of the new features and produces superior results compared to existing feature selection solutions. We demonstrate the effectiveness and robustness of our approach by conducting an extensive evaluation on 25 datasets and 3 different classification algorithms. We show that ExploreKit can achieve classification-error reduction of 20% overall. |
Tasks | Automated Feature Engineering, Feature Selection |
Published | 2016-01-01 |
URL | https://ieeexplore.ieee.org/document/7837936 |
http://people.eecs.berkeley.edu/~dawnsong/papers/icdm-2016.pdf | |
PWC | https://paperswithcode.com/paper/explorekit-automatic-feature-generation-and |
Repo | https://github.com/giladkatz/ExploreKit |
Framework | none |
An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability
Title | An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability |
Authors | Johannes Hellrich, Udo Hahn |
Abstract | |
Tasks | Semantic Textual Similarity |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/W16-2114/ |
https://www.aclweb.org/anthology/W16-2114 | |
PWC | https://paperswithcode.com/paper/an-assessment-of-experimental-protocols-for |
Repo | https://github.com/hellrich/latech2016 |
Framework | none |
Efficient Coarse-To-Fine PatchMatch for Large Displacement Optical Flow
Title | Efficient Coarse-To-Fine PatchMatch for Large Displacement Optical Flow |
Authors | Yinlin Hu, Rui Song, Yunsong Li |
Abstract | As a key component in many computer vision systems, optical flow estimation, especially with large displacements, remains an open problem. In this paper we present a simple but powerful matching method works in a coarse-to-fine scheme for optical flow estimation. Inspired by the nearest neighbor field (NNF) algorithms, our approach, called CPM (Coarse-to-fine PatchMatch), blends an efficient random search strategy with the coarse-to-fine scheme for optical flow problem. Unlike existing NNF techniques, which is efficient but the results is often too noisy for optical flow caused by the lack of global regularization, we propose a propagation step with constrained random search radius between adjacent levels on the hierarchical architecture. The resulting correspondences enjoys a built-in smoothing effect, which is more suited for optical flow estimation than NNF techniques. Furthermore, our approach can also capture the tiny structures with large motions which is a problem for traditional coarse-to-fine optical flow algorithms. Interpolated by an edge-preserving interpolation method (EpicFlow), our method outperforms the state of the art on MPI-Sintel and KITTI, and runs much faster than the competing methods. |
Tasks | Optical Flow Estimation |
Published | 2016-06-01 |
URL | http://openaccess.thecvf.com/content_cvpr_2016/html/Hu_Efficient_Coarse-To-Fine_PatchMatch_CVPR_2016_paper.html |
http://openaccess.thecvf.com/content_cvpr_2016/papers/Hu_Efficient_Coarse-To-Fine_PatchMatch_CVPR_2016_paper.pdf | |
PWC | https://paperswithcode.com/paper/efficient-coarse-to-fine-patchmatch-for-large |
Repo | https://github.com/YinlinHu/CPM |
Framework | none |
Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models
Title | Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models |
Authors | Karl Stratos, Michael Collins, Daniel Hsu |
Abstract | We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call anchor HMMs, assume that each tag is associated with at least one word that can have no other tag, which is a relatively benign condition for POS tagging (e.g., {``}the{''} is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong baselines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010). Furthermore, it produces an interpretable model in which hidden states are automatically lexicalized by words. | |
Tasks | Part-Of-Speech Tagging, Unsupervised Part-Of-Speech Tagging |
Published | 2016-01-01 |
URL | https://www.aclweb.org/anthology/Q16-1018/ |
https://www.aclweb.org/anthology/Q16-1018 | |
PWC | https://paperswithcode.com/paper/unsupervised-part-of-speech-tagging-with |
Repo | https://github.com/karlstratos/anchor |
Framework | none |
Latent Tree Language Model
Title | Latent Tree Language Model |
Authors | Tom{'a}{\v{s}} Brychc{'\i}n |
Abstract | |
Tasks | Language Modelling, Machine Translation, Optical Character Recognition, Speech Recognition |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/D16-1042/ |
https://www.aclweb.org/anthology/D16-1042 | |
PWC | https://paperswithcode.com/paper/latent-tree-language-model-1 |
Repo | https://github.com/brychcin/LTLM |
Framework | none |
A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation
Title | A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation |
Authors | Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, Alexander Sorkine-Hornung |
Abstract | Over the years, datasets and benchmarks have proven their fundamental importance in computer vision research, enabling targeted progress and objective comparisons in many fields. At the same time, legacy datasets may impend the evolution of a field due to saturated algorithm performance and the lack of contemporary, high quality data. In this work we present a new benchmark dataset and evaluation methodology for the area of video object segmentation. The dataset, named DAVIS (Densely Annotated VIdeo Segmentation), consists of fifty high quality, Full HD video sequences, spanning multiple occurrences of common video object segmentation challenges such as occlusions, motion-blur and appearance changes. Each video is accompanied by densely annotated, pixel-accurate and per-frame ground truth segmentation. In addition, we provide a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics that measure the spatial extent of the segmentation, the accuracy of the silhouette contours and the temporal coherence. The results uncover strengths and weaknesses of current approaches, opening up promising directions for future works. |
Tasks | Semantic Segmentation, Video Object Segmentation, Video Semantic Segmentation |
Published | 2016-06-01 |
URL | http://openaccess.thecvf.com/content_cvpr_2016/html/Perazzi_A_Benchmark_Dataset_CVPR_2016_paper.html |
http://openaccess.thecvf.com/content_cvpr_2016/papers/Perazzi_A_Benchmark_Dataset_CVPR_2016_paper.pdf | |
PWC | https://paperswithcode.com/paper/a-benchmark-dataset-and-evaluation |
Repo | https://github.com/fperazzi/davis |
Framework | none |
Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources
Title | Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources |
Authors | Gintar{.e} Grigonyt{.e}, Kristina Nilsson Bj{"o}rkenstam |
Abstract | |
Tasks | Language Acquisition |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/W16-6506/ |
https://www.aclweb.org/anthology/W16-6506 | |
PWC | https://paperswithcode.com/paper/language-independent-exploration-of |
Repo | https://github.com/ginta-re/Varseta |
Framework | none |
Parsing as Language Modeling
Title | Parsing as Language Modeling |
Authors | Do Kook Choe, Eugene Charniak |
Abstract | |
Tasks | Constituency Parsing, Dependency Parsing, Language Modelling |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/D16-1257/ |
https://www.aclweb.org/anthology/D16-1257 | |
PWC | https://paperswithcode.com/paper/parsing-as-language-modeling |
Repo | https://github.com/cdg720/emnlp2016 |
Framework | tf |
Valencer: an API to Query Valence Patterns in FrameNet
Title | Valencer: an API to Query Valence Patterns in FrameNet |
Authors | Alex Kabbach, re, Corentin Ribeyre |
Abstract | This paper introduces Valencer: a RESTful API to search for annotated sentences matching a given combination of syntactic realizations of the arguments of a predicate {–} also called {`}valence pattern{'} {–} in the FrameNet database. The API takes as input an HTTP GET request specifying a valence pattern and outputs a list of exemplifying annotated sentences in JSON format. The API is designed to be modular and language-independent, and can therefore be easily integrated to other (NLP) server-side or client-side applications, as well as non-English FrameNet projects. Valencer is free, open-source, and licensed under the MIT license. | |
Tasks | Question Answering |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-2033/ |
https://www.aclweb.org/anthology/C16-2033 | |
PWC | https://paperswithcode.com/paper/valencer-an-api-to-query-valence-patterns-in |
Repo | https://github.com/akb89/valencer |
Framework | none |
Aicyber at SemEval-2016 Task 4: i-vector based sentence representation
Title | Aicyber at SemEval-2016 Task 4: i-vector based sentence representation |
Authors | Steven Du, Xi Zhang |
Abstract | |
Tasks | Sentiment Analysis, Speaker Verification |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/S16-1017/ |
https://www.aclweb.org/anthology/S16-1017 | |
PWC | https://paperswithcode.com/paper/aicyber-at-semeval-2016-task-4-i-vector-based |
Repo | https://github.com/StevenLOL/aicyber_semeval_2016_ivector |
Framework | none |
UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields
Title | UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields |
Authors | Mohammad Javad Hosseini, Noah A. Smith, Su-In Lee |
Abstract | |
Tasks | |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/S16-1143/ |
https://www.aclweb.org/anthology/S16-1143 | |
PWC | https://paperswithcode.com/paper/uw-cse-at-semeval-2016-task-10-detecting |
Repo | https://github.com/mjhosseini/2-CRF-MWE |
Framework | none |
PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors
Title | PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors |
Authors | David R. Mortensen, Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, Lori Levin |
Abstract | This paper contributes to a growing body of evidence that{—}when coupled with appropriate machine-learning techniques{–}linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data. In particular, we show that phonological features outperform character-based models. PanPhon is a database relating over 5,000 IPA segments to 21 subsegmental articulatory features. We show that this database boosts performance in various NER-related tasks. Phonologically aware, neural CRF models built on PanPhon features are able to perform better on monolingual Spanish and Turkish NER tasks that character-based models. They have also been shown to work well in transfer models (as between Uzbek and Turkish). PanPhon features also contribute measurably to Orthography-to-IPA conversion tasks. |
Tasks | |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-1328/ |
https://www.aclweb.org/anthology/C16-1328 | |
PWC | https://paperswithcode.com/paper/panphon-a-resource-for-mapping-ipa-segments |
Repo | https://github.com/dmort27/panphon |
Framework | none |
Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings
Title | Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings |
Authors | Ond{\v{r}}ej Du{\v{s}}ek, Filip Jur{\v{c}}{'\i}{\v{c}}ek |
Abstract | |
Tasks | Spoken Dialogue Systems, Text Generation |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-2008/ |
https://www.aclweb.org/anthology/P16-2008 | |
PWC | https://paperswithcode.com/paper/sequence-to-sequence-generation-for-spoken-1 |
Repo | https://github.com/UFAL-DSG/tgen |
Framework | tf |