All proceedings
Enter a document #:
Enter search terms:

Info for readers Info for authors Info for editors Info for libraries Order form Shopping cart

Bookmark and Share Paper 3100

Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat)
Jeroen Geertzen, Theodora Alexopoulou, and Anna Korhonen
240-254 (complete paper or proceedings contents)

Abstract

We present a new database of L2 English writings, the EF Cambridge Open Language Database (EFCamDat), an open access resource built at the University of Cambridge in collaboration with EF Education First (EF), an international educational organization. The database consists of writings submitted to Englishtown, EF's online school. EFCamDat stands out for its size and rich individual longitudinal data from thousands of learners around the world. We further present results from a study evaluating the performance of automated part-of-speech tagging and parsing on EFCamDat data. The study provides EFCamDat users with information on the accuracy of the morphosyntactic annotations accompanying EFCamDat data. In particular, we investigate the effect of learner errors on parsing. The parser shows considerable robustness to learner errors, providing correct tagging and parsing assignments for just over half of words containing a learner error. The parser is particularly robust with semantic, word order, and local morphosyntax errors and succeeds in capturing the underlying syntactic dependencies. Natural language processing tools can thus achieve high accuracy scores and provide reliable annotations of syntactic categories and structures of L2 writing, which are crucial for SLA research.

Published in

Selected Proceedings of the 2012 Second Language Research Forum: Building Bridges between Disciplines
edited by Ryan T. Miller, Katherine I. Martin, Chelsea M. Eddington, Ashlie Henery, Nausica Marcos Miguel, Alison M. Tseng, Alba Tuninetti, and Daniel Walter
Table of contents
Printed edition: $290.00