Automatic Linguistic Annotation of
Large Scale L2 Databases: The EF-Cambridge Open Language Database
(EFCamDat)

Jeroen Geertzen; Theodora Alexopoulou; Anna Korhonen

All proceedings

Info for readers Info for authors Info for editors Info for libraries

Order form Shopping cart

Paper 3100

Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat)

Jeroen Geertzen, Theodora Alexopoulou, and Anna Korhonen
240-254 (complete paper or proceedings contents)

Abstract

We present a new database of L2 English writings, the EF Cambridge Open Language Database (EFCamDat), an open access resource built at the University of Cambridge in collaboration with EF Education First (EF), an international educational organization. The database consists of writings submitted to Englishtown, EF's online school. EFCamDat stands out for its size and rich individual longitudinal data from thousands of learners around the world. We further present results from a study evaluating the performance of automated part-of-speech tagging and parsing on EFCamDat data. The study provides EFCamDat users with information on the accuracy of the morphosyntactic annotations accompanying EFCamDat data. In particular, we investigate the effect of learner errors on parsing. The parser shows considerable robustness to learner errors, providing correct tagging and parsing assignments for just over half of words containing a learner error. The parser is particularly robust with semantic, word order, and local morphosyntax errors and succeeds in capturing the underlying syntactic dependencies. Natural language processing tools can thus achieve high accuracy scores and provide reliable annotations of syntactic categories and structures of L2 writing, which are crucial for SLA research.

Published in

Selected Proceedings of the 2012 Second Language Research Forum: Building Bridges between Disciplines

edited by Ryan T. Miller, Katherine I. Martin, Chelsea M. Eddington, Ashlie Henery, Nausica Marcos Miguel, Alison M. Tseng, Alba Tuninetti, and Daniel Walter

Table of contents

ISBN 978-1-57473-464-5 library binding
vi + 254 pages
publication date: 2014
published by Cascadilla Proceedings Project, Somerville, MA, USA