L1-L2 Parallel Dependency Treebank as Learner Corpus

نویسندگان

  • John Lee
  • Keying Li
  • Herman Leung
چکیده

This opinion paper proposes the use of parallel treebank as learner corpus. We show how an L1-L2 parallel treebank — i.e., parse trees of non-native sentences, aligned to the parse trees of their target hypotheses — can facilitate retrieval of sentences with specific learner errors. We argue for its benefits, in terms of corpus reuse and interoperability, over a conventional learner corpus annotated with error tags. As a proof of concept, we conduct a case study on word-order errors made by learners of Chinese as a foreign language. We report precision and recall in retrieving a range of word-order error categories fromL1-L2 tree pairs annotated in the Universal Dependency framework.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Running head: L1 INFLUENCE ON MORPHEME ACQUISITION ORDER 1 L1 Influence on the Acquisition Order of English Grammatical Morphemes: A Learner Corpus Study

We revisit morpheme studies to evaluate the long-standing claim for a universal order of acquisition. We investigate the L2 acquisition order of six English grammatical morphemes by learners from seven L1 groups across five proficiency levels. Data are drawn from approximately 10,000 written exam scripts from the Cambridge Learner Corpus. The study establishes clear L1 influence on the absolute...

متن کامل

Learning with Learner Corpora: using the TLE for Native Language Identification

This study investigates the usefulness of the Treebank of Learner English (TLE) when applied to the task of Native Language Identification (NLI). The TLE is effectively a parallel corpus of Standard/Learner English, as there are two versions; one based on original learner essays, and the other an error-corrected version. We use the corpus to explore how useful a parser trained on ungrammatical ...

متن کامل

Measuring Interlanguage: Native Language Identification with L1-influence Metrics

The task of native language (L1) identification suffers from a relative paucity of useful training corpora, and standard within-corpus evaluation is often problematic due to topic bias. In this paper, we introduce a method for L1 identification in second language (L2) texts that relies only on much more plentiful L1 data, rather than the L2 texts that are traditionally used for training. In par...

متن کامل

REALEC learner treebank: annotation principles and evaluation of automatic parsing

The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The original corpus is manually annotated for learners’ errors and gives information on the error span, error type, and the possible correction of the mistake provided b...

متن کامل

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017