native language

Native Language Identification: A Key N-gram Category Approach

2013

Kristopher Kyle Scott A. Crossley Jianmin Dai Danielle S. McNamara

This study explores the efficacy of an approach to native language identification that utilizes grammatical, rhetorical, semantic, syntactic, and cohesive function categories comprised of key n-grams. The study found that a model based on these categories of key n-grams was able to successfully predict the L1 of essays written in English by L2 learners from 11 different L1 backgrounds with an a...

متن کامل

Language Transfer Hypotheses with Linear SVM Weights

2014

Shervin Malmasi Mark Dras

Language transfer, the characteristic second language usage patterns caused by native language interference, is investigated by Second Language Acquisition (SLA) researchers seeking to find overused and underused linguistic features. In this paper we develop and present a methodology for deriving ranked lists of such features. Using very large learner data, we show our method’s ability to find ...

متن کامل

Oracle and Human Baselines for Native Language Identification

2015

Shervin Malmasi Joel R. Tetreault Mark Dras

We examine different ensemble methods, including an oracle, to estimate the upper-limit of classification accuracy for Native Language Identification (NLI). The oracle outperforms state-of-the-art systems by over 10% and results indicate that for many misclassified texts the correct class label receives a significant portion of the ensemble votes, often being the runner-up. We also present a pi...

متن کامل

Native Language Identification Using Large, Longitudinal Data

2014

Xiao Jiang Yufan Guo Jeroen Geertzen Dora Alexopoulou Lin Sun Anna Korhonen

Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to the recently released EFCamDat corpus which is not only multiple times larger than previous L2 corpora but also provides longitudinal data at several...

متن کامل

The Interlanguage of Persian Learners of Italian: a Focus on Complex Predicates

Journal: Iranian Journal of Applied Language Studies 2011

Francesca Frontini Francesca Mazzariello

This paper aims at investigating the acquisition of Italian complex predicates by native speakers of Persian. Complex predication is not as pervasive a phenomenon in Italian as it is in Persian. Yet Italian native speakers use complex predicates productively; spontaneous data show that Persian learners of Italian seem to be perfectly aware of Italian complex predicates and use this familiar fea...

متن کامل

Stacked Sentence-Document Classifier Approach for Improving Native Language Identification

2017

Andrea Cimino Felice Dell'Orletta

In this paper, we describe the approach of the ItaliaNLP Lab team to native language identification and discuss the results we submitted as participants to the essay track of NLI Shared Task 2017. We introduce for the first time a 2-stacked sentencedocument architecture for native language identification that is able to exploit both local sentence information and a wide set of general–purpose f...

متن کامل

Experimental Results on the Native Language Identification Shared Task

2013

Amjad Abu-Jbara Rahul Jha Eric Morley Dragomir R. Radev

We present a system for automatically identifying the native language of a writer. We experiment with a large set of features and train them on a corpus of 9,900 essays written in English by speakers of 11 different languages. our system achieved an accuracy of 43% on the test data, improved to 63% with improved feature normalization. In this paper, we present the features used in our system, d...

متن کامل

Recognizing English Learners' Native Language from Their Writings

2013

Baoli Li

Native Language Identification (NLI), which tries to identify the native language (L1) of a second language learner based on their writings, is helpful for advancing second language learning and authorship profiling in forensic linguistics. With the availability of relevant data resources, much work has been done to explore the native language of a foreign language learner. In this report, we p...

متن کامل

Robust, Lexicalized Native Language Identification

2012

Julian Brooke Graeme Hirst

Previous approaches to the task of native language identification (Koppel et al., 2005) have been limited to small, within-corpus evaluations. Because these are restrictive and unreliable, we apply cross-corpus evaluation to the task. We demonstrate the efficacy of lexical features, which had previously been avoided due to the within-corpus topic confounds, and provide a detailed evaluation of ...

متن کامل

The MERLIN corpus: Learner language and the CEFR

2014

Adriane Boyd Jirka Hana Lionel Nicolas Walt Detmar Meurers Katrin Wisniewski Andrea Abel Karin Schöne Barbora Stindlová Chiara Vettori

The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains 2,290 learner texts produced in standardized language certifications covering CEFR levels A1–C1. The MERLIN annotation scheme includes a wide range of language characteri...

متن کامل