Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics

نویسندگان

Michael Beißwenger

Nelleke Oostdijk

Angelika Storrer

Henk van den Heuvel

Thierry Chanier

Celine Poudat

Benoit Sagot

Georges Antoniadis

Ciara Wigham

Linda Hriba

Julien Longhi

چکیده

The CoMeRe project aims to build a kernel corpus of different computer-mediated communication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunications, as well as mono and multimodal, and synchronous and asynchronous communications. Corpora are assembled using a standard, thanks to the Text Encoding Initiative (TEI) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email, and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: discourse analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As natural language processing (NLP) algorithms are an indispensable prerequisite for such research, we present our motivations for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyntactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified thanks to a multi-stage quality control process that is designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure Open Resources and Tools for Language (ORTOLANG). We, therefore, highlight issues and decisions made concerning the OpenData perspective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gender and Computer-Mediated Communication: Emoticons in a Digital Forum in Persian

This study aimed to gain an insight into whether computer-mediated communication (CMC) in the form of a digital forum can reflect gendered discursive practices. A great deal of research has now established that computer-mediated interactions embody gendered differences in the use of emoticons, but few studies have examined the potential effect of the gender of the emoticon-receiver on the frequ...

متن کامل

The Effect of CMC in Business Emails in Lingua Franca: Discourse Features and Misunderstandings

The paper argues that everyday exchange of business emails produces a development in the work-group relationship, which, in turn, makes new communication styles possible and acceptable by the users' habit to computer-mediated forms, even in unbalanced professional exchanges. The focus is on the (spoken) discourse features of email messages in a self-compiled corpus of selected computer-mediated...

متن کامل

Building and Using Corpora of Non-Native Czech

Investigating language acquisition by non-native learners helps to understand important linguistic issues and develop teaching methods, better suited both to the specific target language and to the learner. These tasks can now be based on empirical evidence from learner corpora. A learner corpus consists of language produced by language learners, typically learners of a second or foreign langua...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi

The present paper describes an ongoing effort to compile and annotate a large corpus of computer-mediated communication (CMC) in Hindi. It describes the process of the compilation of the corpus, the basic structure of the corpus and the annotation of the corpus and the challenges faced in the creation of such a corpus. It also gives a description of the technologies developed for the processing...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics

نویسندگان

چکیده

منابع مشابه

Gender and Computer-Mediated Communication: Emoticons in a Digital Forum in Persian

The Effect of CMC in Business Emails in Lingua Franca: Discourse Features and Misunderstandings

Building and Using Corpora of Non-Native Czech

Corpus based coreference resolution for Farsi text

Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi

عنوان ژورنال:

اشتراک گذاری