Chapter 4 Character encoding in corpus construction

نویسندگان

Anthony McEnery

Zhonghua Xiao

چکیده

Corpus linguistics has developed, over the past three decades, into a rich paradigm that addresses a great variety of linguistic issues ranging from monolingual research of one language to contrastive and translation studies involving many different languages. Today, while the construction and exploitation of English language corpora still dominate the field of corpus linguistics, corpora of other languages, either monolingual or multilingual, have also become available. These corpora have added notably to the diversity of corpus-based language studies. Character encoding is rarely an issue for alphabetical languages, like English, which typically still use ASCII characters. For many other languages that use different writing systems (e.g. Chinese), encoding is an important issue if one wants to display the corpus properly or facilitate data interchange, especially when working with multilingual corpora that contain a wide range of writing systems. Language specific encoding systems make data interchange problematic, since it is virtually impossible to display a multilingual document containing texts from different languages using such encoding systems. Such documents constitute a new Tower of Babel which disrupts communication. In addition to the problem with displaying corpus text or search results in general, an issue which is particular relevant to corpus building is that the character encoding in a corpus must be consistent if the corpus is to be searched reliably. This is because if the data in a corpus is encoded using different character sets, even though the internal difference is indiscernible to human eyes, a computer will make a distinction, thus leading to unreliable results. In many cases, however, multiple and often competing encoding systems complicate corpus building, providing a real problem. For example, the main difficulty in building a multilingual corpus such as EMILLE is the need to standardize the language data into a single character set (see Baker, Hardie & McEnery et al 2004). The encoding, together with other ancillary data such as markup and annotation schemes, should also be documented clearly. Such documentation must be made available to the users. A legacy encoding is typically designed to support one writing system, or a group of writing systems that use the same script (see discussion below). In contrast, Unicode is truly multilingual in that it can display characters from a very large number of writing systems. Unicode enables one to surmount this Tower of Babel by overcoming the inherent deficiencies of various legacy encodings. 2 It has also facilitated the task of corpus building (most notably for multilingual corpora and corpora involving non-Western languages). Hence, a general trend in corpus building is to encode corpora (especially multilingual corpora) using Unicode (e.g. EMILLE). Corpora encoded in Unicode can also take advantage of the latest Unicode-compliant corpus tools like Xaira (Burnard & Todd 2003) and WordSmith version 4.0 (Scott 2003). In this chapter, we will consider character encoding from the viewpoint of corpus linguistics rather than programming, which means that the account presented

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recent Developments in the National Corpus of Polish

The aim of the paper is to present recent — as of July 2009 — developments in the construction of the National Corpus of Polish. The main developments are: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools.

متن کامل

A Study in Urdu Corpus Construction

We are interested in contributing a small, publicly available Urdu corpus of written text to the natural language processing community. The Urdu text is stored in the Unicode character set, in its native Arabic script, and marked up according to the Corpus Encoding Standard (CES) XML Document Type Definition (DTD). All the tags and metadata are in English. To date, the corpus is made entirely o...

متن کامل

Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information

Web provides a large-scale corpus for researchers to study the language usages in real world. Developing a web-scale corpus needs not only a lot of computation resources, but also great efforts to handle the large variations in the web texts, such as character encoding in processing Chinese web texts. In this paper, we aim to develop a web-scale Chinese word N-gram corpus with parts of speech i...

متن کامل

Some new two-character sets in PG(5, q2) and a distance-2 ovoid in the generalized hexagon H(4)

In this paper, we construct a new infinite class of two-character sets in PG(5, q2) and determine their automorphism groups. From this construction arise new infinite classes of two-weight codes and strongly regular graphs, and a new distance-2 ovoid of the split Cayley hexagon of order 4.

متن کامل

Which XML Standards for Multilevel Corpus Annotation?

The paper attempts to answer the question: Which XML standard(s) should be used for multilevel corpus annotation? Various more or less specific standards and best practices are reviewed: TEI P5, XCES, work within ISO TC 37 / SC 4, TIGER-XML and PAULA. The conclusion of the paper is that the approach with the best claim to following text encoding standards consists in 1) using TEI-conformant sch...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Chapter 4 Character encoding in corpus construction

نویسندگان

چکیده

منابع مشابه

Recent Developments in the National Corpus of Polish

A Study in Urdu Corpus Construction

Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information

Some new two-character sets in PG(5, q2) and a distance-2 ovoid in the generalized hexagon H(4)

Which XML Standards for Multilevel Corpus Annotation?

عنوان ژورنال:

اشتراک گذاری