Designing a Common POS-Tagset Framework for Indian Languages
نویسندگان
چکیده
Research in Parts-of-Speech (POS) tagset design for European and East Asian languages started with a mere listing of important morphosyntactic features in one language and has matured in later years towards hierarchical tagsets, decomposable tags, common framework for multiple languages (EAGLES) etc. Several tagsets have been developed in these languages along with large amount of annotated data for furthering research. Indian Languages (ILs) present a contrasting picture with very little research in tagset design issues. We present our work in designing a common POS-tagset framework for ILs, which is the result of in-depth analysis of eight languages from two major families, viz. Indo-Aryan and Dravidian. Our framework follows hierarchical tagset layout similar to the EAGLES guidelines, but with significant changes as needed for the ILs.
منابع مشابه
A Common Parts-of-Speech Tagset Framework for Indian Languages
We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; th...
متن کاملBIS Annotation Standards With Reference to Konkani Language
The Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset has been prepared for the Indian Languages by the POS Tag Standardization Committee of Department of Information Technology (DIT), New Delhi, India. The BIS POS tagset aims to ensure standardization in the POS tagging of all the Indian Languages. It has been used for POS tagging in the Indian Languages Corpora Initiative (ILCI) pr...
متن کاملThe CLE Urdu POS Tagset
The paper presents a design schema and details of a new Urdu POS tagset. This tagset is designed due to challenges encountered in working with existing tagsets for Urdu. It uses tags that judiciously incorporate information about special morpho-syntactic categories found in Urdu. With respect to the overall naming schema and the basic divisions, the tagset draws on the Penn Treebank and a Commo...
متن کاملParts Of Speech Tagging for Indian Languages: A Literature Survey
Part of speech (POS) tagging is the process of assigning the part of speech tag or other lexical class marker to each and every word in a sentence. In many Natural Language Processing applications such as word sense disambiguation, information retrieval, information processing, parsing, question answering, and machine translation, POS tagging is considered as the one of the basic necessary tool...
متن کاملThe Linguistics Journal Volume 4 Issue 1 the First Paper on " Part-of-speech Tagging for Grammar Checking of Punjabi " Part-of-speech Tagging for Grammar Checking of Punjabi Noun and Modifier Agreement
Part-of-speech (POS) tagging is one of the major activities performed in a typical natural language processing application. This paper explores part-of-speech tagging for the Punjabi language, a member of the Modern Indo-Aryan family of languages. A tagset for use in grammar checking and other similar applications is proposed. This fine-grained tagset is based entirely on the grammatical catego...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008