Automatic Language Identification in Code-Switched Hindi-English Social Media Text

نویسندگان

چکیده

Natural Language Processing (NLP) tools typically struggle to process code-switched data and so linguists are commonly forced annotate such manually. As this becomes more readily available, automatic increasingly needed help speed up the annotation improve consistency. Last year, a toolkit was developed semi-automatically transcribed bilingual Vietnamese-English speech with token-based language information POS tags (hereafter CanVEC toolkit, L. Nguyen & Bryant, 2020). In work, we extend methodology another pair, Hindi-English, explore extent which can standardise automation process. Specifically, applied principles behind from International Conference on (ICON) 2016 shared task, consists of social media posts (Facebook, Twitter WhatsApp) that have been annotated (Molina et al., 2016). We used ICON-2016 annotations as gold-standard labels in identification task. Ultimately, our tool achieved an F1 score 87.99% data. then evaluated first 500 tokens each subset manually, found almost 40% all errors were caused entirely by problems gold-standard, i.e., system correct. It is thus likely overall accuracy higher than reported. This shows great potential for effectively automating corpora, different combinations, genres. finally discuss some limitations approach release code human evaluation together paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text

In this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community w...

متن کامل

POS Tagging of English-Hindi Code-Mixed Social Media Content

Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums, an...

متن کامل

Shallow Parsing Pipeline for Hindi-English Code-Mixed Social Media Text

متن کامل

POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments

We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language detection and POS tag layers do...

متن کامل

Sentiment Identification in Code-Mixed Social Media Text

Sentiment analysis is the Natural Language Processing (NLP) task dealing with the detection and classification of sentiments in texts. While some tasks deal with identifying presence of sentiment in text (Subjectivity analysis), other tasks aim at determining the polarity of the text categorizing them as positive, negative and neutral. Whenever there is presence of sentiment in text, it has a s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of open humanities data

سال: 2021

ISSN: ['2059-481X']

DOI: https://doi.org/10.5334/johd.44