Encoding transliteration variation through dimensionality reduction: FIRE Shared Task on Transliterated Search

نویسندگان

  • Parth Gupta
  • Paolo Rosso
  • Rafael E. Banchs
چکیده

There exist a large amount of user generated Web content in Roman script for the languages which are written in indigenous scripts for various reasons. In the light of this phenomenon, the search engines face a non-trivial problem of matching queries and documents in transliterated space where transliterated content contain extensive spelling variation. This paper describes our proposed method to handle such variation through non-linear dimensionality reduction techniques. The approach achieves MRR as high as 0.84 on the main task of ad hoc retrieval.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared Task Report by DAIICT

This paper aims to address the solution for the Subtask 1 of Shared Task on transliterated search,a task in FIRE ’14. The task addresses the problem of data containing English words and transliterated words of Indian languages in English.The task calls for language identification and subsequent back transliteration into the native Indian scripts.The system proposed herewith implements Language ...

متن کامل

Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra

This paper describes an approach for Mixed-script Ad hoc retrieval, a subtask as part of FIRE 2015 Shared Task on Mixed Script Information Retrieval. We participated in subtask 2 of the shared task, where a statistical model was used to carry out back transliteration to Devanagari script. To perform the search, bigram based index of the documents were used and search was performed using pivot t...

متن کامل

Machine Learning Approach for Language Identification & Transliteration: Shared Task Report of IITP-TS

In this paper, we describe the system that we developed as part of our participation to the FIRE-2014 Shared Task on Transliterated Search. We participated only for Subtask 1 that focused on labeling the query words. The entire process consists of the following subtasks: language identification of each word in the text, named entity recognition and classification (NERC) and transliteration of t...

متن کامل

A Relevance feedback based approach for mixed script transliterated text search: Shared Task report by BIT Mesra, India

This paper describes the experiments carried out as part of the participation in FIRE-2014 Transliterated Search Shared task. We participated in subtask-2 and submitted two results generated by systems based on relevant feedback approach. Given a collection of documents in mixed script, the task is to retrieve relevant documents using queries in either script. The spelling variation between dif...

متن کامل

Incorporating Pronunciation Variation into Extraction of Transliterated-term Pairs from Web Corpora

A novel approach to automatically extracting transliterated-term pairs from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking pronunciation variation into account. Pronunciation variation is a phenomenon of pronunciation ambiguity that seriously affects the term transliteration and hence affects those results produced by transliteration processe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013