Trillions of Comparable Documents

نویسندگان

  • Pascale Fung
  • Emmanuel Prochasson
  • Simon Shi
چکیده

We propose a novel multilingual Web crawler and sentence mining system to continuously mine and extract parallel sentences from trillions of websites, unconstrained by domain or url structures, or publication dates. The system is divided into three main modules, namely Web crawler, comparable and parallel website matching and parallel sentence extraction. Previous methods in mining parallel sentences from the Web focus on specific websites, such as newspaper agencies, or sites sharing the same URL parents. The output of these previous systems are limited in scope and static in nature. As the Web is boundless and growing, we propose to continuously crawl the Web and update the pool of parallel sentences extracted. One main objective of our work is to improve statistical machine translation systems. Another objective is to take advantage of the heterogeneous website documents to discover parallel sentences in henceforth undiscovered domains and genres, such as user generated content. We investigate a host of recall-oriented vs precisionoriented algorithms for comparable and parallel document matching, as well as parallel sentence extraction. In the future, this system can be extended to mine other monolingual or bilingual linguistic resources from the Web.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Trading Partners

B y now, the high expectations for business-to-business (B2B) applications have become familiar. Analysts have projected trillions of dollars of B2B transactions within a few years. Companies have begun to deploy B2B integration servers, which connect to existing back-end applications, and send and receive Extensible Markup Language (XML) documents over the Internet to automate business relatio...

متن کامل

Learning Comparable Corpora from Latent Semantic Analysis Simplified Document Space

Focusing on a systematic Latent Semantic Analysis (LSA) and Machine Learning (ML) approach, this research contributes to the development of a methodology for the automatic compilation of comparable collections of documents. Its originality lies within the delineation of relevant comparability characteristics of similar documents in line with an established definition of comparable corpora. Thes...

متن کامل

Sentence Alignment in Parallel, Comparable, and Quasi-comparable Corpora

We explore the usability of different bilingual corpora for the purpose of multilingual and cross-lingual natural language processing. The usability of bilingual corpus is evaluated by the lexical alignment score calculated for the bi-lexicon pair distributed in the aligned bilingual sentence pairs. We compare and contrast a number of bilingual corpora, ranging from parallel, to comparable, and...

متن کامل

مدیریت دیجیتالی حقوق مالکیت ادبی و هنری

With the emergence and development of ICT especially the Internet, intellectual property law have been facing new challenges. Electronic tools and other new ICTs have provided new and unique opportunities for humanity to produce and duplicate works; however, they have also increased the potential of the breach of authors' rights which is not comparable to the tools used in the few past decades....

متن کامل

LINA: Identifying Comparable Documents from Wikipedia

This paper describes the LINA system for the BUCC 2015 shared track. Following (Enright and Kondrak, 2007), our system identify comparable documents by collecting counts of hapax words. We extend this method by filtering out document pairs sharing target documents using pigeonhole reasoning and cross-lingual information.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011