Trillions of Comparable Documents
نویسندگان
چکیده
We propose a novel multilingual Web crawler and sentence mining system to continuously mine and extract parallel sentences from trillions of websites, unconstrained by domain or url structures, or publication dates. The system is divided into three main modules, namely Web crawler, comparable and parallel website matching and parallel sentence extraction. Previous methods in mining parallel sentences from the Web focus on specific websites, such as newspaper agencies, or sites sharing the same URL parents. The output of these previous systems are limited in scope and static in nature. As the Web is boundless and growing, we propose to continuously crawl the Web and update the pool of parallel sentences extracted. One main objective of our work is to improve statistical machine translation systems. Another objective is to take advantage of the heterogeneous website documents to discover parallel sentences in henceforth undiscovered domains and genres, such as user generated content. We investigate a host of recall-oriented vs precisionoriented algorithms for comparable and parallel document matching, as well as parallel sentence extraction. In the future, this system can be extended to mine other monolingual or bilingual linguistic resources from the Web.
منابع مشابه
Trading Partners
B y now, the high expectations for business-to-business (B2B) applications have become familiar. Analysts have projected trillions of dollars of B2B transactions within a few years. Companies have begun to deploy B2B integration servers, which connect to existing back-end applications, and send and receive Extensible Markup Language (XML) documents over the Internet to automate business relatio...
متن کاملLearning Comparable Corpora from Latent Semantic Analysis Simplified Document Space
Focusing on a systematic Latent Semantic Analysis (LSA) and Machine Learning (ML) approach, this research contributes to the development of a methodology for the automatic compilation of comparable collections of documents. Its originality lies within the delineation of relevant comparability characteristics of similar documents in line with an established definition of comparable corpora. Thes...
متن کاملSentence Alignment in Parallel, Comparable, and Quasi-comparable Corpora
We explore the usability of different bilingual corpora for the purpose of multilingual and cross-lingual natural language processing. The usability of bilingual corpus is evaluated by the lexical alignment score calculated for the bi-lexicon pair distributed in the aligned bilingual sentence pairs. We compare and contrast a number of bilingual corpora, ranging from parallel, to comparable, and...
متن کاملمدیریت دیجیتالی حقوق مالکیت ادبی و هنری
With the emergence and development of ICT especially the Internet, intellectual property law have been facing new challenges. Electronic tools and other new ICTs have provided new and unique opportunities for humanity to produce and duplicate works; however, they have also increased the potential of the breach of authors' rights which is not comparable to the tools used in the few past decades....
متن کاملLINA: Identifying Comparable Documents from Wikipedia
This paper describes the LINA system for the BUCC 2015 shared track. Following (Enright and Kondrak, 2007), our system identify comparable documents by collecting counts of hapax words. We extend this method by filtering out document pairs sharing target documents using pigeonhole reasoning and cross-lingual information.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011