A Formalism for Universal Segmentation of Text
نویسنده
چکیده
Sumo is a formalism for universal segmentation of text Its purpose is to provide a framework for the creation of segmentation applications It is called universal as the formalism itself is independent of the language of the documents to process and independent of the levels of seg mentation e g words sentences paragraphs morphemes considered by the target applica tion This framework relies on a layered struc ture representing the possible segmentations of the document This structure and the tools to manipulate it are described followed by detailed examples highlighting some features of Sumo
منابع مشابه
A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling
In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...
متن کاملA Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling
In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...
متن کاملSpécification et réalisation d'un formalisme générique pour la segmentation multiple de documents textuels multilingues
The issue of word segmentation, or tokenization, is often treated as a trivial matter becauseof the use of separators in writing. The rise of the Internet and the Web led to the availability of millionsof documents in countless languages, which in turn led to a renewed interest for mutlingual applications.These applications rapidly showed the limitations of the simplistic approaches...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000