keywords page segmentation

Structure detection and segmentation of documents using 2D stochastic context-free grammars

Journal: :Neurocomputing 2015

Francisco Alvaro Francisco Cruz Fernandez Joan-Andreu Sánchez Oriol Ramos Terrades José-Miguel Benedí

In this paper we define a bidimensional extension of Stochastic Context-Free Grammars for structure detection and segmentation of images of documents. Two sets of text classification features are used to perform an initial classification of each zone of the page. Then, the document segmentation is obtained as the most likely hypothesis according to a stochastic grammar. We used a dataset of his...

متن کامل

A Unified Algorithm for Identification of Various Tabular Structures from Document Images

Journal: :IJDLS 2011

Sekhar Mandal Amit Kumar Das Partha Bhowmick Bhabatosh Chanda

This paper presents a unified algorithm for segmentation and identification of various tabular structures from document page images. Such tabular structures include conventional tables and displayed mathzones, as well as Table of

متن کامل

Page Segmentation Using Script Identification Vectors: A First Look

1997

Judith Hochberg Michael Cannon Patrick Kelly James White

This paper explores the use of script identification vectors in the analysis of multilingual document images. A script identification vector is calculated for each connected component in a document. The vector expresses the closest distance between the component and templates developed for each of thirteen scripts, including Arabic, Chinese, Cyrillic, and Roman. We calculate the first three pri...

متن کامل

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

2015

Robert Kreuzer Jurriaan Hage A. J. Feelders

This paper explores the effectiveness of different semantic web page segmentation algorithms on modern websites. We compare three known algorithms each serving as an example of a particular approach to the problem, and one self-developed algorithm, WebTerrain, that combines two of the approaches. With our testing framework we have compared the performance of four algorithms for a large benchmar...

متن کامل

Markov Random Field Models to Extract The Layout of Complex Handwritten Documents

2006

Stéphane Nicolas Thierry Paquet Laurent Heutte

We consider in this paper the problem of complex handwritten page segmentation such as novelist drafts or authorial manuscripts. We propose to use stochastic and contextual models in order to cope with local spatial variability, and to take into account some prior knowledge about the global structure of the document image. The models we propose to use are Markov Random Field models. Using this ...

متن کامل

Performance Comparison of Six Algorithms for Page Segmentation

2006

Faisal Shafait Daniel Keysers Thomas M. Breuel

This paper presents a quantitative comparison of six algorithms for page segmentation: X-Y cut, smearing, whitespace analysis, constrained text-line finding, Docstrum, and Voronoi-diagram-based. The evaluation is performed using a subset of the UW-III collection commonly used for evaluation, with a separate training set for parameter optimization. We compare the results using both default param...

متن کامل

Providing Ad Links to Travel Blog Entries Based on Link Types

2011

Aya Ishino Hidetsugu Nanba Toshiyuki Takezawa

Content-targeted advertising systems are becoming an increasingly important part of the funding for free web services. These programs automatically find relevant keywords on a web page, and then display ads based on those keywords. We propose a method for providing links to ads for travel products (which we call ad links) automatically. We extract keywords from citing areas of travel informatio...

متن کامل

Using tree-grammars for training set expansion in page classification

2003

Stefano Baldi Simone Marinai Giovanni Soda

In this paper we describe a method for the expansion of training sets made by XY trees representing page layout. This approach is appropriate when dealing with page classification based on MXY tree page representations. The basic idea is the use of tree grammars to model the variations in the tree which are caused by segmentation algorithms. A set of general grammatical rules are defined and us...

متن کامل

A Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection

2016

Charmi Patel Hiteishi Diwanji Shuang Lin Jie Chen Zhendong Niu Dandan Song Fei Sun Lejian Liao

A Web Page has large amount of information including some additional contents like hyperlinks, header footer, navigational panel; advertisements which may cause the content extraction to be complicated. Page Segmentation is used to detect the noisy content block by detecting malicious URL from Web Pages. Main aim of this research is detecting malicious URL during content extraction by checking ...

متن کامل

Identifying the Defects in Glass Bottles Using Particle Swarm Optimization

2014

Mrs. Anupama Mr. Prasanna

This paper aims at designing and developing a suitable tool for identifying defects in glass bottles through visual inspection based on segmentation algorithm. Defects are identified in three stages namely Image acquisition, Pre-processing and filtering and Segmentation. In the Image acquisition stage, samples of real time images are taken and are converted into monochrome images. In the Pre-pr...

متن کامل