Layout & Language: Preliminary experiments in assigning logical structure to table cells
نویسندگان
چکیده
We describe a prototype system for assigning table cells to their proper place in the table's logical (relational) structure, based on a simple model of table structure combined with a number of measures of cohesion between cell contents. Preliminary results suggest that very simple string-based cohesion measures are not sufficient for the extraction of relational information, and that future work should pursue the aim of more knowledge/dataintensive approximations to a notional subtype /super type definition of the relationships between value and label cells. 1 I n t r o d u c t i o n Real technical documents are full of text in tabular and other complex layout formats. Most representations of tabular data are layout or geometrybased: in SGML, in particular, Marcy Thompson notes "table markup contains a great deal of information about what a table looks like.., but very little about how the table relates the entries . . . . [This] prevents me from doing automated context-based data retrieval or extraction." 1 1.1 V iews o f t ab l e s In (Douglas, Hurst, and Quinn, 1995) an analysis of table layout and linguistic characteristics was offered which emphasised the potential importance of linguistic information about the contents of cells to the task of assigning a layout-oriented table representation to the logical relational structure it embodies. Two views of tables were distinguished: a d e n o t a t i o n a l and a f u n c t i o n a l view. a(Thompson, 1996), p151. The denotation is the table viewed as a set of ntup les , forming a r e l a t i o n between values drawn from n value-sets or domains . Domains typically consist of a set of values with a common supertype in some actual or notional Knowledge Representation scheme. The actual table may also include labe l cells which typically can be interpreted as a lexicalisation of the common supertype. We hypothesize that the contents of value cells and corresponding label cells for a given domain are significantly related in respect of some measures of c o h e s i o n that we can identify. The f u n c t i o n a l view is a description of how the information presentation aspects of tables embody a decis ion s t r u c t u r e (Wright, 1982) or reading path, which determines the order in which domains are accessed in building or looking up a tuple. To express a given table denotation according to a given functional view, there is a repertoire of layo u t p a t t e r n s that express how domains can be grouped and ordered for reading in two dimensions. These layout patterns constitute a syntax of table structure, defining the basic geometric configurations that domain values and labels can appear in. 1.2 A n i n f o r m a t i o n e x t r a c t i o n ta sk Our application task is shallow information extraction in construction industry specification documents, containing many tables, which come to us via the miracles of OCR as formatted ASCII, e.g., in Figure 1. The predominant argument type of this genre of specification documents can be thought of as a form of 'assignment', similar to that in programming languages. Our aim is to fit each assignment into a f r a m e that contains various elements represented in terms of the sublanguage world model, a simple part-of/ type-of knowledge representation. The elements we are looking for are en t i t i e s , a t t r i b u t e s which the KR accepts as appropriate for
منابع مشابه
Layout and Language: Preliminary Investigations in Recognizing the Structure of Tables
We describe a prototype system for assigning table cells to their proper place in the logical structure of the table, based on a simple model of table structure combined with a number of measures of cohesion between cells. A framework is presented for examining the effect of particular variables on the performance of the system, and preliminary results are presented showing the effect of cohesi...
متن کاملTabular Abstraction, Editing, and Formatting
This dissertation investigates the composition of high quality tables with the use of electronic tools A generic model is designed to support the di erent stages of tabu lar composition including the editing of logical structure the speci cation of layout structure and the formatting of concrete tables The model separates table s logical structure from its layout structure which consists of tab...
متن کاملDocument image analysis with cooperative interaction between layout analysis and logical structure analysis
When a printed document is to be input to a computer system, the document must be converted to a computer-readable format, e.g., ASCII, PDF, RTF, CSV, or SGML/XML/HTML-tagged data. In order to obtain these data formats from a printed document, it is necessary to extract from the printed document as much information as possible, i.e., layout structure (layout objects and their hierarchical relat...
متن کاملA duality between LM-fuzzy possibility computations and their logical semantics
Let X be a dcpo and let L be a complete lattice. The family σL(X) of all Scott continuous mappings from X to L is a complete lattice under pointwise order, we call it the L-fuzzy Scott structure on X. Let E be a dcpo. A mapping g : σL(E) −> M is called an LM-fuzzy possibility valuation of E if it preserves arbitrary unions. Denote by πLM(E) the set of all LM-fuzzy possibility valuations of E. T...
متن کاملBibliographic data extraction from HTML medical journal articles
MEDLINE, a biomedical literature database compiled by the US National Library of Medicine, contains 15 million records from approximately 5000 selected journals, and is searched over 3million times a day worldwide. With more journal articles being published online in hypertext markup language (HTML), the automatic extraction of bibliographic data from HTML articles is important for creating MED...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997