Preprocessing and Tokenisation Standards in DELPH-IN Tools
نویسندگان
چکیده
Abstract We discuss preprocessing and tokenisation standards within DELPH-IN, a large scale open-source collaboration providing multiple independent multilingual shallow and deep processors. We discuss (i) a component-specific XML interface format which has been used for some time to interface preprocessor results to the PET parser, and (ii) our implementation of a more generic XML interface format influenced heavily by the (ISO working draft) Morphosyntactic Annotation Framework (MAF). Our generic format encapsulates the information which may be passed from the preprocessing stage to a parser: it uses standoff-annotation, a lattice for the representation of structural ambiguity, intra-annotation dependencies and allows for highly structured annotation content. This work builds on the existing Heart of Gold middleware system, and previous work on Robust Minimal Recursion Semantics (RMRS) as part of an inter-component interface. We give examples of usage with a number of the DELPH-IN processing components and deep grammars.
منابع مشابه
Comparison of Hospital standards with ISO principles and presentation appropriate model of Hospital standard development
Background: Hospital standards are one of the valuest concepture items in an organization and hospital management. The research purpose was determination of luchs of hospital evaluation current way throght the comparission with ISO quality management standards and suggestion of a model to hospital standard development based on aquired findings. Materials & Methods: This research is a descripti...
متن کاملTools to Address the Interdependence between Tokenisation and Standoff Annotation
In this paper we discuss technical issues arising from the interdependence between tokenisation and XML-based annotation tools, in particular those which use standoff annotation in the form of pointers to word tokens. It is common practice for an XML-based annotation tool to use word tokens as the target units for annotating such things as named entities because it provides appropriate units fo...
متن کاملText Tokenisation Using unitok
This paper presents unitok, a tool for tokenisation of text in many languages. Although a simple idea – exploiting spaces in the text to separate tokens – works well most of the time, the rest of observed cases is quite complicated, language dependent and requires a special treatment. The paper covers the overall design of unitok as well as the way the tool deals with some language or web data ...
متن کاملTerm and Collocation Extraction by Means of Complex Linguistic Web Services
We present a web service-based environment for the use of linguistic resources and tools to address issues of terminology and language varieties. We discuss the architecture, corpus representation formats, components and a chainer supporting the combination of tools into task-specific services. Integrated into this environment, single web services also become part of complex scenarios for web s...
متن کاملInvestigation of the Structural Standards of Nursing Homes in Ardabil Province, in 2021
Background & aim: In recent years, with the increase in the elderly population, the trend of transferring the elderly to nursing homes has been increased. Therefore, the existence of standard centers according to international standards for maintenance and care of this vulnerable group has become very important. The purpose of this study was to investigate the status of structural standards of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006