Preprocessing and Tokenisation Standards in DELPH-IN Tools

نویسندگان

  • Benjamin Waldron
  • Ann A. Copestake
  • Ulrich Schäfer
  • Bernd Kiefer
چکیده

Abstract We discuss preprocessing and tokenisation standards within DELPH-IN, a large scale open-source collaboration providing multiple independent multilingual shallow and deep processors. We discuss (i) a component-specific XML interface format which has been used for some time to interface preprocessor results to the PET parser, and (ii) our implementation of a more generic XML interface format influenced heavily by the (ISO working draft) Morphosyntactic Annotation Framework (MAF). Our generic format encapsulates the information which may be passed from the preprocessing stage to a parser: it uses standoff-annotation, a lattice for the representation of structural ambiguity, intra-annotation dependencies and allows for highly structured annotation content. This work builds on the existing Heart of Gold middleware system, and previous work on Robust Minimal Recursion Semantics (RMRS) as part of an inter-component interface. We give examples of usage with a number of the DELPH-IN processing components and deep grammars.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Hospital standards with ISO principles and presentation appropriate model of Hospital standard development

Background: Hospital standards are one of the valuest concepture items in an organization and hospital management. The research purpose was determination of luchs of hospital evaluation current way throght the comparission with ISO quality management standards and suggestion of a model to hospital standard development based on aquired findings. Materials & Methods: This research is a descripti...

متن کامل

Tools to Address the Interdependence between Tokenisation and Standoff Annotation

In this paper we discuss technical issues arising from the interdependence between tokenisation and XML-based annotation tools, in particular those which use standoff annotation in the form of pointers to word tokens. It is common practice for an XML-based annotation tool to use word tokens as the target units for annotating such things as named entities because it provides appropriate units fo...

متن کامل

Text Tokenisation Using unitok

This paper presents unitok, a tool for tokenisation of text in many languages. Although a simple idea – exploiting spaces in the text to separate tokens – works well most of the time, the rest of observed cases is quite complicated, language dependent and requires a special treatment. The paper covers the overall design of unitok as well as the way the tool deals with some language or web data ...

متن کامل

Term and Collocation Extraction by Means of Complex Linguistic Web Services

We present a web service-based environment for the use of linguistic resources and tools to address issues of terminology and language varieties. We discuss the architecture, corpus representation formats, components and a chainer supporting the combination of tools into task-specific services. Integrated into this environment, single web services also become part of complex scenarios for web s...

متن کامل

Investigation of the Structural Standards of Nursing Homes in Ardabil Province, in 2021

Background & aim: In recent years, with the increase in the elderly population, the trend of transferring the elderly to nursing homes has been increased. Therefore, the existence of standard centers according to international standards for maintenance and care of this vulnerable group has become very important. The purpose of this study was to investigate the status of structural standards of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006