TimeML-Compliant Text Analysis for Temporal Reasoning

نویسندگان

  • Branimir Boguraev
  • Rie Kubota Ando
چکیده

Reasoning with time1 needs more than just a list of temporal expressions. TimeML—an emerging standard for temporal annotation as a language capturing properties and relationships among timedenoting expressions and events in text—is a good starting point for bridging the gap between temporal analysis of documents and reasoning with the information derived from them. Hard as TimeMLcompliant analysis is, the small size of the only currently available annotated corpus makes it even harder. We address this problem with a hybrid TimeML annotator, which uses cascaded finite-state grammars (for temporal expression analysis, shallow syntactic parsing, and feature generation) together with a machine learning component capable of effectively using large amounts of unannotated data. 1 Temporal Analysis of Documents Many information extraction tasks limit analysis of time to identifying a narrow class of time expressions, which literally specify a temporal point or an interval. For instance, a recent (2004) ACE task is that of temporal expression recognition and normalisation (TERN; see http:// timex2.mitre.org/tern.html). It targets absolute date/ time specifications (e.g. “June 15th, 1998”), descriptions of intervals (“three semesters”), referential (relative) expressions (“last week”), and so forth. A fraction of such expressions may include a relational component (“the two weeks since the conference”, “a month of delays following the disclosure”), making them event-anchored; however, the majority refer only to what in a more syntactic framework would be considered as a ‘temporal adjunct’. The TERN task thus does not address the general question of associating a time stamp with an event. Deeper document analysis requires awareness of temporal aspects of discourse. Several applications have recently started addressing some issues of time. Document summarisation tackles identification and normalisation of time expresThis work was supported by the ARDA NIMD (Novel Intelligence and Massive Data) program PNWD-SW-6059. sions [Mani & Wilson, 2000], time stamping of event clauses [Filatova and Hovy, 2001], and temporal ordering of events in news [Mani et al., 2003]. Operational question answering (QA) systems can now (under certain conditions) answer e.g. ‘when’ or ‘how long’ questions [Prager et al., 2003]. Beyond manipulation of temporal expressions, advanced content analysis projects are beginning to define operational requirements for, in effect, temporal reasoning. More sophisticated QA, for instance, needs more than just information derived from ‘bare’ temporal markers [Pustejovsky et al., 2003; Schilder & Habel, 2003]. Intelligence analysis typically handles contradictory information, while looking for mutually corroborating facts; for this, temporal relations within such an information space are essential. Multi-document summarisation crucially requires temporal ordering over events described across the collection. A temporal reasoner requires a framework capturing the ways in which relationships among entities are described in text, anchored in time, and related to each other. Related are questions of defining a representation that can accommodate components of a temporal structure, and implementing a text analysis process for instantiating such a structure. This paper describes an effort towards an analytical framework for detailed time information extraction. We sketch the temporal reasoning component which is the ultimate ‘client’ of the analysis. We motivate our choice of TimeML, an emerging standard for annotation of temporal information in text, as a representational framework; in the process, we highlight TimeML’s main features, and characterise a mapping from a TimeML-compliant representation to an isomorphic set of time-points and intervals expected by the reasoner. We develop a strategy for time analysis of text, a synergistic approach deploying both finite-state (FS) grammars and machine learning techniques. The respective strengths of these technologies are well suited for the challenges of the task: complexity of analysis, and paucity of examples of TimeML-style annotation. A complex cascade of FS grammars targets certain components of TimeML (time expressions, in particular), identifies syntactic clues for marking other components (related to temporal links), and derives features for use by machine learning. The training is on a TimeML annotated corpus; given the small—and thus problematic for training—size of the only (so far) available reference corpus (TimeBank), we incorporate a learning strategy developed to leverage large volumes of unlabeled data. To our knowledge, this is the first attempt to use the representational principles of TimeML for practical analysis of time. This is also the first use of a TimeML corpus as reference data for implementing temporal analysis. 2 Motivation: Reasoning with Time We are motivated by developing a useful, and reusable, temporal analysis framework, where ‘downstream’ applications are enabled to reason and draw inferences over time elements. A hybrid reasoner [Fikes et al., 2003], to be deployed in intelligence analysis, maintains a directed graph of time points, intervals defined via start and end points, and temporal relations such as BEFORE, AFTER, and EQUAL POINT. The graph is assumed generated via a mapping process, external to the reasoner, from a (temporal) text analysis. Relations are operationalised, and temporal algebra evaluates instances, draws inference over goals, and broadens a base of inferred assertions on the basis of relational axioms. An example within the reasoner’s inferential capability is: (find instances of ?int such that (during ?int 2003)). Reasoning with relations such as during (associating an event with an interval), costarts (associating two events), instantiated for the example fragment: “On 9 August Iran accuses the Taliban of taking 9 diplomats and 35 truck drivers hostage in Mazar-e-Sharif. The crisis began with that accusation.” would infer, on the basis of predicates like: (during Iran-accuses-Taliban-take-hostages August-9-1998) (costarts Iran-accuses-Taliban-take-hostages Iran-Taliban-Crisis) that the answer to the question “When did the Iranian-Taliban crisis begin?” is “August 9, 1998”. Details of this inferential process need not concern us here. We gloss over issues like enumerating the range of temporal relations and axioms, describing the reasoner’s model of events (e.g. Iran-accuses-Taliban-take-hostages), and elaborating its notion of ‘a point in time’ (subsuming both literal expressions and event specifications). Operationally, a separate component maps temporal analysis results to a suitably neutral, and expressive, ontological representation of time (DAML-Time [Hobbs et al., 2002]). This allows for a representation hospitable to first-order logic inference formalism—like the one assumed in Hobbs et al.—to be kept separate from surface text analysis: much like the traditional separation along the syntax-semantics interface. We start from the belief that the representation for the reasoner is derivable from a TimeML-compliant text analysis. 2 TimeML is a proposal for annotating time information;e.g. the first example sentence above would be marked up as: We are not alone: work on temporal reasoning from formal inference point of view reaches a similar conclusion: “... the [TimeML] annotation scheme itself, due to its closer tie to surface texts, can be used as the first pass in the syntax-semantics interface of a temporal resolution framework such as ours. The more complex representation, suitable for more sophisticated reasoning, can then be obtained by translating from the annotations.” [Han & Lavie, 2004]. On 9 August Iran accuses the Taliban of taking 9 diplomats and 35 truck drivers hostage in Mazar-e-Sharif. The crisis began with that accusation . TimeML is described in Section 3. Essentially, it promotes explicit representation and typing of time expressions and events, and an equally explicit mechanism for linking these with temporal links, using a vocabulary of temporal relations. In addition to in-line mark-up, explicit links are marked. Event instance identifiers, ei1, ei2, and ei4 refer to, respectively, the accusation in the first sentence, the crisis, and the reference to “that accusation” in the second sentence. The relType attributes on the link descriptions define temporal relationships between event instances and time expressions; in this particular example, an IDENTITY link encodes the coreferentiality between the event instances (mentions) in the two sentences of the accusation event of the earlier example. It is the combination of event descriptors, their anchoring to time points, and the semantics of relational links, which enable the derivation of during and costarts associations that the reasoner understands. 3 TimeML: a Mark-up Language for Time Most content analysis applications to date do not explicitly incorporate temporal reasoning, and their needs can be met by analysis of simple time expressions (dates, intervals, etc). This is largely the motivation for TERN’s TIMEX2 tag; at the same time it explains why TIMEX2 is inadequate for supporting the representational requirements outlined earlier. 3 TimeML aims at capturing the richness of time information in documents. It marks up more than just temporal expressions, and focuses on ways of systematically anchoring event predicates to a time denoting expressions, and on ordering such event expressions relative to each other. TimeML derives higher expressiveness from explicitly separating representation of temporal expressions from that of events. Time analysis is distributed across four component structures: TIMEX3, SIGNAL, EVENT, and LINK; all are rendered as tags, with attributes [Saurı́ et al., 2004].4 For a notable extension to TIMEX2, see [Gaizauskas & Setzer, 2002]. An attempt to codify some relational information linking the TIMEX with an event, it is still limited, both in terms of scope (only links with certain syntactic shape can be captured) and representational power (it is hard to separate an event mention from possibly multiple event instances); see [Pustejovsky et al., 2003]. Additionally, a MKINSTANCE tag embodies the difference between event tokens and event instances: for example, the analysis of “Max taught on Monday and Tuesday” requires two different instances to be created for a teaching EVENT. Even if typically there is a one-to-one mapping between an EVENT and an instance, the language requires that a realisation of that event is created. TIMEX3 extends5 the TIMEX2 [Ferro, 2001] attributes: it captures temporal expressions (commonly categorised as DATE, TIME, DURATION), both literal and intensionally specified. SIGNAL tags are (typically) function words indicative of relationships between temporal objects: temporal prepositions (for, during, etc.) or temporal connectives (before, while). EVENT, in TimeML nomenclature, is a cover term for situations that happen or occur; these can be punctual, or last for a period of time. TimeML posits a refined typology of events [Pustejovsky et al., 2003]. All classes of event expressions—tensed verbs, stative adjectives and other modifiers, event nominals—are marked up with suitable attributes on the EVENT tag. Finally, the LINK tag is used to encode a variety of relations that exist between the temporal elements in a document, as well as to establish an explicit ordering of events. Three subtypes to the LINK tag are used to represent strict temporal relationships between events or between an event and a time (TLINK), subordination between two events or an event and a signal (SLINK), and aspectual relationship between an aspectual event and its argument (ALINK). TimeML’s richer component set, in-line mark-up of temporal primitives, and non-consuming tags for temporal relations across arbitrarily long text spans, make it highly compatible with the current paradigm of annotation-based encapsulation of document analysis. 4 TimeML and Temporal Analysis TimeML’s annotation-based representation facilitates integration of time analysis with the analysis of other syntactic and/ or discourse phenomena; it also naturally supports exploitation of larger contextual effects by the temporal parser proper (see 4.4) . This is a crucial observation, given that the prominently attractive characteristic of TimeML—its intrinsic richness of expression—makes it challenging for analysis. There are two broad categories of problems for developing an automated TimeML analyser: of substance and of infrastructure. Substantive issues include normalising time expressions to a canonical representation (TIMEX3’s value attribute), identifying a broad range of events (e.g. event nominals and predicative adjectives acting as event specifiers), linking time-denoting expressions (typically a TIMEX3 and an EVENT), and typing of those LINKs. The infrastructure problems—small size and less than consistent mark-up of the TimeBank corpus—are due to the fact that this, first, version is largely a side product of a small number of annotators trying out TimeML’s expressive capabilities. TimeBank is thus intended as a reference, and not for training. Our hybrid approach to temporal parsing, combining finitestate (FS) recognition with machine learning from sparse data (4.2), is largely motivated by this nature of TimeBank. 4.1 The TimeBank corpus TimeBank has only 186 documents (68.5K words). If we held out 10% of the corpus as test data, we have barely over 60K words for training. Below we show counts of TIMEX2 and TIMEX3 differ substantially in their treatment of event anchoring and sets of times. (EVENT-TIMEX3) TLINK6 and EVENT types [Saurı́ et al., 2004]. TLINK examples are particularly sparse; the data also shows highly uneven distribution of examples of different types. In comparison, the Penn TreeBank corpus for part-ofspeech tagging contains >1M words (> 16 times larger than TIMEBANK); the CoNLL’03 named entity chunking training set (at http://cnts.uia.ac.be/conll2003/ner/) has over 200K words with 23K examples (15 times more than TLINK examples) over just 4 name classes (compared to the 13 TLINK classes defined by TimeML). TERN’s training set—almost 800 documents/300K words—is considered to be somewhat sparse, with over 8K TIMEX examples. tlink type # occurrences event type # occurrences IS INCLUDED 866 OCCURRENCE 4,452 DURING 146 STATE 1,181 ENDS 102 REPORTING 1,010 SIMULTANEOUS 69 I ACTION 668 ENDED BY 52 I STATE 586 AFTER 41 ASPECTUAL 295 BEGINS 37 PERCEPTION 51 BEFORE 35 INCLUDES 29 BEGUN BY 27 IAFTER 5 IDENTITY 5 IBEFORE 1 Total : 1,451 Total : 8,243 4.2 Analytical strategy Minimally, the reasoner would require that the analytical framework supports time stamping and temporal ordering of events; thus we target the analysis tasks of finding TIMEX3’s, assigning canonical values, marking and typing EVENTs, and associating (some of them) with TIMEX3 tags. TIMEX3 expressions are naturally amenable to FS description. FS devices can also encode some larger context for time analysis (temporal connectives for marking putative events, clause boundaries for scoping possible event-time pairs, etc; see 4.4). To complement such analysis, a machine learning approach can cast the problem of marking EVENTs as chunking. Recently, [Ando, 2004] has developed a framework for exploiting large amounts of unannotated corpora in supervised learning for chunking. In such a framework, mid-to-high-level syntactic parsing—typically derived by FS cascades—can produce rich features for classifiers. Thus, we combine FS grammars for temporal expressions, embedded in a general purpose shallow parser, with machine learning trained with TimeBank and unannotated corpora. 4.3 FS-based parser for temporal expressions Viewing TIMEX3 analysis as an information extraction task, a cascade of finite-state grammars with broad coverage (compiled down to a single TIMEX3 automaton with 500 states and over 16000 transitions) targets abstract temporal entities such In all of our experiments we exclude TIMEX3 markup in metadata; the TLINK counts only reflect links to temporal expressions in the body of documents. as UNIT, POINT, PERIOD, RELATION, etc; these may be further decomposed and typed into e.g. MONTH, DAY, YEAR (for a UNIT); or INTERVAL or DURATION (for a PERIOD). Fine-grained analysis of temporal expressions, instantiating attributes like granularity, cardinality, ref direction, and so forth, is crucially required for normalising a TIMEX3: representing“the last five years” as illustrated below facilitates the derivation of a value for the TIMEX3 value attribute. [timex : [relative : true ] [ref_direction : past ]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TimeML-Compliant Analysis of Text Documents

Reasoning with temporal information1 requires a representation of time considerably more involved than just a list of temporal expressions—which typically define the extent of current time extraction efforts. TimeML is an emerging standard for temporal annotation, defining a language for expressing properties and relationships among timedenoting expressions and events in free text. This paper t...

متن کامل

Annotating and Reasoning about Time and Events

In this paper we discuss the relationship between TimeML, a rich specification language for event and temporal expressions in text, and the interpretation of these expressions in a temporal semantics. Specifically, we propose to demonstrate how a TimeML markup of text is interpreted within the DAML-Time Ontology and time framework of Hobbs (2002). We demonstrate the expressiveness of TimeML in ...

متن کامل

TimeBank-Driven TimeML Analysis

The design of TimeML as an expressive language for temporal information brings promises, and challenges; in particular, its representational properties raise the bar for traditional information extraction methods applied to the task of text-to-TimeML analysis. A reference corpus, such as TimeBank, is an invaluable asset in this situation; however, certain characteristics of TimeBank—size and co...

متن کامل

Argument Structure in TimeML

TimeML is a specification language for the annotation of events and temporal expressions in natural language text. In addition, the language introduces three relational tags linking temporal objects and events to one another. These links impose both aspectual and temporal ordering over time objects, as well as mark up subordination contexts introduced by modality, evidentiality, and factivity. ...

متن کامل

Temporal Annotation in the Clinical Domain

This article discusses the requirements of a formal specification for the annotation of temporal information in clinical narratives. We discuss the implementation and extension of ISO-TimeML for annotating a corpus of clinical notes, known as the THYME corpus. To reflect the information task and the heavily inference-based reasoning demands in the domain, a new annotation guideline has been dev...

متن کامل

KTimeML: Specification of Temporal and Event Expressions in Korean Text

TimeML, TimeBank, and TTK (TARSQI Project) have been playing an important role in enhancement of IE, QA, and other NLP applications. TimeML is a specification language for events and temporal expressions in text. This paper presents the problems and solutions for porting TimeML to Korean as a part of the Korean TARSQI Project. We also introduce the KTTK which is an automatic markup tool of temp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005