NTNU-2 at SemEval-2017 Task 10: Identifying Synonym and Hyponym Relations among Keyphrases in Scientific Documents
نویسندگان
چکیده
This paper presents our relation extraction system for subtask C of SemEval-2017 Task 10: ScienceIE. Assuming that the keyphrases are already annotated in the input data, our work explores a wide range of linguistic features, applies various feature selection techniques, optimizes the hyper parameters and class weights and experiments with different problem formulations (single classification model vs individual classifiers for each keyphrase type, single-step classifier vs pipeline classifier for hyponym relations). Performance of five popular classification algorithms are evaluated for each problem formulation along with feature selection. The best setting achieved an F1 score of 71.0% for synonym and 30.0% for hyponym relation on the test data. 1 Problem Description Task C of ScienceIE at SemEval-2017 (Augenstein et al., 2017) concerns identifying sentence level ‘SYNONYM-OF’ (or ‘same-as’) and ‘HYPONYM-OF’ (‘is-a’) relations among three types of keyphrases: PROCESS (PR), TASK (TA) and MATERIAL (MA) in scientific documents. The ‘SYNONYM-OF’ relation is symmetric, whereas the ‘HYPONYM-OF’ relation is directed. Hyponym relation prediction is thus associated with two ordered subtasks: (1) predicting relations between pairs of keyphrases; (2) predicting the direction of the relation. It is assumed that there are no relations between keyphrase of different types. Automatic identification of synonym/hyponym relations is useful for many NLP applications, e.g. knowledge base completion and ontology construction. 2 Challenges The relation prediction task of ScienceIE is challenging and quite different from other semantic relation prediction task like SemEval-2010 Task 8 (Hendrickx et al., 2009). In SemEval-2010 Task 8, there are two marked nominals in a sentence and the task is to predict if any of nine semantic relations hold between the nominal pair. Although there are more relations than ScienceIE (9 vs 2), ScienceIE poses different challenges. Instead of single-word nominals, the keyphrases of ScienceIE are arbitrarily large text spans referring to larger syntactico-semantic units. The top part of Table 1 shows the percentage of keyphrases longer than 10 tokens in the training (10.89%), development (8.76%) and test (6.71%) data. The problem with such large text spans is to identify features which best represent the keyphrase and contribute most to the relation prediction task. Another challenge of ScienceIE is the occurrence of multiple keyphrases in one sentence, producing a large number of possible relations among keyphrase pairs, i.e., n(n−1)/2 for n keyphrases. As most of these are negative instances, the positive and negative classes are imbalanced. A third challenge is the potentially long distance between keyphrase pairs. The middle part of Table 1 shows that there are 49.2%, 57.68% and 43.77% keyphrase pairs in training, development and test sets respectively which are separated by more than 19 tokens. In addtion, a number of other keyphrases can occur in between a pair of related keyphrases, as shown in Table 1. Finally,the number of synonym and hyponym relations in the training and development datasets is limited. The bottom part of Table 2 shows the frequencies of relations in training and development datasets (ignoring inter-sentence keyphrase relations).
منابع مشابه
The NTNU System at SemEval-2017 Task 10: Extracting Keyphrases and Relations from Scientific Publications Using Multiple Conditional Random Fields
This study describes the design of the NTNU system for the ScienceIE task at the SemEval 2017 workshop. We use self-defined feature templates and multiple conditional random fields with extracted features to identify keyphrases along with categorized labels and their relations from scientific publications. A total of 16 teams participated in evaluation scenario 1 (subtasks A, B, and C), with on...
متن کاملNTNU-1$@$ScienceIE at SemEval-2017 Task 10: Identifying and Labelling Keyphrases with Conditional Random Fields
We present NTNU’s systems for Task A (prediction of keyphrases) and Task B (labelling as Material, Process or Task) at SemEval 2017 Task 10: Extracting Keyphrases and Relations from Scientific Publications (Augenstein et al., 2017). Our approach relies on supervised machine learning using Conditional Random Fields. Our system yields a micro F-score of 0.34 for Tasks A and B combined on the test...
متن کاملSemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers work...
متن کاملIllinois-LH: A Denotational and Distributional Approach to Semantics
This paper describes and analyzes our SemEval 2014 Task 1 system. Its features are based on distributional and denotational similarities; word alignment; negation; and hypernym/hyponym, synonym, and antonym relations.
متن کاملLIPN at SemEval-2017 Task 10: Filtering Candidate Keyphrases from Scientific Publications with Part-of-Speech Tag Sequences to Train a Sequence Labeling Model
This paper describes the system used by the team LIPN in SemEval 2017 Task 10: Extracting Keyphrases and Relations from Scientific Publications. The team participated in Scenario 1, that includes three subtasks, Identification of keyphrases (Subtask A), Classification of identified keyphrases (Subtask B) and Extraction of relationships between two identified keyphrases (Subtask C). The presente...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017