PAYMA: A Tagged Corpus of Persian Named Entities

نویسندگان

Faili, Heshaam College of Engineering, University of Tehran

Mohseni, Mahdi College of Engineering, University of Tehran

Shahshahani, Mahsa Sadat College of Engineering, University of Tehran

Shakery, Azadeh College of Engineering, University of Tehran

چکیده مقاله:

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies on this task in Persian. One of the main important reasons for this may be the lack of a standard Persian NER dataset to train and test the NER systems. In this research we create a standard tagged Persian NER dataset which will be distributed freely for research purposes. In order to construct this standard dataset, we studied the existing standard NER datasets in English and came to the conclusion that almost all of these datasets are constructed using news data. Thus we collected documents from ten news websites in Persian. In the next step, in order to provide the annotators with guidelines to tag these documents, we studied the guidelines used for constructing CoNLL and MUC English datasets and created our own guidelines considering the Persian linguistic rules. Using these guidelines, all words in documents can be labeled as person, location, organization, time, date, percent, currency, or other (words that are not in any of these 7 classes). We use IOB encoding for annotating named entities in documents, like most of the existing English NER datasets. Using this encoding, the first token of a named entity is labeled with B, and the next tokens (if exist) are labeled with I. The words that are not part of any named entity are labeled with O. The constructed corpus, named PAYMA, consists of 709 documents and includes 302530 tokens. 41148 tokens out of these tokens are labeled as named entities and the others are labeled as O. In order to determine the inter-annotator agreement, 160 documents were labeled by a second annotator. Kappa statistic was estimated as 95% using words that are labeled as named entities. After creating the dataset, we used the dataset to design a hybrid system for named entity recognition. We trained a statistical system based on the CRF algorithm, and used its output as a feature to train a bidirectional LSTM recurrent neural network. Moreover, we used the k-means word clustering method to cluster the words and fed the cluster number of each word to the LSTM neural network. This form of combining CRF with neural networks and using the cluster number for each word is the novelty of this research work. Experimental results show that the final model can reach an F1 score of 87% at word-level and 80% at phrase level.

Download for Free

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PEYMA: A Tagged Corpus for Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a text into classes such as person, location, and organization. This is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached per...

متن کامل

References to Named Entities: a Corpus Study

References included in multi-document summaries are often problematic. In this paper, we present a corpus study performed to derive a statistical model for the syntactic realization of referential expressions. The interpretation of the probabilistic data helps us gain insight on how extractive summaries can be rewritten in an efficient manner to produce more fluent and easy-to-read text.

متن کامل

Recognizing Nested Named Entities in GENIA corpus

Nested Named Entities (nested NEs), one containing another, are commonly seen in biomedical text, e.g., accounting for 16.7% of all named entities in GENIA corpus. While many works have been done in recognizing non-nested NEs, nested NEs have been largely neglected. In this work, we treat the task as a binary classification problem and solve it using Support Vector Machines. For each token in n...

متن کامل

The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0

In order to build an automatic named entity recognition (NER) system for machine learning, a large tagged corpus is necessary. This paper describes the manual construction of a Chinese named entity tagged corpus (CNEC 1.0) that can be used to improve NER performance. In this project, we define five named entity tags: PER (person name), LOC (location name), ORG (organization name), LAO (location...

متن کامل

Patent Retrieval in Chemistry Based on Semantically Tagged Named Entities

This paper reports on the work that has been conducted by Fraunhofer SCAI for Trec Chemistry (Trec-Chem) track 2009. The team of Fraunhofer SCAI participated in two tasks, namely Technology Survey and Prior Art Search. The core of the framework is an index of 1.2 million chemical patents provided as a data set by Trec. For the technology survey, three runs were submitted based on semantic dicti...

متن کامل