Named Entity Extraction with Conditional Markov Models and Classifiers
نویسنده
چکیده
Languages differ widely in the conventions they use to signal named entities. Spanish, French, and English use the case distinction of the modern Roman alphabet to indicate proper names, and upper case is a fairly good indicator of a proper name. The situation is quite different in German, where upper case is a poor cue. In traditional Chinese scholarly works, certain proper names are indicated by underlining, and without that form of annotation locating a proper name would seem quite challenging. In light of the diversity found across languages and orthographic conventions, it is unclear whether any effective multilingual named entity extraction system will ever be built that does not rely on human expertise for customizing it to a particular language and domain. Since we started by building a Spanish system without knowing what other language it would have to be applied to, the features we ended up using are all rather simple and generic in nature. No language experts were consulted. In the extraction component, we look at the orthographic string of a word with accents removed (since there are some inconsistencies regarding the presence or absence of accents), and we determine whether it starts with an upper case character. For the classification component we look at entire candidate phrases, and determine the length of the phrase, its position in a sentence, the immediately surrounding words, and what words occur within the phrase. For a word inside the phrase we determine whether it starts with and upper or lower case character (or neither), whether it contains any upper case or lower case characters (or neither), and we also use the entire orthographic string with accents removed. Fortunately, these features carry over fairly well to Dutch, the second language of the Shared Task, and may also have been sufficient for French or English, but would probably fall short for German. Needless to say, radically different orthographic systems may require entirely different approaches, so the multilingual scope of our proposal is fairly limited.
منابع مشابه
Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...
متن کاملA Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملSeminar Report Scalable Algorithms For Information Extraction
Information Extraction from unstructured sources like web is one of the interesting problems in machine learning. Part of Speech (PoS) tagging, segmentation of text, Named Entity Recognition (NER) are some of the applications of Information Extraction. There are many models like Hidden Markov Models (HMMs), Maximum Entropy Markov Models (MEMMs), Conditional Random Fields (CRFs) and Semi-Conditi...
متن کاملA Survey on Machine Learning Techniques to Extract Chemical Names from Text Documents
The chemical name extraction has a great importance in the biomedical field. Named Entity Recognition is the subtask of information extraction that is used to identify named entities in the given data. There are various dictionary-based, rule-based and machine learning approaches available for Named Entity Recognition. Rule based techniques include hand written rules. In this paper an extensive...
متن کاملConditional Random Fields vs. Hidden Markov Models in a biomedical Named Entity Recognition task
With a recent quick development of a molecular biology domain the Information Extraction (IE) methods become very useful. Named Entity Recognition (NER), that is considered to be the easiest task of IE, still remains very challenging in molecular biology domain because of the complex structure of biomedical entities and the lack of naming convention. In this paper we apply two popular sequence ...
متن کاملNamed Entity Recognition for Indian Languages: A Survey
Named Entity Recognition (NER) is a sub task of Information Extraction (IE) used to identify and classify the names in any given data. Earlier studies were mostly based on hand written rules where as now-a-days Machine Learning models such as Hidden Markov Model (HMM), Maximum Entropy (MaxEnt), Maximum Entropy Markov model (MEMM), Support Vector Machine (SVM), Conditional Random Fields (CRFs) a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002