NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources
نویسندگان
چکیده
Many Chinese words similarity measure algorithms have been introduced since it’s a fundamental issue in various tasks of natural language processing. Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora. However, knowledge base and corpus have limitations for broad coverage and data update. Thus, ensemble learning is then used to improve performance by combing similarities. This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms. To be specific, knowledge-based methods are based on TYCCL and Hownet. Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora (news and microblog). All similarities are combined through support vector regression to get final similarity. Evaluation suggests that TYCCL-based method behaves best according to testing dataset. However, if tuning parameters appropriately, ensemble learning could outperform all the other algorithms. Besides, deep learning on news corpora is better than other corpus-based methods.
منابع مشابه
Combining Word Embedding and Semantic Lexicon for Chinese Word Similarity Computation
Large corpus-based embedding methods have received increasing attention for their flexibility and effectiveness in many NLP tasks including Word Similarity (WS). However, these approaches rely on high-quality corpora and neglect the human’s intelligence contained in semantic resources such as Tongyici Cilin and Hownet. This paper proposes a novel framework for measuring the Chinese word similar...
متن کاملChinese Word Similarity Computing Based on Combination Strategy
Chinese word similarity computing is a fundamental task for natural language processing. This paper presents a method to calculate the similarity between Chinese words based on combination strategy. We apply Baidubaike to train Word2Vector model, and then integrate different methods, Dictionarybased method, Word2Vector-based method and Chinese FrameNet (CFN)based method, to calculate the semant...
متن کاملOverview of NLPCC Shared Task 4: Stance Detection in Chinese Microblogs
This paper presents the overview of the shared task, stance detection in Chinese microblogs, in NLPCC-ICCPOL 2016. The submitted systems are expected to automatically determine whether the author of a Chinese microblog is in favor of the given target, against the given target, or whether neither inference is likely. Different from regular evaluation tasks on sentiment analysis, the microblog te...
متن کاملOverview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-Blog Texts
In this paper, we give an overview for the shared task at the 5th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2016): Chinese word segmentation for micro-blog texts. Different with the popular used newswire datasets, the dataset of this shared task consists of the relatively informal micro-texts. Besides, we also use a new psychometric-inspired evaluation metric for ...
متن کاملEnsemble of Feature Sets and Classification Methods for Stance Detection
Stance detection is the task of automatically determining the author’s favorability towards a given target. However, the target may not be explicitly mentioned in the text and even someone may refer some positive opinions to against the target, which make the task more difficult. In this paper, we describe an ensemble framework which integrates various feature sets and classification methods, a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016