Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE

نویسنده

  • Yvette Graham
چکیده

We provide an analysis of current evaluation methodologies applied to summarization metrics and identify the following areas of concern: (1) movement away from evaluation by correlation with human assessment; (2) omission of important components of human assessment from evaluations, in addition to large numbers of metric variants; (3) absence of methods of significance testing improvements over a baseline. We outline an evaluation methodology that overcomes all such challenges, providing the first method of significance testing suitable for evaluation of summarization metrics. Our evaluation reveals for the first time which metric variants significantly outperform others, optimal metric variants distinct from current recommended best variants, as well as machine translation metric BLEU to have performance on-par with ROUGE for the purpose of evaluation of summarization systems. We subsequently replicate a recent large-scale evaluation that relied on, what we now know to be, suboptimal ROUGE variants revealing distinct conclusions about the relative performance of state-of-the-art summarization systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Vert: a Method for Automatic Evaluation of Video Summaries

Video Summarization has become an important tool for Multimedia Information processing, but the automatic evaluation of a video summarization system remains a challenge. A major issue is that an ideal “best” summary does not exist, although people can easily distinguish “good” from “bad” summaries. A similar situation arise in machine translation and text summarization, where specific automatic...

متن کامل

CS224d Project Final Report

We develop a Recurrent Neural Network (RNN) Language Model to extract sentences from Yelp Review Data for the purpose of automatic summarization. We compare these extracted sentences against user-generated tips in the Yelp Academic Dataset using ROUGE and BLEU metrics for summarization evaluation. The performance of a uni-directional RNN is compared against word-vector averaging.

متن کامل

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While ROUGE has been shown to be effective in capturing n-gram overlap between system and human composed summaries, there are several limitations with the existing RO...

متن کامل

Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries

Automatic summarization evaluation is critical to the development of summarization systems. While ROUGE has been shown to correlate well with human evaluation for content match in text summarization, there are many characteristics in multiparty meeting domain, which may pose potential problems to ROUGE. In this paper, we carefully examine how well the ROUGE scores correlate with human evaluatio...

متن کامل

Revisiting Summarization Evaluation for Scientific Articles

Evaluation of text summarization approaches have been mostly based on metrics that measure similarities of system generated summaries with a set of human written gold-standard summaries. The most widely used metric in summarization evaluation has been the ROUGE family. ROUGE solely relies on lexical overlaps between the terms and phrases in the sentences; therefore, in cases of terminology vari...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015