赞
踩
rouge 摘要评估
by Kavita Ganesan
通过Kavita Ganesan
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially a set of metrics for evaluating automatic summarization of texts as well as machine translations.
ROUGE代表针对召回评估的面向召回的本科。 它本质上是一组用于评估文本自动摘要和机器翻译的度量。
It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). Let’s say that we have the following system and reference summaries:
它通过将自动产生的摘要或翻译与一组参考摘要 (通常是人工产生的)进行比较来工作。 假设我们有以下系统和参考摘要:
System Summary (what the machine produced):
系统摘要(机器生产的产品):
the cat was found under the bed
Reference Summary (gold standard — usually by humans):
参考摘要(黄金标准-通常是人类):
the cat was under the bed
If we consider just the individual words, the number of overlapping words between the system summary and reference summary is 6. This, however, does not tell you much as a metric. To get a good quantitative value, we can actually compute the precision and recall using the overlap.
如果仅考虑单个单词,则系统摘要和参考摘要之间的重叠单词数为6。但是,这并不能告诉您很多度量标准。 为了获得良好的定量值,我们实际上可以计算精度并使用重叠进行调用 。
Simply put, recall (in the context of ROUGE) refers to how much of the reference summary the system summary is recovering or capturing. If we are just considering the individual words, it can be computed as:
简而言之,回想(在ROUGE的上下文中)指的是多少参考摘要 系统摘要正在恢复或捕获。 如果我们仅考虑单个单词,则可以将其计算为:
In this example, the recall would thus be:
在此示例中,召回将因此为:
This means that all the words in the reference summary have been captured by the system summary, which indeed is the case for this example. Voila!
这意味着参考摘要中的所有单词都已被系统摘要捕获,对于本示例而言确实如此。 瞧!
This looks really good for a text summarization system. But it does not tell you the other side of the story. A machine generated summary (system summary) can be extremely long, capturing all words in the reference summary. But, many of the words in the system summary may be useless, making the summary unnecessarily verbose.
对于文本摘要系统来说,这看起来确实不错。 但这并不能告诉您故事的另一面。 机器生成的摘要(系统摘要)可能非常长,会捕获参考摘要中的所有单词。 但是,系统摘要中的许多单词可能没有用,使摘要不必要地冗长。
This is where precision comes into play. In terms of precision, what you are essentially measuring is, how much of the system summary was in fact relevant or needed? Precision is measured as:
这就是精度发挥作用的地方。 就精度而言,您实质上要衡量的是, 实际上有多少系统摘要是相关的或需要的 ? 精度测量为:
In this example, the Precision would thus be:
因此,在此示例中,精度为:
This simply means that 6 out of the 7 words in the system summary were in fact relevant or needed. If we had the following system summary, as opposed to the example above — System Summary 2:
这仅表示系统摘要中7个单词中的6个实际上是相关的或需要的。 如果我们有以下系统摘要,而不是上面的示例— 系统摘要2:
the tiny little cat was found under the big funny bed
The Precision now becomes:
精度现在变为:
Now, this doesn’t look so good, does it? That is because we have quite a few unnecessary words in the summary. The precision aspect becomes really crucial when you are trying to generate summaries that are concise in nature. Therefore, it is always best to compute both the precision and recall and then report the F-Measure.
现在,这看起来不太好,不是吗? 这是因为摘要中有很多不必要的词。 当您尝试生成本质上简洁的摘要时, 精度方面变得至关重要。 因此,始终最好同时计算精度和查全率 ,然后报告F-Measure 。
If your summaries are in some way forced to be concise through some constraints, then you could consider using just the recall, since precision is of less concern in this scenario.
如果您的摘要在某种程度上受某些约束的约束而变得简明扼要,那么您可以考虑仅使用召回方式,因为在这种情况下,精度不太重要。
ROUGE-N, ROUGE-S, and ROUGE-L can be thought of as the granularity of texts being compared between the system summaries and reference summaries.
可以将ROUGE-N,ROUGE-S和ROUGE-L视为在系统摘要和参考摘要之间进行比较的文本粒度。
ROUGE-N — measures unigram, bigram, trigram and higher order n-gram overlap
ROUGE-N —度量unigram , bigram , trigram 和高阶n-gram重叠
ROUGE-L — measures longest matching sequence of words using LCS. An advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order. Since it automatically includes longest in-sequence common n-grams, you don’t need a predefined n-gram length.
ROUGE-L —使用LCS测量最长的单词匹配序列 。 使用LCS的一个优点是,它不需要连续匹配,但是需要按顺序进行匹配,以反映句子级单词的顺序。 由于它自动包含最长的顺序公共n-gram,因此您不需要预定义的n-gram长度。
ROUGE-S — Is any pair of words in a sentence in order, allowing for arbitrary gaps. This can also be called skip-gram concurrence. For example, skip-bigram measures the overlap of word pairs that can have a maximum of two gaps in between words. As an example, for the phrase “cat in the hat” the skip-bigrams would be “cat in, cat the, cat hat, in the, in hat, the hat”.
ROUGE-S —句子中的任意一对单词,允许任意间隔。 这也可以称为跳过语法并发。 例如, skip-bigram测量单词对之间的重叠,单词对之间的重叠最大为两个间隙。 例如,对于短语“戴帽子的猫” ,跳过二字组将是“戴帽子的猫,戴帽子的猫,戴帽子的猫”。
For example, ROUGE-1 refers to overlap of unigrams between the system summary and reference summary. ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
例如, ROUGE-1表示系统摘要和参考摘要之间的字母组合重叠。 ROUGE-2表示系统摘要和参考摘要之间的双字母组重叠。
Let’s take the example from above. Let us say we want to compute the ROUGE-2 precision and recall scores.
让我们从上面举个例子。 假设我们要计算ROUGE-2精度和召回得分。
System Summary:
系统摘要:
the cat was found under the bed
Reference Summary:
参考摘要:
the cat was under the bed
System Summary Bigrams:
系统摘要二元组:
the cat, cat was, was found, found under, under the, the bed
Reference Summary Bigrams:
参考摘要Bigrams:
the cat, cat was, was under, under the, the bed
Based on the bigrams above, the ROUGE-2 recall is as follows:
基于以上的二元组,ROUGE-2的召回情况如下:
Essentially, the system summary has recovered 4 bigrams out of 5 bigrams from the reference summary, which is pretty good! Now the ROUGE-2 precision is as follows:
本质上,系统摘要已从参考摘要中的5个双元文件中恢复了4个双元文件,这非常好! 现在,ROUGE-2的精度如下:
The precision here tells us that out of all the system summary bigrams, there is a 67% overlap with the reference summary. This is not too bad either. Note that as the summaries (both system and reference summaries) get longer and longer, there will be fewer overlapping bigrams. This is especially true in the case of abstractive summarization, where you are not directly re-using sentences for summarization.
此处的精度告诉我们,在所有系统摘要二元组中,与参考摘要有67%的重叠。 这也不错。 请注意,随着摘要(系统摘要和参考摘要)变得越来越长,重叠的二元组将越来越少。 在抽象摘要的情况下尤其如此,在这种情况下,您不直接重复使用句子进行摘要。
The reason one would use ROUGE-1 over or in conjunction with ROUGE-2 (or other finer granularity ROUGE measures), is to also show the fluency of the summaries or translation. The intuition is that if you more closely follow the word orderings of the reference summary, then your summary is actually more fluent.
之所以要使用ROUGE-1而不是结合使用ROUGE-2(或其他更细粒度的ROUGE度量值),是为了显示摘要或翻译的流畅性。 直觉是,如果您更仔细地遵循参考摘要的单词顺序,则您的摘要实际上会更流利。
For more in-depth information about these evaluation metrics, you can refer to Lin’s paper. Which measure to use depends on the specific task that you are trying to evaluate. If you are working on extractive summarization with fairly verbose system and reference summaries, then it may make sense to use ROUGE-1 and ROUGE-L. For very concise summaries, ROUGE-1 alone may suffice, especially if you are also applying stemming and stop word removal.
有关这些评估指标的更多详细信息,请参阅Lin的论文 。 使用哪种度量取决于您要评估的特定任务。 如果您正在使用相当冗长的系统摘要和参考摘要来进行提取摘要,那么使用ROUGE-1和ROUGE-L可能是有意义的。 对于非常简洁的摘要,仅ROUGE-1就足够了,尤其是在您还应用词干和停止单词删除的情况下。
rouge 摘要评估
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。