Comparing machine graded diagrams with human markers: some observations

نویسنده

  • Pete Thomas
چکیده

In this paper we examine the performance of an automatic (machine) grading algorithm for entity-relationship (E-R) diagrams by comparing it with human generated marks for a set of student answers to an assignment question. Using a variety of statistical tests it is shown that the performance of the automatic marker is very close to that of the human markers: the Pearson correlation coefficient is 0.964 (significant at the 0,01 level, 2tailed, N=26) and the Kendall tau-b correlation coefficient is 0.919 (significant at the 0.01 level, 2-tailed, N=26). The investigation revealed deficiencies in both the machine and human markers. There is prima-facie evidence that the orientation (shape) of a diagram may influence humans to award lower marks than they should. Introduction As part of our ongoing research into machine understanding of imprecise diagrams [9], we have been investigating the particular problem of automatically grading student answers to assignment questions that require Entity-Relationship (E-R) diagrams to be drawn [13]. We have developed an automatic E-R diagram marker that is based on the results earlier work on the automatic grading of textual answers to assignments [11]. The diagram marking tool conforms to a 5-stage architecture described in [9 and 10]. The effectiveness of the automatic marker has been judged against the criterion of how well the automatically generated grades for a set of student drawings compare with marks generated by experts in the field. In this paper we report on some of the issues that these comparisons have raised about the nature of grading, both human and machine-based. In our most recent experiments we have looked at two examples taken from the assessment of a database course. The first experiment was performed on student answers to an assignment early on in the course where the question was tightly specified and where we expected the majority of students to perform well. In this scenario we expected the automatic marker to perform well. In the second experiment, we took student answers to a question posed in the final assignment of the course which was much more open-ended. Here we expected there to be a much wider diversity of answers and consequently a much poorer response from the marking tool. It turned out that our expectations about the performance of the marking tool were not met: the results for the second experiment were better than for the first. This unexpected result caused us to look in depth at the behaviour of both the automatic marker and the human markers, and the way in which we evaluated the effectiveness of the automatic marker. The paper is structured as follows. The next section discusses how tutors approach the marking of diagrams in our educational context and compares this with the approach used in the automatic marker. The third section compares the initial set of marks produced by the human and machine markers and identifies where the major discrepancies occurred. The fourth section looks in detail at the discrepancies and shows how a closer match between human and machine generated marks was obtained. The paper concludes with a discussion of the findings and sets out the direction for future work. The marking processes In this section we shall describe, briefly, how the marking of the E-R diagrams was performed (a) by the human markers and (b) by the marking tool. Human marking In our environment (distance education), we typically have large numbers of students (in excess of one thousand) on each presentation of the database course. Student assignments are marked and commented upon by a team of tutors – experts in the database field with distance teaching experience. Over 40 tutors are employed on this course. To ensure consistency of performance between tutors, two quality assurance procedures are in place. First, each tutor is provided with a set of ‘Tutor Notes’ containing both a sample solution and a comprehensive marking guide which explains how the marking scheme is to be applied. If a tutor is faced with a student answer which does not match the sample solution, they are expected to use their professional judgement and to assign marks within the guidelines set out in the Tutor Notes. Second, a process known as monitoring is invoked in which the work of a tutor is examined by another expert, the monitor, whose role is to check both the marking accuracy and the usefulness of the tutor’s feedback to the student. Problems identified by the monitor in the marks awarded by a tutor can result either in an immediate re-grading of the student answer or a request for the answer to be re-marked. In our experiments we monitored the marking of all tutors and adjusted the marks for those answers where discrepancies were found between the tutor’s and monitor’s marks. These adjusted marks were then used as the definitive measure of correctness of the students’ answers. It is an interesting aside to note that in all but one case where an adjustment was made, the adjustment was of a single mark. However, in the remaining case, the adjustment was 5 marks (the maximum mark for the question was 25); we shall return to this later. Machine marking The algorithm embodied in the automatic marker compares a student diagram with the sample solution and derives a measure of similarity (a value in the range 0 to 1). To derive the similarity measure, both the sample solution and the student diagram are decomposed into their constituent relationships and the ‘best’ match between the two sets of relationships is determined. This process matches pairs of relationships, one from the student answer and one from the sample solution, and assigns a similarity measure to each pair. The similarity between the two diagrams is based on an aggregation of the similarities of the relationship-pairs. Finally, the mark scheme is applied (effectively, 6 marks were available for each correct relationship and 1 mark for the correct identification of the entities – this exactly mirrored the instructions given to the human markers). This can be viewed as a shallow approach to determining similarities. Initial comparison of marks The marks for the automated tool were compared with the moderated human marks as follows. On the first experiment there were 26 student answers in the marking sample (all were from student volunteers). The first comparison used simple descriptive statistics and the results are shown in Table 1. N=26 Mean St. Dev Range Human 21.27 3.436 13 – 25 Machine 22.08 2.497 15 – 25 Table 1 Descriptive statistical tests The descriptive statistics show that the machine marker is the more lenient marker by one mark per student, on average. There is a major discrepancy in the standard deviation with the spread of human marks being much greater than that of the machine marker, a result confirmed by the range of marks awarded. This was not an unexpected result because our experiments with the automatic marking of text have consistently shown the machine marker to have a narrower spread than the human markers. The next test looks at correlations. Table 2 shows the results with three tests of correlation. Correlation Significance Level Pearson 0.939 0.01, 2-tailed Spearman 0.953 0.01, 2-tailed Kendall 0.889 0.01, 2-tailed Table 2 Correlation tests The Pearson correlation coefficient is a (parametric) measure of the closeness of the two sets of marks, whereas Spearman’s rho coefficient is a non-parametric test which measures how closely the two sets of marks rank the students. In both cases, the results are extremely good, showing very close correlation. Kendall’s tau-b statistic is another measure of rank ordering which corrects for ties (which there are in this data), and again shows good correlation. Human (moderated) v Machine (original)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Human errors identification in operation of meat grinder using TAFEI technique

  Background: Human error is the most important cause of occupational and non-occupational accidents. Because, it seems necessary to identify, predict and analyze human errors, and also offer appropriate control strategies to reduce errors which cause adverse consequences, the present study was carried out with the aim of identifying human errors while operating meat grinder and offer sugg...

متن کامل

Behavior-based Retrieval of Software

Abstract— Reduced software development cost and time can be achieved by reusing existing software. One of the most important activities during reuse is retrieval. In the early stages of software development, UML state machine diagrams are used to model the behavior of different system objects. This work describes the retrieval of software from a repository by comparing the state machine diagram...

متن کامل

On a Basis for the Framed Link Vector Space Spanned by Chord Diagrams

In view of the result of Kontsevich, [5] now often called “the fundamental theorem of Vassiliev theory”, identifying the graded dual of the associated graded vector space to the space of Vassiliev invariants filtered by degree with the linear span of chord diagrams modulo the “4T-relation” (and in the unframed case, originally considered in [7], [5], and [1], the “1T-” or “isolated chord relati...

متن کامل

Characteristics of Human Endometrial Stem Cells in Tissue and Isolated Cultured Cells: An Immunohistochemical Aspect

Background: The aim of this study was to investigate the percentage of the stem cells population in human endometrial tissue sections and cultured cells at fourth passage. Methods: Human endometrial specimens were divided into two parts, one part for morphological studies and the other part for in vitro culture. Full thickness of human normal endometrial sections and cultured endometrial cells ...

متن کامل

Error assessment in man-machine systems using the CREAM method and human-in-the-loop fault tree analysis

Background and Objectives: Despite contribution to catastrophic accidents, human errors have been generally ignored in the design of human-machine (HM) systems and the determination of the level of automation (LOA). This paper aims to develop a method to estimate the level of automation in the early stage of the design phase considering both human and machine performance. Methods: A quantita...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004