In addition to the sample size issue, logistics can also be a challenge to ensure that evaluators do not remember the initial attribute they assigned to a scenario when they see it for the second time. Of course, this can be avoided a bit by increasing the sample size and, better yet, wait a while before the scenarios are made available to reviewers a second time (maybe one to two weeks). Randomized passage from one comment to another can also be helpful. In addition, evaluators also tend to work differently if they know they are being examined, so the fact that they know that it is a test can also skew the results. Hiding this in one way or another can help, but it`s almost impossible to achieve it, despite the fact that it borders on ethics. And beyond the fact that they are at best marginally effective, these solutions increase the complexity and time of an already difficult study. The audit should help to identify the specific people and codes that are the main sources of problems and the evaluation of the attribute agreement should help to determine the relative contribution of reproducibility and reproducibility problems for those specific codes (and individuals). In addition, many bug databases have problems with precision records that indicate where an error was created because the place where the error is detected is recorded and not where the error was created. When the error is detected, there is not much to identify the causes, therefore the accuracy of the site assignment should also be an element of the audit.

I put all the default results and evaluation on Minitab and run the attribute agreement analysis. Then I saw that the chords in „Within Appraisers“ and „Appraiser vs Standard“ were about 60%. Some Kappa values were less than 0.6. The result was quite bad. Attribute agreement analysis can be a great tool for detecting sources of inaccuracies in a bug tracking system, but it should be used with great care, consideration, and minimal complexity, if used at all. The best way to do this is to audit the database and then use the results of that audit to perform a focused and optimized analysis of repeatability and reproducibility. For example, if repeatability is the main problem, evaluators are confused or undecided on certain criteria. If reproducibility is the problem, then evaluators have strong opinions on certain conditions, but those opinions differ. If the problems are shown by several evaluators, the problems are systemic or procedural.

If the problems concern only a few evaluators, the problems may simply require a little personal attention. In both cases, training or work aids could be adapted either to specific individuals or to all evaluators, depending on the number of evaluators guilty of imprecise attribution of attributes. The accuracy of a measurement system is analyzed by subdividing it into two essential components: repeatability (the ability of a particular evaluator to assign the same value or attribute several times under the same conditions) and reproducibility (the ability of several evaluators to agree among themselves for a number of circumstances). In the case of an attribute measurement system, repeatability or reproducibility problems inevitably cause accuracy problems. In addition, if one knows the overall accuracy, repeatability and reproducibility, distortions can be detected even in situations where decisions are systematically wrong. In this example, a repeatability assessment is used to illustrate the idea and it also applies to reproducibility. The point here is that many samples are needed to detect differences in an attribute analysis, and if the number of samples is doubled from 50 to 100, the test does not become much more sensitive. Of course, the difference that needs to be identified depends on the situation and the level of risk that the analyst is willing to take in his decision, but the reality is that with 50 scenarios, it will be difficult for an analyst to consider that there is a statistical difference in the repeatability of two evaluators with match rates of 96 percent and 86 percent. .

. .