The results/numbers aren't interesting because the number of samples is woefully insufficient to draw any conclusions beyond "that's a nice looking dashboard" or maybe "this is a cool idea"
You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.