= Step 8: Statistical significance (paired tests) =
[Overall error] McNemar (Rule vs ML):
b=Rule correct & ML wrong: 4750
c=Rule wrong & ML correct: 46002
chi2=33528.6294, p=0
[False Positive on legit-only] McNemar (reducing false alarms):
b=Rule no-FP & ML FP: 4081
c=Rule FP & ML no-FP: 12188
chi2=4038.7999, p=0
[False Negative on phishing-only] McNemar (reducing misses):
b=Rule no-FN & ML FN: 669
c=Rule FN & ML no-FN: 33814
chi2=31856.9943, p=0
[Effect size + 95% CI] Paired bootstrap (Rule rate - ML rate):
FP rate: Rule=0.2017, ML=0.0752, diff=0.1266, CI=(0.1228, 0.1305)
FN rate: Rule=0.7330, ML=0.1530, diff=0.5799, CI=(0.5757, 0.5842)
Interpretation tips:
- McNemar p < 0.05: paired improvement is statistically significant.
- Bootstrap CI not crossing 0: rate reduction is robust.