State

We want to determine whether a machine-learning–based classifier (RandomForest) performs better than a rule-based detector when both models are applied to the same set of test URLs.

Because each URL is evaluated by both classifiers, the data are paired, and the outcome for each classifier is categorical (correct or incorrect). Therefore, the appropriate analysis is a chi-square test for paired categorical data.

Let
= proportion of URLs where the Rule-based model is correct and the ML model is incorrect
= proportion of URLs where the ML model is correct and the Rule-based model is incorrect


Plan

Overall Classification Error

Null Hypothesis

There is no difference in performance between the two classifiers.

Alternative Hypothesis

The ML-based classifier makes fewer errors than the rule-based classifier.


Conditions

  1. Paired data: Each URL is classified by both models.
  2. Categorical outcomes: Each classification is either correct or incorrect.
  3. Large sample condition: The total number of discordant pairs satisfies

    From the output,

    All conditions are satisfied.

Do

Using the chi-square test for paired categorical data, the test statistic is

Substituting the observed values:

The corresponding p-value is


Conclude

Because the p-value is far smaller than the significance level , we reject the null hypothesis.

There is extremely strong statistical evidence that the ML-based classifier has a lower overall error rate than the rule-based detector when applied to the same URLs.

Paired 2×2 Table #1: Overall Correctness (TEST)

Rows = Rule-based classifier, Columns = ML-based classifier

ML CorrectML WrongRow Total
Rule Correct61,6454,75066,395
Rule Wrong46,0028,81054,812
Column Total107,64713,560121,207

Discordant cells (used in paired chi-square / McNemar test):

  • (b = 4{,}750) (Rule correct, ML wrong)
  • (c = 46{,}002) (Rule wrong, ML correct)