Normalization

Problem A#FuckHIMCM

4.1 Parameter selection:

● Popularity and Accessibility.
Our team chose the number of participants, the number of participating countries, the on-site heat, the online heat and the prize money as the five indicators.
First of all, the number of participants in a sport competition is definitely a key factor reflecting the popularity and attractiveness of a sport, which is directly related to the measure of Popularity. A large number of participants indicates that the sport has a broad mass base and appeal.
The number of participating countries can reflect the degree of globalization of a sport, although it is mainly related to Inclusivity, it can also reflect the heat of a sport. If a program is practiced in a large number of countries, it means it has greater potential for global promotion.
The number of countries participating globally is a better indicator of a sport’s popularity than just measuring the number of participants. For example, whether a sport is only popular in one region.
The Live Heat Index is a key measure of Popularity and Accessibility and consists of the number of spectators, the number of days the games lasted, the number of seats and the attendance rate. Live Heat is a direct reflection of the value of the Games, and is a better indicator of support for the Games than online viewers.
The online heat index also reflects the worldwide popularity of a sport. Unlike live spectators, many people may just prefer to stay at home and watch the events, which obviously needs to be counted.
Whereas prize money can measure the marketability potential of the sport, high prize money tournaments tend to attract more athletes and may have some slight impact on visibility.
● Gender Equity.
Gender Equity is simple, we use the number of male and female athletes as a variable and compare the difference in their numbers to determine if there is gender equity.

● Sustainability.
In this section we use a combination of sustainability indicators, including resource recyclability, maintenance cost per square meter and cost per square meter to determine if the sustainability indicator is met.
● Inclusivity:
We used a composite inclusivity index to determine whether a sport is inclusive, which includes disadvantaged participation, cultural adaptability, geographic balance, and economic inclusivity. The ability of disadvantaged groups is always something that needs to be looked at, and good campaigns tend to be inclusive of people with different abilities. Cultural adaptability, geographical balance can determine whether a sport can be widely accepted around the world, while economic inclusiveness also targets specific groups who can afford to pay for the sport (e.g. equipment, training costs).
● Relevance and Innovation.
A sport’s relevance is more abstract, and we use ten indicators to determine its Spectator Appeal: Dynamism, Confrontation, Skill, Strategy, Variety, Audiovisual Experience, Accessibility, Cultural Implications, Participation, and Aesthetics. These are the elements that a sport needs to have
● Safety and Fair Play.
Safety and Fair Play includes whether a program is prone to safety incidents, and whether there is significant inequity can directly affect its viability. We used three variables in order to synthesize an indicator: the combined number of injuries, the percentage of doping, and the fairness of judging.

4.2 Parameter Normalization:

Before selecting the training model we need to normalize all the data, for many variables there are intricate relationships between them that need to be adjusted. After normalization, we merged and retained 10 valid variables for the final model.

**Participants: Log Normalized

Looking at the distribution of participants, we see that the data spans a very wide range, with a mean of about 1209 and a standard deviation of 1837, and has a distinct long tail (with some extreme high participant projects.) Log normalization significantly compresses the larger values, while preserving the nuances of the smaller values, making the results smoother and more suitable for subsequent analysis. According to the formula

x^{'} = \frac{lo g ( x + 1 )}{lo g ( max ( x ) + 1 )}

We can derive the normalized variables.

Number of Participating Countries: Kernel Density Estimation (KDE) Normalization

We found that the number of participating countries follows a clear multimodal distribution, with approximately 20 projects having around 100 and 190 participating countries, while the rest involve fewer than 80 countries. We use Kernel Density Estimation (KDE) with a smoothing function (Gaussian kernel) to estimate the probability density function (PDF) of the data. This method adapts well to multimodal distributions, is robust against outliers, and requires no assumptions about the underlying distribution.

Constructing the Kernel Density Estimation:
- Use KDE to generate the PDF of the data $f (x)$ :
  $f (x) = \frac{1}{nh} \sum_{i = 1}^{n} K (\frac{x - x _{i}}{h})$
  where:
  - $n$ is the number of data points;
  - $h$ is the bandwidth parameter;
  - $K$ is the kernel function, commonly the Gaussian kernel:
    $K (u) = \frac{1}{2 π} e^{- \frac{u ^{2}}{2}}$

Based on the estimated PDF, we can calculate the cumulative distribution function (CDF) for each data point and then normalize the data using the CDF values.

Calculate the Cumulative Distribution Function (CDF) and Normalize:
- The CDF $F (x)$ is the integral of the PDF:
  $x^{'} = F (x) = \int_{- \infty}^{x} f (t) d t$
- In practice, the CDF is estimated for discrete points, and numerical integration is applied to the PDF from KDE.

Audience Count, Duration, Seating Capacity, Attendance Rate: Entropy Weighting Method Normalization

Next, we determined an indicator to assess the popularity of a sports event. We used audience count, event duration, seating capacity, and attendance rate as components to calculate a combined indicator. The entropy weighting method, which is based on information entropy, was employed to measure the information content of each indicator, automatically assigning weights and reducing human bias.

Step 1: Data Normalization

Raw data were normalized to the range [0, 1] to facilitate subsequent calculations. Since the data distribution showed no special characteristics, Min-Max normalization was used:

z_{ij} = \frac{x _{ij} - min ( x _{j} )}{max ( x _{j} ) - min ( x _{j} )}

Step 2: Calculate Proportions $p_{ij}$

After normalization, the proportion of each sample value within an indicator was calculated:

p_{ij} = \frac{z _{ij}}{\sum _{i = 1}^{n} z _{ij}}

Step 3: Calculate Information Entropy $E_{j}$

Using the proportions $p_{ij}$ , the information entropy for each indicator was computed:

E_{j} = - k i = 1 \sum n p_{ij} ln (p_{ij})

k = \frac{1}{ln ( n )}

where $p_{ij} ln (p_{ij})$ is defined as 0 when $p_{ij} = 0$ .

Step 4: Calculate Weights $w_{j}$

Weights for each indicator were derived from the information entropy $E_{j}$ :

w_{j} = \frac{1 - E _{j}}{\sum _{j = 1}^{m} ( 1 - E _{j} )}

Step 5: Calculate Composite Scores $S_{i}$

The composite score for each sample was calculated as follows:

S_{i} = j = 1 \sum m w_{j} \cdot z_{ij}

Intent: Jaccard Similarity Normalization

To evaluate the popularity of a sports event, we also examined its online performance by analyzing user search data. Search intent is considered a less significant factor and includes four types: Informational, Navigational, Transactional, and Local. These types may coexist; thus, we combined the presence of these four boolean values into one indicator. To emphasize the importance of covering multiple intent types, we used Jaccard Similarity for normalization. Jaccard Similarity measures the similarity between two sets as the ratio of their intersection to their union.

Sample Set $A$

For each sample, extract fields with intent type equal to 1 to construct a set:

For example, for sample data $[1, 0, 1, 0]$ : $A = {Informational, Transactional}$

Ideal Set $B$

An ideal set $B$ was defined based on the evaluation goals of Olympic sports:

If the Olympics emphasize economic benefits and promotional value: $B = {Informational, Transactional} = [1, 0, 1, 0]$

Step 2: Compute Jaccard Similarity

S = \frac{∣ A \cap B ∣}{∣ A \cup B ∣}

$A$ : Set of intent types with a value of 1 in the current sample.
$B$ : Ideal set of intent types (e.g., $[1, 0, 1, 0]$ for Informational and Transactional).
$S$ : Similarity score, ranging from [0, 1], where 1 indicates a perfect match and 0 indicates no match.

Online Popularity: PCA-Based Weight Generation and Normalization

For the calculated intent index and other variables, we applied a weighted sum approach using PCA to generate weights automatically. PCA avoids human intervention and provides strong interpretability as weights directly reflect each variable’s contribution to the overall variance.

Step 1: Data Standardization

The raw data matrix $X$ was standardized so that each variable had a mean of 0 and a standard deviation of 1:

z_{ij} = \frac{x _{ij} - μ _{j}}{σ _{j}}

$z_{ij}$ : Standardized value of the $i$ th sample for the $j$ th variable.
$μ_{j}$ : Mean of the $j$ th variable.
$σ_{j}$ : Standard deviation of the $j$ th variable.

This yields a standardized data matrix $Z \in R^{n \times m}$ , where $n$ is the number of samples and $m$ is the number of variables.

Step 2: Compute Covariance Matrix

The standardized matrix $Z$ is used to compute the covariance matrix $C$ :

C = \frac{1}{n - 1} Z^{T} Z

$C \in R^{m \times m}$ : Covariance matrix representing variable correlations.

Step 3: Eigenvalue and Eigenvector Decomposition

Decompose the covariance matrix $C$ to obtain:

Eigenvalues $λ_{i}$ : Variance contribution of each principal component.
Eigenvectors $a_{i}$ : Directions of each principal component.

C a_{i} = λ_{i} a_{i}

Step 4: Select Principal Components

Compute the variance ratio for each principal component:

Variance Ratio_{i} = \frac{λ _{i}}{\sum _{i = 1}^{m} λ _{i}}

Select the first $k$ components based on cumulative variance ratio (e.g., 80%-90%):

Cumulative Variance Ratio = \frac{\sum _{i = 1}^{k} λ _{i}}{\sum _{i = 1}^{m} λ _{i}}

Step 5: Compute PCA Weights

For the first principal component $a_{1}$ , the weights for each variable are given by the eigenvector components:

w_{j}^{PCA} = \frac{∣ a _{1 j} ∣}{\sum _{j = 1}^{m} ∣ a _{1 j} ∣}

Step 6: Compute Composite Scores

The composite score for each sample is calculated as:

S_{i} = j = 1 \sum m w_{j}^{PCA} \cdot z_{ij}

Prize Money: Robust S-Normalization

Prize money is another key indicator of a sport’s attractiveness, with higher prize money likely drawing more athletes. Prize money follows a long-tail distribution, with most events offering none and a few offering substantial sums. To address this, we used Robust S-Normalization, which smooths data around the median, suppresses outlier effects, and enhances contrast in the central range.

1. Compute Robust Statistics

From the distribution $x_{j}$ , calculate:

Median: $Median (x_{j})$
Interquartile Range (IQR): $IQR (x_{j}) = Q 3 - Q 1$

2. Robust Normalization Formula

Apply the robust S-Normalization formula to scale data into the range (0, 1):

z_{ij} = \frac{1}{1 + exp ( - k \cdot \frac{x _{ij} - Median ( x _{j} )}{IQR ( x _{j} )} )}

$z_{ij}$ : Normalized value.

$x_{ij}$ : Original value for the $i$ th sample and $j$ th variable.
$k$ : Controls the steepness of the curve.

The Role of $k$

$k$ determines the steepness of the S-shaped curve:
- Large $k$ : The data distribution becomes more concentrated, with enhanced contrast in the middle range and extreme values rapidly approaching 0 or 1.
- Small $k$ : The data distribution is smoother, with greater influence of extreme values on the normalization results.

Due to the evident long-tail distribution in the data, we use a dynamic adjustment based on standard deviation: $k = \frac{1}{σ _{j}}$
- $σ_{j}$ is the standard deviation of the $j$ -th variable.
- Advantages: Sensitive to the data’s dispersion, suitable for high-variance data.

Gender Ratio: Nonlinear Proportional Reinforcement

For evaluating gender equality in the Olympics, we emphasize the importance of a balanced gender ratio (1:1) while avoiding undue influence from extreme scenarios. In this context, we chose a nonlinear proportional formula, which ensures that the index sharply declines to 0 when gender inequality is severe:

R = 1 - (\frac{∣ M - F ∣}{M + F})^{k}

$M$ : Number of males.
$F$ : Number of females.
$k > 1$ : Nonlinear reinforcement parameter to reward near-balance (we use $k = 3$ ).

When $M ≫ F$ or $F ≫ M$ , the score decreases rapidly, discouraging sports with significant gender imbalance.

Special cases: Events such as synchronized swimming that support only one gender are excluded from this index.

Resource Recycling, Maintenance Costs, and Construction Costs: Linear/Logarithmic + Weighted Sum

Normalization of these three variables considers their characteristics and data distribution:

Resource recycling: Higher values are better.
Maintenance costs: Lower values are better.
Construction costs: Lower costs per unit area are better.

Data analysis shows that resource recycling and maintenance costs follow approximately normal distributions, while construction costs are strongly right-skewed.

For resource recycling, we use Min-Max Normalization: $z_{ij} = \frac{x _{ij} - min ( x _{j} )}{max ( x _{j} ) - min ( x _{j} )}$
For maintenance costs, we use Inverse Normalization: $z_{ij} = 1 - \frac{x _{ij} - min ( x _{j} )}{max ( x _{j} ) - min ( x _{j} )}$
For unit-area construction costs, we use Logarithmic Normalization to suppress the influence of extreme values: $z_{ij} = \frac{lo g ( x _{ij} + 1 ) - min ( lo g ( x _{j} + 1 ))}{max ( lo g ( x _{j} + 1 )) - min ( lo g ( x _{j} + 1 ))}$

After normalization, weights $w_{j}$ are assigned to each variable to calculate the overall score:

S_{i} = j = 1 \sum m w_{j} \cdot z_{ij}

$w_{j}$ : Weight of the $j$ -th variable.
$z_{ij}$ : Normalized value of the $i$ -th sample for the $j$ -th variable.
$m = 3$ : Number of variables.

Entropy Weight Method was used to calculate each weight:

Calculate the entropy of each variable: $E_{j} = - \frac{1}{ln ( n )} i = 1 \sum n p_{ij} ln (p_{ij}), p_{ij} = \frac{z _{ij}}{\sum _{i = 1}^{n} z _{ij}}$
Calculate the weights: $w_{j} = \frac{1 - E _{j}}{\sum _{j = 1}^{m} ( 1 - E _{j} )}$

Finally, calculate the weighted sum:

S = j = 1 \sum m w_{j} \cdot z_{ij}

Inclusive Index: Weighted Average Integration

The inclusion index includes the participation of vulnerable groups, cultural adaptability, regional balance, and economic inclusiveness. Since these factors are equally important, we directly use a weight of 0.25 for the weighted sum:

S = i = 1 \sum 4 w_{i} \cdot x_{i}

$x_{i}$ : Value of the $i$ -th variable (0, 0.5, or 1).
$w_{i}$ : Weight of the $i$ -th variable, where $\sum w_{i} = 1$ .

Spectator Appeal: Voting Model

Spectator Appeal includes the following indicators: dynamism, competitiveness, skill, strategy, variability, audiovisual experience, clarity, cultural depth, engagement, and aesthetics. Since these indicators are equally important, we calculate their average value directly.

Formula

S = \frac{sum ( x _{i} )}{n}

Calculate the proportion of boolean values equal to 1.
No weighting is required; the score is based solely on the number of “passed” indicators.

Applicable Scenarios

Suitable for simple scenarios where all indicators are equally weighted.

Injury Rate: Covariance Contribution Method

The injury rate includes three variables: total injuries as a proportion of participants, injuries lasting more than one day, and injuries lasting more than seven days. These can be considered total injuries, short-term injuries, and long-term injuries. Due to their hierarchical relationship, we use each variable’s contribution to overall covariance to determine weights.

Calculate the row sum of the covariance matrix:
$S_{j} = i \sum Cov (x_{j}, x_{i})$
Assign weights:
$w_{j} = \frac{S _{j}}{\sum S _{j}}$

Finally, use the weighted sum:

S = j = 1 \sum m w_{j} \cdot z_{ij}

Safety/Fair Play: Variance Weighting Method

For indicators related to safety and fair play, including the proportion of participants using doping, injury index, and referee fairness, we use the variance weighting method to emphasize stable (low variance) indicators.

Calculate the variance of each indicator:
$Var_{j} = \frac{\sum ( x _{ij} - x ˉ _{j} ) ^{2}}{n - 1}$
Calculate weights:
$w_{j} = \frac{Var _{j}}{\sum Var _{j}}$

Finally, calculate the weighted sum:

S = j = 1 \sum m w_{j} \cdot z_{ij}

4.3 Model Development:

In this study, we selected a random forest model as the final analytical tool. The random forest model is well-suited for handling high-dimensional and complex data, effectively capturing nonlinear relationships between multiple variables. This is particularly advantageous for our analysis involving diverse variables. Additionally, its built-in feature importance analysis can help identify key factors influencing the selection of Olympic sports, such as audience interest, sustainability, and safety. The ensemble learning mechanism of random forests enhances the model’s robustness and predictive accuracy, reducing the risk of overfitting that may arise with a single model. Based on these characteristics, the random forest model is highly appropriate for solving the multi-factor decision-making problem of Olympic event selection.

Data Preprocessing

Handling Missing Values
For missing values ( $NaN$ ) in the dataset, we applied mean imputation:
$x_{i, j} = {mean ({x_{k, j} ∣ x_{k, j} \neq = NaN}), x_{i, j}, if x_{i, j} = NaN, otherwise .$
Numerical Conversion
All non-numerical data types were converted to $NaN$ to ensure data consistency.
Feature Partitioning
The dataset was divided into multiple blocks $A, B, C, \dots, G$ , each containing a specific group of features:
$X_{k} = {x_{i, j} ∣ j \in Feature Set k},$
$k \in {A, B, C, D, E, F, G}$ .

Random forest is an ensemble algorithm based on decision trees. It enhances prediction accuracy and stability by generating multiple decision trees and aggregating their results. It randomly selects data and features, performing well with high-dimensional data and complex problems.

The following outlines the random forest algorithm, including variables in the training dataset, the target variable $Y$ , and relevant formulas.

Input and Output

Input Data: The training dataset includes $n$ samples, each with $m$ features, aiming to predict a target variable $Y$ .
- Training dataset: ${(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})}$
- Each sample $X_{i} = (x_{i 1}, x_{i 2}, \dots, x_{im})$ is an $m$ -dimensional feature vector.
Output Data: Prediction results $\hat{Y}$ from multiple trained decision trees.

2.1 Basic Steps

Random Sampling of Training Data:
- Randomly sample $n_{tree}$ subsets from the training data with replacement (forming bootstrap samples) for training each decision tree.
- Due to sampling with replacement, some samples may appear multiple times in a subset, while others may not appear at all.
Building Decision Trees:
- During training, feature selection is randomized for each tree.
  1. Randomly select $m_{split}$ features from all features for node splitting.
  2. Choose a split point based on the selected features (maximizing information gain or minimizing Gini impurity).
  3. Recursively repeat the process for each node until stopping conditions (e.g., maximum depth or minimum samples) are met.
- Gini Impurity:
  $G ini (D) = 1 - i = 1 \sum K p_{i}^{2}$
  where $p_{i}$ is the proportion of class $i$ in dataset $D$ , and $K$ is the total number of classes.
- Information Gain:
  $I G (D, f) = E n t r o p y (D) - v \in {v_{1}, v_{2}, \dots, v_{k}} \sum \frac{∣ D _{v} ∣}{∣ D ∣} E n t r o p y (D_{v})$
  where $E n t r o p y (D)$ is the entropy of dataset $D$ , $D_{v}$ is the subset resulting from splitting on feature $f$ , and $∣ D_{v} ∣$ and $∣ D ∣$ are the sample sizes of the subset and original dataset, respectively.
Repeat Steps:
- Repeat steps 1 and 2 to construct $n_{tree}$ decision trees.
Voting Mechanism (Regression: Average, Classification: Majority Vote):
- The final prediction of the random forest is the average prediction from all trees: $\hat{Y}_{rf} = \frac{1}{n _{tree}} t = 1 \sum n_{tree} \hat{Y}_{t}$

4.4 Random Forest Model Training

We trained the model using a dataset with $n = 762$ samples, each with $m = 10$ features. The dataset included variables such as $x_{1}$ (number of participants), $x_{2}$ (number of participating countries), and $x_{3}$ (gender ratio). The target variable $Y$ was binary, indicating whether the condition was met (e.g., whether the event is included in the 2028 Los Angeles Olympics).

Random Forest Model Configuration

The random forest model was configured with $n_{tree} = 100$ decision trees. At each split, $m_{split} = 5$ features were randomly selected. The maximum tree depth was set to $d_{max} = 10$ , with a minimum sample size of 10.

Evaluation Results

The model achieved an accuracy of $0.73$ and an F1 score of $0.87$ on the test set, indicating strong classification performance. Feature importance analysis revealed that “number of participants” and “online popularity” were the most influential factors.

Summary

The random forest model was successfully trained on this dataset and outperformed traditional linear models. By further tuning the number and depth of trees, we believe the model’s performance could be further improved.

Drawbacks:

High computational complexity: Training and prediction can be time-consuming for large datasets and many trees.
Limited interpretability: Unlike a single decision tree, the random forest model is complex and challenging to interpret each tree’s decision-making process.

My Vault

Explorer