Video Game Rating — What’s in the Head of the Raters

Philip Fei-Ran Lee
8 min readMay 24, 2021

Arts and entertainment inform much of our worldviews, and everyone is concerned by the contents of the game contents. The question here is which model tells us most about the game rating and the rating process. As the note unfolds, out of the four chosen models (Catboost, XGboost, Random Tree Forest, and Log Regression), XGBoost will perform best. The surmise is that XGBoost and its boosting process reflect the rating process better than the Random Forest Classifier model.

Data Source, Structure, and Limited Cleaning:

The dataset is from Kaggle’s video game rating dataset uploaded in 2021. The data is relatively clean. The data is binary in nature (0 and 1), indicating whether or not the game demonstrates particular characteristics. 0 is No, and 1 is Yes.

The training dataset had more than 1800 rows and 32 variables. The testing dataset has more than 500 rows and the same number of variables.

Only a few rows (less than 10) had entry errors of entering the final rating in the place of one factor. As a result, I just used .dropna to rid the dataset of them.

Also, I had to cast the violence column as integer:

X[‘violence’] = X[‘violence’].astype(int)

X_test[‘violence’] = X_test[‘violence’].astype(int)

Actual Modeling

The task is to predict which factors contribute the most to the eventual rating into the four categories: E (everyone), ET(Everyone Ten plus), T(Teen), and M(Mature). The training and test dataset structure shows that most of the games are rated “T.”

Baseline:

The data structure is relatively balanced, and the accuracy of teen_rating will be the baseline:

The target variable, thus, is the ESRB rating for T.

Results of Four Models

Logistic Regression:

model_lgr = LogisticRegression(random_state=42)

model_lgr.fit(X_train, y_train)

Training acc 0.8674188998589563

Validation acc 0.8224101479915433

Random Forest Classifier:

model_rf = RandomForestClassifier(n_estimators=250, random_state=42, max_features=10, max_depth=32, max_leaf_nodes=3500, n_jobs=-1, max_samples=.8)

model_rf.fit(X_train, y_train)

Training acc 0.9280677009873061

Validation acc 0.8350951374207188

model_rf.get_params

<bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=32, max_features=10,
max_leaf_nodes=3500, max_samples=0.8,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=250,
n_jobs=-1, oob_score=False, random_state=42, verbose=0,
warm_start=False)>

CatBoost:

model_ctb = CatBoostClassifier(max_depth=16, learning_rate=5, random_seed=42, n_estimators=200)

model_ctb.fit(X_train, y_train)

Training acc 0.9026798307475318

Validation acc 0.8012684989429175

XGBoost(the best):

model_xgb = XGBClassifier(max_depth=35, n_estimators=300, n_jobs=-1, random_state=42)

model_xgb.fit(X_train, y_train)

Training acc 0.9280677009873061

Validation acc 0.8625792811839323

model_xgb.get_params

<bound method XGBModel.get_params of XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=35,
min_child_weight=1, missing=None, n_estimators=300, n_jobs=-1,
nthread=None, objective='multi:softprob', random_state=42,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)>

The one-time validation average of all five models is as below:

  1. Define a function for the accuracy score checks:

def check_metrics(model):

training_acc = accuracy_score(y_train, model.predict(X_train))

val_acc = accuracy_score(y_val, model.predict(X_val))

return training_acc, val_acc

2. Define two lists to iterate:

model_name =[“Logistic Regression”, “Random Forest Classifier”, “XGBClassifier”, “CatBoostClassifier”, “LightGBM”]

models = [model_lgr, model_rf, model_xgb, model_ctb, model_lgbm]

3. Loop the lists and function to print out the results:

for m in range(len(models)):

model_acc_t,model_acc_v = check_metrics(models[m])

name = model_name[m]

print(str(name)+’: ’+ str(model_acc_t)+’; ’+ str(model_acc_v))

Logistic Regression: 0.8575458392101551; 0.8393234672304439
Random Forest Classifier: 0.9294781382228491; 0.8435517970401691
XGBClassifier: 0.9294781382228491; 0.8456659619450317
CatBoostClassifier: 0.9294781382228491; 0.828752642706131
LightGBM: 0.9217207334273625; 0.8498942917547568

As it happens, the tuning of XGBoost did not improve validation accuracy. This blog will not discuss tuning. Also, LightGBM performed slightly worse than XGBoost, and thus was discarded. I will show the reason in the section of taking the number of cases above 95% certainty. The choice came down between two classifiers, XGBoost and Random Forest, before tuning.

Random Forest: Taking 100 Run-Time Validation Average

  1. Looping the Model 100 Times

for i in range(1, 101, 1):

model = RandomForestClassifier(n_estimators=250, random_state=i, max_features=10, max_depth=32, max_leaf_nodes=3500, n_jobs=-1, max_samples=.8)

model.fit(X_train, y_train)

list_acc_rf.append(accuracy_score(y_val, model.predict(X_val)))

2. Take the Average of the Appended List:

list_avg_acc_rf = sum(list_acc_rf)/len(list_acc_rf)

print(list_avg_acc_rf)

0.8447357293868916

LightGBM: Taking 100 Run-Time Validation Average

list_acc_lgbm = []

for i in range(1, 101, 1):

model = LGBMClassifier(n_estimators=250, random_state=i, max_features=10, max_depth=32, max_leaf_nodes=3500, n_jobs=-1, max_samples=.8)

model.fit(X_train, y_train)

list_acc_lgbm.append(accuracy_score(y_val, model.predict(X_val)))

list_avg_acc_lgbm = sum(list_acc_lgbm)/len(list_acc_lgbm)

print(list_avg_acc_lgbm)

0.8435517970401684

XGBoost: Taking 100 Run-Time Validation Average

  1. Looping the Model 100 Times

list_acc_xgb = []

for i in range(1, 101, 1):

model = XGBClassifier(max_depth=35, n_estimators=300, n_jobs=-1, random_state=i)

model.fit(X_train, y_train)

list_acc_xgb.append(accuracy_score(y_val, model.predict(X_val)))

2. Take the Average of the Appended List:

list_avg_acc_xgb = sum(list_acc_xgb)/len(list_acc_xgb)

print(list_avg_acc_rf)

0.845665961945031

XGBoosts shows nearly 10% higher accuracy and is the chosen model for interpretation. The confusion matrix plots and the number of cases above 95% certainty will further clarify the reason for this difference.

The Greater Predictive Power:

I printed the number of titles with more than 95% certainty in rating:

Random Forest Classifier

X_test_nf = X_val[model_rf.predict_proba(X_val)[:,-1] > .95]

print(X_test_nf.shape)

(52, 32)

LightGBM:

X_test_nfgbm = X_val[model_lgbm.predict_proba(X_val)[:,-1] > .95]

print(X_test_nfgbm.shape)

(81, 32)

XGBoost:

X_test_nfxg = X_val[model_xgb.predict_proba(X_val)[:,-1] > .95]

print(X_test_nfxg.shape)

(83, 32)

Namely, XGBoost rates more cases with greater certainty. LightGBM is not as accurate as XGBoost and thus is discarded. The confusion matrices below further demonstrate the difference between the best sequential classification versus parallel classification processes.

Random Forest Classifier:

Random Forest Classifier’s Confusion Matrix after Taking 100 Run-Time Validation Average

# Classification Report for Random Forest Model

print(classification_report(y_val, model_rf.predict(X_val)))

               precision    recall  f1-score   support           E       0.94      0.94      0.94       103
ET 0.70 0.80 0.75 89
M 0.89 0.89 0.89 104
T 0.84 0.78 0.81 177
accuracy 0.84 473
macro avg 0.84 0.85 0.85 473
weighted avg 0.85 0.84 0.84 473

XGBoost

XGBoost Confusion Matrix after Taking 100 Run-Time Validation Average

print(classification_report(y_val, model_lgbm.predict(X_val)))

                precision   recall  f1-score   support           E       0.97      0.94      0.96       103
ET 0.70 0.87 0.77 89
M 0.90 0.90 0.90 104
T 0.85 0.76 0.80 177
accuracy 0.85 473
macro avg 0.85 0.87 0.86 473
weighted avg 0.86 0.85 0.85 473

XGBoost performed better because its recall accuracy (false positive rates) is better across the board, especially predicting Mature-rated games by 1% and ET games by 7%. In fact, XGBoost by itself predicts a Teen-rated game 2% than Random Forest. However, that deficiency is compensated by the overall increase of accuracy in two other categories (ET and M), and the higher precision rate reflects that in Teen-rated games. As a result, XGBoost is the best model for this task.

Important Features and Permutation Importance of Each Model

The nature of the question defines the most fitted model. The rating process is in the human brain, which may resemble sequential or parallel processing depending on the day or level of inherent biases. However, the difference in the case at hand is the medium in which inappropriate content presents itself. If the presentation is sequential, boosting is a better simulation of the rating process.

As we can all agree — game playing and game rating are sequential. Thus, Random Forest performs worse because it takes the average and does not process the information sequentially.

Random Forest Importance Features

Important Features of Random Forest Classifier

“no_descriptors” means “No content descriptors,” according to the data dictionary. Therefore, it may indicate the willingness to hide content. Thus, the random forest model may seize this indicator as the most crucial identifier of other content.

However, the reality of game-content making or rating is more than the presence or absence of descriptors. For instance, blood and gore obviously are the results and contents of violence. Why are they that much less important than no descriptor? Also, highly inappropriate languages tend to signal everything inappropriate in the plot. Thus, we must discover why XGBoost predicts much better.

XGBoost Importance Features

Important Features of XGBoost Classifier

XGBoost model says that in most scenarios, strong languages indicate the presence of other inappropriate content. In fact, strong language itself is often the host of such content. The lack of descriptors is a function of the content triggered by or indicated by strong languages. After all, no one would like to advertise the frequency of inappropriate content couched in strong languages in the descriptors.

Overall, strong sexual themes (3rd) along with blood and gore (4th), and mild fantasy violence (5th), predict the rating outcomes. Violence and sex often come in fantasy themes and not in a manner totally explicit. Furthermore, sexual themes should be a more independent pathway triggered by strong languages or no descriptor, as the slightly higher importance is reflected in the plot.

The permutation importance of both models appears identical. The two graphs below show how sequential orientation versus parallel training orientations does yield different outcomes. After all, average taking is a less precise manner of pathway choices.

XGB Permutation Importance
Random Forest Permutation Importance

Conclusion — Why XGBoost Makes So Much Sense

It is all about sensitivity to the consequential order in the rating process.

  1. Language Triggers the Lack of Description, Not the Reverse

Either way, strong language and lack of description indicate the likelihood that there is inappropriate content. However, XGBoost correctly identifies the consequential order — the language causes the lack of disclosure, while Random Forest did not. This result confirms the external reality of marketing and rating more.

2. Sexual Theme (3rd) Is a Weighty Factor by Itself

Strong Sexual theme is ranked 3rd. This makes sense; after all, while strong languages are the primary triggers and media of sexual content, sexual themes themselves are also potent queues. XGBoost picked up on this reality and reflected it in the feature importance ranking.

3. Violent Themes Trigger Higher Rating — Not the Single Elements.

XGBoost picked up blood and gore, mile fantasy theme, and blood as the 4th, 5th, and 6th most important features. This selection showed great sensitivity to the thinking pathway of the game raters. The theme of fantasy violence leads the rest of the factors because it often is the environment to couch much inappropriate content.

After all, spilling orc blood does not appear as bad as spilling a non-combatant human in car theft, even though both are acts of violence and the same caliber.

As a result, XGBoost performs best to explain the game rating process and may help parents identify and detect the trigger of inappropriate content better. Descriptors may tell little, but languages and visual themes, as in all arts and entertainment, tell us what is coming.

--

--