Random Forest with Grid Search

By GoogleGemini Dont miss the forest for thetrees This is one of the famous quotes in the world. That works in ML literature as well. Random Forest normally has better performance than decisiontree. By GoogleGemini Thats why Random Forest is one of my favorites in ML algorithms. We are going to implement Random Forest in Google Colab. In this blog, we walk through implementing Random Forest step bystep: Getting dataset Reading the dataset and plotting histogram for numerical variables One-hot encoding categorical variables Splitting train and testdata Oversampling the minor class instances Fitting RandomForest Validating the model performance Hyperparameter tuning(Grid Search) Feature Importance Getting Dataset Flight booking data is used. The dataset is from Kaggle public dataset. This dataset has 50,000 records with 14 features. Flight Bookings data The column description is below. The target is booking_complete. This shows whether customers complete a booking ornot. num_passengers = number of passengers traveling sales_channel = sales channel booking was madeon trip_type = trip type (Round Trip, One Way, CircleTrip) purchase_lead = number of days between travel date and bookingdate length_of_stay = number of days spent at destination flight_hour = hour of flight departure flight_day = day of the week of flight departure route = origin -> destination flightroute booking_origin = country from where the booking wasmade wants_extra_baggage = if the customer wanted extra baggage in thebooking wants_preferred_seat = if the customer wanted a preferred seat in thebooking wants_in_flight_meals = if the customer wanted in-flight meals in thebooking flight_duration = total duration of flight (inhours) booking_complete = flag indicating if the customer completed thebooking Reading the dataset and plotting histogram for numerical variables import pandas as pd df = pd.read_csv('customer_booking.csv', encoding='ISO-88591') df.hist(figsize = (10, 10)) From the histograms above, the dependent variable booking_complete is imbalanced and 5 variables are categorical, resulting in one-hot encoding. for loop is useful to identify categorical variables. The condition is if the first row is str ornot. cat_cols = [] num_cols = [] for col in df.columns: if type(df[col][0]) == str: cat_cols.append(col) else: num_cols.append(col) One-hot encoding categorical variables There are mainly two methods of one-hot encoding in sklearn and pandas. Using pandas.get_dummies is easier coding-wise. And, axis=1 should be mentioned in pd.concat. Otherwise, it will be concatenated in a row direction. After one-hot encoding, the column isdropped. from pandas import get_dummies for col in cat_cols: X = pd.concat([X, pd.get_dummies(X[col])], axis =1 ) X = X.drop(col, axis = 1) Splitting train and testdata Splitting data is very important in ML modeling. If not completely separated, data leakage causes fatal prediction problems. For example, both training accuracy and test accuracy are good enough however, this model fails in production. This is because data leakage gets the test accuracy high, sugar-coating low generalization power. Data leakage will be mentioned again in the oversampling step. from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1) y_train.hist() Oversampling the minor class instances As you can see in the histogram above, the distribution of the target is highly imbalanced. The imbalance of the dataset deteriorates the performance of the ML model. Say, a dataset is fraud-related and the fraud rate is 0.01%. If the model blindly predicts all customers are not fraudulent, the accuracy is 99.99% and the model cant prevent fraud. In other words, the model fails to save even though the accuracy is 99.99%. So, the imbalance should be handled using oversampling the minor class or undersampling the major classes. Oversampling is preferable in the MLscene. SMOTE takes advantage of the K-nearest neighbor(KNN) algorithm to create minor class datapoints. In detail, SMOTE selects one of the minor class datapoints, identifies close ones based on KNN and chooses one ofthem. The thing is that only training data should be oversampled, not test data. If oversampling both, data leakage will happen bymistake. from imblearn.over_sampling import SMOTE smote = SMOTE() X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train) y_train_resampled.hist() Fitting RandomForest Weve done data preparation for modeling. Lets fit Random Forest. It gives 84% accuracy. from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score rf = RandomForestClassifier() rf.fit(X_train_resampled, y_train_resampled) y_pred = rf.predict(X_test) accuracy_score(y_test, y_pred) Validating the model performance Confusion matrix is the first choice when validating a model. As per the name of it, it isnt easy to understand and interpret correctly. It confuses many smart guysindeed. from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf.classes_) disp.plot() plt.show() The recall is 23%, so only 23% of truly complete bookings are predicted to be booked completely. from sklearn.metrics import recall_score recall_score(y_test, y_pred) Precision is 42%, so less than half of the predicted complete booking istrue. from sklearn.metrics import precision_score precision_score(y_test, y_pred) Hyperparameter tuning(Grid Search) We cant stay here. Lets try to improve more. Grid search with CV will help us. We tune some of the hyperparameters asbelow. max_depth: the maximum level of each tree. A deeper tree is more overfitted. So, a high value makes the model fail to generalize. n_estimators: the number of trees in theforest. max_features: the number of features. This is one of the main hyperparameters that prevents overfitting. The square root of the total number of features is recommended. min_samples_leaf: the minimum number of samples required to be at the leaf node of eachtree. verbose = 3 shows how the model searches the grid searchspace. from sklearn.model_selection import GridSearchCV rf_grid = RandomForestClassifier() gr_space = ( 'max_depth': [3,5,7,10], 'n_estimators': [100, 200, 300, 400, 500], 'max_features': [10, 20, 30 , 40], 'min_samples_leaf': [1, 2, 4] ) grid = GridSearchCV(rf_grid, gr_space, cv = 3, scoring='accuracy', verbose = 3) model_grid = grid.fit(X_train_resampled, y_train_resampled) print('Best hyperparameters are '+str(model_grid.best_params_)) print('Best score is: ' + str(model_grid.best_score_)) More than 4 hours with Google Colab CPU are taken, but accuracy is worsened thanbefore. Recall is 47% and precision is 30%. In terms of recall and precision, it looks better than before hyperparameter tuning because the previous one f1=0.297 and this one f1=0.366. Hence, what we have doneworks. rf_optimized = model_grid.best_estimator_ y_pred = rf_optimized.predict(X_test) accuracy_score(y_test, y_pred) recall_score(y_test, y_pred) precision_score(y_test, y_pred) Feature importance Random Forest provides feature importance from where we can interpret the model. The top 10 important features are the day of departure(Mon, Tue, Wed, Fri, and Sun), Origin(Australia and South Korea), flight duration, length of stay, and sales channel(mobile). Surprisingly, Thursday and Saturday have less impact on themodel. Conclusion A random forest model was built using the airline booking dataset. During the data preprocessing stage, one-hot encoding was applied to categorical variables, and oversampling was performed to address the imbalanced data. The training and test data were separated to ensure the reliability of the model performance evaluation. The initial random forest model achieved an accuracy of 84%, but had lower recall and precision. As a result, hyperparameter tuning was performed, and the F1 score improved to 0.366. Feature importance analysis revealed that departure day, departure origin, flight duration, length of stay, and sales channel were the key predictive factors. Overall, effective preprocessing of the imbalanced data, hyperparameter tuning and feature importance analysis helped improve the performance of the random forest model. These insights can provide valuable information for real-world airline booking prediction problems. Random Forest with Grid Search was originally published in Cloud Villains on Medium, where people are continuing the conversation by highlighting and responding to this story.

Random Forest with Grid Search

댓글 0개