Flight Price Prediction Machine Learning Model

by rudelabs.ai | Jan 16, 2023 | Coding Projects | 0 comments

What We Do

Software & SaaS Development

Delivered 100+ SaaS Solutions. An expert team capable of converting your ideas into reality.

Custom Mobile Apps Design & Development

Fast Development, Fast Deployment. We develop native apps compatible with both Android & iOS.

AI & Augmented Reality

Agentic Workflows, Process Automation and AI Integration. Our team will help you to deliver AI Apps within 4 weeks.



Introduction

A Flight Price Prediction Machine Learning Model is a type of predictive model that uses historical flight price data to predict the future prices of flights. The model will be trained using various algorithms such as linear regression, decision trees, or neural networks. The input features for the model will include factors such as the departure and arrival locations, the date of travel, the airline, and the class of service. The output of the model is a predicted flight price. Airlines and travel agencies can use the model and other businesses to predict prices and make pricing decisions.

Objectives

The main objectives of creating a Flight Price Prediction Machine Learning Model include the following:

Price forecasting: The model can predict the future prices of flights, which can help airlines and travel agencies to adjust their prices accordingly and remain competitive.
Inventory management: The model can be used to predict flight demand, which can help airlines and travel agencies optimize their inventory and avoid overbooking or underbooking.
Revenue optimization: The model can maximize revenue by predicting the prices at which flights will sell the most, which can help airlines and travel agencies adjust their prices accordingly.
Personalized pricing: The model can be used to personalize pricing for different customers by considering factors such as their past purchase history, location, and demographics.
Anomaly Detection: The model can detect abnormal prices, which can help airlines and travel agencies identify pricing errors or fraud.

Overall, the goal of creating a flight price prediction model is to improve pricing decisions, optimize inventory and revenue, and improve customer experience.

Requirements

Python
Jupyter Notebook

Source Code

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

train_data = pd.read_excel(r"E:\MachineLearning\EDA\Flight_Price\Data_Train.xlsx")

pd.set_option('display.max_columns', None)

train_data.head()

train_data.info()

train_data["Duration"].value_counts()

train_data.dropna(inplace = True)

train_data.dropna(inplace = True)

train_data["Journey_day"] = pd.to_datetime(train_data.Date_of_Journey, format="%d/%m/%Y").dt.day

train_data["Journey_month"] = pd.to_datetime(train_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month

train_data["Journey_month"] = pd.to_datetime(train_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month

# As we have converted Date_of_Journey column into integers. We can drop it now as it is of no use.

train_data.drop(["Date_of_Journey"], axis = 1, inplace = True)

# Departure time is at which a plane leaves the gate.

# Similar to Date_of_Journey we will also extract values from Dep_Time

# Extracting Hours

train_data["Dep_hour"] = pd.to_datetime(train_data["Dep_Time"]).dt.hour

# Extracting Minutes

train_data["Dep_min"] = pd.to_datetime(train_data["Dep_Time"]).dt.minute

# Now we can drop Dep_Time as it is of no use

train_data.drop(["Dep_Time"], axis = 1, inplace = True)

train_data.head()

# Arrival time is when the plane pulls up to the gate.

# Similar to Date_of_Journey we can extract values from Arrival_Time

# Extracting Hours

train_data["Arrival_hour"] = pd.to_datetime(train_data.Arrival_Time).dt.hour

# Extracting Minutes

train_data["Arrival_min"] = pd.to_datetime(train_data.Arrival_Time).dt.minute

# Now we can drop Arrival_Time as it is of no use

train_data.drop(["Arrival_Time"], axis = 1, inplace = True)

train_data.head()

# Time taken by plane to reach destination is called Duration

# It is the differnce betwwen Departure Time and Arrival time

# Assigning and converting Duration column into list

duration = list(train_data["Duration"])

for i in range(len(duration)):

if len(duration[i].split()) != 2: # To Check if duration contains only hour or mins

if "h" in duration[i]:

duration[i] = duration[i].strip() + " 0m" # Add 0 minute

else:

duration[i] = "0h " + duration[i] # Add zero hours

duration_hours = []

duration_mins = []

for i in range(len(duration)):

duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extracting hours from duration

duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1])) # Extracts only minutes from duration

# Addition of duration_hours and duration_mins list to train_data dataframe

train_data["Duration_hours"] = duration_hours

train_data["Duration_mins"] = duration_mins

train_data.drop(["Duration"], axis = 1, inplace = True)

train_data.head()

train_data["Airline"].value_counts()

# From graph we can see that Jet Airways Business have the highest Price.

# Apart from the first Airline almost all are having similar median



# Airline vs Price

sns.catplot(y = "Price", x = "Airline", data = train_data.sort_values("Price", ascending = False), kind="boxen", height = 6, aspect = 3)

plt.show()

# Since Airline is Nominal Categorical data we will perform OneHotEncoding

Airline = train_data[["Airline"]]

Airline = pd.get_dummies(Airline, drop_first= True)

Airline.head()

train_data["Source"].value_counts()

# Source vs Price

sns.catplot(y = "Price", x = "Source", data = train_data.sort_values("Price", ascending = False), kind="boxen", height = 4, aspect = 3)

plt.show()

# As Source is Nominal Categorical data we will perform OneHotEncoding

Source = train_data[["Source"]]

Source = pd.get_dummies(Source, drop_first= True)

Source.head()

train_data["Destination"].value_counts()

# As Destination is Nominal Categorical data we will perform OneHotEncoding

Destination = train_data[["Destination"]]

Destination = pd.get_dummies(Destination, drop_first = True)

Destination.head()

train_data["Route"]

# Additional_Info contains almost 80% no_info

# Route and Total_Stops are related to each other

train_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True)

train_data["Total_Stops"].value_counts()

# Since this is a case of Ordinal Categorical type we perform LabelEncoder.

# Here Values are assigned with corresponding keys

train_data.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True)

train_data.head()

# Concatenate dataframe --> train_data + Airline + Source + Destination

data_train = pd.concat([train_data, Airline, Source, Destination], axis = 1)

data_train.head()

data_train.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True)

data_train.head()

data_train.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True)

test_data = pd.read_excel(r"E:\MachineLearning\EDA\Flight_Price\Test_set.xlsx")’

test_data.head()

# Preprocessing

print("Test data Info")

print("-"*75)

print(test_data.info())

print()

print()

print("Null values :")

print("-"*75)

test_data.dropna(inplace = True)

print(test_data.isnull().sum())

# EDA

# Date_of_Journey

test_data["Journey_day"] = pd.to_datetime(test_data.Date_of_Journey, format="%d/%m/%Y").dt.day

test_data["Journey_month"] = pd.to_datetime(test_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month

test_data.drop(["Date_of_Journey"], axis = 1, inplace = True)

# Dep_Time

test_data["Dep_hour"] = pd.to_datetime(test_data["Dep_Time"]).dt.hour

test_data["Dep_min"] = pd.to_datetime(test_data["Dep_Time"]).dt.minute

test_data.drop(["Dep_Time"], axis = 1, inplace = True)

# Arrival_Time

test_data["Arrival_hour"] = pd.to_datetime(test_data.Arrival_Time).dt.hour

test_data["Arrival_min"] = pd.to_datetime(test_data.Arrival_Time).dt.minute

test_data.drop(["Arrival_Time"], axis = 1, inplace = True)

# Duration

duration = list(test_data["Duration"])

for i in range(len(duration)):

if len(duration[i].split()) != 2: # To Check if duration contains only hour or mins

if "h" in duration[i]:

duration[i] = duration[i].strip() + " 0m" # Adds 0 minute

else:

duration[i] = "0h " + duration[i] # Adding zero hour

duration_hours = []

duration_mins = []

for i in range(len(duration)):

duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extracting hours from duration

duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1])) # Extracts only minutes from duration

# Adding Duration column to test set

test_data["Duration_hours"] = duration_hours

test_data["Duration_mins"] = duration_mins

test_data.drop(["Duration"], axis = 1, inplace = True)

# Categorical data

print("Airline")

print("-"*75)

print(test_data["Airline"].value_counts())

Airline = pd.get_dummies(test_data["Airline"], drop_first= True)

print()

print("Source")

print("-"*75)

print(test_data["Source"].value_counts())

Source = pd.get_dummies(test_data["Source"], drop_first= True)

print()

print("Destination")

print("-"*75)

print(test_data["Destination"].value_counts())

Destination = pd.get_dummies(test_data["Destination"], drop_first = True)

# Additional_Info contains almost 80% no_info

# Route and Total_Stops are related to each other

test_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True)

# Replacing Total_Stops

test_data.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True)

# Concatenate dataframe --> test_data + Airline + Source + Destination

data_test = pd.concat([test_data, Airline, Source, Destination], axis = 1)

data_test.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True)

print()

print()

print("Shape of test data : ", data_test.shape)

data_train.shape

data_train.columns

X = data_train.loc[:, ['Total_Stops', 'Journey_day', 'Journey_month', 'Dep_hour',

'Dep_min', 'Arrival_hour', 'Arrival_min', 'Duration_hours',

'Duration_mins', 'Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo',

'Airline_Jet Airways', 'Airline_Jet Airways Business',

'Airline_Multiple carriers',

'Airline_Multiple carriers Premium economy', 'Airline_SpiceJet',

'Airline_Trujet', 'Airline_Vistara', 'Airline_Vistara Premium economy',

'Source_Chennai', 'Source_Delhi', 'Source_Kolkata', 'Source_Mumbai',

'Destination_Cochin', 'Destination_Delhi', 'Destination_Hyderabad',

'Destination_Kolkata', 'Destination_New Delhi']]

X.head()

y = data_train.iloc[:, 1]

y.head()

# Finds correlation between Independent and dependent attributes

plt.figure(figsize = (18,18))

sns.heatmap(train_data.corr(), annot = True, cmap = "RdYlGn")

plt.show()

# Important feature using ExtraTreesRegressor

from sklearn.ensemble import ExtraTreesRegressor

selection = ExtraTreesRegressor()

selection.fit(X, y)

print(selection.feature_importances_)

# plot graph of feature importances to better visualize

plt.figure(figsize = (12,8))

feat_importances = pd.Series(selection.feature_importances_, index=X.columns)

feat_importances.nlargest(20).plot(kind='barh')

plt.show()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

from sklearn.ensemble import RandomForestRegressor

reg_rf = RandomForestRegressor()

reg_rf.fit(X_train, y_train)

y_pred = reg_rf.predict(X_test)

reg_rf.score(X_train, y_train)

reg_rf.score(X_test, y_test)

sns.distplot(y_test-y_pred)

plt.show()

plt.scatter(y_test, y_pred, alpha = 0.5)

plt.xlabel("y_test")

plt.ylabel("y_pred")

plt.show()

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

print('MSE:', metrics.mean_squared_error(y_test, y_pred))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

# RMSE/(max(DV)-min(DV))

2090.5509/(max(y)-min(y))

metrics.r2_score(y_test, y_pred)

from sklearn.model_selection import RandomizedSearchCV

#Randomized Search CV

# Number of trees in random forest

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]

# Number of features to consider at every split

max_features = ['auto', 'sqrt']

# Maximum no. of levels in tree

max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]

# Minimum no. of samples required to split a node

min_samples_split = [2, 5, 10, 15, 100]

# Minimum no. of samples required at each leaf node

min_samples_leaf = [1, 2, 5, 10]

# Create the random grid

random_grid = {'n_estimators': n_estimators,

'max_features': max_features,

'max_depth': max_depth,

'min_samples_split': min_samples_split,

'min_samples_leaf': min_samples_leaf}

# Random search for parameters using 5 fold cross validation.

# Searching across hundred different combinations

rf_random = RandomizedSearchCV(estimator = reg_rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)

rf_random.fit(X_train,y_train)

rf_random.best_params_

prediction = rf_random.predict(X_test)

plt.figure(figsize = (8,8))

sns.distplot(y_test-prediction)

plt.show()

plt.figure(figsize = (8,8))

plt.scatter(y_test, prediction, alpha = 0.5)

plt.xlabel("y_test")

plt.ylabel("y_pred")

plt.show()

print('MAE:', metrics.mean_absolute_error(y_test, prediction))

print('MSE:', metrics.mean_squared_error(y_test, prediction))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

Explanation of the Code

The outline of the steps in writing the code for a Flight Price Prediction Machine Learning Model is as follows.

1. Initially, we declared all the necessary libraries to build our model and loaded our dataset in our notebook.

2. The next step is to acquire historical flight price data, which can be obtained from various sources such as airlines, travel agencies, or online ticket booking platforms. The data will typically include information such as the departure and arrival locations, the date of travel, the airline, the class of service, and the corresponding prices.

3. Once the data is acquired, we cleaned and preprocessed it to remove any missing or inconsistent values and to format the data properly for the model. This involved removing outliers, normalizing the data, or encoding categorical variables. We cleaned our dataset by dropping the null values through dropna() function.

4. The next step is to extract relevant features from the data that will be used as input to the model. This may involve creating new features by combining existing ones or selecting a subset of the original features.

5. We have used the concept of one hot encoding and label encoding with the features.

6. Next, we trained the model using preprocessed and feature-engineered data. This involved splitting the data into training and test sets and then using an algorithm such as linear regression, decision trees, or neural networks to learn the relationship between the input features and the prices.

7. Next, we applied algorithms like random forest classifier and hyperparameter tuning.

8. Once the model is trained, we evaluate it to assess its performance. This involved comparing the predicted prices to the actual prices in the test set and calculating metrics such as mean squared error or R-squared.

Conclusion

Hence we have successfully built the Flight Price Prediction Machine Learning Model to predict the price of flights which will help us to select the best possible travel route and to reach our destination according to our own demand and utility.

More Machine Learning Projects>>>