A Data Science Voyage: Analyzing & Exploring Flights Delays

Enzo Bergamo
14 min readMay 5, 2021

This project was originally produced as the final project for the course CIS 545: Big Data Analytics at the University of Pennsylvania, Spring 2021. This was co-authored by Joao Filippo and Daniel Ng, both also students at the University of Pennsylvania.


While flight delays may seem like no more than an inconvenience for travelers, the consequences of this phenomenon extend well beyond long waits at the airport. According to Peterson et al, U.S. net welfare would increase by $17.6 billion for a 10% reduction in flight delays. As the COVID-19 pandemic seems to be more under control and passenger traffic slowly returns to normalcy, it becomes particularly relevant to consider the impacts and causes of delays. Making use of public flight data, this article focuses on exploring and investigating how different features of a flight — airline, date, airplane model to name a few — correlate with delays and cancellations; to this end, we present a complete report of the steps and techniques involved in this process.

Data Wrangling & Exploratory Data Analysis

A First Step: Cleaning the Data

Most of the information in this article comes from the U.S. Department of Transportation’s 2015 flight dataset (available on Kaggle at www.kaggle.com/usdot/flight-delays). A quick overview of this dataset reveals an inconsistent naming schema for the origin and destination airport: for the month of October only, instead of the standardized IATA code (JFK and PHL, for example), the data is presented using DOT’’s internal LID code system. The LID code is designed to be a persistent value since IATA codes change over time. However, since only the month of October contained this issue, we can simply use the LID to IATA correspondence to map and standardize the data.

Unfortunately, we found that the FAA does not back up this data. Luckily, we found an ill-formed dataset (available on Kaggle at www.kaggle.com/smiller933/bts?select=L_AIRPORT_ID.csv) which maps the LID code to the corresponding city and airport (encoded in one field called description). Data was provided in a quasi-merged form: strings were formatted as “San Francisco, California: San Francisco International Airport”). To alleviate this issue, we perform RegEx was used to extract the airport from the description field. Luckily, the DOT 2015 dataset contains airport names and associated IATA codes. However, we note that some Airports have additional terminology idiosyncrasies (ie. Intl. vs International). Because of this we perform a jaccard join on the names. Part of the corresponding code is shown below:

A major weakness of the flight dataset is the lack of information regarding the airplane model of each flight. This is problematic: often delayed flights can come from mechanical issues. Luckily, the dataset includes the tail number of the corresponding airplane. The airplane model and age were two factors that we believe would play a major role in any classification models that we were to use.

Using the Aircraft Registration Dataset from the Federal Aviation Administration for 2015, we were able to find matches (i.e. model and year) for the large majority of the planes involved in the flights, which we then proceeded to merge with our current data. In order to complete this step, some minor string manipulation was required before the merge, as illustrated below:

At this point, the dataset was reasonably clean and ready for further use — the next step, therefore, consists of data visualization leveraging the wrangling process that was done. Below is a sample of our cleaned dataset:

A Second Step: Visualizing the Data (EDA)

Given the current state of the dataset, we are now capable of exploring the data and producing some visual representations of the information. A natural first step refers to the total number of flights per airport as well as airline, which we obtained by grouping and merging. The result matches perfectly the official statistics for the year 2015 — an indication that the data, as well as the wrangling, are a good representation of reality.

Given these simpler visualizations, we can start constructing more interesting and relevant visualizations. Namely, we can find which airlines have the highest percentage of delayed flights. This was done in a similar fashion, but this time we aggregated by airlines and looked specifically at arrival delays of at least 20 minutes to find the proportion of flights that exceed this threshold for each airline.

Finally, we can also visualize how seasonality affects the delays and cancellations; for this purpose, we group by month.

It is worth noting that in February 2015, a potent blizzard hit in the Northeastern region of the United States. As a result, an atypical number of flights were canceled — this episode can be clearly seen in the plot above.

However, we were also interested in the network of the dataset. Our data provides us with a network: since each row in our dataset provides us with an origin and destination airport, we can thus create an adjacency list that provides us a rich directed graph.

High fidelity version available here

Our visualization highlights a common business practice for airliners: the hub-and-spoke model. This approach leverages a few key airports that are used as transit points to smaller more regional airports. Indeed, we see an incredibly dense subgraph in the center. Upon closer look, we see the major domestic hubs of the United States: ATL, DEN, LAX, ORD, DTW to name a few.

However, this visualization does not include geographical context. We thus endeavored to provide this contextualization. Using a merge with the coordinate information of airports, we thus now had a dataset that contained coordinates that could be now visualized on a map.

An issue is that flights do not actually travel in a straight line. If you look at a plane’s route on a map, it is actually slightly curved. This is because of the spherical nature of our world. To accurately show the geodesic path between our origin and destination airports, we must use a great circle flight path. To do this, we employ a special package called GCMapper.

Geodesic Flight Map

We thus now see our hub and spoke model come to life. In particular, we can see there is tremendous intercontinental connectivity between the west coast as spots such as SFO, LAX, and SEA. Within the East Coast, we see particular activity throughout the Northeast Corridor, which eventually branches away to large flights to Texas and Florida.

Machine Learning & Classification

In addition to visualizing the data, the extensive number of observations and features opens up the possibility to a whole other dimension: understanding what are the factors that are most closely correlated with flight delays and predicting such delays based on the given characteristics. To this end, we explore different machine learning algorithms and evaluate their performance — the next few steps require extensive wrangling, which is also detailed below.


Given the sheer amount of data from this dataset — especially after including the merges described above–makes it impossible to use it in its entirety. Our dataset contains 5.25M flights, a testament to our busy skies) In order to prepare the data for the machine learning algorithms, a number of steps were taken.

Initially, a series of columns that were considered irrelevant were removed; additionally, since the model focused on delayed flights, observations that corresponded to canceled flights were likewise removed. We also performed string cleaning to account for the different manufacturers and aircraft models. Then, to further reduce the complexity of the data and avoid overfitting, observations involving airports with less than two flights per day were also dropped.

When obtaining the data regarding the airplane models via tail number, a few observations did not have a match. As such, we performed data imputation on the manufacture date of the aircraft. This was done largely using aggregation on the existing data to use as a proxy. In addition to that, we performed one-hot encoding on all of our categorial features, a must since our categorical data cannot be mapped to numerical values (SFO has no considerable numerical difference to ORD). Nevertheless, one-hot encoding tremendously increases our columns, since we must one-hot encode for airports, airplane models, and finally airlines. Finally, we moved on to sampling our dataset so that we obtained a smaller and more manageable dataset as well as the application of PCA for dimensionality reduction. A sample of our one-hot encoded dataset is shown below:

Due to the nature of the dataset, it was expected to be extremely imbalanced; after all, it isonly a minority of flights are in fact delayed. In order to deal with this issue, we were required to use undersampling techniques such that we had an equal amount of observation that corresponded to delayed and on-time flights. Additionally, the flights were separated into discrete bins: all flights that were on time or early were grouped together, and the procedure was then repeated for intervals of 30 minutes (i.e. a flight that was delayed by 15 minutes and another by 30 minutes were in the same category). For flights that were on time or even had early arrivals, we set a delay of 0. This allowed us to perform classification using both a Random Forest Classifier and also a Forward Neural Network. For these bins, we also performed rebalancing, but this time in different proportions.

Even with all the steps described above, however, that the size of the datasets and the number of features made it unfeasible to use Google Colab — we lacked CPU and RAM power to proceed. The first approach was to move to Colab Pro, but soon it was clear that even then the available resources were insufficient; we also observed that further reductions to the dataset resulted in a substantial loss of model accuracy. We further tried Apache Spark, but it also failed to sufficiently hold the dataset properly for modelling. We endeavored to decrease our sample size to at most 40%, an amount we found provided sufficient data points to provide accurate generalizations.

After over 20 hours of constantly crashing Google Colab Pro (a sample of 10% was enough to crash the High-RAM instance), we settled to use Amazon Web Service's SageMaker platform in order to have access to additional computation power and resources. This process has led us to better comprehend the role in the industry of having more computational power, different platforms, as well the importance of distributed solutions.

Amazon SageMaker Studio: Familiar Interface, Superior Power

SageMaker is AWS’s ML system. While it has incredibly rich ML features, we constrain our usage to simply Jupyter Notebooks to complete our analysis. An unfortunate issue was that we had to use our personal accounts — SageMaker Studio creates essentially your own ML IDE environment. Because of this, it requires the provisioning of IAM (non-root) roles to use the workspace. AWS Educate Classrooms regrettably does not allow for this usage.

Creating a SageMaker Studio is relatively simple. Using the root account of your AWS organization, you can provision and create a studio relatively quickly. The studio provides the option of multiple users who can log in simultaneously. In particular, SageMaker charges based on two main criteria: EFS and EC2 ml Instances. EFS, or Elastic File System is the main inactive charge. This holds the data for our Studio instance, in addition to holding our datasets and notebooks. EC2 ml Instances are repackaged EC2 instances that are designed solely for SageMaker usage, and enjoy an integrated experience.

The SageMaker studio, which is used to interact with Jupyter Notebooks, uses the newer JupyterLab interface. In comparison to Google Colab, it has a n interface designed for multiple notebooks, but also to monitor and reprovision instances and restart kernels.

AWS SageMaker Studio

SageMaker contains two pre-defined kernels which we used for our analysis (similar to EC2 AMIs): Python 3 Data Science, and Python 3 PyTorch 1.6 GPU Optimized. We use these kernels for LR/RF and Neural Net respectively.

Luckily SageMaker allows you to rapidly provision and upsize and downsize instances depending on your execution. We eventually settled on two main instances. For EC2 instances, we used SciKit Learn for LR and RF. SciKit does not have GPU capability, and thus we provision a ml.c5–9xlarge to provide us with 36vCPU and 72GB of RAM. While we did not use all of the RAM, we needed a higher amount than the lower 4xlarge. For PyTorch (FNN), we used ml.g4dn.4xlarge which contained 16vCPU, 64 GB of RAM, and a NVIDIA Tesla T4 GPU, which allowed us to use hardware acceleration to more efficiently model our FNN.

With our servers provisioned, we commenced with our modeling.

Model 1: Linear Regression

The first model we attempted to use was the linear regression from the sci-kit library to predict the expected delay in minutes for a particular data instance. We began with a standard scaler, and then executed PCA to attempt to reduce dimensionality; note that for this model we did not use the discrete binning mentioned previously.

Explained Variance vs. Components (PCA), LR

However, we were surprised to see our PCA graphs. Even after scaling our data, PCA was unable to identify dimensions to reduce. We thus see an almost linear line. Because of this, we realized that we would be unable to reduce the dimensionality of our dataset, and proceeded to simply run our linear regression model by itself.

We managed to get a reasonably satisfactory RMSE of about 22.6 minutes. In real-life terms, one can argue that such a margin of error is acceptable. The linear regression was a good starting point as it demonstrated the validity of the application of classification techniques to this data and set a minimum accuracy threshold for further models.

Random Forest

After linear regression, we decided to use a more sophisticated and realistic Given the discrete binning strategy mentioned previously, we make use of a random forest classifier. While this binning process is required in order to use this classifier, it also has a real-life motivation: travelers will often treat a 45-and a 1-hour delay similarly.

We also wanted to see if this binning approach could significantly reduce the number of dimensions required. Employing PCA, we find a significant reduction in components. In fact, we find that only 3 or 4 components were necessary.

Using the random forest classifier, we were able to correctly classify the appropriate delay on the test set with an accuracy of around 85%. This is especially impressive when you consider that just under 40% of the resampled dataset had no delays, so the model correctly classified almost twice as many samples. The corresponding code is shown below.

We also used GridSearch to perform hyperparameter tuning. We found that in general, however, immaterial differences in the accuracy. We did find that the lower the depth, the better the model was in terms of accuracy, highlighting potential issues to overfitting given the idiosyncrasies of the model.

We indeed see this when performing a single decision tree. When we created a decision tree based on entropy, it only had an accuracy of 69.8%. This gives a testament to the use and effectivity of ensemble models.

Forward Neural Network

After the success of our random forest classifier, we moved on to PyTorch to try using Feed-Forward Neural Networks. We used 3 linear layers in our FNN, with ReLU layers in between to break linearity. Upon training our model with an SGD optimizer on 30 epochs, we managed to obtain a mere 47% accuracy on delay bin classification. Although this accuracy is not bad per se, it is considerably reduced in comparison to that of the random forest classifier.

However, we note that this could be due to the lack of time. In particular, our model was improving, albeit slowly. We executed the FNN for 30 epochs, but it already took considerably more time than the RF model. This comes to show that, despite the complexity of neural networks, they might not always be beneficial or yield the best models, especially given the compromise on overhead on code complexity and computing requirements that had to be dealt with. The code associated with is given below the forward neural network is below.

Challenges & Obstacles

We faced a myriad of obstacles while piecing together this project. First, while the dataset did seem reasonably clean, there was an extensive amount of wrangling involved (merges involving non-exact matches for IATA codes using Jaccard joins and RegEx, merging datasets, and dealing with missing values). Then, moving on to visualizations and exploratory data analysis, understanding the steps necessary to produce the non-conventional maps (network, geographical-based maps) required a substantial amount of documentation and research.

However, these steps were expected and are natural components of any data science project. The main obstacle was related to the colossal amount of data involved with this analysis. This caused us immense trouble (and runtime crashes) and forced us to use tools with more RAM and CPU capacity, such as Google Colab Pro and Amazon Web Service SageMaker. Nonetheless, the final classifiers showed the power of these algorithms and yielded very satisfactory results.

Future Steps & Final Thoughts

This article documented the first extensive encounter with big data analytics by the authors. For us, it gave us first-hand experience with many of the concepts covered in the course as well as the opportunity to face new and unique challenges — no college course, for example, will properly prepare you to deal with the hardware limitations that we ran into throughout this analysis. While we are very satisfied with the result, we quote our alma mater founder — Benjamin Franklin– to justify the next section.

Without continual growth and progress, such words as improvement, achievement, and success have no meaning.

As such, we include below a few additional steps that could improve our models as well as offer additional insights into the data.

  • Run similar analysis including more years, identifying how flight delays have changed over time as well as long-term patterns.
  • Acquire more computing power and resources to be able to better handle the data and produce a more accurate model. This would be especially helpful for our FNN, which was constrained by time.
  • Expand the scope beyond the United States. and look at international flights.
  • Incorporate additional types of machine learning models.
  • For Big Data classes, encourage and allow the usage of AWS SageMaker for larger data analyses.

This article uses code inspired from CIS 545: Big Data Analytics, a course at the University of Pennsylvania’s School of Engineering & Applied Science. Elements of this course are available through OpenDS4All.