The Truth About Airline Statistics

Table of Contents ?

Introduction
Step 1: Importing Libraries
Step 2: Gathering Data
Step 3: Univariate Exploration
Step 4: Bivariate Exploration
Step 5: Multivariate Exploration
Step 6: Random Exploration
Step 7: Conclusion

Introduction ?

What is the ASA?

ASA stands for ‘American Statistical Association’, ASA is the main professional organisation for statisticians in the United States. The organization was formed in November 1839 and is the second oldest continuously operating professional society in the United States. Every other year, at the Joint Statistical Meetings, the Graphics Section and the Computing Section join in sponsoring a special Poster Session called The Data Exposition, but more commonly known as The Data Expo. All of the papers presented in this Poster Session are reports of analyses of a common data set provided for the occasion. In addition, all papers presented in the session are encouraged to report the use of graphical methods employed during the development of their analysis and to use graphics to convey their findings.

Quoted from citation: https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.1de

What is the ASA 2009 Data Expo?

The ASA Statistical Computing and Graphics Data Expo is a bi-annual data exploration challenge. Participants are challenged to provide a graphical summary of important features of the data. The task is intentionally vague to allow different entries to focus on different aspects of the data, giving the participants maximum freedom to apply their skills. The 2009 data expo consisted of flight arrival and departure details for all commercial flights on major carriers within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The complete dataset and challenge are available on the competition website http://stat-computing.org/dataexpo/2009/.

Because the dataset is so large, we also provided participants introductions to useful tools for dealing with this scale of data: Linux command line tools, including sort, awk, and cut, and sqlite, a simple SQL database. Additionally, we provided pointers to supplemental data on airport locations, airline carrier codes, individual plane information, and weather. This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008.

Quoted from citation: https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.1de

Project Aims

This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008. For our analysis, we are going to be using the three year period between 1989 to 1991 as our sample dataset as the whole dataset is too large to practically observe with Jupyter. Our goal is to ask many questions such as:

How many flights are there?
Which airlines have the most flights?
What time of the week passengers fly the most?
How does season change the frequency and the destination of travel?
Which routes are the most popular?
Which routes and airports experience the most delays?
Which airports are the busiest in terms of inbound and outbound flights?
Which popular routes are delayed the most?
Which times are airports the busiest in terms of flights by each hour?
How does flight distance affect departure and arrival delays?
Which airline experiences the fewest delays?

Table column descriptions:

In [2]:

import pandas as pd
pd.read_csv('misc_data/master/variable-descriptions_master.csv')

Out[2]:

	name	description
0	Year	1987-2008
1	Month	12-Jan
2	DayofMonth	31-Jan
3	DayOfWeek	1 (Monday) – 7 (Sunday)
4	DepTime	actual departure time (local, hhmm)
5	CRSDepTime	scheduled departure time (local, hhmm)
6	ArrTime	actual arrival time (local, hhmm)
7	CRSArrTime	scheduled arrival time (local, hhmm)
8	UniqueCarrier	unique carrier code
9	FlightNum	flight number
10	TailNum	plane tail number
11	ActualElapsedTime	in minutes
12	CRSElapsedTime	in minutes
13	AirTime	in minutes
14	ArrDelay	arrival delay, in minutes
15	DepDelay	departure delay, in minutes
16	Origin	origin IATA airport code
17	Dest	destination IATA airport code
18	Distance	in miles
19	TaxiIn	taxi in time, in minutes
20	TaxiOut	taxi out time in minutes
21	Cancelled	was the flight cancelled?
22	CancellationCode	reason for cancellation (A = carrier, B = weat…
23	Diverted	1 = yes, 0 = no
24	CarrierDelay	in minutes
25	WeatherDelay	in minutes
26	NASDelay	in minutes
27	SecurityDelay	in minutes
28	LateAircraftDelay	in minutes

Step 1: Import ?

In [3]:

# Import libraries
import pandas as pd
import numpy as np
import datetime as dt
import markdown
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
%matplotlib inline

# Suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

# Display dataframe all columns
pd.set_option('display.max_rows', 636)
pd.set_option('display.max_columns', 636)

References: \ https://unicode.org/emoji/charts/full-emoji-list.html \ https://getemoji.com/In [4]:

# Formatting ticks for large values (this will change the large numbers in charts to short clean values)
def reformat_large_tick_values(tick_val, pos):
    """
    Turns large tick values (in the billions, millions and thousands) such as 4500 into 4.5K and also appropriately turns 4000 into 4K (no zero after the decimal).
    """
    if tick_val >= 1000000000:
        val = round(tick_val/1000000000, 1)
        new_tick_format = '{:}B'.format(val)
    elif tick_val >= 1000000:
        val = round(tick_val/1000000, 1)
        new_tick_format = '{:}M'.format(val)
    elif tick_val >= 1000:
        val = round(tick_val/1000, 1)
        new_tick_format = '{:}K'.format(val)
    elif tick_val < 1000:
        new_tick_format = round(tick_val, 1)
    else:
        new_tick_format = tick_val

    # make new_tick_format into a string value
    new_tick_format = str(new_tick_format)
    
    # code below will keep 4.5M as is but change values such as 4.0M to 4M since that zero after the decimal isn't needed
    index_of_decimal = new_tick_format.find(".")
    
    if index_of_decimal != -1:
        value_after_decimal = new_tick_format[index_of_decimal+1]
        if value_after_decimal == "0":
            # remove the 0 after the decimal point since it's not needed
            new_tick_format = new_tick_format[0:index_of_decimal] + new_tick_format[index_of_decimal+2:]
            
    return new_tick_format

Step 2: Gather ?

In [5]:

# flight data
df8991 = pd.read_csv('dataset/1989-1991.csv')

In [6]:

# misc_data
airports = pd.read_csv('misc_data/master/airports_master.csv')
carriers = pd.read_csv('misc_data/master/carriers_master.csv')
plane_data = pd.read_csv('misc_data/master/plane-data_master.csv')
var_desc = pd.read_csv('misc_data/master/variable-descriptions_master.csv')

In [7]:

df8991.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15031014 entries, 0 to 15031013
Data columns (total 17 columns):
Date                 object
DayOfWeek            object
DepTime              object
CRSDepTime           object
ArrTime              object
CRSArrTime           object
UniqueCarrier        object
FlightNum            int64
ActualElapsedTime    int64
CRSElapsedTime       int64
ArrDelay             int64
DepDelay             int64
Origin               object
Dest                 object
Distance             int64
Cancelled            int64
Diverted             int64
dtypes: int64(8), object(9)
memory usage: 1.9+ GB

In [8]:

# Converting date column
df8991['Date'] = pd.to_datetime(df8991['Date'])

In [9]:

from PIL import Image # Library for importing images
Image.open('images/Boeing_727-22_United_Airlines.jpg')

Out[9]:

Step 3: Univariate Exploration ?

During this step, we will investigate distributions of individual variables to prepare for observing the relationships between variables.In [10]:

df8991.head(10)

Out[10]:

	Date	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	UniqueCarrier	FlightNum	ActualElapsedTime	CRSElapsedTime	ArrDelay	DepDelay	Origin	Dest	Distance
0	1989-01-23	Monday	14:19:00	12:30:00	17:42:00	15:52:00	UA	183	323	322	110	109	SFO	HNL	2398
1	1989-01-24	Tuesday	12:55:00	12:30:00	16:12:00	15:52:00	UA	183	317	322	20	25	SFO	HNL	2398
2	1989-01-25	Wednesday	12:30:00	12:30:00	15:33:00	15:52:00	UA	183	303	322	-19	0	SFO	HNL	2398
3	1989-01-26	Thursday	12:30:00	12:30:00	15:23:00	15:52:00	UA	183	293	322	-29	0	SFO	HNL	2398
4	1989-01-27	Friday	12:32:00	12:30:00	15:13:00	15:52:00	UA	183	281	322	-39	2	SFO	HNL	2398
5	1989-01-28	Saturday	12:28:00	12:30:00	15:50:00	15:52:00	UA	183	322	322	-2	-2	SFO	HNL	2398
6	1989-01-29	Sunday	16:39:00	12:30:00	19:42:00	15:52:00	UA	183	303	322	230	249	SFO	HNL	2398
7	1989-01-30	Monday	12:31:00	12:30:00	15:31:00	15:52:00	UA	183	300	322	-21	1	SFO	HNL	2398
8	1989-01-31	Tuesday	14:05:00	12:30:00	18:27:00	15:52:00	UA	183	382	322	155	95	SFO	HNL	2398
9	1989-01-02	Monday	10:57:00	10:45:00	15:37:00	15:54:00	UA	184	160	189	-17	12	DEN	IAD	1452

3.1. Total Flights by Month

In [11]:

# Extracting the month number from the dates
flight_month = df8991['Date'].dt.month

# Assigning the months
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Creating a new dataframe
flight_month = pd.DataFrame(flight_month)

# Renaming the column name of the new dataframe
flight_month.rename(columns = {'Date':'Month'},inplace=True)

# Adding a new column to count flights per month
flight_month['Flights'] = df8991['Date'].dt.month.count

# Grouping the data by month then assigning month names to 'month' column
month_data  = flight_month.groupby('Month').count()
month_data['Month'] = months

# Renaming each month number to a real month name
flights_by_month = month_data.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})
flights_by_month.drop(columns=['Month'], axis=1, inplace=True)

# Looking at the dataframe we created
flights_by_month

Out[11]:

	Flights
Month
Jan	1223834
Feb	1102190
Mar	1271415
Apr	1250728
May	1281666
Jun	1255135
Jul	1299911
Aug	1312995
Sep	1254058
Oct	1297516
Nov	1224153
Dec	1257413

In [12]:

# Total flights
flights_by_month['Flights'].sum()

Out[12]:

15031014

In [13]:

# Average flights per month
flights_by_month['Flights'].mean()

Out[13]:

1252584.5

In [14]:

# Working out percentage of flights in 'Feb' compared to the monthly average
round(flights_by_month.loc['Feb','Flights']/flights_by_month['Flights'].mean()-1, 2)

Out[14]:

-0.12

In [15]:

# Plot
flights_by_month.plot(kind = 'bar', figsize=(14.70, 8.27), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Month\n', fontsize = 14, weight = "bold")

# Set axis labels
plt.ylabel('Flights', fontsize = 10,  weight = "bold")
plt.xlabel('Month', fontsize = 10, weight = "bold")

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_month.png')
plt.show()

Observation 1: For this dataset we have used the data between 1989-1991, therefore we grouped the data into monthly averages between those years. can see August has the most flights on average and Feburary had the least. This shows us that summer months have the most flights and winter months the least. Feburary had 12% less flights compared to the monthly average of 1,252,585.

3.2. Total Flights by Season

In [16]:

# Create new dataframe
flights_by_season = flights_by_month
flights_by_season = month_data.rename(index={1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring', 6: 'Summer', 7: 'Summer', 8: 'Summer', 9: 'Autumn', 10: 'Autumn', 11: 'Autumn', 12: 'Winter'})

# Drop 'Month' column
flights_by_season.drop(columns=['Month'], axis=1, inplace=True)

# Group seasons
flights_by_season.rename_axis(index={'Month':'Season'}, inplace=True)
flights_by_season.reset_index(inplace=True)
flights_by_season = flights_by_season.groupby('Season').sum()

# Order index
flights_by_season = flights_by_season.reindex(['Spring', 'Summer', 'Autumn', 'Winter'])

# Looking at the dataframe we created
flights_by_season

Out[16]:

	Flights
Season
Spring	3803809
Summer	3868041
Autumn	3775727
Winter	3583437

In [17]:

flights_by_season['Flights'].pct_change(periods=1)

Out[17]:

Season
Spring         NaN
Summer    0.016886
Autumn   -0.023866
Winter   -0.050928
Name: Flights, dtype: float64

In [18]:

# Plot
flights_by_season.plot(kind = 'bar', figsize=(7, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Season', fontsize = 15)

# Set axis labels
plt.ylabel('Number of flights', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_season.png')
plt.show()

Observation 2: Here we are looking at flights by season, this plot shows us the total flights recorded for each season. Summer has the most total flights but the difference between them is not as much as you might think. From my calculations, the percentage difference between the best and worst seasons is only 5%. Spring has a strong performance placing 2nd, Autumn 3rd and Winter a more distant 4th.

3.3. Total Flights by Day

In [19]:

# Extracting the day number from the dates
flight_day = df8991['Date'].dt.dayofweek 

# Assigning the days
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Creating a new dataframe
flight_day = pd.DataFrame(flight_day)

# Renaming the column name of the new dataframe
flight_day.rename(columns = {'Date':'Day'},inplace=True)

# Adding a new column to count flights per day
flight_day['Flights'] = df8991['Date'].dt.dayofweek.count

# Grouping the data by day
day_data  = flight_day.groupby('Day').count()
day_data['Day'] = days

# Renaming each day number to a real day name
flights_by_day = day_data.rename(index={0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', 4: 'Fri', 5: 'Sat', 6: 'Sun'})
flights_by_day.drop(columns=['Day'], axis=1, inplace=True)

# Looking at the dataframe we created
flights_by_day

Out[19]:

	Flights
Day
Mon	2211445
Tue	2210868
Wed	2200217
Thu	2186411
Fri	2180619
Sat	1964556
Sun	2076898

In [20]:

flights_by_day['Flights'].pct_change()

Out[20]:

Day
Mon         NaN
Tue   -0.000261
Wed   -0.004818
Thu   -0.006275
Fri   -0.002649
Sat   -0.099083
Sun    0.057184
Name: Flights, dtype: float64

In [21]:

# Average flights per day
round(flights_by_day['Flights'].mean(), 2)

Out[21]:

2147287.71

In [22]:

# Working out percentage of flights compared to the weekly average
round(flights_by_day.loc['Sat','Flights']/flights_by_day['Flights'].mean()-1, 2)

Out[22]:

-0.09

In [23]:

# Plot
flights_by_day.reindex(index=flights_by_day.index[::-1]).plot(kind = 'barh', figsize=(10, 5), rot=0)

# Set title label
plt.title('Total Flights by Day', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
ax.axvline(flights_by_day['Flights'].mean(), color='red', linewidth=2, linestyle='dashed', label='Average')

# Legend
plt.legend(bbox_to_anchor=(1.2, 1), loc=1, borderaxespad=0.0)

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_day.png')
plt.show()

Observation 3: This plot shows us total flights by day. Monday has the most total flights with 2,211,445 but the difference between Monday and the other weekdays is rather negligable. Saturday shows a near 10% dip in flights compared to Monday while Sunday dips under 6% using the same comparison, this suggests passengers are flying more during the week than weekends. This chart shows a 10% variance between the day with the highest and lowest amount of flights, 3 years of data shows consistency in strong demand for flights regardless of the day of the week.

3.4. Total Flights by Hour

In [24]:

# Creating flight_hour
flight_hour = df8991['DepTime']

# Creating a new dataframe
flight_hour = pd.DataFrame(flight_hour)

# Extracting the hour from the times
flight_hour['FlightHour'] = df8991['DepTime'].str[:2]

# Adding a new column to count flights per hour
flight_hour['Flights'] = flight_hour['FlightHour'].count

# Drop column
flight_hour.drop(columns=['DepTime'], axis=1, inplace=True)

# Grouping the data by hour
hour_data  = flight_hour.groupby('FlightHour').count()
flights_by_hour = hour_data

# Looking at the dataframe we created
flights_by_hour

Out[24]:

	Flights
FlightHour
00	61648
01	46131
02	8595
03	2593
04	5505
05	53452
06	779935
07	1049654
08	1103998
09	976624
10	834899
11	915756
12	977161
13	1032993
14	845945
15	882206
16	934469
17	988229
18	973944
19	842436
20	798213
21	503912
22	275905
23	136811

In [25]:

# Most popular flight hours
flights_by_hour.sort_values(by='Flights', ascending=False).head(5)

Out[25]:

	Flights
FlightHour
08	1103998
07	1049654
13	1032993
17	988229
12	977161

In [26]:

# Least popular flight hours 
flights_by_hour.sort_values(by='Flights', ascending=True).head(5)

Out[26]:

	Flights
FlightHour
03	2593
04	5505
02	8595
01	46131
05	53452

In [27]:

# Working out x difference in flights between busiest hour compared to the quietest hour
round(flights_by_hour.loc['08','Flights']/flights_by_hour.loc['03','Flights'], 2)

Out[27]:

425.76

In [28]:

# Plot
flights_by_hour.plot(kind = 'bar', figsize=(14.70, 8.27), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Hour\n', fontsize = 14)

# Set axis labels
plt.ylabel('Flights', fontsize = 10,  weight = "bold")
plt.xlabel('Time', fontsize = 10, weight = "bold")

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_hour.png')
plt.show()

Observation 4: We can see the total flights by hour here based on departure time. 08:00 is the most popular time to fly with 1,103,998 flights according to departure time data from 1989-1991. We can see that there is a dip at 10:00, perhaps due to scheduling or other external reasons. The next dip we see on the chart is at 14:00 hours before recovering again until 18:00 where the flights steadily drop off. 03:00 recorded the lowest number of flights at only 2,593, that is a -426x difference in total flights vs 08:00. Flights between midnight and 05:00 are lower due to many factors including being the time people are generally sleeping but also noise restrictions. Maybe we can look at flight duration next?

3.5. Flight Duration in Minutes

In [29]:

df8991['ActualElapsedTime'].min(), df8991['ActualElapsedTime'].max()

Out[29]:

(-591, 1883)

In [30]:

df8991['CRSElapsedTime'].min(), df8991['CRSElapsedTime'].max()

Out[30]:

(-59, 1565)

In [31]:

df8991[df8991['CRSElapsedTime'] == -59]

Out[31]:

	Date	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	UniqueCarrier	FlightNum	ActualElapsedTime	CRSElapsedTime	ArrDelay	DepDelay	Origin	Dest	Distance	Cancelled	Diverted
13266400	1991-08-08	Thursday	06:56:00	07:00:00	11:04:00	07:01:00	NW	312	188	-59	243	-4	MSP	DCA	931	0	0

In [32]:

# Looking for 'ActualElapsedTime' values under 0
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] < 0].count()

Out[32]:

In [33]:

# Looking for 'CRSElapsedTime' values under 0
df8991['CRSElapsedTime'][df8991['CRSElapsedTime'] < 0].count()

Out[33]:

In [34]:

# Creating flight_duration and dropping time values under 0
flight_duration = df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] >= 0]

# Creating a new dataframe
flight_duration = pd.DataFrame(flight_duration)

# Extracting the duration data
flight_duration['FlightDuration'] = df8991['ActualElapsedTime']

# Adding a new column to count flights by duration
flight_duration['Flights'] = flight_duration['FlightDuration'].count

# Drop column
flight_duration.drop(columns=['ActualElapsedTime'], axis=1, inplace=True)

# Grouping the data by duration
duration_data  = flight_duration.groupby('FlightDuration').count()
flights_by_duration = duration_data

# Looking at the dataframe we created
flights_by_duration.head()

Out[34]:

	Flights
FlightDuration
0	2
1	2
2	4
3	1
4	3

In [35]:

# Average flight duration
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 0].mean()

Out[35]:

109.90256206131029

In [36]:

# Median flight duration
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 0].median()

Out[36]:

91.0

In [37]:

# Percentage of flights over 300 minutes
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 300].count()/df8991['ActualElapsedTime'].count()

Out[37]:

0.018513654501286475

In [38]:

# Percentage of flights between 0 and 300 minutes
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'].between(0, 300)].count()/df8991['ActualElapsedTime'].count()

Out[38]:

0.9814844161544923

In [39]:

# Checking the statistics
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 0].describe()

Out[39]:

count    1.503098e+07
mean     1.099026e+02
std      6.445965e+01
min      1.000000e+00
25%      6.400000e+01
50%      9.100000e+01
75%      1.400000e+02
max      1.883000e+03
Name: ActualElapsedTime, dtype: float64

In [40]:

# Maximum flight duration
df8991['ActualElapsedTime'].max()

Out[40]:

In [41]:

# Plot
flights_by_duration.plot(kind = 'area', figsize=(7, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Duration', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('Minutes', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_duration.png')
plt.show()

Observation 5: Total flights by duration shows us a lot of interesting insights. Our area graph above shows us flights average around 110 minutes while the median flight duration is 91 minutes. The longest flight duration was 1883 minutes which equates to over 31 hours! The conclusion is 98.1% of flights are under 300 minutes and the remaining 1.9% of flights are generally under 300 minutes or 5 hours. Distance is an ideal thing to look at next, how far do passengers travel and does flight distance vary depending on time?

3.6. Flight Distance in Miles

In [42]:

# Creating flight_distance
flight_distance = df8991['Distance']

# Creating a new dataframe
flight_distance = pd.DataFrame(flight_distance)

# Extracting the distance data
flight_distance['FlightDistance'] = df8991['Distance']

# Adding a new column to count flights by distance
flight_distance['Flights'] = flight_distance['FlightDistance'].count

# Drop column
flight_distance.drop(columns=['Distance'], axis=1, inplace=True)

# Grouping the data by distance
distance_data  = flight_distance.groupby('FlightDistance').count()
flights_by_distance = distance_data
flights_by_distance_idx = distance_data.reset_index()

# Looking at the dataframe we created
flights_by_distance.head()

Out[42]:

	Flights
FlightDistance
0	1
11	889
17	4
18	2
21	5696

In [43]:

# Minimum and maxiumum distance travelled
df8991['Distance'].min(), df8991['Distance'].max()

Out[43]:

(0, 4502)

In [44]:

# Average flight distance
df8991['Distance'].mean()

Out[44]:

632.9591811969572

In [45]:

df8991['Distance'][df8991['Distance'] > 2800].count()/df8991['Distance'].count()

Out[45]:

0.0016491901344779533

In [46]:

# Plot
plt.hist(data = flights_by_distance_idx, x = 'FlightDistance', bins = 25) #, y = 'Flights'

# Set title label
plt.title('Total Flights by Distance', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('Distance (Miles)', fontsize = 13)

# Ticks fixed for histogram
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1) + 'K'))

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_distance.png')
plt.show()

Observation 6: We are looking at total flights by distance using a histogram, flights are indicated on the left and distance is on the bottom. Most flights range from around 100 to 1000 miles and steadily decline after, with the average being 632 miles. 2500-3000 miles seems to be the point flights most drastically reduce. Statistically 0.1% of flights go over 2800 miles. Maybe we can look at carrier data to find some different correlations?

3.7. Flights by Carrier

In [47]:

# Creating flight_carrier
flight_carrier = df8991['UniqueCarrier']

# Creating a new dataframe
flight_carrier = pd.DataFrame(flight_carrier)

# Extracting the carrier data
flight_carrier['FlightCarrier'] = df8991['UniqueCarrier']

# Adding a new column to count flights by carrier
flight_carrier['Flights'] = flight_carrier['FlightCarrier'].count

# Drop column
flight_carrier.drop(columns=['UniqueCarrier'], axis=1, inplace=True)

# Grouping the data by carrier
carrier_data  = flight_carrier.groupby('FlightCarrier').count()
flights_by_carrier = carrier_data

# Looking at the dataframe we created
flights_by_carrier.head()

Out[47]:

	Flights
FlightCarrier
AA	2084513
AS	267975
CO	1241287
DL	2430469
EA	398362

In [48]:

# Creating a filtered database for top 10 carriers ascending
carriers_top_10 = flights_by_carrier.sort_values('Flights', ascending=False).iloc[0:10]

# Dataframe without index
carriers_top_10_idx = carriers_top_10.reset_index()

In [49]:

# Top 10 Carriers
carriers_top_10

Out[49]:

	Flights
FlightCarrier
US	2575160
DL	2430469
AA	2084513
UA	1774850
NW	1352989
CO	1241287
WN	950837
TW	756992
HP	623477
EA	398362

In [50]:

carriers.head()

Out[50]:

	code	description
0	02Q	Titan Airways
1	04Q	Tradewind Aviation
2	05Q	Comlux Aviation, AG
3	06Q	Master Top Linhas Aereas Ltd.
4	07Q	Flair Airlines Ltd.

In [51]:

# Filter
carriers[carriers['code'] == 'US']

Out[51]:

	code	description
1308	US	US Airways Inc. (Merged with America West 9/05…

In [52]:

# Renaming column to prepare for df merge
carriers_top_10_idx.rename(columns={"FlightCarrier":"code"}, inplace=True)

In [53]:

# Merging dataframes to get the carrier description
carriers_top_10_idx2 = pd.merge(left=carriers_top_10_idx, right=carriers, how='left', on='code')

In [54]:

carriers_top_10_idx2

Out[54]:

	code	Flights	description
0	US	2575160	US Airways Inc. (Merged with America West 9/05…
1	DL	2430469	Delta Air Lines Inc.
2	AA	2084513	American Airlines Inc.
3	UA	1774850	United Air Lines Inc.
4	NW	1352989	Northwest Airlines Inc.
5	CO	1241287	Continental Air Lines Inc.
6	WN	950837	Southwest Airlines Co.
7	TW	756992	Trans World Airways LLC
8	HP	623477	America West Airlines Inc. (Merged with US Air…
9	EA	398362	Eastern Air Lines Inc.

In [55]:

carriers_top_10_idx2['description'] = carriers_top_10_idx2['description'].str.replace(r"\(.*\)","")

In [56]:

carriers_top_10_idx2

Out[56]:

	code	Flights	description
0	US	2575160	US Airways Inc.
1	DL	2430469	Delta Air Lines Inc.
2	AA	2084513	American Airlines Inc.
3	UA	1774850	United Air Lines Inc.
4	NW	1352989	Northwest Airlines Inc.
5	CO	1241287	Continental Air Lines Inc.
6	WN	950837	Southwest Airlines Co.
7	TW	756992	Trans World Airways LLC
8	HP	623477	America West Airlines Inc.
9	EA	398362	Eastern Air Lines Inc.

In [57]:

carriers_top_10_idx2['Flights'].pct_change()

Out[57]:

0         NaN
1   -0.056187
2   -0.142341
3   -0.148554
4   -0.237688
5   -0.082559
6   -0.233991
7   -0.203868
8   -0.176376
9   -0.361064
Name: Flights, dtype: float64

In [58]:

# Variables
labels = [' ',' ',' ',' ',' ',' ',' ',' ',' ',' ']
explode = [0.1, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]

# Plot
carriers_top_10.plot(kind='pie', figsize=(14.70, 8.27), rot=0, subplots=True, labels=labels, legend=True, 
                     cmap='jet', autopct='%1.1f%%', pctdistance=1.15, explode=explode)

# Legend
carrier_list = list(carriers_top_10_idx2['description'])
plt.legend(labels = carrier_list, bbox_to_anchor=(1.55, 1), loc=1, borderaxespad=0.0)

# Set title label
plt.title('Total Flights by Carrier', loc='center', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_carrier.png')
plt.show()

Observation 7: The total flights by carrier is a good indication of which airlines were the most popular between 1989 and 1991. The most popular carrier is US Airways Inc with 2,575,160 flights between 1989-1991, that represents an 18.1% share of all flights from our data. Delta Air Lines Inc. is a close 2nd place with 2,430,469 or a 5.6% difference from 1st place with a 17% share of the market. Those two airlines would have gone toe to toe dominating aviation at that time. However, American Airlines Inc. features in 3rd place with 2,084,513 flights with a strong 14.7% market share. The gap widens to a 14% difference in total flights vs 1st place. This trend continues as the other airlines share smaller and smaller stakes in the market. Perhaps later we can find out which airline is the most punctual?

3.8. Flight Distance by Day

In [59]:

# Grouping data by day of week then sorting index
weekday_cat = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday = df8991.drop(['FlightNum'], axis=1).groupby(['DayOfWeek']).mean().reindex(weekday_cat)

# Rename index variable
idx = {"Monday": "Mon", "Tuesday":"Tue", "Wednesday":"Wed", "Thursday":"Thu", "Friday":"Fri", "Saturday":"Sat", "Sunday":"Sun"}

In [60]:

weekday.head(7)

Out[60]:

	ActualElapsedTime	CRSElapsedTime	ArrDelay	DepDelay	Distance	Cancelled	Diverted
DayOfWeek
Monday	109.058930	109.488231	4.964975	5.425532	628.187534	0.0	0.0
Tuesday	109.530635	109.422163	6.213578	6.144185	627.741170	0.0	0.0
Wednesday	109.712301	109.355599	7.021518	6.684452	627.115591	0.0	0.0
Thursday	110.101038	109.411749	8.744248	8.079987	627.562929	0.0	0.0
Friday	110.096784	109.529341	9.349141	8.810754	628.533479	0.0	0.0
Saturday	110.790458	112.260839	4.769930	6.301211	650.885998	0.0	0.0
Sunday	110.143351	111.337036	5.055255	6.310329	643.155490	0.0	0.0

In [61]:

weekday['Distance']

Out[61]:

DayOfWeek
Monday       628.187534
Tuesday      627.741170
Wednesday    627.115591
Thursday     627.562929
Friday       628.533479
Saturday     650.885998
Sunday       643.155490
Name: Distance, dtype: float64

In [62]:

# Figsize
fig = ax.get_figure()
plt.figure(figsize = [15, 6])
plt.suptitle('Elapsed Time and Distance vs Day', fontsize=15)

# Plot 1
plt.subplot(1, 2, 1)
weekday['ActualElapsedTime'].rename(index = idx, inplace = False).plot(kind = 'line', title='Actual Elapsed Time', legend=None)
plt.xlabel('')
plt.ylabel('')

# Plot 2
plt.subplot(1, 2, 1)
weekday['CRSElapsedTime'].rename(index = idx, inplace = False).plot(kind = 'line', title='Actual Elapsed Time vs CRS Elapsed Time (Min)', legend=None)
plt.legend()
plt.xlabel('')
plt.ylabel('')

# Plot 3
plt.subplot(1, 2, 2)
weekday['Distance'].rename(index = idx, inplace = False).plot(kind = 'line', title='Distance (Miles)', legend=None)
plt.xlabel('')
plt.ylabel('')

# Show
fig.tight_layout()
fig.subplots_adjust(top=2)
plt.savefig('images/charts/time_vs_distance_vs_day.png')
plt.show()

Observation 8: Elapsed time and distance versus day is a good way to find correlation between time of the week and flight delays with distance. Flight time seems to increase as the week passes. The weekend has the longest flights on average, indicating passengers might be travelling on weekend holidays but commute on weekdays. We also know from other data that fewer flights take place during the weekend but also delays are lower while flight duration and distance is longer on average. We should look into delays more, which airports are the most punctual and which are the least?

3.9. Delay by Origin and Destination

In [63]:

df8991['Origin'].unique()

Out[63]:

array(['SFO', 'DEN', 'HNL', 'LIH', 'PHL', 'OGG', 'IAD', 'EWR', 'LAX',
       'KOA', 'ORD', 'MKE', 'IAH', 'MSY', 'RIC', 'SEA', 'FLL', 'MCO',
       'BWI', 'HOU', 'MCI', 'SJC', 'CMH', 'OMA', 'ORF', 'BOS', 'ABQ',
       'SMF', 'OKC', 'SGF', 'SLC', 'ONT', 'SAN', 'BUF', 'LGB', 'MIA',
       'BDL', 'IND', 'TPA', 'SYR', 'PIT', 'STL', 'PDX', 'CLT', 'PHX',
       'CVG', 'MBS', 'ATL', 'DSM', 'LAS', 'AUS', 'JAN', 'MEM', 'DTW',
       'SRQ', 'MSP', 'DFW', 'PSP', 'MDT', 'FAT', 'BGR', 'PWM', 'ROC',
       'MDW', 'FAR', 'CHS', 'SAV', 'LGA', 'ANC', 'FSD', 'OAK', 'CLE',
       'CAE', 'TUL', 'HPN', 'GSO', 'RDU', 'DAY', 'GRR', 'MSN', 'JAX',
       'BNA', 'COS', 'SAT', 'CID', 'PBI', 'GEG', 'LNK', 'PVD', 'BIL',
       'SBA', 'ELP', 'SDF', 'TUS', 'SNA', 'ICT', 'BUR', 'ABE', 'RAP',
       'GTF', 'LIT', 'ALB', 'RNO', 'BHM', 'HSV', 'BOI', 'DCA', 'SUX',
       'TYS', 'MHT', 'EUG', 'MLI', 'BTV', 'PIA', 'FAI', 'MFR', 'MRY',
       'JFK', 'RSW', 'ERI', 'PHF', 'PSC', 'ISP', 'AVP', 'EVV', 'ELM',
       'LEX', 'ORH', 'BGM', 'TRI', 'CRW', 'TOL', 'MYR', 'ITH', 'GSP',
       'HTS', 'ACY', 'BLI', 'SCK', 'RDM', 'CCR', 'YKM', 'DAL', 'LBB',
       'CRP', 'AMA', 'HRL', 'MAF', 'DET', 'VPS', 'AZO', 'FWA', 'DLH',
       'ATW', 'GRB', 'BIS', 'GPT', 'LAN', 'GFK', 'RST', 'BZN', 'EAU',
       'LSE', 'MOT', 'MOB', 'MSO', 'BTR', 'CHA', 'SHV', 'MGM', 'PFN',
       'HDN', 'SBN', 'STT', 'STX', 'FAY', 'AVL', 'CAK', 'OAJ', 'TLH',
       'ROA', 'ILM', 'AGS', 'DAB', 'FNT', 'ISO', 'CHO', 'CMI', 'LYH',
       'UCA', 'PNS', 'EYW', 'APF', 'GNV', 'SJU', 'PUB', 'MLU', 'MLB',
       'CSG', 'CPR', 'IDA', 'JAC', 'HLN', 'FCA', 'JNU', 'BTM', 'DRO',
       'GJT', 'YUM', 'FLG', 'GCN', 'PIE', 'TVL', 'BFL', 'GUC', 'BET',
       'OME', 'OTZ', 'SCC', 'CDV', 'YAK', 'PSG', 'SIT', 'WRG', 'GUM',
       'MFE', 'LFT', 'YAP', 'ROR', 'SPN', 'ROP', 'TVC', 'GST', 'EGE',
       'SUN', 'PMD', 'SWF', 'EFD', 'PSE', 'HVN', 'TTN', 'BQN', 'SPI'],
      dtype=object)

In [64]:

origin = df8991.drop(['FlightNum'], axis=1).groupby(['Origin']).mean()

In [65]:

dest = df8991.drop(['FlightNum'], axis=1).groupby(['Dest']).mean()

In [66]:

origin['DepDelay'].sort_values(ascending=False).iloc[0:10]

Out[66]:

Origin
PSE    22.716667
YAP    19.763158
SUN    14.538462
HDN    12.525217
JFK    11.455878
ORD    10.036105
GUC     9.815362
JNU     9.490426
EWR     8.866009
PIT     8.842862
Name: DepDelay, dtype: float64

In [67]:

# Assigning high and low airport delay
aplow_delay = (origin['DepDelay'].sort_values(ascending=False)+dest['ArrDelay']).sort_values(ascending=True).iloc[0:10].sort_values(ascending=False)
aphigh_delay = (origin['DepDelay'].sort_values(ascending=False)+dest['ArrDelay']).sort_values(ascending=False).iloc[0:10].sort_values(ascending=True)

In [68]:

# Top 10 lowest delay airports
aplow_delay.sort_values(ascending=True)

Out[68]:

BQN   -10.351648
ATW    -5.999804
YKM     0.053702
MOT     0.203609
EAU     0.944523
PMD     1.263444
ROP     1.655739
LFT     2.202046
PUB     2.252053
GRB     2.559022
dtype: float64

In [69]:

# Top 10 highest delay airports
aphigh_delay.sort_values(ascending=False)

Out[69]:

PSE    53.250000
SUN    38.038462
YAP    37.362168
HDN    22.337867
GST    20.760290
SIT    20.030887
GUC    19.211647
JFK    18.920419
JNU    18.902215
SFO    18.836069
dtype: float64

In [70]:

# Chart 1
plt.subplot(1, 2, 1)
origin['DepDelay'].sort_values(ascending=False).iloc[0:10].plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Departure Airports by Delay')

# Chart 2
plt.subplot(1, 2, 2)
dest['ArrDelay'].sort_values(ascending=False).iloc[0:10].plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Arrival Airports by Delay', label='Delay in minutes')

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/delay_by_origin_dest.png')
plt.show()

Here we are looking at airports ranked by average delay times. PIT is has the lowest departure delays but PSE has the highest departure delays on average. SCC has the lowest arrival delays and PSE has the highest departure delays. We should look at total delay now, this should be easily done by adding the departure and arrival times.In [71]:

# Chart 1
plt.subplot(1, 2, 1)
aphigh_delay.plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Top 10 Airports by Highest Total Delay')

# Chart 2
plt.subplot(1, 2, 2)
aplow_delay.plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Top 10 Airports by Lowest Total Delay')

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/delay_by_airport.png')
plt.show()

Observation 9: The above plot shows airports by total delay this time. PSE – Mercedita Airport has the highest average delays at 53.2 minutes. BQN – Rafael Hernández Airport has the lowest average delays at -10.35 minutes. In other words, flights arrive 10 minutes early on average.In [72]:

# Airline carrier filter
carriers[carriers['code'] == 'UA']

Out[72]:

	code	description
1297	UA	United Air Lines Inc.

In [73]:

# Airport name filter
airports[airports['iata'] == 'PSE']

Out[73]:

	iata	airport	city	state	country	lat	long
2674	PSE	Mercedita	Ponce	PR	USA	18.008303	-66.563012

In [74]:

# Airports by flights
(df8991['Origin'] == 'ATW').sum()+(df8991['Dest'] == 'ATW').sum()

Out[74]:

Step 4: Bivariate Exploration ?

4.1. Airports by Traffic

In [75]:

df8991['Dest'].unique()

Out[75]:

array(['HNL', 'IAD', 'SFO', 'DEN', 'LIH', 'LAX', 'PDX', 'ORD', 'KOA',
       'OGG', 'BOS', 'EWR', 'IAH', 'MSY', 'PHX', 'OMA', 'ABE', 'SDF',
       'DFW', 'PHL', 'BTV', 'FAT', 'DTW', 'ABQ', 'TUL', 'AUS', 'MSP',
       'PIT', 'GSO', 'SMF', 'TUS', 'IND', 'GRR', 'HOU', 'ORF', 'SJC',
       'SAN', 'CMH', 'SEA', 'MCI', 'MEM', 'BDL', 'TPA', 'PBI', 'FSD',
       'BWI', 'DSM', 'CVG', 'MCO', 'MHT', 'PWM', 'BGR', 'BOI', 'RDU',
       'LAS', 'DAY', 'SAV', 'CHS', 'FAI', 'ANC', 'CLE', 'OAK', 'ICT',
       'OKC', 'MKE', 'GEG', 'LIT', 'TYS', 'BNA', 'BUF', 'COS', 'FAR',
       'SGF', 'LNK', 'SUX', 'RAP', 'GTF', 'BIL', 'CLT', 'ONT', 'SLC',
       'ATL', 'STL', 'MIA', 'MSN', 'MRY', 'JAX', 'MBS', 'CAE', 'RNO',
       'HPN', 'ROC', 'ALB', 'RSW', 'RIC', 'SBA', 'SAT', 'LGB', 'PVD',
       'HSV', 'MDT', 'JAN', 'DCA', 'BUR', 'JFK', 'SNA', 'PSP', 'SYR',
       'FLL', 'BHM', 'MLI', 'PIA', 'CID', 'SRQ', 'LGA', 'MDW', 'ELP',
       'EUG', 'MFR', 'ERI', 'AVP', 'ISP', 'PSC', 'EVV', 'MYR', 'GSP',
       'PHF', 'ITH', 'CRW', 'BGM', 'TOL', 'LEX', 'ELM', 'ORH', 'HTS',
       'TRI', 'ACY', 'SCK', 'CCR', 'YKM', 'RDM', 'BLI', 'HRL', 'DAL',
       'AMA', 'LBB', 'CRP', 'MAF', 'DET', 'GRB', 'FWA', 'RST', 'DLH',
       'AZO', 'GFK', 'LAN', 'LSE', 'EAU', 'BZN', 'MOT', 'BIS', 'MOB',
       'GPT', 'MSO', 'CHA', 'ATW', 'SHV', 'VPS', 'MGM', 'BTR', 'PFN',
       'HDN', 'SBN', 'STT', 'STX', 'CAK', 'FNT', 'AVL', 'OAJ', 'FAY',
       'ROA', 'ISO', 'DAB', 'ILM', 'AGS', 'CMI', 'TLH', 'CHO', 'LYH',
       'UCA', 'PNS', 'EYW', 'GNV', 'APF', 'SJU', 'PUB', 'MLB', 'MLU',
       'CSG', 'CPR', 'JAC', 'FCA', 'HLN', 'IDA', 'JNU', 'BTM', 'GJT',
       'YUM', 'DRO', 'FLG', 'GCN', 'BFL', 'GUC', 'PIE', 'TVL', 'BET',
       'OME', 'OTZ', 'SCC', 'YAK', 'CDV', 'SIT', 'PSG', 'WRG', 'GUM',
       'MFE', 'LFT', 'YAP', 'ROR', 'SPN', 'ROP', 'TVC', 'GST', 'EGE',
       'SUN', 'PMD', 'SWF', 'EFD', 'PSE', 'HVN', 'TTN', 'BQN', 'SPI'],
      dtype=object)

In [76]:

# Extracting the month number from the dates
flight_dest = df8991['Date'].dt.month

# Assigning the months
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Creating a new dataframe
flight_dest = pd.DataFrame(flight_dest)

# Renaming the column name of the new dataframe
flight_dest.rename(columns = {'Date':'Month'},inplace=True)

# Adding a new column to count flights per month
flight_dest['Flights'] = df8991['Date'].dt.month.count

# Adding a new column to sum destinations per month
flight_dest['Dest'] = df8991['Dest']

# Grouping the data by month then assigning month names to 'month' column
month_dest  = flight_dest.groupby([('Month'),('Dest')]).count()

# Renaming each month number to a real month name
flights_by_dest = month_dest.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})

# Looking at the dataframe we created
flights_by_dest.head()

Out[76]:

		Flights
Month	Dest
Jan	ABE	1281
ABQ	7264
ACY	160
AGS	780
ALB	3203

In [77]:

flights_by_dest_piv = flights_by_dest.reset_index().pivot(columns='Month',index='Dest',values='Flights')
flights_by_dest_piv = flights_by_dest_piv.reindex(columns= ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])

In [78]:

flights_by_dest_piv.head()

Out[78]:

Month	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
Dest
ABE	1281.0	1155.0	1169.0	1151.0	1276.0	1319.0	1499.0	1508.0	1490.0	1599.0	1540.0	1519.0
ABQ	7264.0	6549.0	7691.0	7645.0	7947.0	7708.0	7900.0	7909.0	7580.0	7826.0	7281.0	7311.0
ACY	160.0	144.0	163.0	162.0	209.0	211.0	249.0	242.0	188.0	190.0	182.0	184.0
AGS	780.0	703.0	782.0	747.0	796.0	789.0	796.0	809.0	784.0	810.0	789.0	823.0
ALB	3203.0	2989.0	3260.0	3215.0	3256.0	3160.0	3238.0	3312.0	3193.0	3356.0	3097.0	3165.0

In [79]:

flights_by_dest_piv.columns

Out[79]:

Index(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',
       'Nov', 'Dec'],
      dtype='object', name='Month')

In [80]:

flights_by_dest_piv.sort_values(by=['Dest'], axis=0, ascending=False).head()

Out[80]:

Month	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
Dest
YUM	463.0	412.0	451.0	435.0	411.0	372.0	374.0	404.0	364.0	391.0	373.0	430.0
YKM	60.0	51.0	55.0	59.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
YAP	42.0	38.0	47.0	52.0	54.0	24.0	20.0	27.0	26.0	23.0	24.0	27.0
YAK	160.0	121.0	166.0	171.0	168.0	164.0	173.0	174.0	152.0	166.0	151.0	144.0
WRG	63.0	64.0	77.0	83.0	87.0	86.0	81.0	84.0	78.0	85.0	59.0	62.0

In [81]:

# Filling NaN with 0
flights_by_dest_piv = flights_by_dest_piv.fillna(0)

# Changing from float to np.int64 to remove .0
flights_by_dest_piv = flights_by_dest_piv.astype(np.int64)

# Summing across the rows using .sum and 'axis=0'
flights_by_dest_piv['Total'] = flights_by_dest_piv.sum(axis=1).astype(np.int64)

In [82]:

top_10_dest = flights_by_dest_piv.sort_values(by='Total', ascending=False).iloc[0:10,0:12]
top_10_dest

Out[82]:

Month	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
Dest
ORD	64860	58108	66654	66439	68519	66286	67693	68390	65764	67931	63949	66085
DFW	38769	34937	58998	58056	59562	59201	61925	62389	60610	61807	59015	61795
ATL	57380	51969	51962	50441	52188	52102	55742	58388	58889	62291	59045	60404
LAX	39072	34862	39950	39503	41515	41261	42912	42905	39996	41003	39004	39192
DEN	33867	30862	35191	34093	34796	34488	35836	36688	33778	34459	32398	34396
PHX	32979	29793	34678	33877	34737	33711	34804	34720	32948	34131	32723	33877
STL	30506	27696	31060	30712	31455	31567	32828	33232	31617	32362	29153	29462
SFO	30068	26649	30753	30482	31827	31614	32985	33238	30755	31824	29794	30346
DTW	29098	26382	29715	29650	30752	29824	30642	31366	29435	30746	28619	29245
PIT	29062	26246	29307	28722	29731	28558	29588	29576	28745	29720	28388	28690

In [83]:

df8991['Dest'].value_counts().head(10)

Out[83]:

ORD    790678
DFW    677064
ATL    670801
LAX    481175
DEN    410852
PHX    402978
STL    371650
SFO    370335
DTW    355474
PIT    346333
Name: Dest, dtype: int64

In [84]:

df8991['Origin'].value_counts().head(10)

Out[84]:

ORD    774861
ATL    669767
DFW    662101
LAX    482516
DEN    405662
PHX    400611
SFO    374001
STL    366913
DTW    349453
PIT    343552
Name: Origin, dtype: int64

In [85]:

# Extracting the month number from the dates
flight_origin = df8991['Date'].dt.month

# Assigning the months
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Creating a new dataframe
flight_origin = pd.DataFrame(flight_origin)

# Renaming the column name of the new dataframe
flight_origin.rename(columns = {'Date':'Month'},inplace=True)

# Adding a new column to count flights per month
flight_origin['Flights'] = df8991['Date'].dt.month.count

# Adding a new column to sum origininations per month
flight_origin['Origin'] = df8991['Origin']

# Grouping the data by month then assigning month names to 'month' column
month_origin  = flight_origin.groupby([('Month'),('Origin')]).count()

# Renaming each month number to a real month name
flights_by_origin = month_origin.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})

# Looking at the dataframe we created
flights_by_origin.head()

Out[85]:

		Flights
Month	Origin
Jan	ABE	1301
ABQ	7335
ACY	163
AGS	788
ALB	3268

In [86]:

flights_by_origin_piv = flights_by_origin.reset_index().pivot(columns='Month',index='Origin',values='Flights')
flights_by_origin_piv = flights_by_origin_piv.reindex(columns= ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])

In [87]:

# Filling NaN with 0
flights_by_origin_piv = flights_by_origin_piv.fillna(0)

# Changing from float to np.int64 to remove .0
flights_by_origin_piv = flights_by_origin_piv.astype(np.int64)

# Summing across the rows using .sum and 'axis=0'
flights_by_origin_piv['Total'] = flights_by_origin_piv.sum(axis=1).astype(np.int64)

In [88]:

top_10_origin = flights_by_origin_piv.sort_values(by='Total', ascending=False).iloc[0:10,0:12]
top_10_origin

Out[88]:

Month	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
Origin
ORD	63683	57104	65155	65023	67217	64757	66369	66467	64726	66832	62869	64659
ATL	57224	51831	51889	50462	52186	51917	55763	58313	58726	62141	58936	60379
DFW	37532	34338	57383	56859	58287	57753	60299	60768	59686	60774	57928	60494
LAX	39260	34981	40132	39607	41546	41426	42986	43020	40053	41105	39040	39360
DEN	33559	30592	34759	33763	34374	33898	35128	36124	33433	34119	31922	33991
PHX	32876	29661	34464	33563	34596	33422	34562	34440	32777	33942	32553	33755
SFO	30341	27027	31162	30702	32088	31962	33370	33629	30964	32088	30019	30649
STL	30176	27306	30606	30396	31023	31093	32449	32837	31197	31883	28845	29102
DTW	28584	25911	29229	29155	30192	29321	30138	30906	28979	30265	28131	28642
PIT	28787	26041	29022	28528	29542	28366	29325	29314	28534	29451	28215	28427

In [89]:

import seaborn as sns
fig, (ax1, ax2, axcb) = plt.subplots(1,3, figsize=(15,6), gridspec_kw={'width_ratios':[1.05,1,0.08]})

# Plot
fig.suptitle('Top 10 Airports by Flights', fontsize=15)
ax1.title.set_text('Origin Airports')
ax2.title.set_text('Destination Airports')
sns.heatmap(top_10_origin, cmap='Purples', linewidths=5, cbar=None, ax=ax1)
sns.heatmap(top_10_dest, cmap='Purples', linewidths=5, cbar_ax=axcb, ax=ax2)

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/top_10_airports_by_flights.png')
plt.show()

On the left shows the top 10 airports by flights departed and on the right are the destinations by flights arrived. ORD – Chicago O’Hare International has the most flights based on departure and arrivals which suggests it’s a hub for flights across America. ATL – Atlanta International has the 2nd most flights based on departure but 3rd based on arrival. DFW – Dallas-Fort Worth International is 3rd in departures but 2nd in arrivals, the opposite of ATL. This means flights between those two airports are common. We could delve deeper into that if we need to later.In [90]:

# Sum dataframes
top_10_airports = top_10_origin.rename_axis(index='Airport')+top_10_dest.rename_axis(index='Airport')
top_10_airports['Total'] = top_10_airports.sum(axis=1).astype(np.int64)

# Sorting values
top_10_ap_total = top_10_airports['Total']
top_10_airports = top_10_airports.sort_values(by='Total', ascending=False).iloc[0:10,0:12]

In [91]:

# Top 10 airports by total flights
top_10_ap_total.sort_values(ascending=False)

Out[91]:

Airport
ORD    1565539
ATL    1340568
DFW    1339165
LAX     963691
DEN     816514
PHX     803589
SFO     744336
STL     738563
DTW     704927
PIT     689885
Name: Total, dtype: int64

In [92]:

# Plot
fig, ax = plt.subplots(figsize=(14.70, 8.27))
sns.heatmap(top_10_airports.iloc[0:10,0:12], cmap='Blues', linewidths=5, cbar='Blues')

# Labels
ax.set_title('Top 10 Airports by Total Flights\n', fontsize=14, weight='bold')
plt.xlabel('Month'.title(),
               fontsize = 10, weight = "bold")
plt.ylabel('Airport'.title(),
               fontsize = 10, weight = "bold")

# Show
#plt.tight_layout(pad=0)
plt.savefig('images/charts/top_10_airports_by_total_flights.png')
plt.show()

Observation 10: This plot shows us the top 10 airports by total inbound and outbound flights. The chart is topped by Chicago O’Hare International in 1st place with 1,565,539 flights. 2nd place is William B Hartsfield-Atlanta International with 1,340,568, and close 3rd is Dallas-Fort Worth International with 1,339,165. During the summer Chicago O’Hare International is very busy with strong demand throughout the year. Atlanta International follows a similar trend but Dallas-Fort Worth International show a weak January and Feburary in terms of total flights.In [93]:

# Dataframe without index
top_10_origin_idx = top_10_origin.reset_index().iloc[:,0:1]

# Renaming column to prepare for df merge
top_10_origin_idx.rename(columns={"Origin":"iata"}, inplace=True)

# Merging dataframes to get the airport description
top_10_origin_desc = pd.merge(left=top_10_origin_idx, right=airports, how='left', on='iata')

# Show airport data
top_10_origin_desc.iloc[:,:-2]

Out[93]:

	iata	airport	city	state	country
0	ORD	Chicago O’Hare International	Chicago	IL	USA
1	ATL	William B Hartsfield-Atlanta Intl	Atlanta	GA	USA
2	DFW	Dallas-Fort Worth International	Dallas-Fort Worth	TX	USA
3	LAX	Los Angeles International	Los Angeles	CA	USA
4	DEN	Denver Intl	Denver	CO	USA
5	PHX	Phoenix Sky Harbor International	Phoenix	AZ	USA
6	SFO	San Francisco International	San Francisco	CA	USA
7	STL	Lambert-St Louis International	St Louis	MO	USA
8	DTW	Detroit Metropolitan-Wayne County	Detroit	MI	USA
9	PIT	Pittsburgh International	Pittsburgh	PA	USA

In [94]:

top_10_origin_desc['long'].describe()

Out[94]:

count     10.000000
mean     -98.076828
std       15.373173
min     -122.374843
25%     -110.172792
50%      -93.698595
75%      -85.296324
max      -80.232871
Name: long, dtype: float64

In [95]:

top_10_ap_total.sort_values(ascending=False)

Out[95]:

Airport
ORD    1565539
ATL    1340568
DFW    1339165
LAX     963691
DEN     816514
PHX     803589
SFO     744336
STL     738563
DTW     704927
PIT     689885
Name: Total, dtype: int64

4.2. Airlines by Month vs Average Distance

In [96]:

# Creating a Unique Carrier dataframe
UC1 = df8991['UniqueCarrier'].groupby(df8991['Date'].map(lambda x: x.month)).nunique().dropna()
UC2 = round(df8991['Distance'].groupby(df8991['Date'].map(lambda x: x.month)).mean(), 2)

# Merge and reindex dataframes
carriers_month_v_distance = pd.merge(left=UC1, right=UC2, how='left', on='Date')
carriers_month_v_distance = carriers_month_v_distance.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun',
                                                                    7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})
# Rename Columns
carriers_month_v_distance.rename_axis("Month", axis="index", inplace=True)
carriers_month_v_distance.rename(columns = {'UniqueCarrier':'CarrierCount'},inplace=True)

# Saving as a new dataframe
carriers_month_v_distance = pd.DataFrame(carriers_month_v_distance)

# Show
carriers_month_v_distance

Out[96]:

	CarrierCount	Distance
Month
Jan	14	619.77
Feb	14	623.41
Mar	14	630.20
Apr	14	628.60
May	14	629.13
Jun	14	635.24
Jul	14	640.34
Aug	14	639.94
Sep	13	635.37
Oct	13	634.70
Nov	12	636.83
Dec	12	640.03

In [97]:

# Creating a combo chart
fig, ax1 = plt.subplots(figsize=(10,6))
color = 'tab:green'

# Creating Bar Plot
ax1.set_title('Airlines by Month vs Average Distance', fontsize=15)
ax1 = sns.barplot(x='Month', y='CarrierCount', data=carriers_month_v_distance.reset_index(), palette='summer')
ax1.set_xlabel('Month', fontsize=13)
ax1.set_ylabel('Carriers', fontsize=13, color=color, weight='bold')
ax1.tick_params(axis='y')

# Sharing the same x-axis
ax2 = ax1.twinx()
color = 'tab:red'

# Creating Line Plot
ax2 = sns.lineplot(x='Month', y='Distance', data=carriers_month_v_distance.reset_index(), 
                   sort=False, color=color, linewidth = 2, marker='o')
ax2.set_ylabel('\nAverage Distance (Mi)', fontsize=13, color=color, weight='bold')
ax2.tick_params(axis='y', color=color)

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/airlines_by_month_vs_distance.png')
plt.show()

Observation 11: Above is a combination chart showing us the amount of carriers active each month along with the average distance flown. On average, we can count around 12-14 carriers on any given month but the chart shows us from September the amount of carriers reduces into December suggesting reduced holidays. The opposite seems true when we look at average distance travelled by month. This number is increaing from 620 miles on average to 640 miles suggesting fewer longer flights. This correlation suggests perhaps passengers are flying less often but further in the winter than the summer. This could mean going from a cold country to a hotter country or going to see family for the festive season in December.

4.3. Average Flight Delay vs Distance

In [98]:

# Creating a dataframe
DD1 = df8991['DepDelay'].groupby(df8991['Date'].map(lambda x: x.day)).mean()
DD2 = df8991['ArrDelay'].groupby(df8991['Date'].map(lambda x: x.day)).mean()
DD3 = df8991['Distance'].groupby(df8991['Date'].map(lambda x: x.day)).mean()

# Merge and reindex dataframes
delay_v_distance = pd.merge(left=DD1, right=DD2, how='left', on='Date')
delay_v_distance = pd.merge(left=delay_v_distance, right=DD3, how='left', on='Date')

# Saving as a new dataframe
delay_v_distance = pd.DataFrame(delay_v_distance)

# Show
delay_v_distance

Out[98]:

	DepDelay	ArrDelay	Distance
Date
1	6.756960	6.279760	633.227517
2	7.148581	7.067214	633.218274
3	7.722741	7.702845	633.915910
4	6.065070	5.623313	633.371921
5	6.034198	5.818042	632.001161
6	6.525807	6.330133	631.732588
7	6.092675	5.939208	632.502854
8	6.242835	6.140405	632.523106
9	6.557927	6.414992	632.940562
10	6.341829	6.056148	632.540616
11	6.260230	5.892138	631.703551
12	5.819695	5.567792	630.539311
13	5.723200	5.511132	631.097914
14	6.827372	6.880148	632.665304
15	7.680069	7.815735	632.968215
16	7.398344	7.318719	633.384753
17	6.888403	6.965868	632.278565
18	6.953599	7.174428	632.314703
19	7.576420	7.859780	631.405105
20	8.048997	8.234841	632.202658
21	8.192566	8.112351	633.127681
22	8.683596	8.583389	633.981356
23	7.389391	6.976880	634.602349
24	5.878058	5.063008	634.855267
25	5.817795	5.020933	633.658841
26	6.168475	5.591412	631.978279
27	6.972767	6.594828	632.447910
28	7.683362	7.301313	634.785688
29	6.662649	6.500185	635.010834
30	6.456550	6.135588	634.696271
31	7.340170	6.895714	635.679141

In [99]:

# Delay with distance correlation
delay_v_distance.corr()

Out[99]:

	DepDelay	ArrDelay	Distance
DepDelay	1.000000	0.969679	0.305120
ArrDelay	0.969679	1.000000	0.149177
Distance	0.305120	0.149177	1.000000

In [100]:

delay_v_distance['TotalDelay'] = delay_v_distance['DepDelay']+delay_v_distance['ArrDelay']
delay_v_distance.corr()

Out[100]:

	DepDelay	ArrDelay	Distance	TotalDelay
DepDelay	1.000000	0.969679	0.305120	0.990868
ArrDelay	0.969679	1.000000	0.149177	0.993775
Distance	0.305120	0.149177	1.000000	0.221398
TotalDelay	0.990868	0.993775	0.221398	1.000000

In [101]:

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6), gridspec_kw={'width_ratios':[1,1]})

# Plot
fig.suptitle('Average Flight Delay vs Distance', fontsize=15)
sb.regplot(data = delay_v_distance, x = 'DepDelay', y = 'Distance', ax=ax1)
sb.regplot(data = delay_v_distance, x = 'ArrDelay', y = 'Distance', ax=ax2)

# Labels
ax1.title.set_text('')
ax1.set_xlabel('Departure Delay (Min)')
ax1.set_ylabel('Distance (Mi)')

ax2.title.set_text('')
ax2.set_xlabel('Arrival Delay (Min)')
ax2.set_ylabel('')

# Show
#plt.tight_layout(pad=3)
plt.subplots_adjust(top=0.9)
plt.savefig('images/charts/delay_vs_distance.png')
plt.show()

Observation 12: We used a scatterplot to show the average flight delay vs distance. Please keep in mind that we are using a daily average here to plot this data because millions of data points would not be clear. There doesn’t seem to be a clear correlation between the distance and departure or arrival delay but we can clearly see one when we add a line of best fit. We can clearly observe a positive correlation in both plots indicating as the flight distance increases the greater chance of delay increases too. We can move on to delays by airline for our univariate exploration section, which airline is the most punctual and maybe we can look at delay on various flight routes?

Step 5: Univariate Exploration ?

5.1. Routes by Flights, Airline and Average Delay

In [102]:

# Dataframe from columns
flight_route = df8991[['Origin','Dest']]

# Creating a new dataframe
flight_route = pd.DataFrame(flight_route)

# Creating columns
flight_route['Route'] = flight_route['Origin'].str.cat(flight_route['Dest'],sep=" - ")
flight_route['AverageDelay'] = df8991['DepDelay']+df8991['ArrDelay']
flight_route['Carrier'] = df8991['UniqueCarrier']
flight_route.head()

Out[102]:

	Origin	Dest	Route	AverageDelay	Carrier
0	SFO	HNL	SFO – HNL	219	UA
1	SFO	HNL	SFO – HNL	45	UA
2	SFO	HNL	SFO – HNL	-19	UA
3	SFO	HNL	SFO – HNL	-29	UA
4	SFO	HNL	SFO – HNL	-37	UA

In [103]:

flight_route.groupby(['Carrier', 'Route'])['AverageDelay'].mean().head()

Out[103]:

Carrier  Route    
AA       ABE - ALB    33.00000
         ABE - BDL     1.00000
         ABE - MDT    16.00000
         ABE - ORD     6.97619
         ABQ - DEN    52.00000
Name: AverageDelay, dtype: float64

In [104]:

# Top carriers by average delay 
flight_route.groupby('Carrier')['AverageDelay'].mean().sort_values(ascending=False)

Out[104]:

Carrier
PI        25.816885
UA        16.953211
DL        15.099069
EA        14.285615
WN        14.058281
US        13.584921
TW        13.468342
CO        12.723660
HP        11.795458
AA        11.353261
ML (1)    10.940089
AS        10.922795
PA (1)     9.541804
NW         8.327834
Name: AverageDelay, dtype: float64

In [105]:

# Top routes by average delay 
flight_route.groupby('Route')['AverageDelay'].mean().sort_values(ascending=False).iloc[0:10]

Out[105]:

Route
GUC - HDN    1980.000000
ICT - PDX     446.000000
BHM - LIT     407.000000
AMA - SLC     351.000000
MCI - ELP     271.000000
CLE - DAY     231.000000
SGF - TUL     220.333333
DCA - PHL     211.500000
TUL - MSP     204.000000
ORD - SUX     174.000000
Name: AverageDelay, dtype: float64

In [106]:

# Top carriers by total routes
flight_route.groupby('Carrier')['Route'].count().sort_values(ascending=False)

Out[106]:

Carrier
US        2575160
DL        2430469
AA        2084513
UA        1774850
NW        1352989
CO        1241287
WN         950837
TW         756992
HP         623477
EA         398362
PI         281905
AS         267975
PA (1)     223079
ML (1)      69119
Name: Route, dtype: int64

In [107]:

# Top routes by flights
flight_route['Route'].value_counts().iloc[0:10]

Out[107]:

SFO - LAX    69180
LAX - SFO    68754
LAX - PHX    39321
PHX - LAX    38756
LAX - LAS    31549
PHX - LAS    30778
LAS - LAX    30296
LAS - PHX    29775
LGA - ORD    29656
ORD - MSP    29308
Name: Route, dtype: int64

In [108]:

# Creating a new dataframe
top_10_flight_route = pd.DataFrame(flight_route['Route'].value_counts().iloc[0:10])
top_10_flight_route

Out[108]:

	Route
SFO – LAX	69180
LAX – SFO	68754
LAX – PHX	39321
PHX – LAX	38756
LAX – LAS	31549
PHX – LAS	30778
LAS – LAX	30296
LAS – PHX	29775
LGA – ORD	29656
ORD – MSP	29308

In [109]:

# Creating list
delay_list = []
item0 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[0]].mean())
item1 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[1]].mean())
item2 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[2]].mean())
item3 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[3]].mean())
item4 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[4]].mean())
item5 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[5]].mean())
item6 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[6]].mean())
item7 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[7]].mean())
item8 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[8]].mean())
item9 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[9]].mean())

# Creating a new dataframe
delay_list.append(item0+item1+item2+item3+item4+item5+item6+item7+item8+item9)
top_10_flight_delay = pd.DataFrame(delay_list)
top_10_flight_delay = pd.melt(top_10_flight_delay).drop(['variable'], axis=1).rename(columns={"value":"AverageDelay"})
top_10_flight_delay.index = top_10_flight_route.index
top_10_flight_delay

Out[109]:

	AverageDelay
SFO – LAX	14.286889
LAX – SFO	18.972918
LAX – PHX	15.574909
PHX – LAX	16.178011
LAX – LAS	11.177407
PHX – LAS	15.854929
LAS – LAX	11.442567
LAS – PHX	14.501092
LGA – ORD	14.211762
ORD – MSP	14.661662

In [110]:

top_10_flight_route = pd.concat([top_10_flight_route, top_10_flight_delay], axis=1).reindex(top_10_flight_route.index)
top_10_flight_route.rename(columns={"Route":"Flights"}, inplace=True)
top_10_flight_route['AverageDelay'] = round(top_10_flight_route['AverageDelay'], 2)
top_10_flight_route.index.name = 'Route'

In [111]:

top_10_flight_route

Out[111]:

	Flights	AverageDelay
Route
SFO – LAX	69180	14.29
LAX – SFO	68754	18.97
LAX – PHX	39321	15.57
PHX – LAX	38756	16.18
LAX – LAS	31549	11.18
PHX – LAS	30778	15.85
LAS – LAX	30296	11.44
LAS – PHX	29775	14.50
LGA – ORD	29656	14.21
ORD – MSP	29308	14.66

In [112]:

# Creating a dataframe for average delay by route and carrier
carrier_route_delay = pd.DataFrame(flight_route.groupby(['Carrier', 'Route'])['AverageDelay'].mean())
carrier_route_delay.head()

Out[112]:

		AverageDelay
Carrier	Route
AA	ABE – ALB	33.00000
ABE – BDL	1.00000
ABE – MDT	16.00000
ABE – ORD	6.97619
ABQ – DEN	52.00000

In [113]:

carrier_route_delay.filter(like='SFO - LAX', axis=0)

Out[113]:

		AverageDelay
Carrier	Route
AA	SFO – LAX	6.917208
AS	SFO – LAX	20.866279
CO	SFO – LAX	12.990581
DL	SFO – LAX	13.237706
NW	SFO – LAX	20.851767
PA (1)	SFO – LAX	8.613192
TW	SFO – LAX	13.563017
UA	SFO – LAX	17.250478
US	SFO – LAX	17.044848

In [114]:

# Get index names
carrier_route_delay.index.names

Out[114]:

FrozenList(['Carrier', 'Route'])

In [115]:

list(carrier_route_delay.index[0])

Out[115]:

['AA', 'ABE - ALB']

In [116]:

carrier_route_delay['RouteIndex'] = carrier_route_delay.reset_index(level=0, drop=True).index #.reset_index(level=0, drop=True) #.str.extractall(r"(A-Z)").index
carrier_route_delay['RouteIndex'] = carrier_route_delay['RouteIndex']
carrier_route_delay.head()

Out[116]:

		AverageDelay	RouteIndex
Carrier	Route
AA	ABE – ALB	33.00000	ABE – ALB
ABE – BDL	1.00000	ABE – BDL
ABE – MDT	16.00000	ABE – MDT
ABE – ORD	6.97619	ABE – ORD
ABQ – DEN	52.00000	ABQ – DEN

In [117]:

carrier_route_delay['RouteIndex'].iloc[0]

Out[117]:

'ABE - ALB'

In [118]:

carrier_route_delay.filter(like="ABE - ALB", axis=0)

Out[118]:

		AverageDelay	RouteIndex
Carrier	Route
AA	ABE – ALB	33.0	ABE – ALB

In [119]:

carrier_route_delay.filter(like=top_10_flight_route.index[1], axis=0)

Out[119]:

		AverageDelay	RouteIndex
Carrier	Route
AA	LAX – SFO	8.280271	LAX – SFO
AS	LAX – SFO	22.179775	LAX – SFO
CO	LAX – SFO	19.412461	LAX – SFO
DL	LAX – SFO	21.431304	LAX – SFO
NW	LAX – SFO	19.772613	LAX – SFO
PA (1)	LAX – SFO	21.444831	LAX – SFO
TW	LAX – SFO	32.326835	LAX – SFO
UA	LAX – SFO	20.176455	LAX – SFO
US	LAX – SFO	20.680116	LAX – SFO

In [120]:

carrier_route_delay.drop(['RouteIndex'], axis=1, inplace=True)

In [121]:

# Merge index 0-1
item0 = carrier_route_delay.filter(like=top_10_flight_route.index[0], axis=0)
item1 = carrier_route_delay.filter(like=top_10_flight_route.index[1], axis=0)
item0_1 = item0.merge(item1, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 2-3
item2 = carrier_route_delay.filter(like=top_10_flight_route.index[2], axis=0)
item3 = carrier_route_delay.filter(like=top_10_flight_route.index[3], axis=0)
item2_3 = item2.merge(item3, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_3 = item0_1.merge(item2_3, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 4-5
item4 = carrier_route_delay.filter(like=top_10_flight_route.index[4], axis=0)
item5 = carrier_route_delay.filter(like=top_10_flight_route.index[5], axis=0)
item4_5 = item4.merge(item5, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_5 = merge0_3.merge(item4_5, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 6-7
item6 = carrier_route_delay.filter(like=top_10_flight_route.index[6], axis=0)
item7 = carrier_route_delay.filter(like=top_10_flight_route.index[7], axis=0)
item6_7 = item5.merge(item6, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_7 = merge0_5.merge(item6_7, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 8-9
item8 = carrier_route_delay.filter(like=top_10_flight_route.index[8], axis=0)
item9 = carrier_route_delay.filter(like=top_10_flight_route.index[9], axis=0)
item8_9 = item8.merge(item9, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_9 = merge0_7.merge(item8_9, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Print dataframe
carrier_ra = merge0_9
carrier_ra

Out[121]:

		AverageDelay
Carrier	Route
AA	LAS – LAX	-0.577778
LAX – LAS	1.730290
LAX – SFO	8.280271
LGA – ORD	14.982707
ORD – MSP	19.809837
SFO – LAX	6.917208
AS	LAX – SFO	22.179775
SFO – LAX	20.866279
CO	LAS – LAX	6.295775
LAX – LAS	2.410959
LAX – SFO	19.412461
SFO – LAX	12.990581
DL	LAS – LAX	11.196488
LAX – LAS	13.751666
LAX – PHX	16.196706
LAX – SFO	21.431304
PHX – LAS	24.553367
PHX – LAX	13.126173
SFO – LAX	13.237706
HP	LAS – LAX	13.548347
LAX – LAS	8.688118
LAX – PHX	13.020739
PHX – LAS	15.177153
PHX – LAX	15.904401
ML (1)	LAS – LAX	118.200000
LAX – LAS	54.857143
PHX – LAS	30.956522
NW	LAX – SFO	19.772613
ORD – MSP	11.556449
SFO – LAX	20.851767
PA (1)	LAX – SFO	21.444831
SFO – LAX	8.613192
TW	LAS – LAX	17.000000
LAX – LAS	4.000000
LAX – PHX	5.272727
LAX – SFO	32.326835
ORD – MSP	7.494118
PHX – LAS	27.221639
PHX – LAX	76.833333
SFO – LAX	13.563017
UA	LAS – LAX	36.938462
LAX – LAS	17.198347
LAX – SFO	20.176455
LGA – ORD	13.425165
ORD – MSP	16.037805
PHX – LAX	16.000000
SFO – LAX	17.250478
US	LAS – LAX	9.396913
LAX – LAS	13.447669
LAX – SFO	20.680116
PHX – LAS	0.968202
SFO – LAX	17.044848
WN	LAS – LAX	11.451827
LAX – LAS	1.762376
LAX – PHX	17.719184
PHX – LAS	15.269559
PHX – LAX	17.112161

In [122]:

# Rename column for merge
carrier_ra_idx = carrier_ra.reset_index().rename(columns={"Carrier":"code"})

# Merge to get carrier names
carrier_ra_idx = pd.merge(left=carrier_ra_idx, right=carriers, how='left', on='code')
carrier_ra_idx['description'] = carrier_ra_idx['description'].str.replace(r"\(.*\)","")
carrier_ra_idx.tail()

Out[122]:

	code	Route	AverageDelay	description
52	WN	LAS – LAX	11.451827	Southwest Airlines Co.
53	WN	LAX – LAS	1.762376	Southwest Airlines Co.
54	WN	LAX – PHX	17.719184	Southwest Airlines Co.
55	WN	PHX – LAS	15.269559	Southwest Airlines Co.
56	WN	PHX – LAX	17.112161	Southwest Airlines Co.

In [123]:

list(carrier_ra_idx.reset_index()['description'].unique())

Out[123]:

['American Airlines Inc.',
 'Alaska Airlines Inc.',
 'Continental Air Lines Inc.',
 'Delta Air Lines Inc.',
 'America West Airlines Inc. ',
 'Midway Airlines Inc. ',
 'Northwest Airlines Inc.',
 'Pan American World Airways ',
 'Trans World Airways LLC',
 'United Air Lines Inc.',
 'US Airways Inc. ',
 'Southwest Airlines Co.']

In [124]:

# Colors
colors = ["#f94144","#f3722c","#f8961e","#f9844a","#f9c74f","#90be6d","#43aa8b","#4d908e","#577590","#277da1","#284b63","#463f3a"] # 12 Airlines

# Plot
carrier_ra.unstack(level=0).plot(kind='bar', figsize=(14.70, 8.27), color=colors, subplots=False)

# Title
plt.title('Average Airline Delay on Popular Routes', fontsize=15, weight='bold')

# Label
plt.xlabel('Route', weight='bold')
plt.ylabel('Delay (Min)', weight='bold')

# Legend
carrier_list = list(carrier_ra.reset_index()['Carrier'].unique())
long_carrier_list = list(carrier_ra_idx.reset_index()['description'].unique())
plt.legend(long_carrier_list, loc='0', bbox_to_anchor=(1.01, 1.01))

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/routes_by_flights_airline_average_delay.png')
plt.show()

In [125]:

# Plot
g = sns.catplot(data=carrier_ra_idx, kind='swarm', x='AverageDelay', y='Route', hue='description', height=10, s=9)
g._legend.set_title("Carriers")

# Label
plt.xlabel('Average Delay')

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/routes_by_flights_airline_average_delay_2.png')
plt.show()

Observation 13: Which airline is the most punctual and which is the least? And does flight route have an affect on delay times per carrier? LAS – LAX shows us many interesting insights. For example, American Airlines has the fewest delays on this route with an average delay of just -0.57 minutes, this means they are actually early. Midway Airlines Inc. has the most delays totalling an average of 118.2 minutes on this route which is astonishing. Things don’t fair too well for Midway on the return leg with an average delay of 54.8 minutes. For PHX – LAX, Trans World Airways LLC tops the charts with an average of 76.8 minutes and Delta Air Lines Inc. has the fewest delays with just 13.1 minutes for delays on average. For our final conclusion, we can say the average delay time is just 18.36 minutes if all carriers and routes are considered which is pretty good. The outlier here is the maximum delay of 118 minutes which is 644% of the delay average of 18.36 minutes. Let’s see which airline is the most punctual overall, can it still be Delta Air Lines Inc?In [126]:

# Filter route data
carrier_ra.filter(like='PHX - LAX', axis=0).sort_values(by='AverageDelay', ascending=False)

Out[126]:

		AverageDelay
Carrier	Route
TW	PHX – LAX	76.833333
WN	PHX – LAX	17.112161
UA	PHX – LAX	16.000000
HP	PHX – LAX	15.904401
DL	PHX – LAX	13.126173

In [127]:

carrier_ra.describe()

Out[127]:

	AverageDelay
count	57.000000
mean	18.367970
std	18.243396
min	-0.577778
25%	11.196488
50%	15.269559
75%	20.176455
max	118.200000

In [128]:

# Difference between max delay and mean delay in %
carrier_ra.max()/carrier_ra.mean()*100

Out[128]:

AverageDelay    643.511507
dtype: float64

5.2. Top 10 Airlines by Average Delay

In [129]:

# Dataframe from columns
top_10_carriers_by_delay = df8991['DepDelay']+df8991['ArrDelay']

# Creating a new dataframe
top_10_carriers_by_delay = pd.DataFrame(top_10_carriers_by_delay)

# Creating columns
top_10_carriers_by_delay['Carrier'] = df8991['UniqueCarrier']
top_10_carriers_by_delay.columns = ['AverageDelay','Carrier']
top_10_carriers_by_delay['DepDelay'] = df8991['DepDelay']
top_10_carriers_by_delay['ArrDelay'] = df8991['ArrDelay']

# Rename column for merge
top_10_carriers_by_delay = top_10_carriers_by_delay.rename(columns={"Carrier":"code"})

# Merge to get carrier names
top_10_carriers_by_delay = pd.merge(left=top_10_carriers_by_delay, right=carriers, how='left', on='code')
top_10_carriers_by_delay['Carrier'] = top_10_carriers_by_delay['description'].str.replace(r"\(.*\)","")
top_10_carriers_by_delay = top_10_carriers_by_delay.groupby(by='Carrier').mean().sort_values(by='AverageDelay', ascending=True)
top_10_carriers_by_delay

Out[129]:

	AverageDelay	DepDelay	ArrDelay
Carrier
Northwest Airlines Inc.	8.327834	4.487461	3.840373
Pan American World Airways	9.541804	5.263768	4.278036
Alaska Airlines Inc.	10.922795	4.882948	6.039847
Midway Airlines Inc.	10.940089	6.192480	4.747609
American Airlines Inc.	11.353261	5.929751	5.423510
America West Airlines Inc.	11.795458	6.301721	5.493738
Continental Air Lines Inc.	12.723660	6.735197	5.988463
Trans World Airways LLC	13.468342	6.699408	6.768934
US Airways Inc.	13.584921	6.825033	6.759888
Southwest Airlines Co.	14.058281	7.946357	6.111925
Eastern Air Lines Inc.	14.285615	8.222712	6.062903
Delta Air Lines Inc.	15.099069	6.969973	8.129096
United Air Lines Inc.	16.953211	8.579475	8.373736
Piedmont Aviation Inc.	25.816885	11.974005	13.842880

In [130]:

# Plot
top_10_carriers_by_delay.plot(kind='barh', figsize=(14.70, 8.27)) 

# Title
plt.title('Airline Carriers by Average Delay', fontsize=14, weight='bold')

# Label
plt.xlabel('Delay in Minutes (Lower is better)', fontsize=10, weight='bold')
plt.ylabel('Carrier', fontsize=10, weight='bold')

plt.legend(['Average Delay', 'Departure Delay', 'Arrival Delay'])

# Plot
plt.savefig('images/charts/airlines_by_average_delays.png')
plt.show()

Observation 14: Which airline is the most punctual? Northwest Airlines has the lowest average delay on flights for departure and arrival times combined. Piedmont Aviation is the least punctual carrier with almost 26 minutes average delay times with a 52% increased average delay time vs Northwest Airlines. Another trend we can notice is that departure delays are 5% higher than arrival delays on average suggesting various factors cause this like aircraft prepeation and passengers. Unfortunately Delta Air Lines Inc. is 3rd from bottom in terms of punctuality despite performing well in the most popular routes observed earlier.In [131]:

# Departure delays vs Arrival delays % difference
top_10_carriers_by_delay['DepDelay'].mean()/top_10_carriers_by_delay['ArrDelay'].mean()-1

Out[131]:

0.05605593713455348

In [132]:

# Percent change of delay between top 10 airlines
top_10_carriers_by_delay.pct_change()

Out[132]:

	AverageDelay	DepDelay	ArrDelay
Carrier
Northwest Airlines Inc.	NaN	NaN	NaN
Pan American World Airways	0.145773	0.172995	0.113964
Alaska Airlines Inc.	0.144731	-0.072347	0.411827
Midway Airlines Inc.	0.001583	0.268185	-0.213952
American Airlines Inc.	0.037767	-0.042427	0.142367
America West Airlines Inc.	0.038949	0.062729	0.012949
Continental Air Lines Inc.	0.078691	0.068787	0.090053
Trans World Airways LLC	0.058527	-0.005314	0.130329
US Airways Inc.	0.008656	0.018752	-0.001336
Southwest Airlines Co.	0.034845	0.164296	-0.095854
Eastern Air Lines Inc.	0.016171	0.034778	-0.008021
Delta Air Lines Inc.	0.056942	-0.152351	0.340793
United Air Lines Inc.	0.122798	0.230919	0.030094
Piedmont Aviation Inc.	0.522832	0.395657	0.653131

Step 6: Random Exploration ?

6.1. Plane Data

We can round off our exploration by looking at the plane_data table. There we can find data about plane types, manufacturers, models, engine types and issue dates.In [133]:

plane_data.head()

Out[133]:

	tailnum	type	manufacturer	issue_date	model	status	aircraft_type	engine_type	year
0	N10156	Corporation	EMBRAER	02/13/2004	EMB-145XR	Valid	Fixed Wing Multi-Engine	Turbo-Fan	2004
1	N102UW	Corporation	AIRBUS INDUSTRIE	05/26/1999	A320-214	Valid	Fixed Wing Multi-Engine	Turbo-Fan	1998
2	N10323	Corporation	BOEING	07/01/1997	737-3TO	Valid	Fixed Wing Multi-Engine	Turbo-Jet	1986
3	N103US	Corporation	AIRBUS INDUSTRIE	06/18/1999	A320-214	Valid	Fixed Wing Multi-Engine	Turbo-Fan	1999
4	N104UA	Corporation	BOEING	01/26/1998	747-422	Valid	Fixed Wing Multi-Engine	Turbo-Fan	1998

In [134]:

# Remove entries titled 'None' from year column
plane_data = plane_data[~plane_data['year'].isin(['None'])]

# Changing 'year' datatype from string to integer
plane_data['year'] = plane_data['year'].astype('int')

# Filtering years 1989 to 1991 and assigning to new df
plane_data_8991 = plane_data[(plane_data['year'] >= 1989) & (plane_data['year'] <= 1991)]

In [135]:

# Most popular aircraft manufacturers between 89-91
plane_data_8991['manufacturer'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Aircraft Manufacturers from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Aircraft', fontsize=13)

# Plot
plt.savefig('images/charts/top_manufacturers.png')
plt.show()

Which aircraft manufacturer is the most popular, and what is the market distribution like? Boeing is the most popular aircraft manufacturer by far which is no surprise. McDonnell Douglas takes 2nd and 3rd spot. Perhaps we can see which models are the most popular?In [136]:

# Top manufacturers
plane_data_8991['manufacturer'].value_counts().iloc[0:10].sort_values(ascending=False)

Out[136]:

BOEING                           267
MCDONNELL DOUGLAS AIRCRAFT CO     64
MCDONNELL DOUGLAS                 46
AIRBUS INDUSTRIE                  31
SAAB-SCANIA                        2
MCDONNELL DOUGLAS CORPORATION      1
Name: manufacturer, dtype: int64

In [137]:

# Market share
plane_data_8991['manufacturer'].value_counts(normalize=True)

Out[137]:

BOEING                           0.649635
MCDONNELL DOUGLAS AIRCRAFT CO    0.155718
MCDONNELL DOUGLAS                0.111922
AIRBUS INDUSTRIE                 0.075426
SAAB-SCANIA                      0.004866
MCDONNELL DOUGLAS CORPORATION    0.002433
Name: manufacturer, dtype: float64

In [138]:

# Most popular aircraft maodels between 89-91
plane_data_8991['model'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Aircraft Models from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Aircraft', fontsize=13)

# Plot
plt.savefig('images/charts/top_models.png')
plt.show()

We know Boeing is the most popular aircraft manufacturer but what aircraft model is the most popular? MD-88 has the most popular aircraft model which is interesting since Boeing dominates the market. Boeing still has 7 aircraft in the top 10 though. What about engine types?In [139]:

# Top 10 aircraft from 89-91
plane_data_8991['model'].value_counts().iloc[0:10].sort_values(ascending=False)

Out[139]:

MD-88             65
757-222           46
DC-9-82(MD-82)    35
737-4B7           27
757-232           27
757-223           27
737-322           26
737-5H4           18
767-332           17
A320-211          16
Name: model, dtype: int64

In [140]:

# Percent change between values
plane_data_8991['model'].value_counts().iloc[0:10].sort_values(ascending=False).pct_change()

Out[140]:

MD-88                  NaN
757-222          -0.292308
DC-9-82(MD-82)   -0.239130
737-4B7          -0.228571
757-232           0.000000
757-223           0.000000
737-322          -0.037037
737-5H4          -0.307692
767-332          -0.055556
A320-211         -0.058824
Name: model, dtype: float64

In [141]:

# Most popular engine types between 89-91
plane_data_8991['engine_type'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Engine Types from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Engines', fontsize=13)

# Plot
plt.savefig('images/charts/top_engine_types.png')
plt.show()

What is the most popular engine type? The most common engine type is Turbo-Fan with 234 records, followed by Turbo-Jet with 175 then Turbo-Prop with just 2 records.

Observation: Boeing is by far the most popular aircraft manufacturer between 1989 and 1991 with 267 aircraft totalling 65% of the market. The top 6 is rounded out by McDonnel Douglas and Airbus Industrie while SAAB-Scania and MD Corporation rank 5th and 6th. Interestingly, despite Boeing’s dominance the most popular aircraft is a McDonnel Douglas MD-88 with 65 aircraft recorded over our 3 year sample. The rest of the top 10 is dominated by 7 Boeing aircraft, 3rd spot is filled by a McDonnel Douglas DC-9 and 10th is taken by an Airbus A320-211. Finally on engine types, the most common engine type is Turbo-Fan with 234 records, followed by Turbo-Jet with 175 then Turbo-Prop with just 2 records.In [142]:

plane_data_8991['engine_type'].value_counts().iloc[0:10].sort_values(ascending=False)

Out[142]:

Turbo-Fan     234
Turbo-Jet     175
Turbo-Prop      2
Name: engine_type, dtype: int64

Step 7: Conclusion ?

In [143]:

Image.open('images/concorde.jpg')

Out[143]:

7.1. Interesting Findings

From our data, we can assess passengers fly more during the summer and least during the winter and more during midweek versus the weekend. The most popular time to travel for passengers is in the morning around 8:00. Most flights average a 110 minute flight time and an average distance is 632 miles. Interestingly, the average distance for flights increases during the winter months suggesting that passengers commute further to go on distant holidays or visit family and friends across the country. Airline carriers enjoy a competitive market with no more than 36% difference in total flights between carriers. Regarding carriers, we observed that flight delays increase with flight distance. Chicago O’Hare International is the busiest airport with over 1.5 million flights recorded over the 3 year period.

7.2. Project Answers

How many flights are there?\ 15,031,014 flights.

Which airlines have the most flights?\ US Airways Inc.

What time of the week passengers fly the most?\ Monday to Friday.

How does season change the frequency and the destination of travel?\ Summer increases the frequency of average flights whilst winter decreases it.

Which routes are the most popular?

SFO – LAX | 69,180 Flights
LAX – SFO | 68,754 Flights
LAX – PHX | 39,321 Flights

LAX: Los Angeles Intl, SFO: San Fransisco Intl, PHX: Phoenix Sky Harbor Intl.

Which routes and airports experience the most delays?\ GUC – HDN Gunisson to Haydon | 1980 minutes on average for route.\ PSE Mercedita Airport | 53.25 minutes delay on average is the highest for an airport.\ BQN Rafael Hernández Airport | -10.35 minutes delay (early) on average is lowest for an airport.

Which airports are the busiest in terms of inbound and outbound flights?\ Chicago O’Hare International.

Which popular routes are delayed the most?\ LAX – SFO | 68,754 Flights, 18.97 minutes delay on average.\ PHX – LAX | 38,756 Flights, 16.18 minutes delay on average.\ PHX – LAS | 30,778 Flights, 15.85 minutes delay on average.

Which times are airports the busiest in terms of flights by each hour?\ 08:00 – 09:00 has over 1 million flights during the morning.

How does flight distance affect departure and arrival delays?\ Flight distance increases total departure and arrival delays with 0.221398 positive correlation.

Which airline experiences the fewest delays?\ Northwest Airlines.

7.3. Key Insights for Presentation

For the presentation, I decided to look at flight counts using simple vertical and horizontal bar charts before moving on to more complex plots for deeper analysis. I used an area plot for flight duration because it looked visually clearer than using a line chart. I used a histogram for the flight distance plot to keep things interesting since my plan was to use many different plot types for the slides.

For total flights by carrier I decided to use a colourful pie chart which should catch the audiences attention since unique and vibrant colour selection is important. I also decided to explode the largest slice of the pie chart. For the delay section I used a combination of line charts and horizontal bar charts since it was the best way to show bivariate data that relies on categorical data. Next, I introduced a third variable which was categorical so I decided to use a heatmap. Heatmaps are great at showing multiple dimensions of data in a fun way, in our case it was for airport name, flight totals, and month. The darker the colour the more flights per grid box.

I used a combination chart to plot airlines vs month vs average distance. This allowed me to show multiple variables co-existing in the same axis clearly. Average flight delay vs distance used the scatterplot, another plot I was keen to try as it engages the audience with raw individual data points. This allowed us to use a line of best fit to intercept the points to visually see a correlation. For routes by flights vs airline average delay it was the perfect time to use a grouped bar chart, I found some interesting colour palettes for my chart from coolor.co. I could plot multiple airlines for each route on the same bar axis and take an average of them if we wanted to. I used a swarm plot on the same data for another angle to see if we could find out more. Average airlines by delay had a similar treatment with the grouped bar chart instead of replacing routes with airline carriers, this time I used a more modest colour palette. The random exploration at the end used simple bar charts again as they are quick and easy to explore data.

The Truth About Airline Statistics

Write A Comment Cancel Reply

Algorithmic Forex Mastery: Building A Trading Bot and Analytical Tools

Safer Gambling Analytics Part 2 – Session Aggregation

Safer Gambling Analytics Part 1 – Declined Deposits

Kaggle Marketing Analysis – Power BI Dashboard

The Truth About Airline Statistics

Table of Contents ?

Introduction ?

What is the ASA?

What is the ASA 2009 Data Expo?

Project Aims

Step 1: Import ?

Step 2: Gather ?

Step 3: Univariate Exploration ?

3.1. Total Flights by Month

3.2. Total Flights by Season

3.3. Total Flights by Day

3.4. Total Flights by Hour

3.5. Flight Duration in Minutes

3.6. Flight Distance in Miles

3.7. Flights by Carrier

3.8. Flight Distance by Day

3.9. Delay by Origin and Destination

Step 4: Bivariate Exploration ?

4.1. Airports by Traffic

4.2. Airlines by Month vs Average Distance

4.3. Average Flight Delay vs Distance

Step 5: Univariate Exploration ?

5.1. Routes by Flights, Airline and Average Delay

5.2. Top 10 Airlines by Average Delay

Step 6: Random Exploration ?

6.1. Plane Data

Step 7: Conclusion ?

7.1. Interesting Findings

7.2. Project Answers

7.3. Key Insights for Presentation

Related Posts

Write A Comment Cancel Reply