Table of Contents ?

Introduction ?

What is the ASA?

ASA stands for ‘American Statistical Association’, ASA is the main professional organisation for statisticians in the United States. The organization was formed in November 1839 and is the second oldest continuously operating professional society in the United States. Every other year, at the Joint Statistical Meetings, the Graphics Section and the Computing Section join in sponsoring a special Poster Session called The Data Exposition, but more commonly known as The Data Expo. All of the papers presented in this Poster Session are reports of analyses of a common data set provided for the occasion. In addition, all papers presented in the session are encouraged to report the use of graphical methods employed during the development of their analysis and to use graphics to convey their findings.

Quoted from citation: https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.1de

What is the ASA 2009 Data Expo?

The ASA Statistical Computing and Graphics Data Expo is a bi-annual data exploration challenge. Participants are challenged to provide a graphical summary of important features of the data. The task is intentionally vague to allow different entries to focus on different aspects of the data, giving the participants maximum freedom to apply their skills. The 2009 data expo consisted of flight arrival and departure details for all commercial flights on major carriers within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The complete dataset and challenge are available on the competition website http://stat-computing.org/dataexpo/2009/.

Because the dataset is so large, we also provided participants introductions to useful tools for dealing with this scale of data: Linux command line tools, including sort, awk, and cut, and sqlite, a simple SQL database. Additionally, we provided pointers to supplemental data on airport locations, airline carrier codes, individual plane information, and weather. This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008.

Quoted from citation: https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.1de

Project Aims

This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008. For our analysis, we are going to be using the three year period between 1989 to 1991 as our sample dataset as the whole dataset is too large to practically observe with Jupyter. Our goal is to ask many questions such as:

  • How many flights are there?
  • Which airlines have the most flights?
  • What time of the week passengers fly the most?
  • How does season change the frequency and the destination of travel?
  • Which routes are the most popular?
  • Which routes and airports experience the most delays?
  • Which airports are the busiest in terms of inbound and outbound flights?
  • Which popular routes are delayed the most?
  • Which times are airports the busiest in terms of flights by each hour?
  • How does flight distance affect departure and arrival delays?
  • Which airline experiences the fewest delays?

Table column descriptions:

In [2]:

import pandas as pd
pd.read_csv('misc_data/master/variable-descriptions_master.csv')

Out[2]:

namedescription
0Year1987-2008
1Month12-Jan
2DayofMonth31-Jan
3DayOfWeek1 (Monday) – 7 (Sunday)
4DepTimeactual departure time (local, hhmm)
5CRSDepTimescheduled departure time (local, hhmm)
6ArrTimeactual arrival time (local, hhmm)
7CRSArrTimescheduled arrival time (local, hhmm)
8UniqueCarrierunique carrier code
9FlightNumflight number
10TailNumplane tail number
11ActualElapsedTimein minutes
12CRSElapsedTimein minutes
13AirTimein minutes
14ArrDelayarrival delay, in minutes
15DepDelaydeparture delay, in minutes
16Originorigin IATA airport code
17Destdestination IATA airport code
18Distancein miles
19TaxiIntaxi in time, in minutes
20TaxiOuttaxi out time in minutes
21Cancelledwas the flight cancelled?
22CancellationCodereason for cancellation (A = carrier, B = weat…
23Diverted1 = yes, 0 = no
24CarrierDelayin minutes
25WeatherDelayin minutes
26NASDelayin minutes
27SecurityDelayin minutes
28LateAircraftDelayin minutes

Step 1: Import ?

In [3]:

# Import libraries
import pandas as pd
import numpy as np
import datetime as dt
import markdown
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
%matplotlib inline

# Suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

# Display dataframe all columns
pd.set_option('display.max_rows', 636)
pd.set_option('display.max_columns', 636)

References: \ https://unicode.org/emoji/charts/full-emoji-list.html \ https://getemoji.com/In [4]:

# Formatting ticks for large values (this will change the large numbers in charts to short clean values)
def reformat_large_tick_values(tick_val, pos):
    """
    Turns large tick values (in the billions, millions and thousands) such as 4500 into 4.5K and also appropriately turns 4000 into 4K (no zero after the decimal).
    """
    if tick_val >= 1000000000:
        val = round(tick_val/1000000000, 1)
        new_tick_format = '{:}B'.format(val)
    elif tick_val >= 1000000:
        val = round(tick_val/1000000, 1)
        new_tick_format = '{:}M'.format(val)
    elif tick_val >= 1000:
        val = round(tick_val/1000, 1)
        new_tick_format = '{:}K'.format(val)
    elif tick_val < 1000:
        new_tick_format = round(tick_val, 1)
    else:
        new_tick_format = tick_val

    # make new_tick_format into a string value
    new_tick_format = str(new_tick_format)
    
    # code below will keep 4.5M as is but change values such as 4.0M to 4M since that zero after the decimal isn't needed
    index_of_decimal = new_tick_format.find(".")
    
    if index_of_decimal != -1:
        value_after_decimal = new_tick_format[index_of_decimal+1]
        if value_after_decimal == "0":
            # remove the 0 after the decimal point since it's not needed
            new_tick_format = new_tick_format[0:index_of_decimal] + new_tick_format[index_of_decimal+2:]
            
    return new_tick_format

Step 2: Gather ?

In [5]:

# flight data
df8991 = pd.read_csv('dataset/1989-1991.csv')

In [6]:

# misc_data
airports = pd.read_csv('misc_data/master/airports_master.csv')
carriers = pd.read_csv('misc_data/master/carriers_master.csv')
plane_data = pd.read_csv('misc_data/master/plane-data_master.csv')
var_desc = pd.read_csv('misc_data/master/variable-descriptions_master.csv')

In [7]:

df8991.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15031014 entries, 0 to 15031013
Data columns (total 17 columns):
Date                 object
DayOfWeek            object
DepTime              object
CRSDepTime           object
ArrTime              object
CRSArrTime           object
UniqueCarrier        object
FlightNum            int64
ActualElapsedTime    int64
CRSElapsedTime       int64
ArrDelay             int64
DepDelay             int64
Origin               object
Dest                 object
Distance             int64
Cancelled            int64
Diverted             int64
dtypes: int64(8), object(9)
memory usage: 1.9+ GB

In [8]:

# Converting date column
df8991['Date'] = pd.to_datetime(df8991['Date'])

In [9]:

from PIL import Image # Library for importing images
Image.open('images/Boeing_727-22_United_Airlines.jpg')

Out[9]:

Step 3: Univariate Exploration ?

During this step, we will investigate distributions of individual variables to prepare for observing the relationships between variables.In [10]:

df8991.head(10)

Out[10]:

DateDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNumActualElapsedTimeCRSElapsedTimeArrDelayDepDelayOriginDestDistanceCancelledDiverted
01989-01-23Monday14:19:0012:30:0017:42:0015:52:00UA183323322110109SFOHNL239800
11989-01-24Tuesday12:55:0012:30:0016:12:0015:52:00UA1833173222025SFOHNL239800
21989-01-25Wednesday12:30:0012:30:0015:33:0015:52:00UA183303322-190SFOHNL239800
31989-01-26Thursday12:30:0012:30:0015:23:0015:52:00UA183293322-290SFOHNL239800
41989-01-27Friday12:32:0012:30:0015:13:0015:52:00UA183281322-392SFOHNL239800
51989-01-28Saturday12:28:0012:30:0015:50:0015:52:00UA183322322-2-2SFOHNL239800
61989-01-29Sunday16:39:0012:30:0019:42:0015:52:00UA183303322230249SFOHNL239800
71989-01-30Monday12:31:0012:30:0015:31:0015:52:00UA183300322-211SFOHNL239800
81989-01-31Tuesday14:05:0012:30:0018:27:0015:52:00UA18338232215595SFOHNL239800
91989-01-02Monday10:57:0010:45:0015:37:0015:54:00UA184160189-1712DENIAD145200

3.1. Total Flights by Month

In [11]:

# Extracting the month number from the dates
flight_month = df8991['Date'].dt.month

# Assigning the months
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Creating a new dataframe
flight_month = pd.DataFrame(flight_month)

# Renaming the column name of the new dataframe
flight_month.rename(columns = {'Date':'Month'},inplace=True)

# Adding a new column to count flights per month
flight_month['Flights'] = df8991['Date'].dt.month.count

# Grouping the data by month then assigning month names to 'month' column
month_data  = flight_month.groupby('Month').count()
month_data['Month'] = months

# Renaming each month number to a real month name
flights_by_month = month_data.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})
flights_by_month.drop(columns=['Month'], axis=1, inplace=True)

# Looking at the dataframe we created
flights_by_month

Out[11]:

Flights
Month
Jan1223834
Feb1102190
Mar1271415
Apr1250728
May1281666
Jun1255135
Jul1299911
Aug1312995
Sep1254058
Oct1297516
Nov1224153
Dec1257413

In [12]:

# Total flights
flights_by_month['Flights'].sum()

Out[12]:

15031014

In [13]:

# Average flights per month
flights_by_month['Flights'].mean()

Out[13]:

1252584.5

In [14]:

# Working out percentage of flights in 'Feb' compared to the monthly average
round(flights_by_month.loc['Feb','Flights']/flights_by_month['Flights'].mean()-1, 2)

Out[14]:

-0.12

In [15]:

# Plot
flights_by_month.plot(kind = 'bar', figsize=(14.70, 8.27), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Month\n', fontsize = 14, weight = "bold")

# Set axis labels
plt.ylabel('Flights', fontsize = 10,  weight = "bold")
plt.xlabel('Month', fontsize = 10, weight = "bold")

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_month.png')
plt.show()

Observation 1: For this dataset we have used the data between 1989-1991, therefore we grouped the data into monthly averages between those years. can see August has the most flights on average and Feburary had the least. This shows us that summer months have the most flights and winter months the least. Feburary had 12% less flights compared to the monthly average of 1,252,585.

3.2. Total Flights by Season

In [16]:

# Create new dataframe
flights_by_season = flights_by_month
flights_by_season = month_data.rename(index={1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring', 6: 'Summer', 7: 'Summer', 8: 'Summer', 9: 'Autumn', 10: 'Autumn', 11: 'Autumn', 12: 'Winter'})

# Drop 'Month' column
flights_by_season.drop(columns=['Month'], axis=1, inplace=True)

# Group seasons
flights_by_season.rename_axis(index={'Month':'Season'}, inplace=True)
flights_by_season.reset_index(inplace=True)
flights_by_season = flights_by_season.groupby('Season').sum()

# Order index
flights_by_season = flights_by_season.reindex(['Spring', 'Summer', 'Autumn', 'Winter'])

# Looking at the dataframe we created
flights_by_season

Out[16]:

Flights
Season
Spring3803809
Summer3868041
Autumn3775727
Winter3583437

In [17]:

flights_by_season['Flights'].pct_change(periods=1)

Out[17]:

Season
Spring         NaN
Summer    0.016886
Autumn   -0.023866
Winter   -0.050928
Name: Flights, dtype: float64

In [18]:

# Plot
flights_by_season.plot(kind = 'bar', figsize=(7, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Season', fontsize = 15)

# Set axis labels
plt.ylabel('Number of flights', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_season.png')
plt.show()

Observation 2: Here we are looking at flights by season, this plot shows us the total flights recorded for each season. Summer has the most total flights but the difference between them is not as much as you might think. From my calculations, the percentage difference between the best and worst seasons is only 5%. Spring has a strong performance placing 2nd, Autumn 3rd and Winter a more distant 4th.

3.3. Total Flights by Day

In [19]:

# Extracting the day number from the dates
flight_day = df8991['Date'].dt.dayofweek 

# Assigning the days
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Creating a new dataframe
flight_day = pd.DataFrame(flight_day)

# Renaming the column name of the new dataframe
flight_day.rename(columns = {'Date':'Day'},inplace=True)

# Adding a new column to count flights per day
flight_day['Flights'] = df8991['Date'].dt.dayofweek.count

# Grouping the data by day
day_data  = flight_day.groupby('Day').count()
day_data['Day'] = days

# Renaming each day number to a real day name
flights_by_day = day_data.rename(index={0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', 4: 'Fri', 5: 'Sat', 6: 'Sun'})
flights_by_day.drop(columns=['Day'], axis=1, inplace=True)

# Looking at the dataframe we created
flights_by_day

Out[19]:

Flights
Day
Mon2211445
Tue2210868
Wed2200217
Thu2186411
Fri2180619
Sat1964556
Sun2076898

In [20]:

flights_by_day['Flights'].pct_change()

Out[20]:

Day
Mon         NaN
Tue   -0.000261
Wed   -0.004818
Thu   -0.006275
Fri   -0.002649
Sat   -0.099083
Sun    0.057184
Name: Flights, dtype: float64

In [21]:

# Average flights per day
round(flights_by_day['Flights'].mean(), 2)

Out[21]:

2147287.71

In [22]:

# Working out percentage of flights compared to the weekly average
round(flights_by_day.loc['Sat','Flights']/flights_by_day['Flights'].mean()-1, 2)

Out[22]:

-0.09

In [23]:

# Plot
flights_by_day.reindex(index=flights_by_day.index[::-1]).plot(kind = 'barh', figsize=(10, 5), rot=0)

# Set title label
plt.title('Total Flights by Day', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
ax.axvline(flights_by_day['Flights'].mean(), color='red', linewidth=2, linestyle='dashed', label='Average')

# Legend
plt.legend(bbox_to_anchor=(1.2, 1), loc=1, borderaxespad=0.0)

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_day.png')
plt.show()

Observation 3: This plot shows us total flights by day. Monday has the most total flights with 2,211,445 but the difference between Monday and the other weekdays is rather negligable. Saturday shows a near 10% dip in flights compared to Monday while Sunday dips under 6% using the same comparison, this suggests passengers are flying more during the week than weekends. This chart shows a 10% variance between the day with the highest and lowest amount of flights, 3 years of data shows consistency in strong demand for flights regardless of the day of the week.

3.4. Total Flights by Hour

In [24]:

# Creating flight_hour
flight_hour = df8991['DepTime']

# Creating a new dataframe
flight_hour = pd.DataFrame(flight_hour)

# Extracting the hour from the times
flight_hour['FlightHour'] = df8991['DepTime'].str[:2]

# Adding a new column to count flights per hour
flight_hour['Flights'] = flight_hour['FlightHour'].count

# Drop column
flight_hour.drop(columns=['DepTime'], axis=1, inplace=True)

# Grouping the data by hour
hour_data  = flight_hour.groupby('FlightHour').count()
flights_by_hour = hour_data

# Looking at the dataframe we created
flights_by_hour

Out[24]:

Flights
FlightHour
0061648
0146131
028595
032593
045505
0553452
06779935
071049654
081103998
09976624
10834899
11915756
12977161
131032993
14845945
15882206
16934469
17988229
18973944
19842436
20798213
21503912
22275905
23136811

In [25]:

# Most popular flight hours
flights_by_hour.sort_values(by='Flights', ascending=False).head(5)

Out[25]:

Flights
FlightHour
081103998
071049654
131032993
17988229
12977161

In [26]:

# Least popular flight hours 
flights_by_hour.sort_values(by='Flights', ascending=True).head(5)

Out[26]:

Flights
FlightHour
032593
045505
028595
0146131
0553452

In [27]:

# Working out x difference in flights between busiest hour compared to the quietest hour
round(flights_by_hour.loc['08','Flights']/flights_by_hour.loc['03','Flights'], 2)

Out[27]:

425.76

In [28]:

# Plot
flights_by_hour.plot(kind = 'bar', figsize=(14.70, 8.27), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Hour\n', fontsize = 14)

# Set axis labels
plt.ylabel('Flights', fontsize = 10,  weight = "bold")
plt.xlabel('Time', fontsize = 10, weight = "bold")

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_hour.png')
plt.show()

Observation 4: We can see the total flights by hour here based on departure time. 08:00 is the most popular time to fly with 1,103,998 flights according to departure time data from 1989-1991. We can see that there is a dip at 10:00, perhaps due to scheduling or other external reasons. The next dip we see on the chart is at 14:00 hours before recovering again until 18:00 where the flights steadily drop off. 03:00 recorded the lowest number of flights at only 2,593, that is a -426x difference in total flights vs 08:00. Flights between midnight and 05:00 are lower due to many factors including being the time people are generally sleeping but also noise restrictions. Maybe we can look at flight duration next?

3.5. Flight Duration in Minutes

In [29]:

df8991['ActualElapsedTime'].min(), df8991['ActualElapsedTime'].max()

Out[29]:

(-591, 1883)

In [30]:

df8991['CRSElapsedTime'].min(), df8991['CRSElapsedTime'].max()

Out[30]:

(-59, 1565)

In [31]:

df8991[df8991['CRSElapsedTime'] == -59]

Out[31]:

DateDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNumActualElapsedTimeCRSElapsedTimeArrDelayDepDelayOriginDestDistanceCancelledDiverted
132664001991-08-08Thursday06:56:0007:00:0011:04:0007:01:00NW312188-59243-4MSPDCA93100

In [32]:

# Looking for 'ActualElapsedTime' values under 0
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] < 0].count()

Out[32]:

29

In [33]:

# Looking for 'CRSElapsedTime' values under 0
df8991['CRSElapsedTime'][df8991['CRSElapsedTime'] < 0].count()

Out[33]:

12

In [34]:

# Creating flight_duration and dropping time values under 0
flight_duration = df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] >= 0]

# Creating a new dataframe
flight_duration = pd.DataFrame(flight_duration)

# Extracting the duration data
flight_duration['FlightDuration'] = df8991['ActualElapsedTime']

# Adding a new column to count flights by duration
flight_duration['Flights'] = flight_duration['FlightDuration'].count

# Drop column
flight_duration.drop(columns=['ActualElapsedTime'], axis=1, inplace=True)

# Grouping the data by duration
duration_data  = flight_duration.groupby('FlightDuration').count()
flights_by_duration = duration_data

# Looking at the dataframe we created
flights_by_duration.head()

Out[34]:

Flights
FlightDuration
02
12
24
31
43

In [35]:

# Average flight duration
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 0].mean()

Out[35]:

109.90256206131029

In [36]:

# Median flight duration
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 0].median()

Out[36]:

91.0

In [37]:

# Percentage of flights over 300 minutes
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 300].count()/df8991['ActualElapsedTime'].count()

Out[37]:

0.018513654501286475

In [38]:

# Percentage of flights between 0 and 300 minutes
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'].between(0, 300)].count()/df8991['ActualElapsedTime'].count()

Out[38]:

0.9814844161544923

In [39]:

# Checking the statistics
df8991['ActualElapsedTime'][df8991['ActualElapsedTime'] > 0].describe()

Out[39]:

count    1.503098e+07
mean     1.099026e+02
std      6.445965e+01
min      1.000000e+00
25%      6.400000e+01
50%      9.100000e+01
75%      1.400000e+02
max      1.883000e+03
Name: ActualElapsedTime, dtype: float64

In [40]:

# Maximum flight duration
df8991['ActualElapsedTime'].max()

Out[40]:

1883

In [41]:

# Plot
flights_by_duration.plot(kind = 'area', figsize=(7, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Duration', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('Minutes', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_duration.png')
plt.show()

Observation 5: Total flights by duration shows us a lot of interesting insights. Our area graph above shows us flights average around 110 minutes while the median flight duration is 91 minutes. The longest flight duration was 1883 minutes which equates to over 31 hours! The conclusion is 98.1% of flights are under 300 minutes and the remaining 1.9% of flights are generally under 300 minutes or 5 hours. Distance is an ideal thing to look at next, how far do passengers travel and does flight distance vary depending on time?

3.6. Flight Distance in Miles

In [42]:

# Creating flight_distance
flight_distance = df8991['Distance']

# Creating a new dataframe
flight_distance = pd.DataFrame(flight_distance)

# Extracting the distance data
flight_distance['FlightDistance'] = df8991['Distance']

# Adding a new column to count flights by distance
flight_distance['Flights'] = flight_distance['FlightDistance'].count

# Drop column
flight_distance.drop(columns=['Distance'], axis=1, inplace=True)

# Grouping the data by distance
distance_data  = flight_distance.groupby('FlightDistance').count()
flights_by_distance = distance_data
flights_by_distance_idx = distance_data.reset_index()

# Looking at the dataframe we created
flights_by_distance.head()

Out[42]:

Flights
FlightDistance
01
11889
174
182
215696

In [43]:

# Minimum and maxiumum distance travelled
df8991['Distance'].min(), df8991['Distance'].max()

Out[43]:

(0, 4502)

In [44]:

# Average flight distance
df8991['Distance'].mean()

Out[44]:

632.9591811969572

In [45]:

df8991['Distance'][df8991['Distance'] > 2800].count()/df8991['Distance'].count()

Out[45]:

0.0016491901344779533

In [46]:

# Plot
plt.hist(data = flights_by_distance_idx, x = 'FlightDistance', bins = 25) #, y = 'Flights'

# Set title label
plt.title('Total Flights by Distance', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('Distance (Miles)', fontsize = 13)

# Ticks fixed for histogram
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1) + 'K'))

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_distance.png')
plt.show()

Observation 6: We are looking at total flights by distance using a histogram, flights are indicated on the left and distance is on the bottom. Most flights range from around 100 to 1000 miles and steadily decline after, with the average being 632 miles. 2500-3000 miles seems to be the point flights most drastically reduce. Statistically 0.1% of flights go over 2800 miles. Maybe we can look at carrier data to find some different correlations?

3.7. Flights by Carrier

In [47]:

# Creating flight_carrier
flight_carrier = df8991['UniqueCarrier']

# Creating a new dataframe
flight_carrier = pd.DataFrame(flight_carrier)

# Extracting the carrier data
flight_carrier['FlightCarrier'] = df8991['UniqueCarrier']

# Adding a new column to count flights by carrier
flight_carrier['Flights'] = flight_carrier['FlightCarrier'].count

# Drop column
flight_carrier.drop(columns=['UniqueCarrier'], axis=1, inplace=True)

# Grouping the data by carrier
carrier_data  = flight_carrier.groupby('FlightCarrier').count()
flights_by_carrier = carrier_data

# Looking at the dataframe we created
flights_by_carrier.head()

Out[47]:

Flights
FlightCarrier
AA2084513
AS267975
CO1241287
DL2430469
EA398362

In [48]:

# Creating a filtered database for top 10 carriers ascending
carriers_top_10 = flights_by_carrier.sort_values('Flights', ascending=False).iloc[0:10]

# Dataframe without index
carriers_top_10_idx = carriers_top_10.reset_index()

In [49]:

# Top 10 Carriers
carriers_top_10

Out[49]:

Flights
FlightCarrier
US2575160
DL2430469
AA2084513
UA1774850
NW1352989
CO1241287
WN950837
TW756992
HP623477
EA398362

In [50]:

carriers.head()

Out[50]:

codedescription
002QTitan Airways
104QTradewind Aviation
205QComlux Aviation, AG
306QMaster Top Linhas Aereas Ltd.
407QFlair Airlines Ltd.

In [51]:

# Filter
carriers[carriers['code'] == 'US']

Out[51]:

codedescription
1308USUS Airways Inc. (Merged with America West 9/05…

In [52]:

# Renaming column to prepare for df merge
carriers_top_10_idx.rename(columns={"FlightCarrier":"code"}, inplace=True)

In [53]:

# Merging dataframes to get the carrier description
carriers_top_10_idx2 = pd.merge(left=carriers_top_10_idx, right=carriers, how='left', on='code')

In [54]:

carriers_top_10_idx2

Out[54]:

codeFlightsdescription
0US2575160US Airways Inc. (Merged with America West 9/05…
1DL2430469Delta Air Lines Inc.
2AA2084513American Airlines Inc.
3UA1774850United Air Lines Inc.
4NW1352989Northwest Airlines Inc.
5CO1241287Continental Air Lines Inc.
6WN950837Southwest Airlines Co.
7TW756992Trans World Airways LLC
8HP623477America West Airlines Inc. (Merged with US Air…
9EA398362Eastern Air Lines Inc.

In [55]:

carriers_top_10_idx2['description'] = carriers_top_10_idx2['description'].str.replace(r"\(.*\)","")

In [56]:

carriers_top_10_idx2

Out[56]:

codeFlightsdescription
0US2575160US Airways Inc.
1DL2430469Delta Air Lines Inc.
2AA2084513American Airlines Inc.
3UA1774850United Air Lines Inc.
4NW1352989Northwest Airlines Inc.
5CO1241287Continental Air Lines Inc.
6WN950837Southwest Airlines Co.
7TW756992Trans World Airways LLC
8HP623477America West Airlines Inc.
9EA398362Eastern Air Lines Inc.

In [57]:

carriers_top_10_idx2['Flights'].pct_change()

Out[57]:

0         NaN
1   -0.056187
2   -0.142341
3   -0.148554
4   -0.237688
5   -0.082559
6   -0.233991
7   -0.203868
8   -0.176376
9   -0.361064
Name: Flights, dtype: float64

In [58]:

# Variables
labels = [' ',' ',' ',' ',' ',' ',' ',' ',' ',' ']
explode = [0.1, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]

# Plot
carriers_top_10.plot(kind='pie', figsize=(14.70, 8.27), rot=0, subplots=True, labels=labels, legend=True, 
                     cmap='jet', autopct='%1.1f%%', pctdistance=1.15, explode=explode)

# Legend
carrier_list = list(carriers_top_10_idx2['description'])
plt.legend(labels = carrier_list, bbox_to_anchor=(1.55, 1), loc=1, borderaxespad=0.0)

# Set title label
plt.title('Total Flights by Carrier', loc='center', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_carrier.png')
plt.show()

Observation 7: The total flights by carrier is a good indication of which airlines were the most popular between 1989 and 1991. The most popular carrier is US Airways Inc with 2,575,160 flights between 1989-1991, that represents an 18.1% share of all flights from our data. Delta Air Lines Inc. is a close 2nd place with 2,430,469 or a 5.6% difference from 1st place with a 17% share of the market. Those two airlines would have gone toe to toe dominating aviation at that time. However, American Airlines Inc. features in 3rd place with 2,084,513 flights with a strong 14.7% market share. The gap widens to a 14% difference in total flights vs 1st place. This trend continues as the other airlines share smaller and smaller stakes in the market. Perhaps later we can find out which airline is the most punctual?

3.8. Flight Distance by Day

In [59]:

# Grouping data by day of week then sorting index
weekday_cat = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday = df8991.drop(['FlightNum'], axis=1).groupby(['DayOfWeek']).mean().reindex(weekday_cat)

# Rename index variable
idx = {"Monday": "Mon", "Tuesday":"Tue", "Wednesday":"Wed", "Thursday":"Thu", "Friday":"Fri", "Saturday":"Sat", "Sunday":"Sun"}

In [60]:

weekday.head(7)

Out[60]:

ActualElapsedTimeCRSElapsedTimeArrDelayDepDelayDistanceCancelledDiverted
DayOfWeek
Monday109.058930109.4882314.9649755.425532628.1875340.00.0
Tuesday109.530635109.4221636.2135786.144185627.7411700.00.0
Wednesday109.712301109.3555997.0215186.684452627.1155910.00.0
Thursday110.101038109.4117498.7442488.079987627.5629290.00.0
Friday110.096784109.5293419.3491418.810754628.5334790.00.0
Saturday110.790458112.2608394.7699306.301211650.8859980.00.0
Sunday110.143351111.3370365.0552556.310329643.1554900.00.0

In [61]:

weekday['Distance']

Out[61]:

DayOfWeek
Monday       628.187534
Tuesday      627.741170
Wednesday    627.115591
Thursday     627.562929
Friday       628.533479
Saturday     650.885998
Sunday       643.155490
Name: Distance, dtype: float64

In [62]:

# Figsize
fig = ax.get_figure()
plt.figure(figsize = [15, 6])
plt.suptitle('Elapsed Time and Distance vs Day', fontsize=15)

# Plot 1
plt.subplot(1, 2, 1)
weekday['ActualElapsedTime'].rename(index = idx, inplace = False).plot(kind = 'line', title='Actual Elapsed Time', legend=None)
plt.xlabel('')
plt.ylabel('')

# Plot 2
plt.subplot(1, 2, 1)
weekday['CRSElapsedTime'].rename(index = idx, inplace = False).plot(kind = 'line', title='Actual Elapsed Time vs CRS Elapsed Time (Min)', legend=None)
plt.legend()
plt.xlabel('')
plt.ylabel('')

# Plot 3
plt.subplot(1, 2, 2)
weekday['Distance'].rename(index = idx, inplace = False).plot(kind = 'line', title='Distance (Miles)', legend=None)
plt.xlabel('')
plt.ylabel('')

# Show
fig.tight_layout()
fig.subplots_adjust(top=2)
plt.savefig('images/charts/time_vs_distance_vs_day.png')
plt.show()

Observation 8: Elapsed time and distance versus day is a good way to find correlation between time of the week and flight delays with distance. Flight time seems to increase as the week passes. The weekend has the longest flights on average, indicating passengers might be travelling on weekend holidays but commute on weekdays. We also know from other data that fewer flights take place during the weekend but also delays are lower while flight duration and distance is longer on average. We should look into delays more, which airports are the most punctual and which are the least?

3.9. Delay by Origin and Destination

In [63]:

df8991['Origin'].unique()

Out[63]:

array(['SFO', 'DEN', 'HNL', 'LIH', 'PHL', 'OGG', 'IAD', 'EWR', 'LAX',
       'KOA', 'ORD', 'MKE', 'IAH', 'MSY', 'RIC', 'SEA', 'FLL', 'MCO',
       'BWI', 'HOU', 'MCI', 'SJC', 'CMH', 'OMA', 'ORF', 'BOS', 'ABQ',
       'SMF', 'OKC', 'SGF', 'SLC', 'ONT', 'SAN', 'BUF', 'LGB', 'MIA',
       'BDL', 'IND', 'TPA', 'SYR', 'PIT', 'STL', 'PDX', 'CLT', 'PHX',
       'CVG', 'MBS', 'ATL', 'DSM', 'LAS', 'AUS', 'JAN', 'MEM', 'DTW',
       'SRQ', 'MSP', 'DFW', 'PSP', 'MDT', 'FAT', 'BGR', 'PWM', 'ROC',
       'MDW', 'FAR', 'CHS', 'SAV', 'LGA', 'ANC', 'FSD', 'OAK', 'CLE',
       'CAE', 'TUL', 'HPN', 'GSO', 'RDU', 'DAY', 'GRR', 'MSN', 'JAX',
       'BNA', 'COS', 'SAT', 'CID', 'PBI', 'GEG', 'LNK', 'PVD', 'BIL',
       'SBA', 'ELP', 'SDF', 'TUS', 'SNA', 'ICT', 'BUR', 'ABE', 'RAP',
       'GTF', 'LIT', 'ALB', 'RNO', 'BHM', 'HSV', 'BOI', 'DCA', 'SUX',
       'TYS', 'MHT', 'EUG', 'MLI', 'BTV', 'PIA', 'FAI', 'MFR', 'MRY',
       'JFK', 'RSW', 'ERI', 'PHF', 'PSC', 'ISP', 'AVP', 'EVV', 'ELM',
       'LEX', 'ORH', 'BGM', 'TRI', 'CRW', 'TOL', 'MYR', 'ITH', 'GSP',
       'HTS', 'ACY', 'BLI', 'SCK', 'RDM', 'CCR', 'YKM', 'DAL', 'LBB',
       'CRP', 'AMA', 'HRL', 'MAF', 'DET', 'VPS', 'AZO', 'FWA', 'DLH',
       'ATW', 'GRB', 'BIS', 'GPT', 'LAN', 'GFK', 'RST', 'BZN', 'EAU',
       'LSE', 'MOT', 'MOB', 'MSO', 'BTR', 'CHA', 'SHV', 'MGM', 'PFN',
       'HDN', 'SBN', 'STT', 'STX', 'FAY', 'AVL', 'CAK', 'OAJ', 'TLH',
       'ROA', 'ILM', 'AGS', 'DAB', 'FNT', 'ISO', 'CHO', 'CMI', 'LYH',
       'UCA', 'PNS', 'EYW', 'APF', 'GNV', 'SJU', 'PUB', 'MLU', 'MLB',
       'CSG', 'CPR', 'IDA', 'JAC', 'HLN', 'FCA', 'JNU', 'BTM', 'DRO',
       'GJT', 'YUM', 'FLG', 'GCN', 'PIE', 'TVL', 'BFL', 'GUC', 'BET',
       'OME', 'OTZ', 'SCC', 'CDV', 'YAK', 'PSG', 'SIT', 'WRG', 'GUM',
       'MFE', 'LFT', 'YAP', 'ROR', 'SPN', 'ROP', 'TVC', 'GST', 'EGE',
       'SUN', 'PMD', 'SWF', 'EFD', 'PSE', 'HVN', 'TTN', 'BQN', 'SPI'],
      dtype=object)

In [64]:

origin = df8991.drop(['FlightNum'], axis=1).groupby(['Origin']).mean()

In [65]:

dest = df8991.drop(['FlightNum'], axis=1).groupby(['Dest']).mean()

In [66]:

origin['DepDelay'].sort_values(ascending=False).iloc[0:10]

Out[66]:

Origin
PSE    22.716667
YAP    19.763158
SUN    14.538462
HDN    12.525217
JFK    11.455878
ORD    10.036105
GUC     9.815362
JNU     9.490426
EWR     8.866009
PIT     8.842862
Name: DepDelay, dtype: float64

In [67]:

# Assigning high and low airport delay
aplow_delay = (origin['DepDelay'].sort_values(ascending=False)+dest['ArrDelay']).sort_values(ascending=True).iloc[0:10].sort_values(ascending=False)
aphigh_delay = (origin['DepDelay'].sort_values(ascending=False)+dest['ArrDelay']).sort_values(ascending=False).iloc[0:10].sort_values(ascending=True)

In [68]:

# Top 10 lowest delay airports
aplow_delay.sort_values(ascending=True)

Out[68]:

BQN   -10.351648
ATW    -5.999804
YKM     0.053702
MOT     0.203609
EAU     0.944523
PMD     1.263444
ROP     1.655739
LFT     2.202046
PUB     2.252053
GRB     2.559022
dtype: float64

In [69]:

# Top 10 highest delay airports
aphigh_delay.sort_values(ascending=False)

Out[69]:

PSE    53.250000
SUN    38.038462
YAP    37.362168
HDN    22.337867
GST    20.760290
SIT    20.030887
GUC    19.211647
JFK    18.920419
JNU    18.902215
SFO    18.836069
dtype: float64

In [70]:

# Chart 1
plt.subplot(1, 2, 1)
origin['DepDelay'].sort_values(ascending=False).iloc[0:10].plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Departure Airports by Delay')

# Chart 2
plt.subplot(1, 2, 2)
dest['ArrDelay'].sort_values(ascending=False).iloc[0:10].plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Arrival Airports by Delay', label='Delay in minutes')

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/delay_by_origin_dest.png')
plt.show()

Here we are looking at airports ranked by average delay times. PIT is has the lowest departure delays but PSE has the highest departure delays on average. SCC has the lowest arrival delays and PSE has the highest departure delays. We should look at total delay now, this should be easily done by adding the departure and arrival times.In [71]:

# Chart 1
plt.subplot(1, 2, 1)
aphigh_delay.plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Top 10 Airports by Highest Total Delay')

# Chart 2
plt.subplot(1, 2, 2)
aplow_delay.plot(
    kind = 'barh', figsize=(10, 5), legend=None, title='Top 10 Airports by Lowest Total Delay')

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/delay_by_airport.png')
plt.show()

Observation 9: The above plot shows airports by total delay this time. PSE – Mercedita Airport has the highest average delays at 53.2 minutes. BQN – Rafael Hernández Airport has the lowest average delays at -10.35 minutes. In other words, flights arrive 10 minutes early on average.In [72]:

# Airline carrier filter
carriers[carriers['code'] == 'UA']

Out[72]:

codedescription
1297UAUnited Air Lines Inc.

In [73]:

# Airport name filter
airports[airports['iata'] == 'PSE']

Out[73]:

iataairportcitystatecountrylatlong
2674PSEMerceditaPoncePRUSA18.008303-66.563012

In [74]:

# Airports by flights
(df8991['Origin'] == 'ATW').sum()+(df8991['Dest'] == 'ATW').sum()

Out[74]:

1861

Step 4: Bivariate Exploration ?

4.1. Airports by Traffic

In [75]:

df8991['Dest'].unique()

Out[75]:

array(['HNL', 'IAD', 'SFO', 'DEN', 'LIH', 'LAX', 'PDX', 'ORD', 'KOA',
       'OGG', 'BOS', 'EWR', 'IAH', 'MSY', 'PHX', 'OMA', 'ABE', 'SDF',
       'DFW', 'PHL', 'BTV', 'FAT', 'DTW', 'ABQ', 'TUL', 'AUS', 'MSP',
       'PIT', 'GSO', 'SMF', 'TUS', 'IND', 'GRR', 'HOU', 'ORF', 'SJC',
       'SAN', 'CMH', 'SEA', 'MCI', 'MEM', 'BDL', 'TPA', 'PBI', 'FSD',
       'BWI', 'DSM', 'CVG', 'MCO', 'MHT', 'PWM', 'BGR', 'BOI', 'RDU',
       'LAS', 'DAY', 'SAV', 'CHS', 'FAI', 'ANC', 'CLE', 'OAK', 'ICT',
       'OKC', 'MKE', 'GEG', 'LIT', 'TYS', 'BNA', 'BUF', 'COS', 'FAR',
       'SGF', 'LNK', 'SUX', 'RAP', 'GTF', 'BIL', 'CLT', 'ONT', 'SLC',
       'ATL', 'STL', 'MIA', 'MSN', 'MRY', 'JAX', 'MBS', 'CAE', 'RNO',
       'HPN', 'ROC', 'ALB', 'RSW', 'RIC', 'SBA', 'SAT', 'LGB', 'PVD',
       'HSV', 'MDT', 'JAN', 'DCA', 'BUR', 'JFK', 'SNA', 'PSP', 'SYR',
       'FLL', 'BHM', 'MLI', 'PIA', 'CID', 'SRQ', 'LGA', 'MDW', 'ELP',
       'EUG', 'MFR', 'ERI', 'AVP', 'ISP', 'PSC', 'EVV', 'MYR', 'GSP',
       'PHF', 'ITH', 'CRW', 'BGM', 'TOL', 'LEX', 'ELM', 'ORH', 'HTS',
       'TRI', 'ACY', 'SCK', 'CCR', 'YKM', 'RDM', 'BLI', 'HRL', 'DAL',
       'AMA', 'LBB', 'CRP', 'MAF', 'DET', 'GRB', 'FWA', 'RST', 'DLH',
       'AZO', 'GFK', 'LAN', 'LSE', 'EAU', 'BZN', 'MOT', 'BIS', 'MOB',
       'GPT', 'MSO', 'CHA', 'ATW', 'SHV', 'VPS', 'MGM', 'BTR', 'PFN',
       'HDN', 'SBN', 'STT', 'STX', 'CAK', 'FNT', 'AVL', 'OAJ', 'FAY',
       'ROA', 'ISO', 'DAB', 'ILM', 'AGS', 'CMI', 'TLH', 'CHO', 'LYH',
       'UCA', 'PNS', 'EYW', 'GNV', 'APF', 'SJU', 'PUB', 'MLB', 'MLU',
       'CSG', 'CPR', 'JAC', 'FCA', 'HLN', 'IDA', 'JNU', 'BTM', 'GJT',
       'YUM', 'DRO', 'FLG', 'GCN', 'BFL', 'GUC', 'PIE', 'TVL', 'BET',
       'OME', 'OTZ', 'SCC', 'YAK', 'CDV', 'SIT', 'PSG', 'WRG', 'GUM',
       'MFE', 'LFT', 'YAP', 'ROR', 'SPN', 'ROP', 'TVC', 'GST', 'EGE',
       'SUN', 'PMD', 'SWF', 'EFD', 'PSE', 'HVN', 'TTN', 'BQN', 'SPI'],
      dtype=object)

In [76]:

# Extracting the month number from the dates
flight_dest = df8991['Date'].dt.month

# Assigning the months
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Creating a new dataframe
flight_dest = pd.DataFrame(flight_dest)

# Renaming the column name of the new dataframe
flight_dest.rename(columns = {'Date':'Month'},inplace=True)

# Adding a new column to count flights per month
flight_dest['Flights'] = df8991['Date'].dt.month.count

# Adding a new column to sum destinations per month
flight_dest['Dest'] = df8991['Dest']

# Grouping the data by month then assigning month names to 'month' column
month_dest  = flight_dest.groupby([('Month'),('Dest')]).count()

# Renaming each month number to a real month name
flights_by_dest = month_dest.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})

# Looking at the dataframe we created
flights_by_dest.head()

Out[76]:

Flights
MonthDest
JanABE1281
ABQ7264
ACY160
AGS780
ALB3203

In [77]:

flights_by_dest_piv = flights_by_dest.reset_index().pivot(columns='Month',index='Dest',values='Flights')
flights_by_dest_piv = flights_by_dest_piv.reindex(columns= ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])

In [78]:

flights_by_dest_piv.head()

Out[78]:

MonthJanFebMarAprMayJunJulAugSepOctNovDec
Dest
ABE1281.01155.01169.01151.01276.01319.01499.01508.01490.01599.01540.01519.0
ABQ7264.06549.07691.07645.07947.07708.07900.07909.07580.07826.07281.07311.0
ACY160.0144.0163.0162.0209.0211.0249.0242.0188.0190.0182.0184.0
AGS780.0703.0782.0747.0796.0789.0796.0809.0784.0810.0789.0823.0
ALB3203.02989.03260.03215.03256.03160.03238.03312.03193.03356.03097.03165.0

In [79]:

flights_by_dest_piv.columns

Out[79]:

Index(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',
       'Nov', 'Dec'],
      dtype='object', name='Month')

In [80]:

flights_by_dest_piv.sort_values(by=['Dest'], axis=0, ascending=False).head()

Out[80]:

MonthJanFebMarAprMayJunJulAugSepOctNovDec
Dest
YUM463.0412.0451.0435.0411.0372.0374.0404.0364.0391.0373.0430.0
YKM60.051.055.059.01.0NaNNaNNaNNaNNaNNaNNaN
YAP42.038.047.052.054.024.020.027.026.023.024.027.0
YAK160.0121.0166.0171.0168.0164.0173.0174.0152.0166.0151.0144.0
WRG63.064.077.083.087.086.081.084.078.085.059.062.0

In [81]:

# Filling NaN with 0
flights_by_dest_piv = flights_by_dest_piv.fillna(0)

# Changing from float to np.int64 to remove .0
flights_by_dest_piv = flights_by_dest_piv.astype(np.int64)

# Summing across the rows using .sum and 'axis=0'
flights_by_dest_piv['Total'] = flights_by_dest_piv.sum(axis=1).astype(np.int64)

In [82]:

top_10_dest = flights_by_dest_piv.sort_values(by='Total', ascending=False).iloc[0:10,0:12]
top_10_dest

Out[82]:

MonthJanFebMarAprMayJunJulAugSepOctNovDec
Dest
ORD648605810866654664396851966286676936839065764679316394966085
DFW387693493758998580565956259201619256238960610618075901561795
ATL573805196951962504415218852102557425838858889622915904560404
LAX390723486239950395034151541261429124290539996410033900439192
DEN338673086235191340933479634488358363668833778344593239834396
PHX329792979334678338773473733711348043472032948341313272333877
STL305062769631060307123145531567328283323231617323622915329462
SFO300682664930753304823182731614329853323830755318242979430346
DTW290982638229715296503075229824306423136629435307462861929245
PIT290622624629307287222973128558295882957628745297202838828690

In [83]:

df8991['Dest'].value_counts().head(10)

Out[83]:

ORD    790678
DFW    677064
ATL    670801
LAX    481175
DEN    410852
PHX    402978
STL    371650
SFO    370335
DTW    355474
PIT    346333
Name: Dest, dtype: int64

In [84]:

df8991['Origin'].value_counts().head(10)

Out[84]:

ORD    774861
ATL    669767
DFW    662101
LAX    482516
DEN    405662
PHX    400611
SFO    374001
STL    366913
DTW    349453
PIT    343552
Name: Origin, dtype: int64

In [85]:

# Extracting the month number from the dates
flight_origin = df8991['Date'].dt.month

# Assigning the months
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

# Creating a new dataframe
flight_origin = pd.DataFrame(flight_origin)

# Renaming the column name of the new dataframe
flight_origin.rename(columns = {'Date':'Month'},inplace=True)

# Adding a new column to count flights per month
flight_origin['Flights'] = df8991['Date'].dt.month.count

# Adding a new column to sum origininations per month
flight_origin['Origin'] = df8991['Origin']

# Grouping the data by month then assigning month names to 'month' column
month_origin  = flight_origin.groupby([('Month'),('Origin')]).count()

# Renaming each month number to a real month name
flights_by_origin = month_origin.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})

# Looking at the dataframe we created
flights_by_origin.head()

Out[85]:

Flights
MonthOrigin
JanABE1301
ABQ7335
ACY163
AGS788
ALB3268

In [86]:

flights_by_origin_piv = flights_by_origin.reset_index().pivot(columns='Month',index='Origin',values='Flights')
flights_by_origin_piv = flights_by_origin_piv.reindex(columns= ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])

In [87]:

# Filling NaN with 0
flights_by_origin_piv = flights_by_origin_piv.fillna(0)

# Changing from float to np.int64 to remove .0
flights_by_origin_piv = flights_by_origin_piv.astype(np.int64)

# Summing across the rows using .sum and 'axis=0'
flights_by_origin_piv['Total'] = flights_by_origin_piv.sum(axis=1).astype(np.int64)

In [88]:

top_10_origin = flights_by_origin_piv.sort_values(by='Total', ascending=False).iloc[0:10,0:12]
top_10_origin

Out[88]:

MonthJanFebMarAprMayJunJulAugSepOctNovDec
Origin
ORD636835710465155650236721764757663696646764726668326286964659
ATL572245183151889504625218651917557635831358726621415893660379
DFW375323433857383568595828757753602996076859686607745792860494
LAX392603498140132396074154641426429864302040053411053904039360
DEN335593059234759337633437433898351283612433433341193192233991
PHX328762966134464335633459633422345623444032777339423255333755
SFO303412702731162307023208831962333703362930964320883001930649
STL301762730630606303963102331093324493283731197318832884529102
DTW285842591129229291553019229321301383090628979302652813128642
PIT287872604129022285282954228366293252931428534294512821528427

In [89]:

import seaborn as sns
fig, (ax1, ax2, axcb) = plt.subplots(1,3, figsize=(15,6), gridspec_kw={'width_ratios':[1.05,1,0.08]})

# Plot
fig.suptitle('Top 10 Airports by Flights', fontsize=15)
ax1.title.set_text('Origin Airports')
ax2.title.set_text('Destination Airports')
sns.heatmap(top_10_origin, cmap='Purples', linewidths=5, cbar=None, ax=ax1)
sns.heatmap(top_10_dest, cmap='Purples', linewidths=5, cbar_ax=axcb, ax=ax2)

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/top_10_airports_by_flights.png')
plt.show()

On the left shows the top 10 airports by flights departed and on the right are the destinations by flights arrived. ORD – Chicago O’Hare International has the most flights based on departure and arrivals which suggests it’s a hub for flights across America. ATL – Atlanta International has the 2nd most flights based on departure but 3rd based on arrival. DFW – Dallas-Fort Worth International is 3rd in departures but 2nd in arrivals, the opposite of ATL. This means flights between those two airports are common. We could delve deeper into that if we need to later.In [90]:

# Sum dataframes
top_10_airports = top_10_origin.rename_axis(index='Airport')+top_10_dest.rename_axis(index='Airport')
top_10_airports['Total'] = top_10_airports.sum(axis=1).astype(np.int64)

# Sorting values
top_10_ap_total = top_10_airports['Total']
top_10_airports = top_10_airports.sort_values(by='Total', ascending=False).iloc[0:10,0:12]

In [91]:

# Top 10 airports by total flights
top_10_ap_total.sort_values(ascending=False)

Out[91]:

Airport
ORD    1565539
ATL    1340568
DFW    1339165
LAX     963691
DEN     816514
PHX     803589
SFO     744336
STL     738563
DTW     704927
PIT     689885
Name: Total, dtype: int64

In [92]:

# Plot
fig, ax = plt.subplots(figsize=(14.70, 8.27))
sns.heatmap(top_10_airports.iloc[0:10,0:12], cmap='Blues', linewidths=5, cbar='Blues')

# Labels
ax.set_title('Top 10 Airports by Total Flights\n', fontsize=14, weight='bold')
plt.xlabel('Month'.title(),
               fontsize = 10, weight = "bold")
plt.ylabel('Airport'.title(),
               fontsize = 10, weight = "bold")

# Show
#plt.tight_layout(pad=0)
plt.savefig('images/charts/top_10_airports_by_total_flights.png')
plt.show()

Observation 10: This plot shows us the top 10 airports by total inbound and outbound flights. The chart is topped by Chicago O’Hare International in 1st place with 1,565,539 flights. 2nd place is William B Hartsfield-Atlanta International with 1,340,568, and close 3rd is Dallas-Fort Worth International with 1,339,165. During the summer Chicago O’Hare International is very busy with strong demand throughout the year. Atlanta International follows a similar trend but Dallas-Fort Worth International show a weak January and Feburary in terms of total flights.In [93]:

# Dataframe without index
top_10_origin_idx = top_10_origin.reset_index().iloc[:,0:1]

# Renaming column to prepare for df merge
top_10_origin_idx.rename(columns={"Origin":"iata"}, inplace=True)

# Merging dataframes to get the airport description
top_10_origin_desc = pd.merge(left=top_10_origin_idx, right=airports, how='left', on='iata')

# Show airport data
top_10_origin_desc.iloc[:,:-2]

Out[93]:

iataairportcitystatecountry
0ORDChicago O’Hare InternationalChicagoILUSA
1ATLWilliam B Hartsfield-Atlanta IntlAtlantaGAUSA
2DFWDallas-Fort Worth InternationalDallas-Fort WorthTXUSA
3LAXLos Angeles InternationalLos AngelesCAUSA
4DENDenver IntlDenverCOUSA
5PHXPhoenix Sky Harbor InternationalPhoenixAZUSA
6SFOSan Francisco InternationalSan FranciscoCAUSA
7STLLambert-St Louis InternationalSt LouisMOUSA
8DTWDetroit Metropolitan-Wayne CountyDetroitMIUSA
9PITPittsburgh InternationalPittsburghPAUSA

In [94]:

top_10_origin_desc['long'].describe()

Out[94]:

count     10.000000
mean     -98.076828
std       15.373173
min     -122.374843
25%     -110.172792
50%      -93.698595
75%      -85.296324
max      -80.232871
Name: long, dtype: float64

In [95]:

top_10_ap_total.sort_values(ascending=False)

Out[95]:

Airport
ORD    1565539
ATL    1340568
DFW    1339165
LAX     963691
DEN     816514
PHX     803589
SFO     744336
STL     738563
DTW     704927
PIT     689885
Name: Total, dtype: int64

4.2. Airlines by Month vs Average Distance

In [96]:

# Creating a Unique Carrier dataframe
UC1 = df8991['UniqueCarrier'].groupby(df8991['Date'].map(lambda x: x.month)).nunique().dropna()
UC2 = round(df8991['Distance'].groupby(df8991['Date'].map(lambda x: x.month)).mean(), 2)

# Merge and reindex dataframes
carriers_month_v_distance = pd.merge(left=UC1, right=UC2, how='left', on='Date')
carriers_month_v_distance = carriers_month_v_distance.rename(index={1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun',
                                                                    7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})
# Rename Columns
carriers_month_v_distance.rename_axis("Month", axis="index", inplace=True)
carriers_month_v_distance.rename(columns = {'UniqueCarrier':'CarrierCount'},inplace=True)

# Saving as a new dataframe
carriers_month_v_distance = pd.DataFrame(carriers_month_v_distance)

# Show
carriers_month_v_distance

Out[96]:

CarrierCountDistance
Month
Jan14619.77
Feb14623.41
Mar14630.20
Apr14628.60
May14629.13
Jun14635.24
Jul14640.34
Aug14639.94
Sep13635.37
Oct13634.70
Nov12636.83
Dec12640.03

In [97]:

# Creating a combo chart
fig, ax1 = plt.subplots(figsize=(10,6))
color = 'tab:green'

# Creating Bar Plot
ax1.set_title('Airlines by Month vs Average Distance', fontsize=15)
ax1 = sns.barplot(x='Month', y='CarrierCount', data=carriers_month_v_distance.reset_index(), palette='summer')
ax1.set_xlabel('Month', fontsize=13)
ax1.set_ylabel('Carriers', fontsize=13, color=color, weight='bold')
ax1.tick_params(axis='y')

# Sharing the same x-axis
ax2 = ax1.twinx()
color = 'tab:red'

# Creating Line Plot
ax2 = sns.lineplot(x='Month', y='Distance', data=carriers_month_v_distance.reset_index(), 
                   sort=False, color=color, linewidth = 2, marker='o')
ax2.set_ylabel('\nAverage Distance (Mi)', fontsize=13, color=color, weight='bold')
ax2.tick_params(axis='y', color=color)

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/airlines_by_month_vs_distance.png')
plt.show()

Observation 11: Above is a combination chart showing us the amount of carriers active each month along with the average distance flown. On average, we can count around 12-14 carriers on any given month but the chart shows us from September the amount of carriers reduces into December suggesting reduced holidays. The opposite seems true when we look at average distance travelled by month. This number is increaing from 620 miles on average to 640 miles suggesting fewer longer flights. This correlation suggests perhaps passengers are flying less often but further in the winter than the summer. This could mean going from a cold country to a hotter country or going to see family for the festive season in December.

4.3. Average Flight Delay vs Distance

In [98]:

# Creating a dataframe
DD1 = df8991['DepDelay'].groupby(df8991['Date'].map(lambda x: x.day)).mean()
DD2 = df8991['ArrDelay'].groupby(df8991['Date'].map(lambda x: x.day)).mean()
DD3 = df8991['Distance'].groupby(df8991['Date'].map(lambda x: x.day)).mean()

# Merge and reindex dataframes
delay_v_distance = pd.merge(left=DD1, right=DD2, how='left', on='Date')
delay_v_distance = pd.merge(left=delay_v_distance, right=DD3, how='left', on='Date')

# Saving as a new dataframe
delay_v_distance = pd.DataFrame(delay_v_distance)

# Show
delay_v_distance

Out[98]:

DepDelayArrDelayDistance
Date
16.7569606.279760633.227517
27.1485817.067214633.218274
37.7227417.702845633.915910
46.0650705.623313633.371921
56.0341985.818042632.001161
66.5258076.330133631.732588
76.0926755.939208632.502854
86.2428356.140405632.523106
96.5579276.414992632.940562
106.3418296.056148632.540616
116.2602305.892138631.703551
125.8196955.567792630.539311
135.7232005.511132631.097914
146.8273726.880148632.665304
157.6800697.815735632.968215
167.3983447.318719633.384753
176.8884036.965868632.278565
186.9535997.174428632.314703
197.5764207.859780631.405105
208.0489978.234841632.202658
218.1925668.112351633.127681
228.6835968.583389633.981356
237.3893916.976880634.602349
245.8780585.063008634.855267
255.8177955.020933633.658841
266.1684755.591412631.978279
276.9727676.594828632.447910
287.6833627.301313634.785688
296.6626496.500185635.010834
306.4565506.135588634.696271
317.3401706.895714635.679141

In [99]:

# Delay with distance correlation
delay_v_distance.corr()

Out[99]:

DepDelayArrDelayDistance
DepDelay1.0000000.9696790.305120
ArrDelay0.9696791.0000000.149177
Distance0.3051200.1491771.000000

In [100]:

delay_v_distance['TotalDelay'] = delay_v_distance['DepDelay']+delay_v_distance['ArrDelay']
delay_v_distance.corr()

Out[100]:

DepDelayArrDelayDistanceTotalDelay
DepDelay1.0000000.9696790.3051200.990868
ArrDelay0.9696791.0000000.1491770.993775
Distance0.3051200.1491771.0000000.221398
TotalDelay0.9908680.9937750.2213981.000000

In [101]:

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6), gridspec_kw={'width_ratios':[1,1]})

# Plot
fig.suptitle('Average Flight Delay vs Distance', fontsize=15)
sb.regplot(data = delay_v_distance, x = 'DepDelay', y = 'Distance', ax=ax1)
sb.regplot(data = delay_v_distance, x = 'ArrDelay', y = 'Distance', ax=ax2)

# Labels
ax1.title.set_text('')
ax1.set_xlabel('Departure Delay (Min)')
ax1.set_ylabel('Distance (Mi)')

ax2.title.set_text('')
ax2.set_xlabel('Arrival Delay (Min)')
ax2.set_ylabel('')

# Show
#plt.tight_layout(pad=3)
plt.subplots_adjust(top=0.9)
plt.savefig('images/charts/delay_vs_distance.png')
plt.show()

Observation 12: We used a scatterplot to show the average flight delay vs distance. Please keep in mind that we are using a daily average here to plot this data because millions of data points would not be clear. There doesn’t seem to be a clear correlation between the distance and departure or arrival delay but we can clearly see one when we add a line of best fit. We can clearly observe a positive correlation in both plots indicating as the flight distance increases the greater chance of delay increases too. We can move on to delays by airline for our univariate exploration section, which airline is the most punctual and maybe we can look at delay on various flight routes?

Step 5: Univariate Exploration ?

5.1. Routes by Flights, Airline and Average Delay

In [102]:

# Dataframe from columns
flight_route = df8991[['Origin','Dest']]

# Creating a new dataframe
flight_route = pd.DataFrame(flight_route)

# Creating columns
flight_route['Route'] = flight_route['Origin'].str.cat(flight_route['Dest'],sep=" - ")
flight_route['AverageDelay'] = df8991['DepDelay']+df8991['ArrDelay']
flight_route['Carrier'] = df8991['UniqueCarrier']
flight_route.head()

Out[102]:

OriginDestRouteAverageDelayCarrier
0SFOHNLSFO – HNL219UA
1SFOHNLSFO – HNL45UA
2SFOHNLSFO – HNL-19UA
3SFOHNLSFO – HNL-29UA
4SFOHNLSFO – HNL-37UA

In [103]:

flight_route.groupby(['Carrier', 'Route'])['AverageDelay'].mean().head()

Out[103]:

Carrier  Route    
AA       ABE - ALB    33.00000
         ABE - BDL     1.00000
         ABE - MDT    16.00000
         ABE - ORD     6.97619
         ABQ - DEN    52.00000
Name: AverageDelay, dtype: float64

In [104]:

# Top carriers by average delay 
flight_route.groupby('Carrier')['AverageDelay'].mean().sort_values(ascending=False)

Out[104]:

Carrier
PI        25.816885
UA        16.953211
DL        15.099069
EA        14.285615
WN        14.058281
US        13.584921
TW        13.468342
CO        12.723660
HP        11.795458
AA        11.353261
ML (1)    10.940089
AS        10.922795
PA (1)     9.541804
NW         8.327834
Name: AverageDelay, dtype: float64

In [105]:

# Top routes by average delay 
flight_route.groupby('Route')['AverageDelay'].mean().sort_values(ascending=False).iloc[0:10]

Out[105]:

Route
GUC - HDN    1980.000000
ICT - PDX     446.000000
BHM - LIT     407.000000
AMA - SLC     351.000000
MCI - ELP     271.000000
CLE - DAY     231.000000
SGF - TUL     220.333333
DCA - PHL     211.500000
TUL - MSP     204.000000
ORD - SUX     174.000000
Name: AverageDelay, dtype: float64

In [106]:

# Top carriers by total routes
flight_route.groupby('Carrier')['Route'].count().sort_values(ascending=False)

Out[106]:

Carrier
US        2575160
DL        2430469
AA        2084513
UA        1774850
NW        1352989
CO        1241287
WN         950837
TW         756992
HP         623477
EA         398362
PI         281905
AS         267975
PA (1)     223079
ML (1)      69119
Name: Route, dtype: int64

In [107]:

# Top routes by flights
flight_route['Route'].value_counts().iloc[0:10]

Out[107]:

SFO - LAX    69180
LAX - SFO    68754
LAX - PHX    39321
PHX - LAX    38756
LAX - LAS    31549
PHX - LAS    30778
LAS - LAX    30296
LAS - PHX    29775
LGA - ORD    29656
ORD - MSP    29308
Name: Route, dtype: int64

In [108]:

# Creating a new dataframe
top_10_flight_route = pd.DataFrame(flight_route['Route'].value_counts().iloc[0:10])
top_10_flight_route

Out[108]:

Route
SFO – LAX69180
LAX – SFO68754
LAX – PHX39321
PHX – LAX38756
LAX – LAS31549
PHX – LAS30778
LAS – LAX30296
LAS – PHX29775
LGA – ORD29656
ORD – MSP29308

In [109]:

# Creating list
delay_list = []
item0 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[0]].mean())
item1 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[1]].mean())
item2 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[2]].mean())
item3 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[3]].mean())
item4 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[4]].mean())
item5 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[5]].mean())
item6 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[6]].mean())
item7 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[7]].mean())
item8 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[8]].mean())
item9 = list(flight_route[flight_route['Route'] == top_10_flight_route.reset_index()['index'].loc[9]].mean())

# Creating a new dataframe
delay_list.append(item0+item1+item2+item3+item4+item5+item6+item7+item8+item9)
top_10_flight_delay = pd.DataFrame(delay_list)
top_10_flight_delay = pd.melt(top_10_flight_delay).drop(['variable'], axis=1).rename(columns={"value":"AverageDelay"})
top_10_flight_delay.index = top_10_flight_route.index
top_10_flight_delay

Out[109]:

AverageDelay
SFO – LAX14.286889
LAX – SFO18.972918
LAX – PHX15.574909
PHX – LAX16.178011
LAX – LAS11.177407
PHX – LAS15.854929
LAS – LAX11.442567
LAS – PHX14.501092
LGA – ORD14.211762
ORD – MSP14.661662

In [110]:

top_10_flight_route = pd.concat([top_10_flight_route, top_10_flight_delay], axis=1).reindex(top_10_flight_route.index)
top_10_flight_route.rename(columns={"Route":"Flights"}, inplace=True)
top_10_flight_route['AverageDelay'] = round(top_10_flight_route['AverageDelay'], 2)
top_10_flight_route.index.name = 'Route'

In [111]:

top_10_flight_route

Out[111]:

FlightsAverageDelay
Route
SFO – LAX6918014.29
LAX – SFO6875418.97
LAX – PHX3932115.57
PHX – LAX3875616.18
LAX – LAS3154911.18
PHX – LAS3077815.85
LAS – LAX3029611.44
LAS – PHX2977514.50
LGA – ORD2965614.21
ORD – MSP2930814.66

In [112]:

# Creating a dataframe for average delay by route and carrier
carrier_route_delay = pd.DataFrame(flight_route.groupby(['Carrier', 'Route'])['AverageDelay'].mean())
carrier_route_delay.head()

Out[112]:

AverageDelay
CarrierRoute
AAABE – ALB33.00000
ABE – BDL1.00000
ABE – MDT16.00000
ABE – ORD6.97619
ABQ – DEN52.00000

In [113]:

carrier_route_delay.filter(like='SFO - LAX', axis=0)

Out[113]:

AverageDelay
CarrierRoute
AASFO – LAX6.917208
ASSFO – LAX20.866279
COSFO – LAX12.990581
DLSFO – LAX13.237706
NWSFO – LAX20.851767
PA (1)SFO – LAX8.613192
TWSFO – LAX13.563017
UASFO – LAX17.250478
USSFO – LAX17.044848

In [114]:

# Get index names
carrier_route_delay.index.names

Out[114]:

FrozenList(['Carrier', 'Route'])

In [115]:

list(carrier_route_delay.index[0])

Out[115]:

['AA', 'ABE - ALB']

In [116]:

carrier_route_delay['RouteIndex'] = carrier_route_delay.reset_index(level=0, drop=True).index #.reset_index(level=0, drop=True) #.str.extractall(r"(A-Z)").index
carrier_route_delay['RouteIndex'] = carrier_route_delay['RouteIndex']
carrier_route_delay.head()

Out[116]:

AverageDelayRouteIndex
CarrierRoute
AAABE – ALB33.00000ABE – ALB
ABE – BDL1.00000ABE – BDL
ABE – MDT16.00000ABE – MDT
ABE – ORD6.97619ABE – ORD
ABQ – DEN52.00000ABQ – DEN

In [117]:

carrier_route_delay['RouteIndex'].iloc[0]

Out[117]:

'ABE - ALB'

In [118]:

carrier_route_delay.filter(like="ABE - ALB", axis=0)

Out[118]:

AverageDelayRouteIndex
CarrierRoute
AAABE – ALB33.0ABE – ALB

In [119]:

carrier_route_delay.filter(like=top_10_flight_route.index[1], axis=0)

Out[119]:

AverageDelayRouteIndex
CarrierRoute
AALAX – SFO8.280271LAX – SFO
ASLAX – SFO22.179775LAX – SFO
COLAX – SFO19.412461LAX – SFO
DLLAX – SFO21.431304LAX – SFO
NWLAX – SFO19.772613LAX – SFO
PA (1)LAX – SFO21.444831LAX – SFO
TWLAX – SFO32.326835LAX – SFO
UALAX – SFO20.176455LAX – SFO
USLAX – SFO20.680116LAX – SFO

In [120]:

carrier_route_delay.drop(['RouteIndex'], axis=1, inplace=True)

In [121]:

# Merge index 0-1
item0 = carrier_route_delay.filter(like=top_10_flight_route.index[0], axis=0)
item1 = carrier_route_delay.filter(like=top_10_flight_route.index[1], axis=0)
item0_1 = item0.merge(item1, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 2-3
item2 = carrier_route_delay.filter(like=top_10_flight_route.index[2], axis=0)
item3 = carrier_route_delay.filter(like=top_10_flight_route.index[3], axis=0)
item2_3 = item2.merge(item3, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_3 = item0_1.merge(item2_3, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 4-5
item4 = carrier_route_delay.filter(like=top_10_flight_route.index[4], axis=0)
item5 = carrier_route_delay.filter(like=top_10_flight_route.index[5], axis=0)
item4_5 = item4.merge(item5, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_5 = merge0_3.merge(item4_5, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 6-7
item6 = carrier_route_delay.filter(like=top_10_flight_route.index[6], axis=0)
item7 = carrier_route_delay.filter(like=top_10_flight_route.index[7], axis=0)
item6_7 = item5.merge(item6, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_7 = merge0_5.merge(item6_7, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Merge index 8-9
item8 = carrier_route_delay.filter(like=top_10_flight_route.index[8], axis=0)
item9 = carrier_route_delay.filter(like=top_10_flight_route.index[9], axis=0)
item8_9 = item8.merge(item9, how="outer", on="AverageDelay", left_index=True, right_index=True)
merge0_9 = merge0_7.merge(item8_9, how="outer", on="AverageDelay", left_index=True, right_index=True)

# Print dataframe
carrier_ra = merge0_9
carrier_ra

Out[121]:

AverageDelay
CarrierRoute
AALAS – LAX-0.577778
LAX – LAS1.730290
LAX – SFO8.280271
LGA – ORD14.982707
ORD – MSP19.809837
SFO – LAX6.917208
ASLAX – SFO22.179775
SFO – LAX20.866279
COLAS – LAX6.295775
LAX – LAS2.410959
LAX – SFO19.412461
SFO – LAX12.990581
DLLAS – LAX11.196488
LAX – LAS13.751666
LAX – PHX16.196706
LAX – SFO21.431304
PHX – LAS24.553367
PHX – LAX13.126173
SFO – LAX13.237706
HPLAS – LAX13.548347
LAX – LAS8.688118
LAX – PHX13.020739
PHX – LAS15.177153
PHX – LAX15.904401
ML (1)LAS – LAX118.200000
LAX – LAS54.857143
PHX – LAS30.956522
NWLAX – SFO19.772613
ORD – MSP11.556449
SFO – LAX20.851767
PA (1)LAX – SFO21.444831
SFO – LAX8.613192
TWLAS – LAX17.000000
LAX – LAS4.000000
LAX – PHX5.272727
LAX – SFO32.326835
ORD – MSP7.494118
PHX – LAS27.221639
PHX – LAX76.833333
SFO – LAX13.563017
UALAS – LAX36.938462
LAX – LAS17.198347
LAX – SFO20.176455
LGA – ORD13.425165
ORD – MSP16.037805
PHX – LAX16.000000
SFO – LAX17.250478
USLAS – LAX9.396913
LAX – LAS13.447669
LAX – SFO20.680116
PHX – LAS0.968202
SFO – LAX17.044848
WNLAS – LAX11.451827
LAX – LAS1.762376
LAX – PHX17.719184
PHX – LAS15.269559
PHX – LAX17.112161

In [122]:

# Rename column for merge
carrier_ra_idx = carrier_ra.reset_index().rename(columns={"Carrier":"code"})

# Merge to get carrier names
carrier_ra_idx = pd.merge(left=carrier_ra_idx, right=carriers, how='left', on='code')
carrier_ra_idx['description'] = carrier_ra_idx['description'].str.replace(r"\(.*\)","")
carrier_ra_idx.tail()

Out[122]:

codeRouteAverageDelaydescription
52WNLAS – LAX11.451827Southwest Airlines Co.
53WNLAX – LAS1.762376Southwest Airlines Co.
54WNLAX – PHX17.719184Southwest Airlines Co.
55WNPHX – LAS15.269559Southwest Airlines Co.
56WNPHX – LAX17.112161Southwest Airlines Co.

In [123]:

list(carrier_ra_idx.reset_index()['description'].unique())

Out[123]:

['American Airlines Inc.',
 'Alaska Airlines Inc.',
 'Continental Air Lines Inc.',
 'Delta Air Lines Inc.',
 'America West Airlines Inc. ',
 'Midway Airlines Inc. ',
 'Northwest Airlines Inc.',
 'Pan American World Airways ',
 'Trans World Airways LLC',
 'United Air Lines Inc.',
 'US Airways Inc. ',
 'Southwest Airlines Co.']

In [124]:

# Colors
colors = ["#f94144","#f3722c","#f8961e","#f9844a","#f9c74f","#90be6d","#43aa8b","#4d908e","#577590","#277da1","#284b63","#463f3a"] # 12 Airlines

# Plot
carrier_ra.unstack(level=0).plot(kind='bar', figsize=(14.70, 8.27), color=colors, subplots=False)

# Title
plt.title('Average Airline Delay on Popular Routes', fontsize=15, weight='bold')

# Label
plt.xlabel('Route', weight='bold')
plt.ylabel('Delay (Min)', weight='bold')

# Legend
carrier_list = list(carrier_ra.reset_index()['Carrier'].unique())
long_carrier_list = list(carrier_ra_idx.reset_index()['description'].unique())
plt.legend(long_carrier_list, loc='0', bbox_to_anchor=(1.01, 1.01))

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/routes_by_flights_airline_average_delay.png')
plt.show()

In [125]:

# Plot
g = sns.catplot(data=carrier_ra_idx, kind='swarm', x='AverageDelay', y='Route', hue='description', height=10, s=9)
g._legend.set_title("Carriers")

# Label
plt.xlabel('Average Delay')

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/routes_by_flights_airline_average_delay_2.png')
plt.show()

Observation 13: Which airline is the most punctual and which is the least? And does flight route have an affect on delay times per carrier? LAS – LAX shows us many interesting insights. For example, American Airlines has the fewest delays on this route with an average delay of just -0.57 minutes, this means they are actually early. Midway Airlines Inc. has the most delays totalling an average of 118.2 minutes on this route which is astonishing. Things don’t fair too well for Midway on the return leg with an average delay of 54.8 minutes. For PHX – LAX, Trans World Airways LLC tops the charts with an average of 76.8 minutes and Delta Air Lines Inc. has the fewest delays with just 13.1 minutes for delays on average. For our final conclusion, we can say the average delay time is just 18.36 minutes if all carriers and routes are considered which is pretty good. The outlier here is the maximum delay of 118 minutes which is 644% of the delay average of 18.36 minutes. Let’s see which airline is the most punctual overall, can it still be Delta Air Lines Inc?In [126]:

# Filter route data
carrier_ra.filter(like='PHX - LAX', axis=0).sort_values(by='AverageDelay', ascending=False)

Out[126]:

AverageDelay
CarrierRoute
TWPHX – LAX76.833333
WNPHX – LAX17.112161
UAPHX – LAX16.000000
HPPHX – LAX15.904401
DLPHX – LAX13.126173

In [127]:

carrier_ra.describe()

Out[127]:

AverageDelay
count57.000000
mean18.367970
std18.243396
min-0.577778
25%11.196488
50%15.269559
75%20.176455
max118.200000

In [128]:

# Difference between max delay and mean delay in %
carrier_ra.max()/carrier_ra.mean()*100

Out[128]:

AverageDelay    643.511507
dtype: float64

5.2. Top 10 Airlines by Average Delay

In [129]:

# Dataframe from columns
top_10_carriers_by_delay = df8991['DepDelay']+df8991['ArrDelay']

# Creating a new dataframe
top_10_carriers_by_delay = pd.DataFrame(top_10_carriers_by_delay)

# Creating columns
top_10_carriers_by_delay['Carrier'] = df8991['UniqueCarrier']
top_10_carriers_by_delay.columns = ['AverageDelay','Carrier']
top_10_carriers_by_delay['DepDelay'] = df8991['DepDelay']
top_10_carriers_by_delay['ArrDelay'] = df8991['ArrDelay']

# Rename column for merge
top_10_carriers_by_delay = top_10_carriers_by_delay.rename(columns={"Carrier":"code"})

# Merge to get carrier names
top_10_carriers_by_delay = pd.merge(left=top_10_carriers_by_delay, right=carriers, how='left', on='code')
top_10_carriers_by_delay['Carrier'] = top_10_carriers_by_delay['description'].str.replace(r"\(.*\)","")
top_10_carriers_by_delay = top_10_carriers_by_delay.groupby(by='Carrier').mean().sort_values(by='AverageDelay', ascending=True)
top_10_carriers_by_delay

Out[129]:

AverageDelayDepDelayArrDelay
Carrier
Northwest Airlines Inc.8.3278344.4874613.840373
Pan American World Airways9.5418045.2637684.278036
Alaska Airlines Inc.10.9227954.8829486.039847
Midway Airlines Inc.10.9400896.1924804.747609
American Airlines Inc.11.3532615.9297515.423510
America West Airlines Inc.11.7954586.3017215.493738
Continental Air Lines Inc.12.7236606.7351975.988463
Trans World Airways LLC13.4683426.6994086.768934
US Airways Inc.13.5849216.8250336.759888
Southwest Airlines Co.14.0582817.9463576.111925
Eastern Air Lines Inc.14.2856158.2227126.062903
Delta Air Lines Inc.15.0990696.9699738.129096
United Air Lines Inc.16.9532118.5794758.373736
Piedmont Aviation Inc.25.81688511.97400513.842880

In [130]:

# Plot
top_10_carriers_by_delay.plot(kind='barh', figsize=(14.70, 8.27)) 

# Title
plt.title('Airline Carriers by Average Delay', fontsize=14, weight='bold')

# Label
plt.xlabel('Delay in Minutes (Lower is better)', fontsize=10, weight='bold')
plt.ylabel('Carrier', fontsize=10, weight='bold')

plt.legend(['Average Delay', 'Departure Delay', 'Arrival Delay'])

# Plot
plt.savefig('images/charts/airlines_by_average_delays.png')
plt.show()

Observation 14: Which airline is the most punctual? Northwest Airlines has the lowest average delay on flights for departure and arrival times combined. Piedmont Aviation is the least punctual carrier with almost 26 minutes average delay times with a 52% increased average delay time vs Northwest Airlines. Another trend we can notice is that departure delays are 5% higher than arrival delays on average suggesting various factors cause this like aircraft prepeation and passengers. Unfortunately Delta Air Lines Inc. is 3rd from bottom in terms of punctuality despite performing well in the most popular routes observed earlier.In [131]:

# Departure delays vs Arrival delays % difference
top_10_carriers_by_delay['DepDelay'].mean()/top_10_carriers_by_delay['ArrDelay'].mean()-1

Out[131]:

0.05605593713455348

In [132]:

# Percent change of delay between top 10 airlines
top_10_carriers_by_delay.pct_change()

Out[132]:

AverageDelayDepDelayArrDelay
Carrier
Northwest Airlines Inc.NaNNaNNaN
Pan American World Airways0.1457730.1729950.113964
Alaska Airlines Inc.0.144731-0.0723470.411827
Midway Airlines Inc.0.0015830.268185-0.213952
American Airlines Inc.0.037767-0.0424270.142367
America West Airlines Inc.0.0389490.0627290.012949
Continental Air Lines Inc.0.0786910.0687870.090053
Trans World Airways LLC0.058527-0.0053140.130329
US Airways Inc.0.0086560.018752-0.001336
Southwest Airlines Co.0.0348450.164296-0.095854
Eastern Air Lines Inc.0.0161710.034778-0.008021
Delta Air Lines Inc.0.056942-0.1523510.340793
United Air Lines Inc.0.1227980.2309190.030094
Piedmont Aviation Inc.0.5228320.3956570.653131

Step 6: Random Exploration ?

6.1. Plane Data

We can round off our exploration by looking at the plane_data table. There we can find data about plane types, manufacturers, models, engine types and issue dates.In [133]:

plane_data.head()

Out[133]:

tailnumtypemanufacturerissue_datemodelstatusaircraft_typeengine_typeyear
0N10156CorporationEMBRAER02/13/2004EMB-145XRValidFixed Wing Multi-EngineTurbo-Fan2004
1N102UWCorporationAIRBUS INDUSTRIE05/26/1999A320-214ValidFixed Wing Multi-EngineTurbo-Fan1998
2N10323CorporationBOEING07/01/1997737-3TOValidFixed Wing Multi-EngineTurbo-Jet1986
3N103USCorporationAIRBUS INDUSTRIE06/18/1999A320-214ValidFixed Wing Multi-EngineTurbo-Fan1999
4N104UACorporationBOEING01/26/1998747-422ValidFixed Wing Multi-EngineTurbo-Fan1998

In [134]:

# Remove entries titled 'None' from year column
plane_data = plane_data[~plane_data['year'].isin(['None'])]

# Changing 'year' datatype from string to integer
plane_data['year'] = plane_data['year'].astype('int')

# Filtering years 1989 to 1991 and assigning to new df
plane_data_8991 = plane_data[(plane_data['year'] >= 1989) & (plane_data['year'] <= 1991)]

In [135]:

# Most popular aircraft manufacturers between 89-91
plane_data_8991['manufacturer'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Aircraft Manufacturers from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Aircraft', fontsize=13)

# Plot
plt.savefig('images/charts/top_manufacturers.png')
plt.show()

Which aircraft manufacturer is the most popular, and what is the market distribution like? Boeing is the most popular aircraft manufacturer by far which is no surprise. McDonnell Douglas takes 2nd and 3rd spot. Perhaps we can see which models are the most popular?In [136]:

# Top manufacturers
plane_data_8991['manufacturer'].value_counts().iloc[0:10].sort_values(ascending=False)

Out[136]:

BOEING                           267
MCDONNELL DOUGLAS AIRCRAFT CO     64
MCDONNELL DOUGLAS                 46
AIRBUS INDUSTRIE                  31
SAAB-SCANIA                        2
MCDONNELL DOUGLAS CORPORATION      1
Name: manufacturer, dtype: int64

In [137]:

# Market share
plane_data_8991['manufacturer'].value_counts(normalize=True)

Out[137]:

BOEING                           0.649635
MCDONNELL DOUGLAS AIRCRAFT CO    0.155718
MCDONNELL DOUGLAS                0.111922
AIRBUS INDUSTRIE                 0.075426
SAAB-SCANIA                      0.004866
MCDONNELL DOUGLAS CORPORATION    0.002433
Name: manufacturer, dtype: float64

In [138]:

# Most popular aircraft maodels between 89-91
plane_data_8991['model'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Aircraft Models from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Aircraft', fontsize=13)

# Plot
plt.savefig('images/charts/top_models.png')
plt.show()

We know Boeing is the most popular aircraft manufacturer but what aircraft model is the most popular? MD-88 has the most popular aircraft model which is interesting since Boeing dominates the market. Boeing still has 7 aircraft in the top 10 though. What about engine types?In [139]:

# Top 10 aircraft from 89-91
plane_data_8991['model'].value_counts().iloc[0:10].sort_values(ascending=False)

Out[139]:

MD-88             65
757-222           46
DC-9-82(MD-82)    35
737-4B7           27
757-232           27
757-223           27
737-322           26
737-5H4           18
767-332           17
A320-211          16
Name: model, dtype: int64

In [140]:

# Percent change between values
plane_data_8991['model'].value_counts().iloc[0:10].sort_values(ascending=False).pct_change()

Out[140]:

MD-88                  NaN
757-222          -0.292308
DC-9-82(MD-82)   -0.239130
737-4B7          -0.228571
757-232           0.000000
757-223           0.000000
737-322          -0.037037
737-5H4          -0.307692
767-332          -0.055556
A320-211         -0.058824
Name: model, dtype: float64

In [141]:

# Most popular engine types between 89-91
plane_data_8991['engine_type'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Engine Types from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Engines', fontsize=13)

# Plot
plt.savefig('images/charts/top_engine_types.png')
plt.show()

What is the most popular engine type? The most common engine type is Turbo-Fan with 234 records, followed by Turbo-Jet with 175 then Turbo-Prop with just 2 records.

Observation: Boeing is by far the most popular aircraft manufacturer between 1989 and 1991 with 267 aircraft totalling 65% of the market. The top 6 is rounded out by McDonnel Douglas and Airbus Industrie while SAAB-Scania and MD Corporation rank 5th and 6th. Interestingly, despite Boeing’s dominance the most popular aircraft is a McDonnel Douglas MD-88 with 65 aircraft recorded over our 3 year sample. The rest of the top 10 is dominated by 7 Boeing aircraft, 3rd spot is filled by a McDonnel Douglas DC-9 and 10th is taken by an Airbus A320-211. Finally on engine types, the most common engine type is Turbo-Fan with 234 records, followed by Turbo-Jet with 175 then Turbo-Prop with just 2 records.In [142]:

plane_data_8991['engine_type'].value_counts().iloc[0:10].sort_values(ascending=False)

Out[142]:

Turbo-Fan     234
Turbo-Jet     175
Turbo-Prop      2
Name: engine_type, dtype: int64

Step 7: Conclusion ?

In [143]:

Image.open('images/concorde.jpg')

Out[143]:

7.1. Interesting Findings

From our data, we can assess passengers fly more during the summer and least during the winter and more during midweek versus the weekend. The most popular time to travel for passengers is in the morning around 8:00. Most flights average a 110 minute flight time and an average distance is 632 miles. Interestingly, the average distance for flights increases during the winter months suggesting that passengers commute further to go on distant holidays or visit family and friends across the country. Airline carriers enjoy a competitive market with no more than 36% difference in total flights between carriers. Regarding carriers, we observed that flight delays increase with flight distance. Chicago O’Hare International is the busiest airport with over 1.5 million flights recorded over the 3 year period.

7.2. Project Answers

How many flights are there?\ 15,031,014 flights.

Which airlines have the most flights?\ US Airways Inc.

What time of the week passengers fly the most?\ Monday to Friday.

How does season change the frequency and the destination of travel?\ Summer increases the frequency of average flights whilst winter decreases it.

Which routes are the most popular?

  1. SFO – LAX | 69,180 Flights
  2. LAX – SFO | 68,754 Flights
  3. LAX – PHX | 39,321 Flights

LAX: Los Angeles Intl, SFO: San Fransisco Intl, PHX: Phoenix Sky Harbor Intl.

Which routes and airports experience the most delays?\ GUC – HDN Gunisson to Haydon | 1980 minutes on average for route.\ PSE Mercedita Airport | 53.25 minutes delay on average is the highest for an airport.\ BQN Rafael Hernández Airport | -10.35 minutes delay (early) on average is lowest for an airport.

Which airports are the busiest in terms of inbound and outbound flights?\ Chicago O’Hare International.

Which popular routes are delayed the most?\ LAX – SFO | 68,754 Flights, 18.97 minutes delay on average.\ PHX – LAX | 38,756 Flights, 16.18 minutes delay on average.\ PHX – LAS | 30,778 Flights, 15.85 minutes delay on average.

Which times are airports the busiest in terms of flights by each hour?\ 08:00 – 09:00 has over 1 million flights during the morning.

How does flight distance affect departure and arrival delays?\ Flight distance increases total departure and arrival delays with 0.221398 positive correlation.

Which airline experiences the fewest delays?\ Northwest Airlines.

7.3. Key Insights for Presentation

For the presentation, I decided to look at flight counts using simple vertical and horizontal bar charts before moving on to more complex plots for deeper analysis. I used an area plot for flight duration because it looked visually clearer than using a line chart. I used a histogram for the flight distance plot to keep things interesting since my plan was to use many different plot types for the slides.

For total flights by carrier I decided to use a colourful pie chart which should catch the audiences attention since unique and vibrant colour selection is important. I also decided to explode the largest slice of the pie chart. For the delay section I used a combination of line charts and horizontal bar charts since it was the best way to show bivariate data that relies on categorical data. Next, I introduced a third variable which was categorical so I decided to use a heatmap. Heatmaps are great at showing multiple dimensions of data in a fun way, in our case it was for airport name, flight totals, and month. The darker the colour the more flights per grid box.

I used a combination chart to plot airlines vs month vs average distance. This allowed me to show multiple variables co-existing in the same axis clearly. Average flight delay vs distance used the scatterplot, another plot I was keen to try as it engages the audience with raw individual data points. This allowed us to use a line of best fit to intercept the points to visually see a correlation. For routes by flights vs airline average delay it was the perfect time to use a grouped bar chart, I found some interesting colour palettes for my chart from coolor.co. I could plot multiple airlines for each route on the same bar axis and take an average of them if we wanted to. I used a swarm plot on the same data for another angle to see if we could find out more. Average airlines by delay had a similar treatment with the grouped bar chart instead of replacing routes with airline carriers, this time I used a more modest colour palette. The random exploration at the end used simple bar charts again as they are quick and easy to explore data.

 
Author

Write A Comment