Project 1 - Investigate a Dataset

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction

Gapminder datasets are being used to investigate the relationship between GDP/GNI per capita, standard of living, carbon footprint, healthcare and the importance of export in each country’s economy.

Few questions which are of interest:

1) What is the income per capita distribution by country?
2) Does lower level of corruption lead to higher income (GNI/capita)?
3) Does higher income (GNI/capita) lead to higher standard of living (HDI)?
4) What are the characteristics of export-dependent nations?
5) Will higher level of expenditure in healthcare, especially if led by government, contribute to higher life expectancy?
6) Do higher economic output (GDP/capita) and production of oil lead to higher carbon footpirnt (CO2 emission/capita)?

In [1]:

#Importing all required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Data Wrangling

Datasets are loaded, cleaned and merged in order to enable the studying of relationship between variables.

General Properties

Few datasets are printed out to evaluate the quality and completion of the data.

In [2]:

gdp_pc = pd.read_excel('indicator gapminder gdp_per_capita_ppp.xlsx', sheet_name='Data')
gni_pc = pd.read_excel('indicatorGNIpercapitaATLAS.xlsx', sheet_name='Data')
cpi_country = pd.read_excel('indicator ti cpi 2009.xlsx', sheet_name='Data')
hdi_country = pd.read_excel('Indicator_HDI.xlsx', sheet_name='Data')
life_expec = pd.read_excel('indicator life_expectancy_at_birth.xlsx', sheet_name='Data')
total_health = pd.read_excel('indicator_per capita total expenditure on health (ppp int. $).xlsx', sheet_name='Data')
government_health = pd.read_excel('indicator_per capita government expenditure on health (ppp int. $).xlsx', sheet_name='Data')
export_perc = pd.read_excel('Exports (p of GDP).xlsx', sheet_name='Data')
female = pd.read_excel('indicator_total population female.xlsx', sheet_name='Data')
male = pd.read_excel('indicator_total population male.xlsx', sheet_name='Data')
co2_pc = pd.read_excel('indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx', sheet_name='Data')
oil_prod = pd.read_excel('Oil Production.xlsx', sheet_name='Data')

In [3]:

#First few rows of all datasets are being loaded and printed out
gdp_pc.head()

#Abkhazia and Akrotiri and Dhekelia don't contain any value because it might be 
#difficult to collect data in unstable regions without functional government. 
#Thus, countries / regions with missing data should be excluded.

Out[3]:

	Country	1800	1801	1802	1803	1804	1805	1806	1807	1808	…	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	603.0	603.0	603.0	603.0	603.0	603.0	603.0	603.0	603.0	…	1173.0	1298.0	1311.0	1548.0	1637.0	1695.0	1893.0	1884.0	1877.0	1925.0
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	667.0	667.0	668.0	668.0	668.0	668.0	668.0	668.0	668.0	…	7476.0	7977.0	8644.0	8994.0	9374.0	9640.0	9811.0	9961.0	10160.0	10620.0
4	Algeria	716.0	716.0	717.0	718.0	719.0	720.0	721.0	722.0	723.0	…	12088.0	12289.0	12314.0	12285.0	12494.0	12606.0	12779.0	12893.0	13179.0	13434.0

5 rows × 217 columns

In [4]:

gni_pc.head()

#There are more missing data for earlier years (before 2000's), 
#therefore only recent data should be used to ensure quality of the study.

Out[4]:

	Country	1962	1963	1964	1965	1966	1967	1968	1969	1970	…	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
0	Aruba	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Andorra	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	42740.0	44250.0	43760.0	44180.0	41670.0	40580.0	41010.0	43270.0	45655.0	48170.0
2	Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	280.0	340.0	370.0	470.0	520.0	570.0	690.0	690.0	670.0	651.0
3	Angola	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	1620.0	2350.0	3010.0	3490.0	3550.0	3690.0	4150.0	4730.0	4850.0	4973.0
4	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	3120.0	3530.0	4080.0	4280.0	4360.0	4390.0	4370.0	4510.0	4460.0	4411.0

5 rows × 55 columns

In [5]:

#Data type are inspected
gdp_pc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Columns: 217 entries, Country to 2015
dtypes: float64(216), object(1)
memory usage: 444.2+ KB

In [6]:

cpi_country.info()

#Since the CPI dataset only contains year 2008 and 2009, only data 
#of year 2008 or 2009 should be used when datasets are combined

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
Country    180 non-null object
2008       180 non-null float64
2009       176 non-null float64
dtypes: float64(2), object(1)
memory usage: 4.3+ KB

In [7]:

#Table for GNI per capita after data are binned into category

gni_bin = gni_pc.loc[:,('Country', 2015)] #Year 2015 is chosen as it is the latest data
gni_bin = gni_bin.dropna() #Countries with NaN values are dropped
bins = [0, 1025, 4035, 12475, 1000000] #Income classification based on World Bank data
group_names = ['Low Income', 'Lower Middle Income', 'Upper Middle Income', 'High Income'] #4 Categories to classify the income of citizens
categories = pd.cut(gni_bin[2015], bins, labels=group_names)
binned_gni = pd.value_counts(categories) #Summarized the count for each classification group
binned_gni

#https://blogs.worldbank.org/opendata/new-country-classifications-2016
#low Income <$1025, Lower Middle Income $1,026-$4,035, Upper Middle Income $4,036-$12,475, High Income >$12,475

Out[7]:

Upper Middle Income    69
High Income            67
Lower Middle Income    53
Low Income             30
Name: 2015, dtype: int64

In [8]:

#Table for CPI index and GNI/capita of year 2009 by country is constructed

gni_percapita = gni_pc[['Country', '2009']] #Year 2009 is used as it is the latest data available for CPI
gni_percapita = gni_percapita.dropna() #NAs are dropped to ensure the quality of the study
gni_percapita.rename(columns={'2009':'GNI per capita'}, inplace=True) #Name of column is changed to make table columns clearer

cpi = cpi_country[['Country', 2009]] #Year 2009 is used as it is the latest data available
cpi = cpi.dropna() #NAs are dropped to make relationship easier to study
cpi.rename(columns={2009:'CPI'}, inplace=True) #Name of column is changed to make table columns clearer


cpi_gni_pc = pd.merge(left=cpi, right=gni_percapita, left_on='Country', right_on='Country') 
#Both datasets are merged using inner merge

cpi_gni_pc.head()

Out[8]:

	Country	CPI	GNI per capita
0	New Zealand	9.3	29410.0
1	Denmark	9.3	59840.0
2	Sweden	9.2	51900.0
3	Singapore	9.3	37080.0
4	Switzerland	8.7	70230.0

In [9]:

#Table for 2011 HDI index and GNI/capita by country is constructed

gni_percapita = gni_pc[['Country', '2011']] #Year 2011 is used as it is the latest data available
gni_percapita = gni_percapita.dropna() #NAs are dropped to make relationship easier to study
gni_percapita.rename(columns={'2011':'GNI per capita'}, inplace=True) #Name of column is changed to make table columns clearer

hdi = hdi_country[['Country', 2011]] #Year 2011 is used as it is the latest data availablex
hdi = hdi.dropna() #NAs are dropped to make relationship easier to study
hdi.rename(columns={2011:'HDI'}, inplace=True) #Name of column is changed to make table columns clearer

gni_pc_hdi = pd.merge(left=gni_percapita, right=hdi, left_on='Country', right_on='Country') #Both datasets are merged using inner merge

gni_pc_hdi.head()

Out[9]:

	Country	GNI per capita	HDI
0	Andorra	40580.0	0.838
1	Afghanistan	570.0	0.398
2	Angola	3690.0	0.486
3	Albania	4390.0	0.739
4	United Arab Emirates	35150.0	0.846

In [10]:

#Table for 2010 population, GDP/capita and Export Value/GDP in % by country is constructed

female_pop = female[['Country', 2010]] #Year 2010 is selected
female_pop = female_pop[28:] #Only individual country data is selected
female_pop = female_pop.dropna() #NAs are dropped

male_pop = male[['Country', 2010]] #Year 2010 is selected
male_pop = male_pop[28:] #Only individual country data is selected
male_pop = male_pop.dropna() #NAs are dropped

population = pd.merge(left=female_pop, right=male_pop, left_on='Country', right_on='Country') #Female and male pop. are merged into one table
population.rename(columns={'2010_x':'Female', '2010_y':'Male'}, inplace=True) #Column names are replaced to reflect the female and male pop.
population['Population'] = population['Female'] + population['Male'] #A new column for population is created by adding up female and male pop.
population['Population'] = (population['Population']).astype(int)
population = population[['Country', 'Population']] #Female and male population columns are excluded

gdp_percapita = gdp_pc[['Country', 2010]] #Year 2010 is used to match export % dataset
gdp_percapita = gdp_percapita.dropna() #NAs are dropped
gdp_percapita.rename(columns={2010:'GDP per capita'}, inplace=True) #Column name is changed

export_percentage = export_perc[['Country', '2010']] #Year 2010 is used as there are less missing data when compared to 2011
export_percentage = export_percentage.dropna() #NAs are dropped
export_percentage.rename(columns={'2010':'Export_perc'}, inplace=True) #Column name is changed

export_pop = pd.merge(left=export_percentage, right=population, left_on='Country', right_on='Country') #Inner merged is used
export_pop_gdp_pc = pd.merge(left=export_pop, right=gdp_percapita, left_on='Country', right_on='Country') #Inner merge is used again to combine 3 datasets

export_pop_gdp_pc.head()

Out[10]:

	Country	Export_perc	Population	GDP per capita
0	Afghanistan	20.124519	29117489	1637.0
1	Albania	32.427471	3169087	9374.0
2	Algeria	30.830405	35422589	12494.0
3	Angola	62.413106	18992707	7047.0
4	Argentina	21.706469	40665732	15765.0

In [11]:

#Table for 2010 life expectancy, total health expenditure, government health expenditure and GDP/capita by country is constructed

life_expectancy = life_expec[['Country', 2010]] #Year 2010 is used as it is the latest data available
life_expectancy = life_expectancy.dropna() #NAs are dropped
life_expectancy.rename(columns={2010:'Life Expectancy'}, inplace=True) #Name of column is changed to reflect actual column data

total_health_exp = total_health[['Country', 2010]] #Year 2010 is used as it is the latest data available
total_health_exp = total_health_exp.dropna() #NAs are dropped
total_health_exp.rename(columns={2010:'Total Health Exp'}, inplace=True) #Name of column is changed to reflect actual column data

government_health_exp = government_health[['Country', 2010]] #Year 2010 is used as it is the latest data available
government_health_exp = government_health_exp.dropna() #NAs are dropped
government_health_exp.rename(columns={2010:'Government Health Exp'}, inplace=True) #Name of column is changed to reflect actual column data

life_totalexp = pd.merge(left=life_expectancy, right=total_health_exp, left_on='Country', right_on='Country') #Inner merged is used
life_totalexp_governmentexp = pd.merge(left=life_totalexp, right=government_health_exp, left_on='Country', right_on='Country') #Inner merge is used to combine 3 datasets
life_totalexp_governmentexp_gdp_pc = pd.merge(left=life_totalexp_governmentexp, right=gdp_percapita, left_on='Country', right_on='Country') #Inner merge is used again to combine 4 datasets
life_totalexp_governmentexp_gdp_pc['Gov Exp %'] = life_totalexp_governmentexp_gdp_pc['Government Health Exp'] / life_totalexp_governmentexp_gdp_pc['Total Health Exp'] #A new column for population is created by adding up female and male pop.

life_totalexp_governmentexp_gdp_pc.head()

Out[11]:

	Country	Life Expectancy	Total Health Exp	Government Health Exp	GDP per capita	Gov Exp %
0	Afghanistan	53.6	44.465952	5.182912	1637.0	0.116559
1	Albania	77.2	577.275816	225.381942	9374.0	0.390423
2	Algeria	76.0	330.006155	257.052609	12494.0	0.778933
3	Andorra	84.7	3254.524918	2281.096968	38982.0	0.700900
4	Angola	57.6	168.130557	138.640325	7047.0	0.824599

In [12]:

#Construct a table for 2010 CO2 emission/capita, Oil Production and GDP/capita by country

co2_percapita = co2_pc[['Country', 2010]] #Year 2010 is used as it is the latest data available
co2_percapita = co2_percapita.dropna() #NAs are dropped
co2_percapita.rename(columns={2010:'CO2 per capita'}, inplace=True) #Name of column is updated

oil_production = oil_prod[['Country', 2010]] #Year 2010 is used as it is the latest data available
oil_production = oil_production.dropna() #NAs are dropped
oil_production.rename(columns={2010:'Oil Prod'}, inplace=True) #Name of column is updated

co2_oil = pd.merge(left=co2_percapita, right=oil_production, left_on='Country', right_on='Country') #Inner merged is used
co2_oil_gdp_pc = pd.merge(left=co2_oil, right=gdp_percapita, left_on='Country', right_on='Country')  #Inner merge is used again to combine 3 datasets

co2_oil_gdp_pc.head()

Out[12]:

	Country	CO2 per capita	Oil Prod	GDP per capita
0	Algeria	3.480977	7.550114e+07	12494.0
1	Angola	1.593918	9.197321e+07	7047.0
2	Argentina	4.466338	3.254959e+07	15765.0
3	Australia	16.752301	2.462724e+07	41330.0
4	Azerbaijan	4.976935	5.083820e+07	15950.0

Exploratory Data Analysis

Graph visualizations are created to study and show the relationship between variables.

1. What is the income per capita distribution by country?

In [13]:

plot1 = binned_gni.plot.bar(title='GNI per capita Distribution by Country')

Majority of the countries are classified as either Upper Middle Income or High Income.

2. Does lower level of corruption (CPI) lead to higher income (GNI/capita)?

In [14]:

plot2 = cpi_gni_pc.plot.scatter(x='CPI', y='GNI per capita', title='GNI per capita vs CPI')
plot2.set_yscale('log')

The graph above shows that countries with lower lovel of corruption (higher CPI index) tend to have higher income (GNI per capita). This could be due to the fact that governments that are transparent are more likely to invest public money to drive economoic growth which lead to higher income per capita and standard of living. On the other way, it is less likely that higher income per capita leads to less corrupted government.

3. Does higher income (GNI/capita) lead to higher standard of living (HDI)?

In [15]:

plot2 = gni_pc_hdi.plot.scatter(x='GNI per capita', y='HDI', title='HDI vs GNI per capita')
plot2.set_xscale('log')

Countries with higher income (GNI per capita) have higher human development index (HDI) which is an indicator of standard of living, education level and income level. As a result, both variable should show a strong positive correlation as income is one of the components in calculating HDI.

4. What are the characteristics of export-dependent nations?

In [16]:

plot3 = export_pop_gdp_pc.plot.scatter(x='Export_perc', y='Population', s=export_pop_gdp_pc['GDP per capita']*0.001, title='Population vs Export % as GDP vs GDP per capita');
plot3.set_yscale('log')

#The size of the bubble indicates the GDP per capita value of each country

In [17]:

export1 = export_pop_gdp_pc.sort_values(by='Export_perc', ascending=False)
export1.head()

Out[17]:

	Country	Export_perc	Population	GDP per capita
61	Hong Kong, China	218.861632	7069378	48108.0
121	Singapore	207.170536	4836691	72056.0
80	Luxembourg	172.366716	491772	91147.0
81	Macao, China	106.294511	547591	98722.0
66	Ireland	101.165280	4589002	43860.0

From the graph, countries that are more dependent of export when compared to their economy size (>75%) tend to have higher GDP per capita (bigger bubble size). At the same time, it is observed that the population size is also less than 10 millions in almost all cases.

Few examples of such countries/regions are Singapore, Hong Kong and Luxembourg where all of them are small and open economies.

4. Will higher level of expenditure in healthcare, especially government expenditure, lead to higher life expectancy?

In [18]:

plot4a = life_totalexp_governmentexp_gdp_pc.plot.scatter(x='Life Expectancy', y='Total Health Exp', s=life_totalexp_governmentexp_gdp_pc['Gov Exp %']*70, title='Total Health Expenditure vs Life Expectancy vs Government Health Expenditure');
plot4b = life_totalexp_governmentexp_gdp_pc.plot.scatter(x='Life Expectancy', y='GDP per capita', color='Red', title='GDP per capita vs Life Expectancy');

plot4a.set_yscale('log')
plot4b.set_yscale('log')

In the first graph (blue) above, higher expenditure in health corresponds with higher life expectancy. At the same time, it seems that most countries have similar level of governmental support when it comes to health expenditure and therefore if can be concluded higher level of government expenditure on health doesn’t result in higher life expectancy.

In the second graph (red), higher GDP per capita is highly correlated to higher life expectancy. One of the reasons could be due to higher GDP per capita enables citizens to have more disposable income to spend on health realted expenditure.

5. Do higher economic output (GDP/capita) and oil-producing nations have higher carbon footpirnt (CO2 emission/capita)?

In [19]:

plot5 = co2_oil_gdp_pc.plot.scatter(x='CO2 per capita', y='Oil Prod', s=co2_oil_gdp_pc['GDP per capita']*0.002, title='Oil Production vs CO2 per capita vs GDP per capita');
plot5.set_ylim(0,600000000)
plot5.set_xlim(0,50)

Out[19]:

(0, 50)

Countries with high CO2 emission per capita are those who are either oil-producing countries or those with high GDP per capita.

Conclusions

From the visualizations above, most of the inferences are based on correlation and not causation. In order to prove the causation among variables, most in depth studies need to be carried out and other independent variables need to be controlled.

Limitation

Most of the datasets have quite a number of missing datas. This causes a lot of the countries to be excluded in studying the relationship between variables.

Most recent data available are usually before the year of 2011. The lack of most updated data prevented the studying of how most economies perform after the global financial crisis in 2008.In [20]:

from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

Out[20]:

4294967295

Project 1 – Investigate a Dataset

Write A Comment Cancel Reply

Safer Gambling Analytics Part 2 – Session Aggregation

Safer Gambling Analytics Part 1 – Declined Deposits

Kaggle Marketing Analysis – Power BI Dashboard

Starbucks Data Science Blog