Table of Contents

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • Conclusions

Introduction

Gapminder datasets are being used to investigate the relationship between GDP/GNI per capita, standard of living, carbon footprint, healthcare and the importance of export in each country’s economy.

Few questions which are of interest:

1) What is the income per capita distribution by country?
2) Does lower level of corruption lead to higher income (GNI/capita)?
3) Does higher income (GNI/capita) lead to higher standard of living (HDI)?
4) What are the characteristics of export-dependent nations?
5) Will higher level of expenditure in healthcare, especially if led by government, contribute to higher life expectancy?
6) Do higher economic output (GDP/capita) and production of oil lead to higher carbon footpirnt (CO2 emission/capita)?

In [1]:

#Importing all required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Data Wrangling

Datasets are loaded, cleaned and merged in order to enable the studying of relationship between variables.

General Properties

Few datasets are printed out to evaluate the quality and completion of the data.

In [2]:

gdp_pc = pd.read_excel('indicator gapminder gdp_per_capita_ppp.xlsx', sheet_name='Data')
gni_pc = pd.read_excel('indicatorGNIpercapitaATLAS.xlsx', sheet_name='Data')
cpi_country = pd.read_excel('indicator ti cpi 2009.xlsx', sheet_name='Data')
hdi_country = pd.read_excel('Indicator_HDI.xlsx', sheet_name='Data')
life_expec = pd.read_excel('indicator life_expectancy_at_birth.xlsx', sheet_name='Data')
total_health = pd.read_excel('indicator_per capita total expenditure on health (ppp int. $).xlsx', sheet_name='Data')
government_health = pd.read_excel('indicator_per capita government expenditure on health (ppp int. $).xlsx', sheet_name='Data')
export_perc = pd.read_excel('Exports (p of GDP).xlsx', sheet_name='Data')
female = pd.read_excel('indicator_total population female.xlsx', sheet_name='Data')
male = pd.read_excel('indicator_total population male.xlsx', sheet_name='Data')
co2_pc = pd.read_excel('indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx', sheet_name='Data')
oil_prod = pd.read_excel('Oil Production.xlsx', sheet_name='Data')

In [3]:

#First few rows of all datasets are being loaded and printed out
gdp_pc.head()

#Abkhazia and Akrotiri and Dhekelia don't contain any value because it might be 
#difficult to collect data in unstable regions without functional government. 
#Thus, countries / regions with missing data should be excluded.

Out[3]:

Country1800180118021803180418051806180718082006200720082009201020112012201320142015
0AbkhaziaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1Afghanistan603.0603.0603.0603.0603.0603.0603.0603.0603.01173.01298.01311.01548.01637.01695.01893.01884.01877.01925.0
2Akrotiri and DhekeliaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3Albania667.0667.0668.0668.0668.0668.0668.0668.0668.07476.07977.08644.08994.09374.09640.09811.09961.010160.010620.0
4Algeria716.0716.0717.0718.0719.0720.0721.0722.0723.012088.012289.012314.012285.012494.012606.012779.012893.013179.013434.0

5 rows × 217 columns

In [4]:

gni_pc.head()

#There are more missing data for earlier years (before 2000's), 
#therefore only recent data should be used to ensure quality of the study.

Out[4]:

Country1962196319641965196619671968196919702006200720082009201020112012201320142015
0ArubaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1AndorraNaNNaNNaNNaNNaNNaNNaNNaNNaN42740.044250.043760.044180.041670.040580.041010.043270.045655.048170.0
2AfghanistanNaNNaNNaNNaNNaNNaNNaNNaNNaN280.0340.0370.0470.0520.0570.0690.0690.0670.0651.0
3AngolaNaNNaNNaNNaNNaNNaNNaNNaNNaN1620.02350.03010.03490.03550.03690.04150.04730.04850.04973.0
4AlbaniaNaNNaNNaNNaNNaNNaNNaNNaNNaN3120.03530.04080.04280.04360.04390.04370.04510.04460.04411.0

5 rows × 55 columns

In [5]:

#Data type are inspected
gdp_pc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Columns: 217 entries, Country to 2015
dtypes: float64(216), object(1)
memory usage: 444.2+ KB

In [6]:

cpi_country.info()

#Since the CPI dataset only contains year 2008 and 2009, only data 
#of year 2008 or 2009 should be used when datasets are combined
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
Country    180 non-null object
2008       180 non-null float64
2009       176 non-null float64
dtypes: float64(2), object(1)
memory usage: 4.3+ KB

In [7]:

#Table for GNI per capita after data are binned into category

gni_bin = gni_pc.loc[:,('Country', 2015)] #Year 2015 is chosen as it is the latest data
gni_bin = gni_bin.dropna() #Countries with NaN values are dropped
bins = [0, 1025, 4035, 12475, 1000000] #Income classification based on World Bank data
group_names = ['Low Income', 'Lower Middle Income', 'Upper Middle Income', 'High Income'] #4 Categories to classify the income of citizens
categories = pd.cut(gni_bin[2015], bins, labels=group_names)
binned_gni = pd.value_counts(categories) #Summarized the count for each classification group
binned_gni

#https://blogs.worldbank.org/opendata/new-country-classifications-2016
#low Income <$1025, Lower Middle Income $1,026-$4,035, Upper Middle Income $4,036-$12,475, High Income >$12,475

Out[7]:

Upper Middle Income    69
High Income            67
Lower Middle Income    53
Low Income             30
Name: 2015, dtype: int64

In [8]:

#Table for CPI index and GNI/capita of year 2009 by country is constructed

gni_percapita = gni_pc[['Country', '2009']] #Year 2009 is used as it is the latest data available for CPI
gni_percapita = gni_percapita.dropna() #NAs are dropped to ensure the quality of the study
gni_percapita.rename(columns={'2009':'GNI per capita'}, inplace=True) #Name of column is changed to make table columns clearer

cpi = cpi_country[['Country', 2009]] #Year 2009 is used as it is the latest data available
cpi = cpi.dropna() #NAs are dropped to make relationship easier to study
cpi.rename(columns={2009:'CPI'}, inplace=True) #Name of column is changed to make table columns clearer


cpi_gni_pc = pd.merge(left=cpi, right=gni_percapita, left_on='Country', right_on='Country') 
#Both datasets are merged using inner merge

cpi_gni_pc.head()

Out[8]:

CountryCPIGNI per capita
0New Zealand9.329410.0
1Denmark9.359840.0
2Sweden9.251900.0
3Singapore9.337080.0
4Switzerland8.770230.0

In [9]:

#Table for 2011 HDI index and GNI/capita by country is constructed

gni_percapita = gni_pc[['Country', '2011']] #Year 2011 is used as it is the latest data available
gni_percapita = gni_percapita.dropna() #NAs are dropped to make relationship easier to study
gni_percapita.rename(columns={'2011':'GNI per capita'}, inplace=True) #Name of column is changed to make table columns clearer

hdi = hdi_country[['Country', 2011]] #Year 2011 is used as it is the latest data availablex
hdi = hdi.dropna() #NAs are dropped to make relationship easier to study
hdi.rename(columns={2011:'HDI'}, inplace=True) #Name of column is changed to make table columns clearer

gni_pc_hdi = pd.merge(left=gni_percapita, right=hdi, left_on='Country', right_on='Country') #Both datasets are merged using inner merge

gni_pc_hdi.head()

Out[9]:

CountryGNI per capitaHDI
0Andorra40580.00.838
1Afghanistan570.00.398
2Angola3690.00.486
3Albania4390.00.739
4United Arab Emirates35150.00.846

In [10]:

#Table for 2010 population, GDP/capita and Export Value/GDP in % by country is constructed

female_pop = female[['Country', 2010]] #Year 2010 is selected
female_pop = female_pop[28:] #Only individual country data is selected
female_pop = female_pop.dropna() #NAs are dropped

male_pop = male[['Country', 2010]] #Year 2010 is selected
male_pop = male_pop[28:] #Only individual country data is selected
male_pop = male_pop.dropna() #NAs are dropped

population = pd.merge(left=female_pop, right=male_pop, left_on='Country', right_on='Country') #Female and male pop. are merged into one table
population.rename(columns={'2010_x':'Female', '2010_y':'Male'}, inplace=True) #Column names are replaced to reflect the female and male pop.
population['Population'] = population['Female'] + population['Male'] #A new column for population is created by adding up female and male pop.
population['Population'] = (population['Population']).astype(int)
population = population[['Country', 'Population']] #Female and male population columns are excluded

gdp_percapita = gdp_pc[['Country', 2010]] #Year 2010 is used to match export % dataset
gdp_percapita = gdp_percapita.dropna() #NAs are dropped
gdp_percapita.rename(columns={2010:'GDP per capita'}, inplace=True) #Column name is changed

export_percentage = export_perc[['Country', '2010']] #Year 2010 is used as there are less missing data when compared to 2011
export_percentage = export_percentage.dropna() #NAs are dropped
export_percentage.rename(columns={'2010':'Export_perc'}, inplace=True) #Column name is changed

export_pop = pd.merge(left=export_percentage, right=population, left_on='Country', right_on='Country') #Inner merged is used
export_pop_gdp_pc = pd.merge(left=export_pop, right=gdp_percapita, left_on='Country', right_on='Country') #Inner merge is used again to combine 3 datasets

export_pop_gdp_pc.head()

Out[10]:

CountryExport_percPopulationGDP per capita
0Afghanistan20.124519291174891637.0
1Albania32.42747131690879374.0
2Algeria30.8304053542258912494.0
3Angola62.413106189927077047.0
4Argentina21.7064694066573215765.0

In [11]:

#Table for 2010 life expectancy, total health expenditure, government health expenditure and GDP/capita by country is constructed

life_expectancy = life_expec[['Country', 2010]] #Year 2010 is used as it is the latest data available
life_expectancy = life_expectancy.dropna() #NAs are dropped
life_expectancy.rename(columns={2010:'Life Expectancy'}, inplace=True) #Name of column is changed to reflect actual column data

total_health_exp = total_health[['Country', 2010]] #Year 2010 is used as it is the latest data available
total_health_exp = total_health_exp.dropna() #NAs are dropped
total_health_exp.rename(columns={2010:'Total Health Exp'}, inplace=True) #Name of column is changed to reflect actual column data

government_health_exp = government_health[['Country', 2010]] #Year 2010 is used as it is the latest data available
government_health_exp = government_health_exp.dropna() #NAs are dropped
government_health_exp.rename(columns={2010:'Government Health Exp'}, inplace=True) #Name of column is changed to reflect actual column data

life_totalexp = pd.merge(left=life_expectancy, right=total_health_exp, left_on='Country', right_on='Country') #Inner merged is used
life_totalexp_governmentexp = pd.merge(left=life_totalexp, right=government_health_exp, left_on='Country', right_on='Country') #Inner merge is used to combine 3 datasets
life_totalexp_governmentexp_gdp_pc = pd.merge(left=life_totalexp_governmentexp, right=gdp_percapita, left_on='Country', right_on='Country') #Inner merge is used again to combine 4 datasets
life_totalexp_governmentexp_gdp_pc['Gov Exp %'] = life_totalexp_governmentexp_gdp_pc['Government Health Exp'] / life_totalexp_governmentexp_gdp_pc['Total Health Exp'] #A new column for population is created by adding up female and male pop.

life_totalexp_governmentexp_gdp_pc.head()

Out[11]:

CountryLife ExpectancyTotal Health ExpGovernment Health ExpGDP per capitaGov Exp %
0Afghanistan53.644.4659525.1829121637.00.116559
1Albania77.2577.275816225.3819429374.00.390423
2Algeria76.0330.006155257.05260912494.00.778933
3Andorra84.73254.5249182281.09696838982.00.700900
4Angola57.6168.130557138.6403257047.00.824599

In [12]:

#Construct a table for 2010 CO2 emission/capita, Oil Production and GDP/capita by country

co2_percapita = co2_pc[['Country', 2010]] #Year 2010 is used as it is the latest data available
co2_percapita = co2_percapita.dropna() #NAs are dropped
co2_percapita.rename(columns={2010:'CO2 per capita'}, inplace=True) #Name of column is updated

oil_production = oil_prod[['Country', 2010]] #Year 2010 is used as it is the latest data available
oil_production = oil_production.dropna() #NAs are dropped
oil_production.rename(columns={2010:'Oil Prod'}, inplace=True) #Name of column is updated

co2_oil = pd.merge(left=co2_percapita, right=oil_production, left_on='Country', right_on='Country') #Inner merged is used
co2_oil_gdp_pc = pd.merge(left=co2_oil, right=gdp_percapita, left_on='Country', right_on='Country')  #Inner merge is used again to combine 3 datasets

co2_oil_gdp_pc.head()

Out[12]:

CountryCO2 per capitaOil ProdGDP per capita
0Algeria3.4809777.550114e+0712494.0
1Angola1.5939189.197321e+077047.0
2Argentina4.4663383.254959e+0715765.0
3Australia16.7523012.462724e+0741330.0
4Azerbaijan4.9769355.083820e+0715950.0

Exploratory Data Analysis

Graph visualizations are created to study and show the relationship between variables.

1. What is the income per capita distribution by country?

In [13]:

plot1 = binned_gni.plot.bar(title='GNI per capita Distribution by Country')

Majority of the countries are classified as either Upper Middle Income or High Income.

2. Does lower level of corruption (CPI) lead to higher income (GNI/capita)?

In [14]:

plot2 = cpi_gni_pc.plot.scatter(x='CPI', y='GNI per capita', title='GNI per capita vs CPI')
plot2.set_yscale('log')

The graph above shows that countries with lower lovel of corruption (higher CPI index) tend to have higher income (GNI per capita). This could be due to the fact that governments that are transparent are more likely to invest public money to drive economoic growth which lead to higher income per capita and standard of living. On the other way, it is less likely that higher income per capita leads to less corrupted government.

3. Does higher income (GNI/capita) lead to higher standard of living (HDI)?

In [15]:

plot2 = gni_pc_hdi.plot.scatter(x='GNI per capita', y='HDI', title='HDI vs GNI per capita')
plot2.set_xscale('log')

Countries with higher income (GNI per capita) have higher human development index (HDI) which is an indicator of standard of living, education level and income level. As a result, both variable should show a strong positive correlation as income is one of the components in calculating HDI.

4. What are the characteristics of export-dependent nations?

In [16]:

plot3 = export_pop_gdp_pc.plot.scatter(x='Export_perc', y='Population', s=export_pop_gdp_pc['GDP per capita']*0.001, title='Population vs Export % as GDP vs GDP per capita');
plot3.set_yscale('log')

#The size of the bubble indicates the GDP per capita value of each country

In [17]:

export1 = export_pop_gdp_pc.sort_values(by='Export_perc', ascending=False)
export1.head()

Out[17]:

CountryExport_percPopulationGDP per capita
61Hong Kong, China218.861632706937848108.0
121Singapore207.170536483669172056.0
80Luxembourg172.36671649177291147.0
81Macao, China106.29451154759198722.0
66Ireland101.165280458900243860.0

From the graph, countries that are more dependent of export when compared to their economy size (>75%) tend to have higher GDP per capita (bigger bubble size). At the same time, it is observed that the population size is also less than 10 millions in almost all cases.

Few examples of such countries/regions are Singapore, Hong Kong and Luxembourg where all of them are small and open economies.

4. Will higher level of expenditure in healthcare, especially government expenditure, lead to higher life expectancy?

In [18]:

plot4a = life_totalexp_governmentexp_gdp_pc.plot.scatter(x='Life Expectancy', y='Total Health Exp', s=life_totalexp_governmentexp_gdp_pc['Gov Exp %']*70, title='Total Health Expenditure vs Life Expectancy vs Government Health Expenditure');
plot4b = life_totalexp_governmentexp_gdp_pc.plot.scatter(x='Life Expectancy', y='GDP per capita', color='Red', title='GDP per capita vs Life Expectancy');

plot4a.set_yscale('log')
plot4b.set_yscale('log')

In the first graph (blue) above, higher expenditure in health corresponds with higher life expectancy. At the same time, it seems that most countries have similar level of governmental support when it comes to health expenditure and therefore if can be concluded higher level of government expenditure on health doesn’t result in higher life expectancy.

In the second graph (red), higher GDP per capita is highly correlated to higher life expectancy. One of the reasons could be due to higher GDP per capita enables citizens to have more disposable income to spend on health realted expenditure.

5. Do higher economic output (GDP/capita) and oil-producing nations have higher carbon footpirnt (CO2 emission/capita)?

In [19]:

plot5 = co2_oil_gdp_pc.plot.scatter(x='CO2 per capita', y='Oil Prod', s=co2_oil_gdp_pc['GDP per capita']*0.002, title='Oil Production vs CO2 per capita vs GDP per capita');
plot5.set_ylim(0,600000000)
plot5.set_xlim(0,50)

Out[19]:

(0, 50)

Countries with high CO2 emission per capita are those who are either oil-producing countries or those with high GDP per capita.

Conclusions

From the visualizations above, most of the inferences are based on correlation and not causation. In order to prove the causation among variables, most in depth studies need to be carried out and other independent variables need to be controlled.

Limitation

Most of the datasets have quite a number of missing datas. This causes a lot of the countries to be excluded in studying the relationship between variables.

Most recent data available are usually before the year of 2011. The lack of most updated data prevented the studying of how most economies perform after the global financial crisis in 2008.In [20]:

from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

Out[20]:

4294967295
Author

Write A Comment