Project 4 - Communicate Data Findings

Introduction of the topic and dataset
Dataset Investigation and preliminary wrangling
Further Data Wrangling
Univariate Exploration and Analysis
Bivariate Exploration and Analysis
Multivariate Exploration and Analysis
Conclusions and Answers

Introduction of the topic and dataset

Introduction to PISA

What Is PISA?

The Program for International Student Assessment (PISA) is a system of international assessments that allows countries to compare outcomes of learning as students near the end of compulsory schooling. PISA core assessments measure the performance of 15-year-old students in mathematics, science, and reading literacy every 3 years. Coordinated by the Organization for Economic Cooperation and Development (OECD), PISA was first implemented in 2000 in 32 countries. It has since grown to 65 education systems in 2012.

Project Aims

Project: Data Exploration of the performance of globally-selected 15/16-year-old students in Mathematics, Reading and Science Literacy, based on the results of the PISA 2012 test

What PISA Measures

PISA’s goal is to assess students’ preparation for the challenges of life as young adults. PISA assesses the application of knowledge in mathematics, science, and reading literacy to problems within a reallife context (OECD 1999). PISA does not focus explicitly on curricular outcomes and uses the term “literacy” in each subject area to indicate its broad focus on the application of knowledge and skills. For example, when assessing mathematics, PISA examines how well 15-year-old students can understand, use, and reflect on mathematics for a variety of real-life problems and settings that they may not encounter in the classroom. Scores on the PISA scales represent skill levels along a continuum of literacy skills.

Each PISA data collection cycle assesses one of the three core subject areas in depth (considered the major subject area), although all three core subjects are assessed in each cycle (the other two subjects are considered minor subject areas for that assessment year). Assessing all three subjects every 3 years allows countries to have a consistent source of achievement data in each of the three subjects, while rotating one area as the primary focus over the years. Mathematics was the major subject area in 2012, as it was in 2003, since each subject is a major subject area once every three cycles. In 2012, mathematics, science, and reading literacy were assessed primarily through a paper-and-pencil assessment, and problem solving was administered via a computer-based assessment. In addition to these core assessments, education systems could participate in optional paper-based financial literacy and computer-based mathematics and reading assessments. The United States participated in these optional assessments. Visit www.nces.ed.gov/surveys/pisa for more information on the PISA assessments, including information on how the assessments were designed and examples of PISA questions.

Ref.: NCES 2014-024, U.S. Department of Education

Introduction to the PISA 2012 dataset

PISA is a survey of students’ skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.

Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.In [1]:

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

sb.set()

# We are interested in exploring the formatting of all columns (variables), hence we will display all of them
pd.set_option('display.max_rows', 636)
pd.set_option('display.max_columns', 636)

In [2]:

df = pd.read_csv('pisa2012.csv', encoding='latin-1', low_memory = False)

In [3]:

df.head(3)

Out[3]:

	Unnamed: 0	CNT	SUBNATIO	STRATUM	OECD	NC	SCHOOLID	STIDSTD	ST01Q01	ST02Q01	ST03Q01	ST03Q02	ST04Q01	ST05Q01	ST06Q01	ST07Q01	ST07Q02	ST07Q03	ST08Q01	ST09Q01	ST115Q01	ST11Q01	ST11Q02	ST11Q03	ST11Q04	ST11Q05	ST11Q06	ST13Q01	ST14Q01	ST14Q02	ST14Q03	ST14Q04	ST15Q01	ST17Q01	ST18Q01	ST18Q02	ST18Q03	ST18Q04	ST19Q01	ST20Q01	ST20Q02	ST20Q03	ST21Q01	ST25Q01	ST26Q01	ST26Q02	ST26Q03	ST26Q04	ST26Q05	ST26Q06	ST26Q07	ST26Q08	ST26Q09	ST26Q10	ST26Q11	ST26Q12	ST26Q13	ST26Q14	ST26Q15	ST26Q16	ST26Q17	ST27Q01	ST27Q02	ST27Q03	ST27Q04	ST27Q05	ST28Q01	ST29Q01	ST29Q02	ST29Q03	ST29Q04	ST29Q05	ST29Q06	ST29Q07	ST29Q08	ST35Q01	ST35Q02	ST35Q03	ST35Q04	ST35Q05	ST35Q06	ST37Q01	ST37Q02	ST37Q03	ST37Q04	ST37Q05	ST37Q06	ST37Q07	ST37Q08	ST42Q01	ST42Q02	ST42Q03	ST42Q04	ST42Q05	ST42Q06	ST42Q07	ST42Q08	ST42Q09	ST42Q10	ST43Q01	ST43Q02	ST43Q03	ST43Q04	ST43Q05	ST43Q06	ST44Q01	ST44Q03	ST44Q04	ST44Q05	ST44Q07	ST44Q08	ST46Q01	ST46Q02	ST46Q03	ST46Q04	ST46Q05	ST46Q06	ST46Q07	ST46Q08	ST46Q09	ST48Q01	ST48Q02	ST48Q03	ST48Q04	ST48Q05	ST49Q01	ST49Q02	ST49Q03	ST49Q04	ST49Q05	ST49Q06	ST49Q07	ST49Q09	ST53Q01	ST53Q02	ST53Q03	ST53Q04	ST55Q01	ST55Q02	ST55Q03	ST55Q04	ST57Q01	ST57Q02	ST57Q03	ST57Q04	ST57Q05	ST57Q06	ST61Q01	ST61Q02	ST61Q03	ST61Q04	ST61Q05	ST61Q06	ST61Q07	ST61Q08	ST61Q09	ST62Q01	ST62Q02	ST62Q03	ST62Q04	ST62Q06	ST62Q07	ST62Q08	ST62Q09	ST62Q10	ST62Q11	ST62Q12	ST62Q13	ST62Q15	ST62Q16	ST62Q17	ST62Q19	ST69Q01	ST69Q02	ST69Q03	ST70Q01	ST70Q02	ST70Q03	ST71Q01	ST72Q01	ST73Q01	ST73Q02	ST74Q01	ST74Q02	ST75Q01	ST75Q02	ST76Q01	ST76Q02	ST77Q01	ST77Q02	ST77Q04	ST77Q05	ST77Q06	ST79Q01	ST79Q02	ST79Q03	ST79Q04	ST79Q05	ST79Q06	ST79Q07	ST79Q08	ST79Q10	ST79Q11	ST79Q12	ST79Q15	ST79Q17	ST80Q01	ST80Q04	ST80Q05	ST80Q06	ST80Q07	ST80Q08	ST80Q09	ST80Q10	ST80Q11	ST81Q01	ST81Q02	ST81Q03	ST81Q04	ST81Q05	ST82Q01	ST82Q02	ST82Q03	ST83Q01	ST83Q02	ST83Q03	ST83Q04	ST84Q01	ST84Q02	ST84Q03	ST85Q01	ST85Q02	ST85Q03	ST85Q04	ST86Q01	ST86Q02	ST86Q03	ST86Q04	ST86Q05	ST87Q01	ST87Q02	ST87Q03	ST87Q04	ST87Q05	ST87Q06	ST87Q07	ST87Q08	ST87Q09	ST88Q01	ST88Q02	ST88Q03	ST88Q04	ST89Q02	ST89Q03	ST89Q04	ST89Q05	ST91Q01	ST91Q02	ST91Q03	ST91Q04	ST91Q05	ST91Q06	ST93Q01	ST93Q03	ST93Q04	ST93Q06	ST93Q07	ST94Q05	ST94Q06	ST94Q09	ST94Q10	ST94Q14	ST96Q01	ST96Q02	ST96Q03	ST96Q05	ST101Q01	ST101Q02	ST101Q03	ST101Q05	ST104Q01	ST104Q04	ST104Q05	ST104Q06	IC01Q01	IC01Q02	IC01Q03	IC01Q04	IC01Q05	IC01Q06	IC01Q07	IC01Q08	IC01Q09	IC01Q10	IC01Q11	IC02Q01	IC02Q02	IC02Q03	IC02Q04	IC02Q05	IC02Q06	IC02Q07	IC03Q01	IC04Q01	IC05Q01	IC06Q01	IC07Q01	IC08Q01	IC08Q02	IC08Q03	IC08Q04	IC08Q05	IC08Q06	IC08Q07	IC08Q08	IC08Q09	IC08Q11	IC09Q01	IC09Q02	IC09Q03	IC09Q04	IC09Q05	IC09Q06	IC09Q07	IC10Q01	IC10Q02	IC10Q03	IC10Q04	IC10Q05	IC10Q06	IC10Q07	IC10Q08	IC10Q09	IC11Q01	IC11Q02	IC11Q03	IC11Q04	IC11Q05	IC11Q06	IC11Q07	IC22Q01	IC22Q02	IC22Q04	IC22Q06	IC22Q07	IC22Q08	EC01Q01	EC02Q01	EC03Q01	EC03Q02	EC03Q03	EC03Q04	EC03Q05	EC03Q06	EC03Q07	EC03Q08	EC03Q09	EC03Q10	EC04Q01A	EC04Q01B	EC04Q01C	EC04Q02A	EC04Q02B	EC04Q02C	EC04Q03A	EC04Q03B	EC04Q03C	EC04Q04A	EC04Q04B	EC04Q04C	EC04Q05A	EC04Q05B	EC04Q05C	EC04Q06A	EC04Q06B	EC04Q06C	EC05Q01	EC06Q01	EC07Q01	EC07Q02	EC07Q03	EC07Q04	EC07Q05	EC08Q01	EC08Q02	EC08Q03	EC08Q04	EC09Q03	EC10Q01	EC11Q02	EC11Q03	EC12Q01	ST22Q01	ST23Q01	ST23Q02	ST23Q03	ST23Q04	ST23Q05	ST23Q06	ST23Q07	ST23Q08	ST24Q01	ST24Q02	ST24Q03	CLCUSE1	CLCUSE301	CLCUSE302	DEFFORT	QUESTID	BOOKID	EASY	AGE	GRADE	PROGN	ANXMAT	ATSCHL	ATTLNACT	BELONG	BFMJ2	BMMJ1	CLSMAN	COBN_F	COBN_M	COBN_S	COGACT	CULTDIST	CULTPOS	DISCLIMA	ENTUSE	ESCS	EXAPPLM	EXPUREM	FAILMAT	FAMCON	FAMCONC	FAMSTRUC	FISCED	HEDRES	HERITCUL	HISCED	HISEI	HOMEPOS	HOMSCH	HOSTCUL	ICTATTNEG	ICTATTPOS	ICTHOME	ICTRES	ICTSCH	IMMIG	INFOCAR	INFOJOB1	INFOJOB2	INSTMOT	INTMAT	ISCEDD	ISCEDL	ISCEDO	LANGCOMM	LANGN	LANGRPPD	LMINS	MATBEH	MATHEFF	MATINTFC	MATWKETH	MISCED	MMINS	MTSUP	OCOD1	OCOD2	OPENPS	OUTHOURS	PARED	PERSEV	REPEAT	SCMAT	SMINS	STUDREL	SUBNORM	TCHBEHFA	TCHBEHSO	TCHBEHTD	TEACHSUP	TESTLANG	TIMEINT	USEMATH	USESCH	WEALTH	ANCATSCHL	ANCATTLNACT	ANCBELONG	ANCCLSMAN	ANCCOGACT	ANCINSTMOT	ANCINTMAT	ANCMATWKETH	ANCMTSUP	ANCSCMAT	ANCSTUDREL	ANCSUBNORM	PV1MATH	PV2MATH	PV3MATH	PV4MATH	PV5MATH	PV1MACC	PV2MACC	PV3MACC	PV4MACC	PV5MACC	PV1MACQ	PV2MACQ	PV3MACQ	PV4MACQ	PV5MACQ	PV1MACS	PV2MACS	PV3MACS	PV4MACS	PV5MACS	PV1MACU	PV2MACU	PV3MACU	PV4MACU	PV5MACU	PV1MAPE	PV2MAPE	PV3MAPE	PV4MAPE	PV5MAPE	PV1MAPF	PV2MAPF	PV3MAPF	PV4MAPF	PV5MAPF	PV1MAPI	PV2MAPI	PV3MAPI	PV4MAPI	PV5MAPI	PV1READ	PV2READ	PV3READ	PV4READ	PV5READ	PV1SCIE	PV2SCIE	PV3SCIE	PV4SCIE	PV5SCIE	W_FSTUWT	W_FSTR1	W_FSTR2	W_FSTR3	W_FSTR4	W_FSTR5	W_FSTR6	W_FSTR7	W_FSTR8	W_FSTR9	W_FSTR10	W_FSTR11	W_FSTR12	W_FSTR13	W_FSTR14	W_FSTR15	W_FSTR16	W_FSTR17	W_FSTR18	W_FSTR19	W_FSTR20	W_FSTR21	W_FSTR22	W_FSTR23	W_FSTR24	W_FSTR25	W_FSTR26	W_FSTR27	W_FSTR28	W_FSTR29	W_FSTR30	W_FSTR31	W_FSTR32	W_FSTR33	W_FSTR34	W_FSTR35	W_FSTR36	W_FSTR37	W_FSTR38	W_FSTR39	W_FSTR40	W_FSTR41	W_FSTR42	W_FSTR43	W_FSTR44	W_FSTR45	W_FSTR46	W_FSTR47	W_FSTR48	W_FSTR49	W_FSTR50	W_FSTR51	W_FSTR52	W_FSTR53	W_FSTR54	W_FSTR55	W_FSTR56	W_FSTR57	W_FSTR58	W_FSTR59	W_FSTR60	W_FSTR61	W_FSTR62	W_FSTR63	W_FSTR64	W_FSTR65	W_FSTR66	W_FSTR67	W_FSTR68	W_FSTR69	W_FSTR70	W_FSTR71	W_FSTR72	W_FSTR73	W_FSTR74	W_FSTR75	W_FSTR76	W_FSTR77	W_FSTR78	W_FSTR79	W_FSTR80	WVARSTRR	VAR_UNIT	SENWGT_STU	VER_STU
0	1	Albania	80000	ALB0006	Non-OECD	Albania	1	1	10	1.0	2	1996	Female	No	6.0	No, never	No, never	No, never	None	None	1.0	Yes	Yes	Yes	Yes	NaN	NaN	<ISCED level 3A>	No	No	No	No	Other (e.g. home duties, retired)	<ISCED level 3A>	NaN	NaN	NaN	NaN	Working part-time <for pay>	Country of test	Country of test	Country of test	NaN	Language of the test	Yes	No	Yes	No	No	No	No	Yes	No	Yes	No	Yes	No	Yes	8002	8001	8002	Two	One	None	None	None	0-10 books	Agree	Strongly agree	Agree	Agree	Agree	Agree	Agree	Strongly agree	Disagree	Agree	Disagree	Agree	Agree	Agree	Not at all confident	Not very confident	Confident	Confident	Confident	Not at all confident	Confident	Very confident	Agree	Disagree	Agree	Agree	Agree	Agree	Agree	Disagree	Disagree	Disagree	Agree	Disagree	Disagree	Agree	NaN	Disagree	Likely	Slightly likely	Likely	Likely	Likely	Very Likely	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Agree	Courses after school Test Language	Major in college Science	Study harder Test Language	Maximum classes Science	Pursuing a career Math	Often	Sometimes	Sometimes	Sometimes	Sometimes	Never or rarely	Never or rarely	Never or rarely	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Every Lesson	Every Lesson	Every Lesson	Every Lesson	Every Lesson	Never or Hardly Ever	Most Lessons	Never or Hardly Ever	Every Lesson	Most Lessons	Every Lesson	Every Lesson	Every Lesson	Never or Hardly Ever	Most Lessons	Every Lesson	Every Lesson	Every Lesson	Always or almost always	Sometimes	Never or rarely	Always or almost always	Always or almost always	Always or almost always	Always or almost always	Often	Often	Never or Hardly Ever	Never or Hardly Ever	Never or Hardly Ever	Never or Hardly Ever	Never or Hardly Ever	Strongly disagree	Strongly disagree	Strongly disagree	Strongly disagree	Agree	Agree	Agree	Strongly agree	Strongly agree	Disagree	Agree	Strongly disagree	Disagree	Agree	Agree	Strongly disagree	Agree	Agree	Disagree	Agree	Agree	Strongly disagree	Strongly agree	Strongly agree	Strongly disagree	Agree	Strongly disagree	Agree	Agree	Strongly agree	Strongly disagree	Strongly disagree	Agree	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Strongly disagree	Disagree	Strongly disagree	Very much like me	Very much like me	Very much like me	Somewhat like me	Very much like me	Somewhat like me	Mostly like me	Mostly like me	Mostly like me	Somewhat like me	definitely do this	definitely do this	definitely do this	definitely do this	4.0	2.0	1.0	1.0	1.0	2.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	99	99	99	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A Simple calculator	99	99	99	StQ Form B	booklet 7	Standard set of booklets	16.17	0.0	Albania: Upper secondary education	0.32	-2.31	0.5206	-1.18	76.49	79.74	-1.3771	Albania	Albania	Albania	0.6994	NaN	-0.48	1.85	NaN	NaN	NaN	NaN	0.6400	NaN	NaN	2.0	ISCED 3A, ISCED 4	-1.29	NaN	ISCED 3A, ISCED 4	NaN	-2.61	NaN	NaN	NaN	NaN	NaN	-3.16	NaN	Native	NaN	NaN	NaN	0.80	0.91	A	ISCED level 3	General	NaN	Albanian	NaN	NaN	0.6426	-0.77	-0.7332	0.2882	ISCED 3A, ISCED 4	NaN	-0.9508	Building architects	Primary school teachers	0.0521	NaN	12.0	-0.3407	Did not repeat a <grade>	0.41	NaN	-1.04	-0.0455	1.3625	0.9374	0.4297	1.68	Albanian	NaN	NaN	NaN	-2.92	-1.8636	-0.6779	-0.7351	-0.7808	-0.0219	-0.1562	0.0486	-0.2199	-0.5983	-0.0807	-0.5901	-0.3346	406.8469	376.4683	344.5319	321.1637	381.9209	325.8374	324.2795	279.8800	267.4170	312.5954	409.1837	388.1524	373.3525	389.7102	415.4152	351.5423	375.6894	341.4161	386.5945	426.3203	396.7207	334.4057	328.9531	339.8582	354.6580	324.2795	345.3108	381.1419	380.3630	346.8687	319.6059	345.3108	360.8895	390.4892	322.7216	290.7852	345.3108	326.6163	407.6258	367.1210	249.5762	254.3420	406.8496	175.7053	218.5981	341.7009	408.8400	348.2283	367.8105	392.9877	8.9096	13.1249	13.0829	4.5315	13.0829	13.9235	13.1249	13.1249	4.3389	4.3313	13.7954	4.5315	4.3313	13.7954	13.9235	4.3389	4.3313	4.5084	4.5084	13.7954	4.5315	13.1249	13.0829	4.5315	13.0829	13.9235	13.1249	13.1249	4.3389	4.3313	13.7954	4.5315	4.3313	13.7954	13.9235	4.3389	4.3313	4.5084	4.5084	13.7954	4.5315	4.5084	4.5315	13.0829	4.5315	4.3313	4.5084	4.5084	13.7954	13.9235	4.3389	13.0829	13.9235	4.3389	4.3313	13.7954	13.9235	13.1249	13.1249	4.3389	13.0829	4.5084	4.5315	13.0829	4.5315	4.3313	4.5084	4.5084	13.7954	13.9235	4.3389	13.0829	13.9235	4.3389	4.3313	13.7954	13.9235	13.1249	13.1249	4.3389	13.0829	19	1	0.2098	22NOV13
1	2	Albania	80000	ALB0006	Non-OECD	Albania	1	2	10	1.0	2	1996	Female	Yes, for more than one year	7.0	No, never	No, never	No, never	One or two times	None	1.0	Yes	Yes	NaN	Yes	NaN	NaN	<ISCED level 3A>	Yes	Yes	No	No	Working full-time <for pay>	<ISCED level 3A>	No	No	No	No	Working full-time <for pay>	Country of test	Country of test	Country of test	NaN	Language of the test	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	8001	8001	8002	Three or more	Three or more	Three or more	Two	Two	201-500 books	Disagree	Strongly agree	Disagree	Disagree	Agree	Agree	Disagree	Disagree	Strongly agree	Strongly agree	Disagree	Agree	Disagree	Agree	Confident	Very confident	Very confident	Confident	Very confident	Confident	Very confident	Not very confident	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Strongly agree	Strongly agree	Strongly disagree	Disagree	Agree	Disagree	Likely	Slightly likely	Slightly likely	Very Likely	Slightly likely	Likely	Agree	Agree	Strongly agree	Strongly agree	Strongly agree	Agree	Agree	Disagree	Agree	Courses after school Math	Major in college Science	Study harder Math	Maximum classes Science	Pursuing a career Science	Sometimes	Often	Always or almost always	Sometimes	Always or almost always	Never or rarely	Never or rarely	Often	relating to known	Improve understanding	in my sleep	Repeat examples	I do not attend <out-of-school time lessons> i…	2 or more but less than 4 hours a week	2 or more but less than 4 hours a week	Less than 2 hours a week	NaN	NaN	6.0	0.0	0.0	2.0	Rarely	Rarely	Frequently	Sometimes	Frequently	Sometimes	Frequently	Never	Frequently	Know it well, understand the concept	Know it well, understand the concept	Heard of it once or twice	Know it well, understand the concept	Know it well, understand the concept	Know it well, understand the concept	Never heard of it	Know it well, understand the concept	Know it well, understand the concept	Never heard of it	Know it well, understand the concept	Heard of it once or twice	Know it well, understand the concept	Know it well, understand the concept	Never heard of it	Heard of it often	45.0	45.0	45.0	7.0	6.0	2.0	NaN	30.0	Frequently	Sometimes	Frequently	Frequently	Sometimes	Sometimes	Sometimes	Sometimes	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Not at all like me	Not at all like me	Mostly like me	Somewhat like me	Very much like me	Somewhat like me	Not much like me	Not much like me	Mostly like me	Not much like me	probably not do this	probably do this	probably not do this	probably do this	1.0	2.0	3.0	2.0	2.0	3.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	99	99	99	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A Simple calculator	99	99	99	StQ Form A	booklet 9	Standard set of booklets	16.17	0.0	Albania: Upper secondary education	NaN	NaN	NaN	NaN	15.35	23.47	NaN	Albania	Albania	Albania	NaN	NaN	1.27	NaN	NaN	NaN	-0.0681	0.7955	0.1524	0.6387	-0.08	2.0	ISCED 3A, ISCED 4	1.12	NaN	ISCED 5A, 6	NaN	1.41	NaN	NaN	NaN	NaN	NaN	1.15	NaN	Native	NaN	NaN	NaN	-0.39	0.00	A	ISCED level 3	General	NaN	Albanian	NaN	315.0	1.4702	0.34	-0.2514	0.6490	ISCED 5A, 6	270.0	NaN	Tailors, dressmakers, furriers and hatters	Building construction labourers	-0.9492	8.0	16.0	1.3116	Did not repeat a <grade>	NaN	90.0	NaN	0.6602	NaN	NaN	NaN	NaN	Albanian	NaN	NaN	NaN	0.69	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	486.1427	464.3325	453.4273	472.9008	476.0165	325.6816	419.9330	378.6493	359.9548	384.1019	373.1968	444.0801	456.5431	401.2385	461.2167	366.9653	459.6588	426.1645	423.0488	443.3011	389.5544	438.6275	417.5962	379.4283	438.6275	440.1854	456.5431	486.9216	458.1010	444.0801	411.3647	437.8486	457.3220	454.2063	460.4378	434.7328	448.7537	494.7110	429.2803	434.7328	406.2936	349.8975	400.7334	369.7553	396.7618	548.9929	471.5964	471.5964	443.6218	454.8116	8.9096	13.1249	13.0829	4.5315	13.0829	13.9235	13.1249	13.1249	4.3389	4.3313	13.7954	4.5315	4.3313	13.7954	13.9235	4.3389	4.3313	4.5084	4.5084	13.7954	4.5315	13.1249	13.0829	4.5315	13.0829	13.9235	13.1249	13.1249	4.3389	4.3313	13.7954	4.5315	4.3313	13.7954	13.9235	4.3389	4.3313	4.5084	4.5084	13.7954	4.5315	4.5084	4.5315	13.0829	4.5315	4.3313	4.5084	4.5084	13.7954	13.9235	4.3389	13.0829	13.9235	4.3389	4.3313	13.7954	13.9235	13.1249	13.1249	4.3389	13.0829	4.5084	4.5315	13.0829	4.5315	4.3313	4.5084	4.5084	13.7954	13.9235	4.3389	13.0829	13.9235	4.3389	4.3313	13.7954	13.9235	13.1249	13.1249	4.3389	13.0829	19	1	0.2098	22NOV13
2	3	Albania	80000	ALB0006	Non-OECD	Albania	1	3	9	1.0	9	1996	Female	Yes, for more than one year	6.0	No, never	No, never	No, never	None	None	1.0	Yes	Yes	No	Yes	No	No	<ISCED level 3B, 3C>	Yes	Yes	Yes	No	Working full-time <for pay>	<ISCED level 3A>	Yes	No	Yes	Yes	Working full-time <for pay>	Country of test	Country of test	Country of test	NaN	Language of the test	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes	No	Yes	No	Yes	8001	8001	8001	Three or more	Two	Two	One	Two	More than 500 books	Agree	Strongly agree	Agree	Agree	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Agree	Strongly agree	Strongly agree	Agree	Confident	Very confident	Very confident	Confident	Very confident	Not very confident	Very confident	Confident	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Strongly agree	Agree	Strongly agree	Strongly disagree	Strongly agree	Strongly disagree	Likely	Likely	Very Likely	Very Likely	Very Likely	Slightly likely	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Strongly agree	Agree	Strongly agree	Strongly agree	Strongly agree	Courses after school Math	Major in college Science	Study harder Math	Maximum classes Science	Pursuing a career Science	Sometimes	Always or almost always	Sometimes	Never or rarely	Always or almost always	Never or rarely	Never or rarely	Never or rarely	Most important	Improve understanding	learning goals	more information	Less than 2 hours a week	2 or more but less than 4 hours a week	4 or more but less than 6 hours a week	I do not attend <out-of-school time lessons> i…	NaN	6.0	6.0	7.0	2.0	3.0	Frequently	Sometimes	Frequently	Rarely	Frequently	Rarely	Frequently	Sometimes	Frequently	Never heard of it	Know it well, understand the concept	Heard of it once or twice	Know it well, understand the concept	Know it well, understand the concept	Know it well, understand the concept	Heard of it once or twice	Know it well, understand the concept	Know it well, understand the concept	Heard of it once or twice	Know it well, understand the concept	Know it well, understand the concept	Know it well, understand the concept	Know it well, understand the concept	Know it well, understand the concept	Know it well, understand the concept	60.0	NaN	NaN	5.0	4.0	2.0	24.0	30.0	Frequently	Frequently	Frequently	Frequently	Frequently	Frequently	Rarely	Rarely	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Not much like me	Not much like me	Very much like me	Very much like me	Somewhat like me	Mostly like me	Mostly like me	Very much like me	Mostly like me	Very much like me	probably not do this	definitely do this	definitely not do this	probably do this	1.0	3.0	4.0	1.0	3.0	4.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	99	99	99	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A Simple calculator	99	99	99	StQ Form A	booklet 3	Standard set of booklets	15.58	-1.0	Albania: Lower secondary education	NaN	NaN	NaN	NaN	22.57	NaN	NaN	Albania	Albania	Albania	NaN	NaN	1.27	NaN	NaN	NaN	0.5359	0.7955	1.2219	0.8215	-0.89	2.0	ISCED 5A, 6	-0.69	NaN	ISCED 5A, 6	NaN	0.14	NaN	NaN	NaN	NaN	NaN	-0.40	NaN	Native	NaN	NaN	NaN	1.59	1.23	A	ISCED level 2	General	NaN	Albanian	NaN	300.0	0.9618	0.34	-0.2514	2.0389	ISCED 5A, 6	NaN	NaN	Housewife	Bricklayers and related workers	0.9383	24.0	16.0	0.9918	Did not repeat a <grade>	NaN	NaN	NaN	2.2350	NaN	NaN	NaN	NaN	Albanian	NaN	NaN	NaN	-0.23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	533.2684	481.0796	489.6479	490.4269	533.2684	611.1622	486.5322	567.5417	541.0578	544.9525	597.1413	495.1005	576.8889	507.5635	556.6365	594.8045	473.2902	554.2997	537.1631	568.3206	471.7324	431.2276	460.8272	419.5435	456.9325	559.7523	501.3320	555.0787	467.0587	506.7845	580.7836	481.0796	555.0787	453.8168	491.2058	527.0369	444.4695	516.1318	403.9648	476.4060	401.2100	404.3872	387.7067	431.3938	401.2100	499.6643	428.7952	492.2044	512.7191	499.6643	8.4871	12.7307	12.7307	4.2436	12.7307	12.7307	12.7307	12.7307	4.2436	4.2436	12.7307	4.2436	4.2436	12.7307	12.7307	4.2436	4.2436	4.2436	4.2436	12.7307	4.2436	12.7307	12.7307	4.2436	12.7307	12.7307	12.7307	12.7307	4.2436	4.2436	12.7307	4.2436	4.2436	12.7307	12.7307	4.2436	4.2436	4.2436	4.2436	12.7307	4.2436	4.2436	4.2436	12.7307	4.2436	4.2436	4.2436	4.2436	12.7307	12.7307	4.2436	12.7307	12.7307	4.2436	4.2436	12.7307	12.7307	12.7307	12.7307	4.2436	12.7307	4.2436	4.2436	12.7307	4.2436	4.2436	4.2436	4.2436	12.7307	12.7307	4.2436	12.7307	12.7307	4.2436	4.2436	12.7307	12.7307	12.7307	12.7307	4.2436	12.7307	19	1	0.1999	22NOV13

As we can see above, the data is clearly very abundant, with a large number of variables to take into consideration.

After looking throughout the Dataset Dictionary to find out what each of these columns represents, a number of leads to be explored have been considered:

#### We are interested in finding out how students from individual countries perform in Math, Reading and Science literacy.
- For that, we will check the average world and country-wide distribution of Math, Reading and Science literacy scores, individually.
#### Considering that we can see the countries’ average literacy patters in different subjects, we are also curious about from which countries do the “geniuses” stem, meaning which countries have students with exceptionally high literacy scores.
- For that, we will check the distrbution of exceptional scores in Math, Reading and Science literacy, grouped by country.
#### Lastly, we would like to find out whether students whose parents have different cultural backgrounds will report any changes in average scores, compared with students raised in a homogenous family background.
- For that, we will compare the distribution of mean scores in each subject across both students with homogenous family background (parents born in same country) and students with heterogenous family background (parents born in two different countries).

In [4]:

df = df[['CNT', 'ST03Q02', 'ST04Q01', 'AGE', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 
         'PV3READ', 'PV4READ', 'PV5READ','PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'COBN_F', 'COBN_M', 'COBN_S']]

In [5]:

df.head()

Out[5]:

	CNT	ST03Q02	ST04Q01	AGE	PV1MATH	PV2MATH	PV3MATH	PV4MATH	PV5MATH	PV1READ	PV2READ	PV3READ	PV4READ	PV5READ	PV1SCIE	PV2SCIE	PV3SCIE	PV4SCIE	PV5SCIE	COBN_F	COBN_M	COBN_S
0	Albania	1996	Female	16.17	406.8469	376.4683	344.5319	321.1637	381.9209	249.5762	254.3420	406.8496	175.7053	218.5981	341.7009	408.8400	348.2283	367.8105	392.9877	Albania	Albania	Albania
1	Albania	1996	Female	16.17	486.1427	464.3325	453.4273	472.9008	476.0165	406.2936	349.8975	400.7334	369.7553	396.7618	548.9929	471.5964	471.5964	443.6218	454.8116	Albania	Albania	Albania
2	Albania	1996	Female	15.58	533.2684	481.0796	489.6479	490.4269	533.2684	401.2100	404.3872	387.7067	431.3938	401.2100	499.6643	428.7952	492.2044	512.7191	499.6643	Albania	Albania	Albania
3	Albania	1996	Female	15.67	412.2215	498.6836	415.3373	466.7472	454.2842	547.3630	481.4353	461.5776	425.0393	471.9036	438.6796	481.5740	448.9370	474.1141	426.5573	Albania	Albania	Albania
4	Albania	1996	Female	15.50	381.9209	328.1742	403.7311	418.5309	395.1628	311.7707	141.7883	293.5015	272.8495	260.1405	361.5628	275.7740	372.7527	403.5248	422.1746	Albania	Albania	Albania

Here, we have only kept variables that may aid in the exploration of our desired leads.

Before we begin visualizing the data, further wrangling needs to be done in order to ensure clarily of information when working with the dataset.

Further Data Wrangling

In [6]:

# Replace NaN age values with the mean age of students in the dataset
df.loc[np.isfinite(df['AGE']) == False, 'AGE'] = df['AGE'].mean()

In [7]:

# Replace NaN or 'Invalid' values for father/mother birth country to 'Missing', 
# which is already being used to represent missing information

df.loc[df['COBN_F'].isna() == True, 'COBN_F'] = 'Missing'
df.loc[df['COBN_M'].isna() == True, 'COBN_M'] = 'Missing'
df.loc[df['COBN_S'].isna() == True, 'COBN_S'] = 'Missing'

df.loc[df['COBN_F'] == 'Invalid', 'COBN_F'] = 'Missing'
df.loc[df['COBN_M'] == 'Invalid', 'COBN_M'] = 'Missing'
df.loc[df['COBN_S'] == 'Invalid', 'COBN_S'] = 'Missing'

In [8]:

# Check if there are any columns which still have unwrangled NA values
# If this function prints a column, it will also show the total number of NA values in the column, otherwise nothing will print
# The ideal case is that this function does not print anything, meaning there are no more NA values in our working dataset

for column in df.columns:
    if (df[column].isna().sum() > 0):
        print((column) + '  ' + str(df[column].isna().sum()))

Within the dataset, for each literacy subject, there are 5 plausible scores of performance recorded for a student. We will compute the actual score of the student by taking the average of the 5 plausible scores, as performed below:In [9]:

# Compute the average of plausible scores determines the PISA score of a student in a particular subject

df['Math Score'] = (df['PV1MATH'] + df['PV2MATH'] + df['PV3MATH'] + df['PV4MATH'] + df['PV5MATH']) / 5
df['Reading Score'] = (df['PV1READ'] + df['PV2READ'] + df['PV3READ'] + df['PV4READ'] + df['PV5READ']) / 5
df['Science Score'] = (df['PV1SCIE'] + df['PV2SCIE'] + df['PV3SCIE'] + df['PV4SCIE'] + df['PV5SCIE']) / 5

Since we will work only with the mean of scores from now on, the plausible scores used to make up the mean are no longer needed, therefore they will be dropped.In [10]:

# Drop any further-unnecessary columns

df.drop(columns = ['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 
                   'PV5READ','PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'], inplace = True)

We now need to name our variables in a more descriptive manner than initially provided. We will decipher the meaning of the initial variable names using, once again, the data dictionary of the PISA 2012 test.In [11]:

# Rename columns appropriately

df.rename({'CNT' : 'Country', 'ST03Q02' : 'Birth year', 'ST04Q01' : 'Gender', 'AGE' : 'Age', 'COBN_F' : 'Birth Country Father', 
           'COBN_M' : 'Birth Country Mother', 'COBN_S' : 'Birth Country Child'}, axis = 'columns', inplace = True)

Since we need to find out whether a student comes from a homogenous or heterogenous family background, we will perform feature engineering to create a variable which tells us this information.In [12]:

df['Parents - Same Cultural Background'] = (df['Birth Country Father'] == df['Birth Country Mother'])

In [13]:

df.loc[df['Parents - Same Cultural Background'] == True, 'Parents - Same Cultural Background'] = 'Same'
df.loc[df['Parents - Same Cultural Background'] == False, 'Parents - Same Cultural Background'] = 'Different'

The final form of the wrangled working dataset looks as seen below:In [14]:

df.head()

Out[14]:

	Country	Birth year	Gender	Age	Birth Country Father	Birth Country Mother	Birth Country Child	Math Score	Reading Score	Science Score	Parents – Same Cultural Background
0	Albania	1996	Female	16.17	Albania	Albania	Albania	366.18634	261.01424	371.91348	Same
1	Albania	1996	Female	16.17	Albania	Albania	Albania	470.56396	384.68832	478.12382	Same
2	Albania	1996	Female	15.58	Albania	Albania	Albania	505.53824	405.18154	486.60946	Same
3	Albania	1996	Female	15.67	Albania	Albania	Albania	449.45476	477.46376	453.97240	Same
4	Albania	1996	Female	15.50	Albania	Albania	Albania	385.50398	256.01010	367.15778	Same

In [15]:

df.shape

Out[15]:

(485490, 11)

Univariate Exploration and Analysis

In this section, we will investigate the distributions of individual variables.

Visualisation 1

First, we are interested in seeing what is the distribution of PISA scores for each subject in part, along with their type of distribution and mode values. Also, for us to determine what “exceptionally high” scores represent, we first need to understand what is common to happen within the data and between what intervals do most score values lie.In [16]:

plt.figure(figsize = [17, 5])

bins_hist = np.arange(0, 1000 + 1, 100)

plt.subplot(1, 3, 1)
plt.hist(df['Math Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Math Score');
plt.ylabel('Nr. of students')
plt.title("Math Score distribution across students");

plt.subplot(1, 3, 2)
plt.hist(df['Reading Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Reading Score');
plt.title("Reading Score distribution across students");

plt.subplot(1, 3, 3)
plt.hist(df['Science Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Science Score');
plt.title("Science Score distribution across students");

From the above distributions, we find out that:

The literacy scores are spread out in a clear, smooth unimodal distribution of values
The vast majority of the students are scoring in each subject between 300 and 600 points, while a small portion of the total number achieves poorer (between 100 and 300 points) or greater (between 600 and 800 points) test performance
For each of these three score distributions, most students fitted within the interval of scores between 400 and 500 points, which happens to also be the middle interval of the possible score range. This shows that the PISA 2012 test has been constructed in a balanced manner with respect to student reviewing methodology and the difficulty of its questions

Visualisation 2

Next, we are interested in finding out about which countries host “exceptionally performing” students, and what their number is, with respect to each country and subject.

Since we have found out that a distribution of scores between 100 and 800 points is large enough to be noticed in our histogram of score count distributions, we understand that “exceptionally high” scores will be valued above 800 points.In [17]:

# Retrieve entries of students with scores above 800 points in each subject

high_math_score = df[df['Math Score'] > 800]['Country'].value_counts()
high_reading_score = df[df['Reading Score'] > 800]['Country'].value_counts()
high_science_score = df[df['Science Score'] > 800]['Country'].value_counts()

In [18]:

plt.figure(figsize = [15, 5])
plt.subplots_adjust(wspace = 0.6) # adjust spacing between subplots, in order to show long country names nicely
x_lim_max = high_math_score.values[0] + 6 # '+6' is done in order to show the text counts properly next to the bars


plt.subplot(1, 3, 1)
sb.barplot(y = high_math_score.index, x = high_math_score.values, color = sb.color_palette()[0])
plt.title('Students with Math score > 800');
plt.xlabel('Nr. of students')
plt.ylabel('Countries (ordered by student count)')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_math_score[label.get_text()] + 1, s = high_math_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);    


plt.subplot(1, 3, 2)
sb.barplot(y = high_reading_score.index, x = high_reading_score.values, color = sb.color_palette()[0])
plt.xlim(0, x_lim_max);
plt.title('Students with Reading score > 800');
plt.xlabel('Nr. of students')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_reading_score[label.get_text()] + 1, s = high_reading_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);    


plt.subplot(1, 3, 3)
sb.barplot(y = high_science_score.index, x = high_science_score.values, color = sb.color_palette()[0])
plt.xlim(0, x_lim_max);
plt.title('Students with Science score > 800');
plt.xlabel('Nr. of students')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_science_score[label.get_text()] + 1, s = high_science_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);

Here, we have counted the total number of exceptional students for each subject, and from their count distributions, we find out that:

Oceania and Asian countries, in particular China and Singapore, seem to be the places where exceptional students are most likely to stem from
Singapore is the clear leader in excellency, since it manages to reach within the top 3 countries with most exceptional students in all three areas of study
Korea and Australia also have a respectable number of excellent students in both Math and Science
Besides Asia and Oceania, a number of European and North American countries also make the podium: Canada is present in all three of the above distributions, while Poland, Czech Republic and Switzerland appear to be educating Math-bright students

Visualisation 3

Lastly, we would like to find out whether students with heterogenous family roots provide different score readings, on average, than students raised by parents with a homogenous family background.

For that, we would first like to see the proportion of students having parents with same or different cultural background:In [19]:

plt.figure(figsize=[3, 5]);
sb.countplot(x = 'Parents - Same Cultural Background', data = df, color = sb.color_palette()[0]);

y_ticks = np.arange(0, 450000 + 1, 50000)
plt.yticks(y_ticks, y_ticks);
plt.ylabel("Nr. of students");
plt.title('Family background distribution');

From this previous distribution, we discover that there are almost 9 times more students having parents with same cultural background than those having parents with different backgrounds.

Bivariate Exploration and Analysis

In this section, we will investigate the relationships between pairs of relevant variables in our data.

Visualisation 4

After finding out previously the global distribution of scores for each literacy category in part, we are interested to look into the relationship how the country of residence/education affects scores on each of the subjects individually.

Also, we have previously discovered countries where exceptional students stem from, such as China, Singapore, Korea or Poland.

Was this just a strong deviation from the norm of scores within these countries, or are these countries also between the top leaders when it comes to educating all of their students in these subjects?In [20]:

plt.figure(figsize = [15, 35])
plt.subplots_adjust(wspace = 0.85) # adjust spacing between subplots, in order to show long country names nicely

math_score_country_order = df.groupby('Country')['Math Score'].mean().sort_values(ascending = False).index
reading_score_country_order = df.groupby('Country')['Reading Score'].mean().sort_values(ascending = False).index
science_score_country_order = df.groupby('Country')['Science Score'].mean().sort_values(ascending = False).index

plt.subplot(1, 3, 1)
sb.boxplot(x = df['Math Score'], y = df['Country'], order = math_score_country_order, color = sb.color_palette()[9]);
plt.ylabel('Countries (ordered descendingly by score ranking)')
plt.title('Math score distributions by country');

plt.subplot(1, 3, 2)
sb.boxplot(x = df['Reading Score'], y = df['Country'], order = reading_score_country_order, color = sb.color_palette()[9]);
plt.ylabel(''); # Remove the redundant label
plt.title('Reading score distributions by country');

plt.subplot(1, 3, 3)
sb.boxplot(x = df['Science Score'], y = df['Country'], order = science_score_country_order, color = sb.color_palette()[9]);
plt.ylabel(''); # Remove the redundant label
plt.title('Science score distributions by country');

Here we have, in decreasing order, the rankings of countries with the best-performing students, on average, for each of the three subjects. We find that:

Most countries seem to achieve, in different subjects, rankings which are close to each other, with 80% of countries reaching rankings within 10 places higher/lower across all the subjects. The remaining 20% of countries with large deviations will be analyzed further on, as we would like to find out in what subjects are they deviating and how large is the difference
From the top ranking of countries, we can clearly see that it was no coincidence that exceptional students come from Asian countries, in particular China and Singapore, as these are the countries which are 1st, 2nd or 3rd place across all subjects. Another country which impressed was Poland, with 3 exceptional students in Mathematics. Here, it can be seen that this is no coincidence either, as Poland occupies a spot in the top 10 placements in Reading and Science, and 11th in Mathematics
The box-and-wiskers plot allows us to see the outliers in scores for each country in part. For example, we find out that, even though a number of Singapore’s students managed to achieve outstanding results in the field of Mathematics, that is not to be considered “out of ordinary” (or, with other words, as outliers) in the country’s perspective, however the outstanding results received in Science can be indeed considered to be slightly out of regular bounds, since we can notice the Singapore’s outliers in the plot
Other such findings can be performed by picking a country of interest and checking its score distribution, rankings and outlier distribution accordingly

Visualisation 5

Further on, we are interested to see if there are any countries which had multi-talented students, performing exceptionally (score > 800) in not only one discipline, but in multiple ones at the same time.

In [21]:

high_math_and_reading_score = df[(df['Math Score'] >= 800) & (df['Reading Score'] >= 800)]['Country'].value_counts()
high_math_and_science_score = df[(df['Math Score'] >= 800) & (df['Science Score'] >= 800)]['Country'].value_counts()
high_reading_and_science_score = df[(df['Reading Score'] >= 800) & (df['Science Score'] >= 800)]['Country'].value_counts()

In [22]:

plt.figure(figsize = [15, 1])
plt.subplots_adjust(wspace = 0.6) # adjust spacing between subplots, in order to show long country names nicely
x_lim_max = high_math_and_science_score.values[0] # adjust the proportions of the x-axis with respect to all 3 plots using this

plt.subplot(1, 3, 1)
sb.barplot(y = high_math_and_reading_score.index, x = high_math_and_reading_score.values, color = sb.color_palette()[0])
plt.title('Students with Math & Reading scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

plt.subplot(1, 3, 2)
sb.barplot(y = high_math_and_science_score.index, x = high_math_and_science_score.values, color = sb.color_palette()[0])
plt.title('Students with Math & Science scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

plt.subplot(1, 3, 3)
sb.barplot(y = high_reading_and_science_score.index, x = high_reading_and_science_score.values, color = sb.color_palette()[0])
plt.title('Students with Reading & Science scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

We have once again realized plots to visualize the distribution of counted students with exceptional results in multiple fields. It can be found that:

Singapore leads in the world ranking of multi-talented outstanding results, being the only one country present in all three possible combination of two talents. Coupled together with previous findings of Singapore’s on-average and above-average student results, we can determine that its country is world-leader in providing multidisciplinary high-quality education, across all measured disciplines.
China is once again present in the rankings, together with (South) Korea, determining that, as said about Singapore, such countries have a very high quality of education in the measured fields. The connection between all previous visualizations of scores determine that such outstanding results are not simple coincidence, but only slightly-higher than average results compared with these countries’ mean of scores.
There were no students who have managed to reach scores above 800 in all three subjects simultaneously, which is why there is no further visualization on this matter.

Visualisation 6

Lastly, we are investigating whether students having parents from different cultural (country) backgrounds are performing differently than students whose parents come from the same type of background.In [23]:

plt.figure(figsize = [15, 4])
plt.subplots_adjust(wspace = 1.2)

plt.subplot(1, 3, 1)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Math Score'], palette = 'Set2')
plt.title('Math scores related to family background');

plt.subplot(1, 3, 2)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Reading Score'], palette = 'Set2')
plt.title('Reading scores related to family background');

plt.subplot(1, 3, 3)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Science Score'], palette = 'Set2');
plt.title('Science scores related to family background');

It can be seen that, on average, indeed students coming from heterogenic family backgrounds report increased performance in all areas, compared to students from homogenous family backgrounds.

Multivariate Exploration and Analysis

In this section, we will investigate the relationships between the interactions of three or more variables at the same time.

Visualisation 7

Out of curiosity, we would like to know whether it is normal that most countries who had many students scoring high in one of the subjects, also had, on average, students scoring high on the other two subjects.

Therefore, we are going to study the pair-by-pair relationship between Math Scores, Reading Scores and Science Scores, and correlate them in pair scatter plots, in order to see what type and strength of correlation exists.In [24]:

grid = sb.pairplot(data = df, vars=["Math Score", "Reading Score", "Science Score"]);
grid.fig.suptitle("Pair-by-pair representation of different scores' correlation", y = 1.02);

As it could be expected, there is a very strong and positive correlation between any pair of the three variables representing the scores of the three subjects. Therefore, the previous relationships between scores observed in the behaviour of many countries’ students is justified.

Visualisation 8

Lastly in this analysis, even though we found out that positive correlation of students’ scores is normal behaviour for a country, and that most countries display similar rankings in scores across all three subjects, there are still 20% of countries who deviate from this behaviour, where these countries have differences in rankings of some scores of more than 10 places.

We will look into who are those countries, and where do the differences take place in the scores of each of these countries.

In [25]:

country_outliers = []

for country in df['Country'].unique():
    if ((np.abs((math_score_country_order.get_loc(country) - reading_score_country_order.get_loc(country))) > 10) |\
        (np.abs((math_score_country_order.get_loc(country) - science_score_country_order.get_loc(country))) > 10) |\
        (np.abs((reading_score_country_order.get_loc(country) - science_score_country_order.get_loc(country))) > 10)):
        
        country_outliers.append(country)
        
        
country_outliers.sort() # Sort countries alphabetically

for country in country_outliers:
    print((country) + ':' + str((len('United States of America:') - len(country) + 1) * ' ') + 'Math place: ' + str(math_score_country_order.get_loc(country)) + str((5 - len(str(math_score_country_order.get_loc(country))) + 1) * ' ') + 'Reading place: ' + str(reading_score_country_order.get_loc(country)) + str((5 - len(str(reading_score_country_order.get_loc(country))) + 1) * ' ') + 'Science place: ' + str(science_score_country_order.get_loc(country)))

Austria:                   Math place: 17    Reading place: 31    Science place: 23
Florida (USA):             Math place: 44    Reading place: 33    Science place: 40
Iceland:                   Math place: 27    Reading place: 40    Science place: 42
Ireland:                   Math place: 20    Reading place: 5     Science place: 13
Israel:                    Math place: 43    Reading place: 32    Science place: 44
Kazakhstan:                Math place: 53    Reading place: 65    Science place: 54
Macao-China:               Math place: 6     Reading place: 17    Science place: 14
Massachusetts (USA):       Math place: 19    Reading place: 8     Science place: 15
Norway:                    Math place: 31    Reading place: 22    Science place: 33
Slovak Republic:           Math place: 34    Reading place: 45    Science place: 43
Slovenia:                  Math place: 37    Reading place: 46    Science place: 32
Switzerland:               Math place: 9     Reading place: 26    Science place: 26
United States of America:  Math place: 39    Reading place: 25    Science place: 30
Vietnam:                   Math place: 15    Reading place: 19    Science place: 7

In [26]:

df_country_outliers = df[['Country', 'Math Score', 'Reading Score', 'Science Score']][df['Country'].isin(country_outliers)]
df_country_outliers = df_country_outliers.melt('Country', var_name = 'Score Type', value_name = 'Scores')

In [27]:

plt.figure(figsize = [7, 10])

sb.pointplot(x = 'Scores', y = 'Country', hue = 'Score Type', data = df_country_outliers, linestyles = '', dodge = 0.4, ci = 'sd', palette = 'deep', order = country_outliers);
plt.legend(loc = 2);
plt.title('Score distribution across subjects of countries with strong deviations in results');

From this visualization, we can better understand the deviations in the outlier-countries’ score pattern. For example, we find out that Kazakhstan is deviating due to much weaker scores in Reading than in the other two disciplines; while Switzerland is deviating through much higher scores in Mathematics than in the other two subjects. For any country of interest, such an analysis can be performed.

Conclusions and Answers

Lastly, we will restate our three questions of interest, along with a summary of our conclusions:

How do students from individual countries perform in Math, Reading and Science literacy?
- We have found the literacy trends for each country in part, along with countries with high deviations in scores compared to the norm. The results can be explored in Visualization 4 and 8.

What are the countries from which “geniuses” stem, meaning which countries have students with exceptionally high literacy scores?
- Generally, we have seen Asian countries as being the world leaders at educating students who perform exceptionally, not only in each different subject individually, but also at multiple subjects simultaneously. Singapore and China are examples of this.

Do students whose parents have different cultural backgrounds report any changes in average scores, compared with students raised in a homogeneous family background?
- We have discovered that students whose parents are from two different cultural backgrounds report, on average, slightly higher performance (scores) in all three measured subjects.

Project 4 – Communicate Data Findings

Write A Comment Cancel Reply

Algorithmic Forex Mastery: Building A Trading Bot and Analytical Tools

Safer Gambling Analytics Part 2 – Session Aggregation

Safer Gambling Analytics Part 1 – Declined Deposits

Kaggle Marketing Analysis – Power BI Dashboard