Table of Contents

  • Introduction of the topic and dataset
  • Dataset Investigation and preliminary wrangling
  • Further Data Wrangling
  • Univariate Exploration and Analysis
  • Bivariate Exploration and Analysis
  • Multivariate Exploration and Analysis
  • Conclusions and Answers

Introduction of the topic and dataset

Introduction to PISA

What Is PISA?

The Program for International Student Assessment (PISA) is a system of international assessments that allows countries to compare outcomes of learning as students near the end of compulsory schooling. PISA core assessments measure the performance of 15-year-old students in mathematics, science, and reading literacy every 3 years. Coordinated by the Organization for Economic Cooperation and Development (OECD), PISA was first implemented in 2000 in 32 countries. It has since grown to 65 education systems in 2012.

Project Aims

Project: Data Exploration of the performance of globally-selected 15/16-year-old students in Mathematics, Reading and Science Literacy, based on the results of the PISA 2012 test

What PISA Measures

PISA’s goal is to assess students’ preparation for the challenges of life as young adults. PISA assesses the application of knowledge in mathematics, science, and reading literacy to problems within a reallife context (OECD 1999). PISA does not focus explicitly on curricular outcomes and uses the term “literacy” in each subject area to indicate its broad focus on the application of knowledge and skills. For example, when assessing mathematics, PISA examines how well 15-year-old students can understand, use, and reflect on mathematics for a variety of real-life problems and settings that they may not encounter in the classroom. Scores on the PISA scales represent skill levels along a continuum of literacy skills.

Each PISA data collection cycle assesses one of the three core subject areas in depth (considered the major subject area), although all three core subjects are assessed in each cycle (the other two subjects are considered minor subject areas for that assessment year). Assessing all three subjects every 3 years allows countries to have a consistent source of achievement data in each of the three subjects, while rotating one area as the primary focus over the years. Mathematics was the major subject area in 2012, as it was in 2003, since each subject is a major subject area once every three cycles. In 2012, mathematics, science, and reading literacy were assessed primarily through a paper-and-pencil assessment, and problem solving was administered via a computer-based assessment. In addition to these core assessments, education systems could participate in optional paper-based financial literacy and computer-based mathematics and reading assessments. The United States participated in these optional assessments. Visit www.nces.ed.gov/surveys/pisa for more information on the PISA assessments, including information on how the assessments were designed and examples of PISA questions.

Ref.: NCES 2014-024, U.S. Department of Education

Introduction to the PISA 2012 dataset

PISA is a survey of students’ skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.

Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.In [1]:

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

sb.set()

# We are interested in exploring the formatting of all columns (variables), hence we will display all of them
pd.set_option('display.max_rows', 636)
pd.set_option('display.max_columns', 636)

In [2]:

df = pd.read_csv('pisa2012.csv', encoding='latin-1', low_memory = False)

In [3]:

df.head(3)

Out[3]:

Unnamed: 0CNTSUBNATIOSTRATUMOECDNCSCHOOLIDSTIDSTDST01Q01ST02Q01ST03Q01ST03Q02ST04Q01ST05Q01ST06Q01ST07Q01ST07Q02ST07Q03ST08Q01ST09Q01ST115Q01ST11Q01ST11Q02ST11Q03ST11Q04ST11Q05ST11Q06ST13Q01ST14Q01ST14Q02ST14Q03ST14Q04ST15Q01ST17Q01ST18Q01ST18Q02ST18Q03ST18Q04ST19Q01ST20Q01ST20Q02ST20Q03ST21Q01ST25Q01ST26Q01ST26Q02ST26Q03ST26Q04ST26Q05ST26Q06ST26Q07ST26Q08ST26Q09ST26Q10ST26Q11ST26Q12ST26Q13ST26Q14ST26Q15ST26Q16ST26Q17ST27Q01ST27Q02ST27Q03ST27Q04ST27Q05ST28Q01ST29Q01ST29Q02ST29Q03ST29Q04ST29Q05ST29Q06ST29Q07ST29Q08ST35Q01ST35Q02ST35Q03ST35Q04ST35Q05ST35Q06ST37Q01ST37Q02ST37Q03ST37Q04ST37Q05ST37Q06ST37Q07ST37Q08ST42Q01ST42Q02ST42Q03ST42Q04ST42Q05ST42Q06ST42Q07ST42Q08ST42Q09ST42Q10ST43Q01ST43Q02ST43Q03ST43Q04ST43Q05ST43Q06ST44Q01ST44Q03ST44Q04ST44Q05ST44Q07ST44Q08ST46Q01ST46Q02ST46Q03ST46Q04ST46Q05ST46Q06ST46Q07ST46Q08ST46Q09ST48Q01ST48Q02ST48Q03ST48Q04ST48Q05ST49Q01ST49Q02ST49Q03ST49Q04ST49Q05ST49Q06ST49Q07ST49Q09ST53Q01ST53Q02ST53Q03ST53Q04ST55Q01ST55Q02ST55Q03ST55Q04ST57Q01ST57Q02ST57Q03ST57Q04ST57Q05ST57Q06ST61Q01ST61Q02ST61Q03ST61Q04ST61Q05ST61Q06ST61Q07ST61Q08ST61Q09ST62Q01ST62Q02ST62Q03ST62Q04ST62Q06ST62Q07ST62Q08ST62Q09ST62Q10ST62Q11ST62Q12ST62Q13ST62Q15ST62Q16ST62Q17ST62Q19ST69Q01ST69Q02ST69Q03ST70Q01ST70Q02ST70Q03ST71Q01ST72Q01ST73Q01ST73Q02ST74Q01ST74Q02ST75Q01ST75Q02ST76Q01ST76Q02ST77Q01ST77Q02ST77Q04ST77Q05ST77Q06ST79Q01ST79Q02ST79Q03ST79Q04ST79Q05ST79Q06ST79Q07ST79Q08ST79Q10ST79Q11ST79Q12ST79Q15ST79Q17ST80Q01ST80Q04ST80Q05ST80Q06ST80Q07ST80Q08ST80Q09ST80Q10ST80Q11ST81Q01ST81Q02ST81Q03ST81Q04ST81Q05ST82Q01ST82Q02ST82Q03ST83Q01ST83Q02ST83Q03ST83Q04ST84Q01ST84Q02ST84Q03ST85Q01ST85Q02ST85Q03ST85Q04ST86Q01ST86Q02ST86Q03ST86Q04ST86Q05ST87Q01ST87Q02ST87Q03ST87Q04ST87Q05ST87Q06ST87Q07ST87Q08ST87Q09ST88Q01ST88Q02ST88Q03ST88Q04ST89Q02ST89Q03ST89Q04ST89Q05ST91Q01ST91Q02ST91Q03ST91Q04ST91Q05ST91Q06ST93Q01ST93Q03ST93Q04ST93Q06ST93Q07ST94Q05ST94Q06ST94Q09ST94Q10ST94Q14ST96Q01ST96Q02ST96Q03ST96Q05ST101Q01ST101Q02ST101Q03ST101Q05ST104Q01ST104Q04ST104Q05ST104Q06IC01Q01IC01Q02IC01Q03IC01Q04IC01Q05IC01Q06IC01Q07IC01Q08IC01Q09IC01Q10IC01Q11IC02Q01IC02Q02IC02Q03IC02Q04IC02Q05IC02Q06IC02Q07IC03Q01IC04Q01IC05Q01IC06Q01IC07Q01IC08Q01IC08Q02IC08Q03IC08Q04IC08Q05IC08Q06IC08Q07IC08Q08IC08Q09IC08Q11IC09Q01IC09Q02IC09Q03IC09Q04IC09Q05IC09Q06IC09Q07IC10Q01IC10Q02IC10Q03IC10Q04IC10Q05IC10Q06IC10Q07IC10Q08IC10Q09IC11Q01IC11Q02IC11Q03IC11Q04IC11Q05IC11Q06IC11Q07IC22Q01IC22Q02IC22Q04IC22Q06IC22Q07IC22Q08EC01Q01EC02Q01EC03Q01EC03Q02EC03Q03EC03Q04EC03Q05EC03Q06EC03Q07EC03Q08EC03Q09EC03Q10EC04Q01AEC04Q01BEC04Q01CEC04Q02AEC04Q02BEC04Q02CEC04Q03AEC04Q03BEC04Q03CEC04Q04AEC04Q04BEC04Q04CEC04Q05AEC04Q05BEC04Q05CEC04Q06AEC04Q06BEC04Q06CEC05Q01EC06Q01EC07Q01EC07Q02EC07Q03EC07Q04EC07Q05EC08Q01EC08Q02EC08Q03EC08Q04EC09Q03EC10Q01EC11Q02EC11Q03EC12Q01ST22Q01ST23Q01ST23Q02ST23Q03ST23Q04ST23Q05ST23Q06ST23Q07ST23Q08ST24Q01ST24Q02ST24Q03CLCUSE1CLCUSE301CLCUSE302DEFFORTQUESTIDBOOKIDEASYAGEGRADEPROGNANXMATATSCHLATTLNACTBELONGBFMJ2BMMJ1CLSMANCOBN_FCOBN_MCOBN_SCOGACTCULTDISTCULTPOSDISCLIMAENTUSEESCSEXAPPLMEXPUREMFAILMATFAMCONFAMCONCFAMSTRUCFISCEDHEDRESHERITCULHISCEDHISEIHOMEPOSHOMSCHHOSTCULICTATTNEGICTATTPOSICTHOMEICTRESICTSCHIMMIGINFOCARINFOJOB1INFOJOB2INSTMOTINTMATISCEDDISCEDLISCEDOLANGCOMMLANGNLANGRPPDLMINSMATBEHMATHEFFMATINTFCMATWKETHMISCEDMMINSMTSUPOCOD1OCOD2OPENPSOUTHOURSPAREDPERSEVREPEATSCMATSMINSSTUDRELSUBNORMTCHBEHFATCHBEHSOTCHBEHTDTEACHSUPTESTLANGTIMEINTUSEMATHUSESCHWEALTHANCATSCHLANCATTLNACTANCBELONGANCCLSMANANCCOGACTANCINSTMOTANCINTMATANCMATWKETHANCMTSUPANCSCMATANCSTUDRELANCSUBNORMPV1MATHPV2MATHPV3MATHPV4MATHPV5MATHPV1MACCPV2MACCPV3MACCPV4MACCPV5MACCPV1MACQPV2MACQPV3MACQPV4MACQPV5MACQPV1MACSPV2MACSPV3MACSPV4MACSPV5MACSPV1MACUPV2MACUPV3MACUPV4MACUPV5MACUPV1MAPEPV2MAPEPV3MAPEPV4MAPEPV5MAPEPV1MAPFPV2MAPFPV3MAPFPV4MAPFPV5MAPFPV1MAPIPV2MAPIPV3MAPIPV4MAPIPV5MAPIPV1READPV2READPV3READPV4READPV5READPV1SCIEPV2SCIEPV3SCIEPV4SCIEPV5SCIEW_FSTUWTW_FSTR1W_FSTR2W_FSTR3W_FSTR4W_FSTR5W_FSTR6W_FSTR7W_FSTR8W_FSTR9W_FSTR10W_FSTR11W_FSTR12W_FSTR13W_FSTR14W_FSTR15W_FSTR16W_FSTR17W_FSTR18W_FSTR19W_FSTR20W_FSTR21W_FSTR22W_FSTR23W_FSTR24W_FSTR25W_FSTR26W_FSTR27W_FSTR28W_FSTR29W_FSTR30W_FSTR31W_FSTR32W_FSTR33W_FSTR34W_FSTR35W_FSTR36W_FSTR37W_FSTR38W_FSTR39W_FSTR40W_FSTR41W_FSTR42W_FSTR43W_FSTR44W_FSTR45W_FSTR46W_FSTR47W_FSTR48W_FSTR49W_FSTR50W_FSTR51W_FSTR52W_FSTR53W_FSTR54W_FSTR55W_FSTR56W_FSTR57W_FSTR58W_FSTR59W_FSTR60W_FSTR61W_FSTR62W_FSTR63W_FSTR64W_FSTR65W_FSTR66W_FSTR67W_FSTR68W_FSTR69W_FSTR70W_FSTR71W_FSTR72W_FSTR73W_FSTR74W_FSTR75W_FSTR76W_FSTR77W_FSTR78W_FSTR79W_FSTR80WVARSTRRVAR_UNITSENWGT_STUVER_STU
01Albania80000ALB0006Non-OECDAlbania11101.021996FemaleNo6.0No, neverNo, neverNo, neverNoneNone1.0YesYesYesYesNaNNaN<ISCED level 3A>NoNoNoNoOther (e.g. home duties, retired)<ISCED level 3A>NaNNaNNaNNaNWorking part-time <for pay>Country of testCountry of testCountry of testNaNLanguage of the testYesNoYesNoNoNoNoYesNoYesNoYesNoYes800280018002TwoOneNoneNoneNone0-10 booksAgreeStrongly agreeAgreeAgreeAgreeAgreeAgreeStrongly agreeDisagreeAgreeDisagreeAgreeAgreeAgreeNot at all confidentNot very confidentConfidentConfidentConfidentNot at all confidentConfidentVery confidentAgreeDisagreeAgreeAgreeAgreeAgreeAgreeDisagreeDisagreeDisagreeAgreeDisagreeDisagreeAgreeNaNDisagreeLikelySlightly likelyLikelyLikelyLikelyVery LikelyAgreeAgreeAgreeAgreeAgreeAgreeAgreeAgreeAgreeCourses after school Test LanguageMajor in college ScienceStudy harder Test LanguageMaximum classes SciencePursuing a career MathOftenSometimesSometimesSometimesSometimesNever or rarelyNever or rarelyNever or rarelyNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNEvery LessonEvery LessonEvery LessonEvery LessonEvery LessonNever or Hardly EverMost LessonsNever or Hardly EverEvery LessonMost LessonsEvery LessonEvery LessonEvery LessonNever or Hardly EverMost LessonsEvery LessonEvery LessonEvery LessonAlways or almost alwaysSometimesNever or rarelyAlways or almost alwaysAlways or almost alwaysAlways or almost alwaysAlways or almost alwaysOftenOftenNever or Hardly EverNever or Hardly EverNever or Hardly EverNever or Hardly EverNever or Hardly EverStrongly disagreeStrongly disagreeStrongly disagreeStrongly disagreeAgreeAgreeAgreeStrongly agreeStrongly agreeDisagreeAgreeStrongly disagreeDisagreeAgreeAgreeStrongly disagreeAgreeAgreeDisagreeAgreeAgreeStrongly disagreeStrongly agreeStrongly agreeStrongly disagreeAgreeStrongly disagreeAgreeAgreeStrongly agreeStrongly disagreeStrongly disagreeAgreeStrongly agreeStrongly agreeStrongly agreeStrongly agreeStrongly agreeStrongly agreeStrongly disagreeDisagreeStrongly disagreeVery much like meVery much like meVery much like meSomewhat like meVery much like meSomewhat like meMostly like meMostly like meMostly like meSomewhat like medefinitely do thisdefinitely do thisdefinitely do thisdefinitely do this4.02.01.01.01.02.01.01.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN999999NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNA Simple calculator999999StQ Form Bbooklet 7Standard set of booklets16.170.0Albania: Upper secondary education0.32-2.310.5206-1.1876.4979.74-1.3771AlbaniaAlbaniaAlbania0.6994NaN-0.481.85NaNNaNNaNNaN0.6400NaNNaN2.0ISCED 3A, ISCED 4-1.29NaNISCED 3A, ISCED 4NaN-2.61NaNNaNNaNNaNNaN-3.16NaNNativeNaNNaNNaN0.800.91AISCED level 3GeneralNaNAlbanianNaNNaN0.6426-0.77-0.73320.2882ISCED 3A, ISCED 4NaN-0.9508Building architectsPrimary school teachers0.0521NaN12.0-0.3407Did not repeat a <grade>0.41NaN-1.04-0.04551.36250.93740.42971.68AlbanianNaNNaNNaN-2.92-1.8636-0.6779-0.7351-0.7808-0.0219-0.15620.0486-0.2199-0.5983-0.0807-0.5901-0.3346406.8469376.4683344.5319321.1637381.9209325.8374324.2795279.8800267.4170312.5954409.1837388.1524373.3525389.7102415.4152351.5423375.6894341.4161386.5945426.3203396.7207334.4057328.9531339.8582354.6580324.2795345.3108381.1419380.3630346.8687319.6059345.3108360.8895390.4892322.7216290.7852345.3108326.6163407.6258367.1210249.5762254.3420406.8496175.7053218.5981341.7009408.8400348.2283367.8105392.98778.909613.124913.08294.531513.082913.923513.124913.12494.33894.331313.79544.53154.331313.795413.92354.33894.33134.50844.508413.79544.531513.124913.08294.531513.082913.923513.124913.12494.33894.331313.79544.53154.331313.795413.92354.33894.33134.50844.508413.79544.53154.50844.531513.08294.53154.33134.50844.508413.795413.92354.338913.082913.92354.33894.331313.795413.923513.124913.12494.338913.08294.50844.531513.08294.53154.33134.50844.508413.795413.92354.338913.082913.92354.33894.331313.795413.923513.124913.12494.338913.08291910.209822NOV13
12Albania80000ALB0006Non-OECDAlbania12101.021996FemaleYes, for more than one year7.0No, neverNo, neverNo, neverOne or two timesNone1.0YesYesNaNYesNaNNaN<ISCED level 3A>YesYesNoNoWorking full-time <for pay><ISCED level 3A>NoNoNoNoWorking full-time <for pay>Country of testCountry of testCountry of testNaNLanguage of the testYesYesYesYesYesYesYesYesYesYesYesYesYesYes800180018002Three or moreThree or moreThree or moreTwoTwo201-500 booksDisagreeStrongly agreeDisagreeDisagreeAgreeAgreeDisagreeDisagreeStrongly agreeStrongly agreeDisagreeAgreeDisagreeAgreeConfidentVery confidentVery confidentConfidentVery confidentConfidentVery confidentNot very confidentNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNStrongly agreeStrongly agreeStrongly disagreeDisagreeAgreeDisagreeLikelySlightly likelySlightly likelyVery LikelySlightly likelyLikelyAgreeAgreeStrongly agreeStrongly agreeStrongly agreeAgreeAgreeDisagreeAgreeCourses after school MathMajor in college ScienceStudy harder MathMaximum classes SciencePursuing a career ScienceSometimesOftenAlways or almost alwaysSometimesAlways or almost alwaysNever or rarelyNever or rarelyOftenrelating to knownImprove understandingin my sleepRepeat examplesI do not attend <out-of-school time lessons> i…2 or more but less than 4 hours a week2 or more but less than 4 hours a weekLess than 2 hours a weekNaNNaN6.00.00.02.0RarelyRarelyFrequentlySometimesFrequentlySometimesFrequentlyNeverFrequentlyKnow it well, understand the conceptKnow it well, understand the conceptHeard of it once or twiceKnow it well, understand the conceptKnow it well, understand the conceptKnow it well, understand the conceptNever heard of itKnow it well, understand the conceptKnow it well, understand the conceptNever heard of itKnow it well, understand the conceptHeard of it once or twiceKnow it well, understand the conceptKnow it well, understand the conceptNever heard of itHeard of it often45.045.045.07.06.02.0NaN30.0FrequentlySometimesFrequentlyFrequentlySometimesSometimesSometimesSometimesNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNot at all like meNot at all like meMostly like meSomewhat like meVery much like meSomewhat like meNot much like meNot much like meMostly like meNot much like meprobably not do thisprobably do thisprobably not do thisprobably do this1.02.03.02.02.03.01.01.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN999999NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNA Simple calculator999999StQ Form Abooklet 9Standard set of booklets16.170.0Albania: Upper secondary educationNaNNaNNaNNaN15.3523.47NaNAlbaniaAlbaniaAlbaniaNaNNaN1.27NaNNaNNaN-0.06810.79550.15240.6387-0.082.0ISCED 3A, ISCED 41.12NaNISCED 5A, 6NaN1.41NaNNaNNaNNaNNaN1.15NaNNativeNaNNaNNaN-0.390.00AISCED level 3GeneralNaNAlbanianNaN315.01.47020.34-0.25140.6490ISCED 5A, 6270.0NaNTailors, dressmakers, furriers and hattersBuilding construction labourers-0.94928.016.01.3116Did not repeat a <grade>NaN90.0NaN0.6602NaNNaNNaNNaNAlbanianNaNNaNNaN0.69NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN486.1427464.3325453.4273472.9008476.0165325.6816419.9330378.6493359.9548384.1019373.1968444.0801456.5431401.2385461.2167366.9653459.6588426.1645423.0488443.3011389.5544438.6275417.5962379.4283438.6275440.1854456.5431486.9216458.1010444.0801411.3647437.8486457.3220454.2063460.4378434.7328448.7537494.7110429.2803434.7328406.2936349.8975400.7334369.7553396.7618548.9929471.5964471.5964443.6218454.81168.909613.124913.08294.531513.082913.923513.124913.12494.33894.331313.79544.53154.331313.795413.92354.33894.33134.50844.508413.79544.531513.124913.08294.531513.082913.923513.124913.12494.33894.331313.79544.53154.331313.795413.92354.33894.33134.50844.508413.79544.53154.50844.531513.08294.53154.33134.50844.508413.795413.92354.338913.082913.92354.33894.331313.795413.923513.124913.12494.338913.08294.50844.531513.08294.53154.33134.50844.508413.795413.92354.338913.082913.92354.33894.331313.795413.923513.124913.12494.338913.08291910.209822NOV13
23Albania80000ALB0006Non-OECDAlbania1391.091996FemaleYes, for more than one year6.0No, neverNo, neverNo, neverNoneNone1.0YesYesNoYesNoNo<ISCED level 3B, 3C>YesYesYesNoWorking full-time <for pay><ISCED level 3A>YesNoYesYesWorking full-time <for pay>Country of testCountry of testCountry of testNaNLanguage of the testYesYesYesYesNoYesYesYesYesYesNoYesNoYes800180018001Three or moreTwoTwoOneTwoMore than 500 booksAgreeStrongly agreeAgreeAgreeStrongly agreeStrongly agreeStrongly agreeStrongly agreeStrongly agreeStrongly agreeAgreeStrongly agreeStrongly agreeAgreeConfidentVery confidentVery confidentConfidentVery confidentNot very confidentVery confidentConfidentNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNStrongly agreeAgreeStrongly agreeStrongly disagreeStrongly agreeStrongly disagreeLikelyLikelyVery LikelyVery LikelyVery LikelySlightly likelyStrongly agreeStrongly agreeStrongly agreeStrongly agreeStrongly agreeAgreeStrongly agreeStrongly agreeStrongly agreeCourses after school MathMajor in college ScienceStudy harder MathMaximum classes SciencePursuing a career ScienceSometimesAlways or almost alwaysSometimesNever or rarelyAlways or almost alwaysNever or rarelyNever or rarelyNever or rarelyMost importantImprove understandinglearning goalsmore informationLess than 2 hours a week2 or more but less than 4 hours a week4 or more but less than 6 hours a weekI do not attend <out-of-school time lessons> i…NaN6.06.07.02.03.0FrequentlySometimesFrequentlyRarelyFrequentlyRarelyFrequentlySometimesFrequentlyNever heard of itKnow it well, understand the conceptHeard of it once or twiceKnow it well, understand the conceptKnow it well, understand the conceptKnow it well, understand the conceptHeard of it once or twiceKnow it well, understand the conceptKnow it well, understand the conceptHeard of it once or twiceKnow it well, understand the conceptKnow it well, understand the conceptKnow it well, understand the conceptKnow it well, understand the conceptKnow it well, understand the conceptKnow it well, understand the concept60.0NaNNaN5.04.02.024.030.0FrequentlyFrequentlyFrequentlyFrequentlyFrequentlyFrequentlyRarelyRarelyNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNot much like meNot much like meVery much like meVery much like meSomewhat like meMostly like meMostly like meVery much like meMostly like meVery much like meprobably not do thisdefinitely do thisdefinitely not do thisprobably do this1.03.04.01.03.04.01.01.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN999999NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNA Simple calculator999999StQ Form Abooklet 3Standard set of booklets15.58-1.0Albania: Lower secondary educationNaNNaNNaNNaN22.57NaNNaNAlbaniaAlbaniaAlbaniaNaNNaN1.27NaNNaNNaN0.53590.79551.22190.8215-0.892.0ISCED 5A, 6-0.69NaNISCED 5A, 6NaN0.14NaNNaNNaNNaNNaN-0.40NaNNativeNaNNaNNaN1.591.23AISCED level 2GeneralNaNAlbanianNaN300.00.96180.34-0.25142.0389ISCED 5A, 6NaNNaNHousewifeBricklayers and related workers0.938324.016.00.9918Did not repeat a <grade>NaNNaNNaN2.2350NaNNaNNaNNaNAlbanianNaNNaNNaN-0.23NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN533.2684481.0796489.6479490.4269533.2684611.1622486.5322567.5417541.0578544.9525597.1413495.1005576.8889507.5635556.6365594.8045473.2902554.2997537.1631568.3206471.7324431.2276460.8272419.5435456.9325559.7523501.3320555.0787467.0587506.7845580.7836481.0796555.0787453.8168491.2058527.0369444.4695516.1318403.9648476.4060401.2100404.3872387.7067431.3938401.2100499.6643428.7952492.2044512.7191499.66438.487112.730712.73074.243612.730712.730712.730712.73074.24364.243612.73074.24364.243612.730712.73074.24364.24364.24364.243612.73074.243612.730712.73074.243612.730712.730712.730712.73074.24364.243612.73074.24364.243612.730712.73074.24364.24364.24364.243612.73074.24364.24364.243612.73074.24364.24364.24364.243612.730712.73074.243612.730712.73074.24364.243612.730712.730712.730712.73074.243612.73074.24364.243612.73074.24364.24364.24364.243612.730712.73074.243612.730712.73074.24364.243612.730712.730712.730712.73074.243612.73071910.199922NOV13

As we can see above, the data is clearly very abundant, with a large number of variables to take into consideration.

After looking throughout the Dataset Dictionary to find out what each of these columns represents, a number of leads to be explored have been considered:

  • #### We are interested in finding out how students from individual countries perform in Math, Reading and Science literacy.
    • For that, we will check the average world and country-wide distribution of Math, Reading and Science literacy scores, individually.
  • #### Considering that we can see the countries’ average literacy patters in different subjects, we are also curious about from which countries do the “geniuses” stem, meaning which countries have students with exceptionally high literacy scores.
    • For that, we will check the distrbution of exceptional scores in Math, Reading and Science literacy, grouped by country.
  • #### Lastly, we would like to find out whether students whose parents have different cultural backgrounds will report any changes in average scores, compared with students raised in a homogenous family background.
    • For that, we will compare the distribution of mean scores in each subject across both students with homogenous family background (parents born in same country) and students with heterogenous family background (parents born in two different countries).

In [4]:

df = df[['CNT', 'ST03Q02', 'ST04Q01', 'AGE', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 
         'PV3READ', 'PV4READ', 'PV5READ','PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'COBN_F', 'COBN_M', 'COBN_S']]

In [5]:

df.head()

Out[5]:

CNTST03Q02ST04Q01AGEPV1MATHPV2MATHPV3MATHPV4MATHPV5MATHPV1READPV2READPV3READPV4READPV5READPV1SCIEPV2SCIEPV3SCIEPV4SCIEPV5SCIECOBN_FCOBN_MCOBN_S
0Albania1996Female16.17406.8469376.4683344.5319321.1637381.9209249.5762254.3420406.8496175.7053218.5981341.7009408.8400348.2283367.8105392.9877AlbaniaAlbaniaAlbania
1Albania1996Female16.17486.1427464.3325453.4273472.9008476.0165406.2936349.8975400.7334369.7553396.7618548.9929471.5964471.5964443.6218454.8116AlbaniaAlbaniaAlbania
2Albania1996Female15.58533.2684481.0796489.6479490.4269533.2684401.2100404.3872387.7067431.3938401.2100499.6643428.7952492.2044512.7191499.6643AlbaniaAlbaniaAlbania
3Albania1996Female15.67412.2215498.6836415.3373466.7472454.2842547.3630481.4353461.5776425.0393471.9036438.6796481.5740448.9370474.1141426.5573AlbaniaAlbaniaAlbania
4Albania1996Female15.50381.9209328.1742403.7311418.5309395.1628311.7707141.7883293.5015272.8495260.1405361.5628275.7740372.7527403.5248422.1746AlbaniaAlbaniaAlbania

Here, we have only kept variables that may aid in the exploration of our desired leads.

Before we begin visualizing the data, further wrangling needs to be done in order to ensure clarily of information when working with the dataset.

Further Data Wrangling

In [6]:

# Replace NaN age values with the mean age of students in the dataset
df.loc[np.isfinite(df['AGE']) == False, 'AGE'] = df['AGE'].mean()

In [7]:

# Replace NaN or 'Invalid' values for father/mother birth country to 'Missing', 
# which is already being used to represent missing information

df.loc[df['COBN_F'].isna() == True, 'COBN_F'] = 'Missing'
df.loc[df['COBN_M'].isna() == True, 'COBN_M'] = 'Missing'
df.loc[df['COBN_S'].isna() == True, 'COBN_S'] = 'Missing'

df.loc[df['COBN_F'] == 'Invalid', 'COBN_F'] = 'Missing'
df.loc[df['COBN_M'] == 'Invalid', 'COBN_M'] = 'Missing'
df.loc[df['COBN_S'] == 'Invalid', 'COBN_S'] = 'Missing'

In [8]:

# Check if there are any columns which still have unwrangled NA values
# If this function prints a column, it will also show the total number of NA values in the column, otherwise nothing will print
# The ideal case is that this function does not print anything, meaning there are no more NA values in our working dataset

for column in df.columns:
    if (df[column].isna().sum() > 0):
        print((column) + '  ' + str(df[column].isna().sum()))

Within the dataset, for each literacy subject, there are 5 plausible scores of performance recorded for a student. We will compute the actual score of the student by taking the average of the 5 plausible scores, as performed below:In [9]:

# Compute the average of plausible scores determines the PISA score of a student in a particular subject

df['Math Score'] = (df['PV1MATH'] + df['PV2MATH'] + df['PV3MATH'] + df['PV4MATH'] + df['PV5MATH']) / 5
df['Reading Score'] = (df['PV1READ'] + df['PV2READ'] + df['PV3READ'] + df['PV4READ'] + df['PV5READ']) / 5
df['Science Score'] = (df['PV1SCIE'] + df['PV2SCIE'] + df['PV3SCIE'] + df['PV4SCIE'] + df['PV5SCIE']) / 5

Since we will work only with the mean of scores from now on, the plausible scores used to make up the mean are no longer needed, therefore they will be dropped.In [10]:

# Drop any further-unnecessary columns

df.drop(columns = ['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 
                   'PV5READ','PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'], inplace = True)

We now need to name our variables in a more descriptive manner than initially provided. We will decipher the meaning of the initial variable names using, once again, the data dictionary of the PISA 2012 test.In [11]:

# Rename columns appropriately

df.rename({'CNT' : 'Country', 'ST03Q02' : 'Birth year', 'ST04Q01' : 'Gender', 'AGE' : 'Age', 'COBN_F' : 'Birth Country Father', 
           'COBN_M' : 'Birth Country Mother', 'COBN_S' : 'Birth Country Child'}, axis = 'columns', inplace = True)

Since we need to find out whether a student comes from a homogenous or heterogenous family background, we will perform feature engineering to create a variable which tells us this information.In [12]:

df['Parents - Same Cultural Background'] = (df['Birth Country Father'] == df['Birth Country Mother'])

In [13]:

df.loc[df['Parents - Same Cultural Background'] == True, 'Parents - Same Cultural Background'] = 'Same'
df.loc[df['Parents - Same Cultural Background'] == False, 'Parents - Same Cultural Background'] = 'Different'

The final form of the wrangled working dataset looks as seen below:In [14]:

df.head()

Out[14]:

CountryBirth yearGenderAgeBirth Country FatherBirth Country MotherBirth Country ChildMath ScoreReading ScoreScience ScoreParents – Same Cultural Background
0Albania1996Female16.17AlbaniaAlbaniaAlbania366.18634261.01424371.91348Same
1Albania1996Female16.17AlbaniaAlbaniaAlbania470.56396384.68832478.12382Same
2Albania1996Female15.58AlbaniaAlbaniaAlbania505.53824405.18154486.60946Same
3Albania1996Female15.67AlbaniaAlbaniaAlbania449.45476477.46376453.97240Same
4Albania1996Female15.50AlbaniaAlbaniaAlbania385.50398256.01010367.15778Same

In [15]:

df.shape

Out[15]:

(485490, 11)

Univariate Exploration and Analysis

In this section, we will investigate the distributions of individual variables.

Visualisation 1

First, we are interested in seeing what is the distribution of PISA scores for each subject in part, along with their type of distribution and mode values. Also, for us to determine what “exceptionally high” scores represent, we first need to understand what is common to happen within the data and between what intervals do most score values lie.In [16]:

plt.figure(figsize = [17, 5])

bins_hist = np.arange(0, 1000 + 1, 100)

plt.subplot(1, 3, 1)
plt.hist(df['Math Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Math Score');
plt.ylabel('Nr. of students')
plt.title("Math Score distribution across students");

plt.subplot(1, 3, 2)
plt.hist(df['Reading Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Reading Score');
plt.title("Reading Score distribution across students");

plt.subplot(1, 3, 3)
plt.hist(df['Science Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Science Score');
plt.title("Science Score distribution across students");

From the above distributions, we find out that:

  • The literacy scores are spread out in a clear, smooth unimodal distribution of values
  • The vast majority of the students are scoring in each subject between 300 and 600 points, while a small portion of the total number achieves poorer (between 100 and 300 points) or greater (between 600 and 800 points) test performance
  • For each of these three score distributions, most students fitted within the interval of scores between 400 and 500 points, which happens to also be the middle interval of the possible score range. This shows that the PISA 2012 test has been constructed in a balanced manner with respect to student reviewing methodology and the difficulty of its questions

Visualisation 2

Next, we are interested in finding out about which countries host “exceptionally performing” students, and what their number is, with respect to each country and subject.

Since we have found out that a distribution of scores between 100 and 800 points is large enough to be noticed in our histogram of score count distributions, we understand that “exceptionally high” scores will be valued above 800 points.In [17]:

# Retrieve entries of students with scores above 800 points in each subject

high_math_score = df[df['Math Score'] > 800]['Country'].value_counts()
high_reading_score = df[df['Reading Score'] > 800]['Country'].value_counts()
high_science_score = df[df['Science Score'] > 800]['Country'].value_counts()

In [18]:

plt.figure(figsize = [15, 5])
plt.subplots_adjust(wspace = 0.6) # adjust spacing between subplots, in order to show long country names nicely
x_lim_max = high_math_score.values[0] + 6 # '+6' is done in order to show the text counts properly next to the bars


plt.subplot(1, 3, 1)
sb.barplot(y = high_math_score.index, x = high_math_score.values, color = sb.color_palette()[0])
plt.title('Students with Math score > 800');
plt.xlabel('Nr. of students')
plt.ylabel('Countries (ordered by student count)')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_math_score[label.get_text()] + 1, s = high_math_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);    


plt.subplot(1, 3, 2)
sb.barplot(y = high_reading_score.index, x = high_reading_score.values, color = sb.color_palette()[0])
plt.xlim(0, x_lim_max);
plt.title('Students with Reading score > 800');
plt.xlabel('Nr. of students')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_reading_score[label.get_text()] + 1, s = high_reading_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);    


plt.subplot(1, 3, 3)
sb.barplot(y = high_science_score.index, x = high_science_score.values, color = sb.color_palette()[0])
plt.xlim(0, x_lim_max);
plt.title('Students with Science score > 800');
plt.xlabel('Nr. of students')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_science_score[label.get_text()] + 1, s = high_science_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);

Here, we have counted the total number of exceptional students for each subject, and from their count distributions, we find out that:

  • Oceania and Asian countries, in particular China and Singapore, seem to be the places where exceptional students are most likely to stem from
  • Singapore is the clear leader in excellency, since it manages to reach within the top 3 countries with most exceptional students in all three areas of study
  • Korea and Australia also have a respectable number of excellent students in both Math and Science
  • Besides Asia and Oceania, a number of European and North American countries also make the podium: Canada is present in all three of the above distributions, while Poland, Czech Republic and Switzerland appear to be educating Math-bright students

Visualisation 3

Lastly, we would like to find out whether students with heterogenous family roots provide different score readings, on average, than students raised by parents with a homogenous family background.

For that, we would first like to see the proportion of students having parents with same or different cultural background:In [19]:

plt.figure(figsize=[3, 5]);
sb.countplot(x = 'Parents - Same Cultural Background', data = df, color = sb.color_palette()[0]);

y_ticks = np.arange(0, 450000 + 1, 50000)
plt.yticks(y_ticks, y_ticks);
plt.ylabel("Nr. of students");
plt.title('Family background distribution');

From this previous distribution, we discover that there are almost 9 times more students having parents with same cultural background than those having parents with different backgrounds.

Bivariate Exploration and Analysis

In this section, we will investigate the relationships between pairs of relevant variables in our data.

Visualisation 4

After finding out previously the global distribution of scores for each literacy category in part, we are interested to look into the relationship how the country of residence/education affects scores on each of the subjects individually.

Also, we have previously discovered countries where exceptional students stem from, such as China, Singapore, Korea or Poland.

Was this just a strong deviation from the norm of scores within these countries, or are these countries also between the top leaders when it comes to educating all of their students in these subjects?In [20]:

plt.figure(figsize = [15, 35])
plt.subplots_adjust(wspace = 0.85) # adjust spacing between subplots, in order to show long country names nicely

math_score_country_order = df.groupby('Country')['Math Score'].mean().sort_values(ascending = False).index
reading_score_country_order = df.groupby('Country')['Reading Score'].mean().sort_values(ascending = False).index
science_score_country_order = df.groupby('Country')['Science Score'].mean().sort_values(ascending = False).index

plt.subplot(1, 3, 1)
sb.boxplot(x = df['Math Score'], y = df['Country'], order = math_score_country_order, color = sb.color_palette()[9]);
plt.ylabel('Countries (ordered descendingly by score ranking)')
plt.title('Math score distributions by country');

plt.subplot(1, 3, 2)
sb.boxplot(x = df['Reading Score'], y = df['Country'], order = reading_score_country_order, color = sb.color_palette()[9]);
plt.ylabel(''); # Remove the redundant label
plt.title('Reading score distributions by country');

plt.subplot(1, 3, 3)
sb.boxplot(x = df['Science Score'], y = df['Country'], order = science_score_country_order, color = sb.color_palette()[9]);
plt.ylabel(''); # Remove the redundant label
plt.title('Science score distributions by country');

Here we have, in decreasing order, the rankings of countries with the best-performing students, on average, for each of the three subjects. We find that:

  • Most countries seem to achieve, in different subjects, rankings which are close to each other, with 80% of countries reaching rankings within 10 places higher/lower across all the subjects. The remaining 20% of countries with large deviations will be analyzed further on, as we would like to find out in what subjects are they deviating and how large is the difference
  • From the top ranking of countries, we can clearly see that it was no coincidence that exceptional students come from Asian countries, in particular China and Singapore, as these are the countries which are 1st, 2nd or 3rd place across all subjects. Another country which impressed was Poland, with 3 exceptional students in Mathematics. Here, it can be seen that this is no coincidence either, as Poland occupies a spot in the top 10 placements in Reading and Science, and 11th in Mathematics
  • The box-and-wiskers plot allows us to see the outliers in scores for each country in part. For example, we find out that, even though a number of Singapore’s students managed to achieve outstanding results in the field of Mathematics, that is not to be considered “out of ordinary” (or, with other words, as outliers) in the country’s perspective, however the outstanding results received in Science can be indeed considered to be slightly out of regular bounds, since we can notice the Singapore’s outliers in the plot
  • Other such findings can be performed by picking a country of interest and checking its score distribution, rankings and outlier distribution accordingly

Visualisation 5

Further on, we are interested to see if there are any countries which had multi-talented students, performing exceptionally (score > 800) in not only one discipline, but in multiple ones at the same time.

In [21]:

high_math_and_reading_score = df[(df['Math Score'] >= 800) & (df['Reading Score'] >= 800)]['Country'].value_counts()
high_math_and_science_score = df[(df['Math Score'] >= 800) & (df['Science Score'] >= 800)]['Country'].value_counts()
high_reading_and_science_score = df[(df['Reading Score'] >= 800) & (df['Science Score'] >= 800)]['Country'].value_counts()

In [22]:

plt.figure(figsize = [15, 1])
plt.subplots_adjust(wspace = 0.6) # adjust spacing between subplots, in order to show long country names nicely
x_lim_max = high_math_and_science_score.values[0] # adjust the proportions of the x-axis with respect to all 3 plots using this

plt.subplot(1, 3, 1)
sb.barplot(y = high_math_and_reading_score.index, x = high_math_and_reading_score.values, color = sb.color_palette()[0])
plt.title('Students with Math & Reading scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

plt.subplot(1, 3, 2)
sb.barplot(y = high_math_and_science_score.index, x = high_math_and_science_score.values, color = sb.color_palette()[0])
plt.title('Students with Math & Science scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

plt.subplot(1, 3, 3)
sb.barplot(y = high_reading_and_science_score.index, x = high_reading_and_science_score.values, color = sb.color_palette()[0])
plt.title('Students with Reading & Science scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

We have once again realized plots to visualize the distribution of counted students with exceptional results in multiple fields. It can be found that:

  • Singapore leads in the world ranking of multi-talented outstanding results, being the only one country present in all three possible combination of two talents. Coupled together with previous findings of Singapore’s on-average and above-average student results, we can determine that its country is world-leader in providing multidisciplinary high-quality education, across all measured disciplines.
  • China is once again present in the rankings, together with (South) Korea, determining that, as said about Singapore, such countries have a very high quality of education in the measured fields. The connection between all previous visualizations of scores determine that such outstanding results are not simple coincidence, but only slightly-higher than average results compared with these countries’ mean of scores.
  • There were no students who have managed to reach scores above 800 in all three subjects simultaneously, which is why there is no further visualization on this matter.

Visualisation 6

Lastly, we are investigating whether students having parents from different cultural (country) backgrounds are performing differently than students whose parents come from the same type of background.In [23]:

plt.figure(figsize = [15, 4])
plt.subplots_adjust(wspace = 1.2)

plt.subplot(1, 3, 1)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Math Score'], palette = 'Set2')
plt.title('Math scores related to family background');

plt.subplot(1, 3, 2)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Reading Score'], palette = 'Set2')
plt.title('Reading scores related to family background');

plt.subplot(1, 3, 3)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Science Score'], palette = 'Set2');
plt.title('Science scores related to family background');

It can be seen that, on average, indeed students coming from heterogenic family backgrounds report increased performance in all areas, compared to students from homogenous family backgrounds.

Multivariate Exploration and Analysis

In this section, we will investigate the relationships between the interactions of three or more variables at the same time.

Visualisation 7

Out of curiosity, we would like to know whether it is normal that most countries who had many students scoring high in one of the subjects, also had, on average, students scoring high on the other two subjects.

Therefore, we are going to study the pair-by-pair relationship between Math Scores, Reading Scores and Science Scores, and correlate them in pair scatter plots, in order to see what type and strength of correlation exists.In [24]:

grid = sb.pairplot(data = df, vars=["Math Score", "Reading Score", "Science Score"]);
grid.fig.suptitle("Pair-by-pair representation of different scores' correlation", y = 1.02);

As it could be expected, there is a very strong and positive correlation between any pair of the three variables representing the scores of the three subjects. Therefore, the previous relationships between scores observed in the behaviour of many countries’ students is justified.

Visualisation 8

Lastly in this analysis, even though we found out that positive correlation of students’ scores is normal behaviour for a country, and that most countries display similar rankings in scores across all three subjects, there are still 20% of countries who deviate from this behaviour, where these countries have differences in rankings of some scores of more than 10 places.

We will look into who are those countries, and where do the differences take place in the scores of each of these countries.

In [25]:

country_outliers = []

for country in df['Country'].unique():
    if ((np.abs((math_score_country_order.get_loc(country) - reading_score_country_order.get_loc(country))) > 10) |\
        (np.abs((math_score_country_order.get_loc(country) - science_score_country_order.get_loc(country))) > 10) |\
        (np.abs((reading_score_country_order.get_loc(country) - science_score_country_order.get_loc(country))) > 10)):
        
        country_outliers.append(country)
        
        
country_outliers.sort() # Sort countries alphabetically

for country in country_outliers:
    print((country) + ':' + str((len('United States of America:') - len(country) + 1) * ' ') + 'Math place: ' + str(math_score_country_order.get_loc(country)) + str((5 - len(str(math_score_country_order.get_loc(country))) + 1) * ' ') + 'Reading place: ' + str(reading_score_country_order.get_loc(country)) + str((5 - len(str(reading_score_country_order.get_loc(country))) + 1) * ' ') + 'Science place: ' + str(science_score_country_order.get_loc(country)))
Austria:                   Math place: 17    Reading place: 31    Science place: 23
Florida (USA):             Math place: 44    Reading place: 33    Science place: 40
Iceland:                   Math place: 27    Reading place: 40    Science place: 42
Ireland:                   Math place: 20    Reading place: 5     Science place: 13
Israel:                    Math place: 43    Reading place: 32    Science place: 44
Kazakhstan:                Math place: 53    Reading place: 65    Science place: 54
Macao-China:               Math place: 6     Reading place: 17    Science place: 14
Massachusetts (USA):       Math place: 19    Reading place: 8     Science place: 15
Norway:                    Math place: 31    Reading place: 22    Science place: 33
Slovak Republic:           Math place: 34    Reading place: 45    Science place: 43
Slovenia:                  Math place: 37    Reading place: 46    Science place: 32
Switzerland:               Math place: 9     Reading place: 26    Science place: 26
United States of America:  Math place: 39    Reading place: 25    Science place: 30
Vietnam:                   Math place: 15    Reading place: 19    Science place: 7

In [26]:

df_country_outliers = df[['Country', 'Math Score', 'Reading Score', 'Science Score']][df['Country'].isin(country_outliers)]
df_country_outliers = df_country_outliers.melt('Country', var_name = 'Score Type', value_name = 'Scores')

In [27]:

plt.figure(figsize = [7, 10])

sb.pointplot(x = 'Scores', y = 'Country', hue = 'Score Type', data = df_country_outliers, linestyles = '', dodge = 0.4, ci = 'sd', palette = 'deep', order = country_outliers);
plt.legend(loc = 2);
plt.title('Score distribution across subjects of countries with strong deviations in results');

From this visualization, we can better understand the deviations in the outlier-countries’ score pattern. For example, we find out that Kazakhstan is deviating due to much weaker scores in Reading than in the other two disciplines; while Switzerland is deviating through much higher scores in Mathematics than in the other two subjects. For any country of interest, such an analysis can be performed.

Conclusions and Answers

Lastly, we will restate our three questions of interest, along with a summary of our conclusions:

  • How do students from individual countries perform in Math, Reading and Science literacy?
    • We have found the literacy trends for each country in part, along with countries with high deviations in scores compared to the norm. The results can be explored in Visualization 4 and 8.
  • What are the countries from which “geniuses” stem, meaning which countries have students with exceptionally high literacy scores?
    • Generally, we have seen Asian countries as being the world leaders at educating students who perform exceptionally, not only in each different subject individually, but also at multiple subjects simultaneously. Singapore and China are examples of this.
  • Do students whose parents have different cultural backgrounds report any changes in average scores, compared with students raised in a homogeneous family background?
    • We have discovered that students whose parents are from two different cultural backgrounds report, on average, slightly higher performance (scores) in all three measured subjects.
Author

Write A Comment