My Capstone for the Data Science Immersive at General Assembly

by oboechick on October 14, 2016


I started my capstone project by looking for some way to look at the last reports from Trends in International Mathematics and Science Study (TIMSS). This is a group that looks at the way that math is taught worldwide and creates studies to see the best way to teach math so that the students remember it years after they've stopped going to school. TIMSS does this for Science as well but I am less familiar with that branch of the group.

While I was looking for this data I read an article that summarized results from the Organisation for Economic Co‑operation and Development Programme for International Student Assessment (OECD PISA). I discovered that this was one of the umbrellas under which TIMSS published it results. So I dove in and collected the data.

PISA is a group that gives assessments to students ages 15 years 3 months to 16 years 2 months from about 70 different countries. The assessments determine how literate the students are in math, science, language, and finances (starting in 2012). This assessment is given every 3 years starting in 2000. I was able to collect the survey scores from the 2012 assessments.

This data came in the form of over 300 different excel spreadsheets about 250 of them had more than one sheet in the file. I decided that I would start by importing all of the spreadsheets into pandas dataframes and clean all of them with one function. I then followed these steps.

  1. Pulled all of the excel files into pandas dataframe
  2. Left out first row because it was irrelevant
  3. Combined 2nd and 3rd rows to make the headers more descriptive and understandable
  4. Wanted to merge all dataframes together using country column as the index but had the problem of how do I know which question goes with which header?
  5. Added question ID and sheet name to the beginning of every header
  6. Merged all the data
  7. Got dataframe with 65 rows and 4,095 columns
  8. Attempted to use logistic regression.
    1. Used the countries as target y.
    2. Used headers that had words “none”, “once”, “twice”, “four”, and “five” as my  X features.
    3. I got 1,194 columns in my X and I could not get this model to work

There were many problems with the way that I went about this. Apparently not all my headers got cleaned, half of the columns in my features list were click logs not assessment results, and I had click logs and survey results all in the one dataframe.

## What did I learn from this? 

Find a dictionary of the data or create one before you try to analyze it and take your time familiarizing yourself with the data. In the end it will take less time if you don't have to go back and correct things after you have done all the hard work. 


Return to blog