Extracting Game of Thrones data from Wikipedia

I'm learning different methods to import data from the web in my Web and Cloud Computing class. This week, I learnt web scrapping with BeautifulSoup, a python package for extracting data out of HTML and XML files.
In this post, I show how I collected Game of Thrones data from this Wikipedia page that lists all Game of Thrones episodes by season. More specifically, I scrapped and summarized in a pandas dataframe, all episodes' titles, links, seasons, number of U.S. viewers, and running time .
Read More

Reflecting on my MDS experience

Without a computer science background and limited coding experience, I expected UBC Master of Data Science (MDS) program to be challenging to say the least. However, it wasn't the first time I was stepping outside of my comfort zone. When I was 16, I relocated to South Africa to attend the African Leadership Academy where I had to study in a different language. Few years later, I started my career in finance without a finance background.


It has been 3 months, 12 courses, 24 quizzes, 48 assignments, 96 hours of in-class assignments ,144 hours of lectures, lots of sleepless nights and totally worth it. I like the block schedule for the immersion in 4 subjects every month. We have good instructors that care about their students, are open to feedback and are easily accessible. Presentations and group projects are great learning opportunities and the diversity of the cohort allows for interesting discussions.
Among other things, I’ve learnt to wrangle data with R and python, make effective plots, use recursion and dynamic programming, query a database using SQL, automate data science workflows, implement supervised machine learning models such as random forests, Naïve Bayes and K-Nearest Neighbors. WOW!
My first computer science class: Algorithms and data structures has been the most challenging so far but one I’m the proudest of. On the second assignment, I used recursion to draw a Sierpinski triangle of depth 6.
Read More

Relationship between gender and math grades

I completed my first project-based course: Data science workflows which focused on reproducibility of data analysis projects. I learnt to:

  • Write R, Python and shell scripts for non-interactive data analysis
Read More

Which online grocery store should I shop from in Vancouver?

I've always wished I was one of those who enjoy grocery shopping, but I really don't. This is due to three main reasons: First, when I look at my busy calendar, I can't help but feel as if the time I allocate to grocery shopping could be used to do something more productive such as working on an assignment, catching up with a relative or even just sleeping. Secondly, I always end up spending more time than planned because I rarely find the correct aisle on the first try and when I finally do, I get confused by all the options. Lastly, I'm very good at overestimating the amount of groceries I can carry, hence going home is often a struggle.
In view of the upcoming winter and school assignments becoming more and more time-consuming, I've decided to explore online grocery shopping. My goals in doing so were to save time, minimize the amount of effort it takes to carry groceries and of course save money by finding the cheapest online store. I first researched all stores with an online presence and same-day delivery services. My delivery fee budget was at most $15 (significantly lower than the dollar value I would put on the time I spend in grocery stores weekly). In addition, I did not want any additional restrictions such as a minimum order value. Below is a summary of my findings:

summary table

As you can see from the above table, my options were quickly reduced to 3 stores : Save-On-Foods, SuperStore, and Walmart. Then, I picked 7 fod items I ususally purchase and looked up their prices on the stores websites. The below graphs show bananas, eggs, milk, tomatoes, rice, potatoes, and oil prices accross the 3 stores as well as delivery fee.

image2
image3

Did I hear SuperStore is the cheapest? Correct! All products are cheaper at SuperStore but oil. If I were to purchase these 7 items, my total bill before tax would be $34.84 at SuperStore, $41.12 at Save-On-Foods and $41.52 at Walmart. SuperStore is about $6.68 and $6.28 cheaper than Walmart and Save-On-Foods, respectively. However, when the delivery fee is excluded the price gaps shrink and Walmart becomes more competitive than Save-On-Foods. Then, Superstore is $3.65 cheaper than Walmart and $4.27 cheaper than Save-On-Foods.
This means that assuming online prices are a reasonable proxy for in-store prices, Walmart is a cheaper option than Save-On-Foods as far as in-store shopping is concerned.

image4

This analysis has two main limitations. First, a robust analysis requires more than 7 products. Secondly, I chose specific brands instead of an average price per item. This means that the findings may suffer from sampling bias and an analysis conducted with different brands and/or quantities could yield different results.

All things equal, I will be shopping at Superstore online very soon !

Read More

What is covariance ?

Baking a chocolate cake for 6 people requires specific ingredients and quantities such as 3 large eggs, 6 oz of sugar, 6 oz of flour etc ... If we instead want to bake for 5 people, decreasing the quantity of some ingredients might require changes in the quantity of other ingredients. The direction of the relationship between the ingredients’ quantities is called covariance.
If two ingredients both increase (decrease) at the same time, their covariance is positive. For example, less flour is required when the number of eggs is reduced. However, if two ingredients vary in opposite direction, their covariance is negative. For instance, less sugar is needed when honey is added to the recipe. Lastly, if two ingredients don’t have any effect on each other (i.e.: if they are independent), their covariance is zero.


However, does a 0 covariance imply independence? Let’s say you’re baking a lemon and a chocolate cake. The required sugar quantities are specified in the recipes; hence the covariance between the required amount of sugar in each cake is zero. But, imagine that you don’t have enough sugar for both cakes. This means that the amount of sugar you can add to the lemon cake depends on the quantity you’ll use for the chocolate cake. Therefore, the sugar quantities are not independent. This shows that a 0 covariance does not necessarily imply independence.
Let’s continue with the baking analogy to explore another characteristic of covariance. If the covariance between eggs and sugar is 3 and the covariance between flour and cocoa powder is 10, can we conclude the latter covariance is stronger than the former? The answer is no. Covariance does not give us the strength of the relationship between two variables because it doesn’t have a minimum and a maximum value (i.e. it is not normalized). Its magnitude depends on the magnitude of the variables.


Luckily, it is possible to scale the covariance such that it has a minimum value of -1 and a maximum value of 1. This is done by dividing it by the standard deviations of both variables. Doing so, transforms the covariance into its normalized form called correlation… Nothing is lost though, just transformed.
In probability and statistics, covariance is formally defined as “the expected product of deviations of two random variables from their mean values” (lab 551, lecture 3). In other words, it tells us whether we should expect one random variable to be below (above) its mean when the other is.
Covariance plays a key role in data science. Random variables with high covariance can be combined without losing significant information. It is also useful for inferences; it can be used to predict one result from another. In addition, it allows the integration of data from multiple sets which is key to machine learning.
Read More