Statistical Analysis: Pandas and Seaborn on a Kaggle Dataset

Share This with your Geeky Friends!

When doing Statistical Analysis, curiosity and intuition are two of a Data Scientist’s most powerful tools. The third one may be Pandas.

On my previous Exploratory Data Analysis tutorial I showed you how to:

  • Get an idea of how complete a Dataset is.
  • Plot a few of the variables.
  • Look at trends and tendencies over time.

To do this, we used Python’s Pandas framework on a Jupyter Notebook for Statistical Analysis and Data Processing, and the Seaborn Framework for visualiation.

On the previous article, as on this one, we used the 120 years of Olympics Dataset from Kaggle.

We looked at female participation over time, athletes’ weights’ and heights’ probability distributions, and other variables.

However, we did look into the data about which sport each athlete practiced.

This time, we will focus on the Sport column of the Dataset, and glean some insights about it through statistical data analysis.

A few questions I can think of are:

  • What sports favor heavybuilt people? And what about tall people?
  • What sports are newer, and which are older? Are there any sports that actually lost the Olympics’ favor and stopped being played?
  • Are there some sports where the same teams always win? What about the most diverse sports, with winners from many different places?

Same as before, we’ll be using this Github project for the analysis, and you can fork it and add your own analysis and insights.
 Let’s dive right in!

Statistical Analysis with Python: Weights and Statures

For our first analysis, we’ll look at what sports have the heaviest and tallest players. We’ll then see which have the lightest or shortest ones.

As we saw on the previous article, both height and weight are heavily dependent on sex. We also have more data on the male athletes than the female ones.

Because of this, we’ll limit this analysis to the male ones.

Keep in mind however, the same code would work for either by just switching the ‘Sex’ filter.

As you can see, if I group by sport I can take the min, max and average weight and height for each sport’s players.

I then looked at the top 5 heaviest sports, and found this (in kilograms):

Sport             min  max  average 
Tug-Of-War 75.0 118.0 95.61
Basketball 59.0 156.0 91.68
Rugby Sevens 65.0 113.0 91.00
Bobsleigh 55.0 145.0 90.38
Beach Volleyball 62.0 110.0 89.51

Not too unexpected, right? Tug-of-war practitioners, Basketball players and Rugby players are all heavy.

It’s quite interesting to see there’s so much variation in Basketball and Rugby players, going from 59 to 156 kg, whereas most tug of war players are over 80 kilos.

Then I just plotted the mean weight for each sport, and found that it followed a normal distribution:

Statistical Analysis: weight distribution in Olympics Athletes.

The height has a similar, normal distribution, but its variance is a lot smaller, being highly concentrated in the mean:

Statistical Analysis: height distribution of Olympic Athletes.

Next I set out to graph all individual means, in an ordered scatter plot, to see whether there were any outliers.

Data Analysis with Pandas.

In fact, the ‘heaviest’ sport is quite the outlier with respect to the rest of the graph. The same thing happens with the ‘lightest’.

If we look at heights, variance was clearly smaller. However, the plot reveals an even bigger difference between ‘outliers’ and people near the mean.

This is accentuated by the fact that most people do not really deviate a lot from it.

Data Visualization with Seaborn

For the lightest sports, the results can be obtained using the previously generated variable, plot_data.

The results (omitting the heaviest ones, since we already saw those) are the following:

Gymnastics: 63.3436047592
Ski Jumping: 65.2458805355
Boxing: 65.2962797951
Trampolining: 65.8378378378
Nordic Combined: 66.9095595127

As you can see, Gymnastics athletes, even the male ones, are by far the lightest players!

They are followed quite closely by Ski Jumping, Boxing (which kinda surprised me) and Trampolining, which actually makes a lot of sense.

If we instead look for the tallest and shortest athletes, the results will be a little less surprising.

I’m guessing we all expected the same sport to come up on top and, unsurprisingly, it did. At least we can now say it’s not a stereotype.

shortest (cm): 
Gymnastics: 167.644438396
Weightlifting: 169.153061224
Trampolining: 171.368421053
Diving: 171.555352242
Wrestling: 172.870686236
tallest (cm): 
Rowing: 186.882697947
Handball: 188.778373113
Volleyball: 193.265659955
Beach Volleyball: 193.290909091
Basketball: 194.872623574

We see Gymnastics practitioners are very light, and very short.

However, some sports in these rankings do not appear in the weight ones.

I wonder what ‘build’ (weight/height) each sport has?

The plot has a pretty linear look, until we get to the top where most outliers fall:

Build (Weight/Height) distribution of Olympics’ athletes

And here are the least and most heavily built sports:

Smallest Build (Kg/centimeters) 
Alpine Skiing 0.441989
Archery 0.431801
Art Competitions 0.430488
Athletics 0.410746
Badminton 0.413997
Heaviest Build
Tug-Of-War 0.523977
Rugby Sevens 0.497754
Bobsleigh 0.496656
Weightlifting 0.474433
Handball 0.473507

So again Rugby and Tug of War are the most heavily built sports. This time Alpine skiing comes up as the least one.

Archery and Art Competitions (which I just learned is an Olympics Sport and will require further research) follow close by.

Sports over time

Now we’ve done several interesting things with those three columns, I’d like to start looking at the time variable. Specifically, the year.

I want to see whether new sports have been introduced to the Olympics, and when. I also wish to see which ones have been deprecated.

The following snippet will be generally useful any time we need to see when something arose for the first time, especially if we want to see an abnormal increase in a variable.

The graph shows us how many sports were practiced in the Olympics for the first time for each year. Or, in other words, how many sports were introduced each year:

So even though a lot of sports where there before 1910, and most where introduced before 1920, there have been many relatively new introductions. Looking at the data, I see there were many new sports introduced in 1936, and afterwards they were always brought in small (less than five sports) sets.
There weren’t any new sports between 1936 and 1960, when Biathlon was introduced, and then they kept adding them pretty regularly:

Sport           introduced
Biathlon 1960
Luge 1964
Volleyball 1964
Judo 1964
Table Tennis 1988
Baseball 1992
Short Track Speed Skating 1992
Badminton 1992
Freestyle Skiing 1992
Beach Volleyball 1996
Snowboarding 1998
Taekwondo 2000
Trampolining 2000
Triathlon 2000
Rugby Sevens 2016

An analogous analysis for deprecated sports (where max year is not recent) shows this list of sports, most of which I’ve never heard of (though that’s by no means a good metric of whether a sport is popular!).

Basque Pelota    1900
Croquet 1900
Cricket 1900
Roque 1904
Jeu De Paume 1908
Racquets 1908
Motorboating 1908
Lacrosse 1908
Tug-Of-War 1920
Rugby 1924
Military Ski Patrol 1924
Polo 1936
Aeronautics 1936
Alpinism 1936
Art Competitions 1948

We see Art Competitions were dropped in 1948.

Polo hasn’t been practiced olympically since 1936, and the same goes for Aeronautics.

If anyone knows what exactly is Aeronautics, please let me know.

I’m picturing people in a plane but don’t see what the competition could be like.

Maybe plane races? Let’s bring those back!

That’s all for today, folks! I hope you’ve enjoyed this Statistical Analysis tutorial, and maybe you’ve got a new interesting fact to bring up in your next family dinner.

As usual, feel free to fork the code from this analysis and add your own insights.

As a follow up, I’m thinking of training a small Machine Learning model to predict an athlete’s sex based on the sport, weight and height columns, tell me what model you’d use!

And if you feel anything in this article was not properly explained, or is simply wrong, please also let me know, as I’m learning from these as well!

If you wish to go deeper into Statistical Analysis with Python, I highly recommend this O’Reilly book.

Follow me on Twitter or Medium for more Statistical Analysis articles, Python tutorials and anything else Data Related! If you liked this article, please share it with your data friends on Twitter.

Was this post useful?

Click to rate!

I am sorry that this post was not useful for you!

Let us improve this post!

Would you tell me how I can improve this post?

Share This with your Geeky Friends!

1 thought on “Statistical Analysis: Pandas and Seaborn on a Kaggle Dataset”

  1. Pingback: 5 Probability Distributions Every Data Scientist Should Know | Data Stuff

Leave a Comment