As a Data Scientist, I spend about a third of my time looking at data and trying to get meaningful insights, the discipline some call exploratory data analysis. These are the tools I use the most.
Today we will be looking at two awesome tools, following closely the code I uploaded on this github project. One is Jupyter Notebooks, and the other is a Python Framework called Pandas.
If you’re very new to Python, I recommend you to read a language tutorial before jumping into this post. I won’t be using anything too obscure, but won’t stop to explain list comprehensions either. I recommend the O’Reilly book “Data Science From Scratch with Python”, but any other tutorial may do fine.
Jupyter Notebooks: Interactive Python Scripts
If you’ve ever used the Python console, you’ve probably realized how useful it can be. From easily debugging a function by checking if an output matches your expectations (I’m not saying this is a good replacement for tests, but who hasn’t done this?), to running expensive processes once so they’re loaded in memory and you can start testing other things on their output, there are many advantages to having an interactive environment.
Jupyter Notebooks are just that same environment, on steroids. It’s like running the Python console tool for a read-eval-print loop, but with a cute interface, and the ability to document and save what you tested. It’s very convenient, for instance, if you’re writing a Medium article about Python tools!
To install Jupyter, all you have to do is run this pip command:
python -m pip install jupyter
If you’re on Linux you should use this command:
pip3 install --user jupyter
(Note: a previous version of this article recommended using sudo for installation. A kind reader taught me this is not safe and is actually very bad practice, as it gives the setup.py program sudo privileges, which is generally no good and can allow malicious code to run. The more you know!)
Then to run the program, open your terminal in the folder where you’d like to store your notebooks, and just run
Yes, it’s that simple. That command will initialize a server in your computer, and redirect your favorite browser to it. From there, just use the GUI to create a new notebook, and open it. Use the + button to create a new block of code, and the “cut” one to delete one. Each block of code can be run independently (though not concurrently) by putting your cursor into it and hitting “Shift+Enter”.
Now that you know how to run a Jupyter notebook, it would be wise to clone and pull the project I just linked. All you have to do is click the clone green button, get the link, and do
git clone *link*
Getting Started: CSV Files
Now for the fun part, let’s talk a bit about Pandas. Pandas is an Open Source Python framework, maintained by the PyData community. It’s mostly used for Data Analysis and Processing, mostly to manipulate CSV files.
In case you don’t know, a CSV is just a format for files that encode data in Series (columns), where each object (row) has a value. Each row is a line in the file, and each value is separated from the previous one with a comma, thus Comma Separated Values file. The first line is reserved for the header, with the names for each column.
In this tutorial, we will first generate a mock dataset of ‘employee data’, from a very cold company that only stores each employee’s name, surname and salary. The dataset will be generated as a CSV file, as seen in the generate_people.py program.
We then generate a Pandas Dataframe: their abstraction over CSV files, where each column is a Pandas Series (An iterable with some vectorized operations).
That’s it. As you can see, a Dataframe is generated as a list of dictionaries, and an (optional) argument for column names. Exporting it to a CSV file is as easy as calling the to_csv method, with a filename as its only argument.
Using Pandas for Data Analysis
Now for the reading and processing part, let’s run the “Analyse Dataset” notebook, also present in the repository. I’d suggest you to open it and read this article with it on the side, but will add the relevant snippets in case you’re on mobile or just feeling a bit lazy.
We first read the CSV file as a Dataframe with the lines:
import pandas as pd
df = pd.read_csv("random_people.csv")
using df.head(k) for some k will let us see the first k lines of the dataframe, which will look pretty nice thanks to Jupyter’s magic. This is an easy way to get a sense of the data (and your main debugging tool when you start processing it).
In order to only see one of the Series, all we have to do is index it as if it were a dictionary field:
You can call any of the usual aggregations you’d use in any other Data Analysis tools, like SQL, on any Series as a method. My favorite one if I want to get an intuition about a Series is the value_counts method, which displays each value, and how many times it appears on the Series, ordered by descending number of apparitions. Other options include mean, sum, countand median.
An advantage of using these kinds of methods instead of manually opening the CSV and running your own implementation of those functions by yourself, is most of these are vectorized (they use SIMD instructions and other dark magicks), which means they will be faster by, roughly, a factor of 10. This also holds true for addition, substraction, division and products, which can be broadcast through a Series very easily.
However sometimes you’ll want to apply a function to a Series that’s not as trivial, and maybe the people who made Pandas haven’t really thought of your use case. In that case, you can just define your own function, and use the apply method on the Series you want to modify. This will be a bit slower, but still runs smoothly for simple functions. It’s the pandas equivalent of Python’s native map, and will add a ton of flexibility to your processing.
If you want to filter your Dataframe and only keep the rows that maintain a certain property, this is what you’ll do:
df_high = df[df["salary"]>=6000]
Where “df[“salary”]>=6000″ could be switched by anything that returns a Series of booles, or “df[“any series name”].apply(f)” such that f is a boolean function. Make sure the Series has the same amount of elements as the Series you’re filtering, though.
Lastly, keep in mind these filtered Dataframes are read-only by default, since they’re generated by reference. If you’ll want to alter them, just add “.loc” after the dataframe’s name, before the first bracket, like this:
You can also add a column name as a second parameter to loc in order to keep a single filtered Series instead of the whole Dataframe. This is an example straight from the notebook.
Note that the latter are slices made by reference, so any alteration you make on them will be made on the original Dataframe too. To avoid this, just call the copy method at the end of the line.
That’s it, that was some initial crash course on Pandas and Notebooks. I hope you’ve found it useful, and if there’s anything else you feel I should have covered, or any other thing you’d like to ask me, you know where to find me. I am also open to any criticism, good or bad, as these are my first articles on Medium and I’m still getting used to it.
If you want to expand on this, here’s an article I wrote on how to do Data Analysis on parallel. Keep coding!