The XGBoost Framework gained notoriety after Boosted Trees became popular in both Kaggle competitions and industrial applications.

XGBoost has proven unbeatable in **regression tasks**: supervised Machine Learning problems where we need to **predict a numerical value** given a set of known inputs, for a certain instance or data-point.

Today we will train an XGBoost model for regression over the official Human Development Index dataset, and see how well we can predict a country’s life expectancy and other statistics.

*Would you help me decide what to write about next? Tell me what's troubling you in this 30 seconds survey!*

## What is XGBoost?

XGBoost is a **Python framework** that allows us to train **Boosted Trees** exploiting **multicore parallelism**.

It is also available in R, though we won’t be covering that here.

To learn about other ways you can use multithreading, see my guide to multithreading in C.

### The task: Regression

Boosted Trees are a Machine Learning model for **regression**.

This means, given a set of inputs and numeric labels, they will **estimate the function** that outputs the label given their corresponding input.

Unlike in classification tasks though, we are interested in **continuous values**, and not a discrete set of classes, for the labels.

For instance, we may want to predict a person’s height given their weight and age, as opposed to, say, labeling them as male, female or other.

We will achieve this with the power of stacked **decision trees**.

A **decision tree** consists of a **root**, a set of **nodes**, and **leaves** at the end of each branch.

Each non-leaf node will correspond to a **numerical comparison** between a feature’s value, and a threshold.

In order to predict a decision tree’s output for a given set of inputs, we will:

- Start at the root, and evaluate its corresponding comparison.
- Move to the right if it’s true, or to the left if it’s false (usually, though the opposite would also work).
- Stop when we reach a leaf, and return its value.

For instance, if we made a decision tree to predict whether an animal is a cat or a spider by counting its legs, we would only need a node that said *“if legs > 4”* and then two leaves that said “cat”, and “spider”.

To train the decision tree, we will make sure each node’s split is **optimal** for the part of our training set’s population that reaches it, where optimal is usually defined by optimizing its **cross-entropy or GINI coefficient**.

For another example of a supervised regression Machine Learning model, see my Text Generating LSTMs tutorial.

### XGBoost’s model: What are Gradient Boosted Trees?

**Boosted trees** are similar to random forests: they are a **combination of decision trees**, each one potentially only having access to **part of our available features**.

To make our XGBoost model, we will train a **set of decision trees**, each one returning a number or vector in the labels’ space. The same is true for a Random Forest.

However for **XGBoost**, we will receive our input, and return the **sum of all the trees’ outputs**. Conversely, on a random forest we would return the average of all the labels.

In order for this to work, of course, both methods are trained differently.

Lastly, for classification instead of regression tasks, we will usually return the average of the classes for the training set elements that fall on each leaf.

So how exactly are Boosted Trees different from Random Forests?

### Boosted Trees vs Random Forest: The difference

When training a **Boosted Tree**, unlike with random forests, we **change the labels every time** we add a new tree.

For every new tree, we update the labels by **subtracting the sum of the previous trees’ predictions**, multiplied by a certain learning rate.

This way, each tree will effectively be learning to **correct the previous trees’ mistakes.**

Consequently, on the prediction phase, we will simply return the sum of the predictions of all of the trees, multiplied by the learning rate.

This also means, unlike with random forests or bagged trees, **this model will overfit** if we increase the **quantity of trees** too much. However, we will learn how to account for that.

To learn more about Boosted Trees, I strongly recommend you to read the official Docs for XGBoost. They taught me a lot, and explain the basics better and with nicer pictures.

If you want to dive even deeper into this and other machine learning models, the book Introduction to Statistical Learning was the one with which these concepts finally clicked for me, and I can’t recommend it enough.

## Using XGBoost with Python

XGBoost’s API is pretty straightforward, but we will also learn a bit about its hyperparameters. First of all however, I will show you today’s mission.

### Today’s Dataset: HDI Public Data

The HDI Dataset contains a lot of information about most countries’ development level, looking at many metrics and areas, over the decades.

For today’s article, I decided to only look at the data from the latest available year: 2017. This is just to keep things recent.

I also had to perform a bit of data reshaping and cleaning to make the original dataset into a more manageable and, particularly, consumable one.

The GitHub repository for this article is available here, and I encourage you to follow along with the Jupyter Notebook. However, I will like always be adding the most relevant snippets here.

### Preprocessing the Data with Pandas

First of all we will read the dataset into memory. Since it contains a whole column for every year, and a row for each country and metric, it is pretty cumbersome to manage.

We will reshape it into something along the lines of:

{country: {metric1: value1, metric2: value2, etc.}

for country in countries }

So that we can feed it into our XGBoost model. Also since all these metrics are numerical, no more preprocessing will be needed before training.

This snippet gets rid of the rows that have a NaN value, and also tells us we have 195 different countries.

If any of these operations are new for you, or you’re having any trouble following along, try reading my introduction to pandas before continuing.

To see all the available *indicators* (metrics) the dataset provides, we will use the *unique* method.

df['indicator_name'].unique()

There are loads (97) of them, so I won’t list them here. Some are related to health, some to education, some to economics, and then a few are about women’s rights.

I thought the most interesting one for our label would be life expectancy, since I think it’s a pretty telling metric about a country.

Of course, feel free to try this same code out with a different label, and hit me up with the results!

Also from the huge list of indicators, I just chose a few by hand that I thought would be correlated with our label (or not), but we could have chosen other ones.

Here’s the code to convert our weirdly shaped dataset into something more palatable.

Finally, after converting our dictionary into a DataFrame I realized it was transposed –Its columns were what should’ve been its rows, and viceversa. Here’s how I fixed it.

final_df = pd.DataFrame(indic).transpose()

### Features Correlation Analysis

I hypothesized the features I picked would be good for regression, but that assumption was based on my intuition only.

So I decided to check whether it held up to Statistics. Here’s how I made a correlation matrix.

And here’s the result:

As expected, life expectancy at birth is **highly correlated with vaccination** (take that, anti-vaxxers!), lack of vulnerable employment, and education.

I would expect the model to use these same features (the ones showing either the highest or lowest correlations) to predict the label.

Thankfully, **XGBoost provides us a feature_importance method**, to check what the model is basing its predictions on.

This is a **huge advantage of Boosted Trees** over, for instance, Neural Networks, where **explaining the models’ decisions** to someone can be very **hard**, and even convincing ourselves that the model is being reasonable is difficult. But more on that later.

### Training the XGBoost Model

It’s finally time to train! After all that processing, you’d expect this part to be complicated, right? But it’s not, it’s just a simple method call.

We will first make sure there is no overlap between our training and testing data.

And now, here’s what you’ve been waiting for. This is the code to train an XGBoost model.

You’ll notice the code to actually train the model, and the one to generate the predictions, are pretty simple.

However before that, there’s a bit of a set up. Those values you see there are the model’s hyperparameters, which will specify certain behaviors when training or predicting.

### XGBoost Hyperparameters Primer

*max_depth *refers to the maximum depth allowed to each tree in the ensemble. If this parameter is bigger, the trees tend to be more complex, and will usually overfit faster (all other things being equal).

*eta *is our learning rate. As I said earlier, it will multiply the output of each tree before fitting the next one, and also the sum when doing predictions afterwards. When set to 1, the default value, it does nothing. If set to a smaller number, the model will take longer to converge, but will usually fit the data better (potentially overfitting). It behaves similarly to a Neural Network’s learning rate.

*colsample_bytree* refers to how many of the columns will be available to each tree when generating the branches.

Its default value is 1, meaning “all of them”. Potentially, we may want to set it to a lower value, so that all of the trees can’t just use the best feature over and over. This way, the model becomes more robust to changes in the data distribution, and also overfits less.

Lastly, *num_rounds* refers to the number of training rounds: instances where we check whether to add a new tree. The training will also stop if the target function hasn’t improved in many iterations.

### Evaluating our results

Now let’s see if the model learned!

Life expectancy had a standard deviation slightly over 7 in 2017 (I checked with Pandas’ describe), so a **MSE-square-root of 4** is nothing to laugh at!

The training was also super fast, though this small dataset doesn’t really take advantage of XGBoost’s multicore capabilities.

However, I feel like the model is still underfitting: that means it hasn’t reached its full potential yet.

### Hyperparameter Tuning: Let’s Iterate

Since we think we are underfitting, let’s try allowing more complex trees (*max_depth* = 6), reducing the learning rate (*eta* = 0.1) and increasing the number of training rounds to 40.

This time, using the same metric, I’m pretty happy with the results!

We reduced the **error rate in our test set to 3.15**! That’s less than half the standard deviation of our label, and should be considered statistically accurate.

Imagine predicting someone’s life expectancy with an error margin of 3 years, based on **a few statistics about their country**.

(Of course, that interpretation would be wrong, since deviation in life expectancy inside a single country is definitely non-zero.)

### Understanding XGBoost’s decisions: Feature Importance

The model seems to be pretty accurate. However, what is it **basing its decisions** on?

To come to our aid, XGBoost gives us the *plot_importance* method. It will make a chart with all of our features (or the top N most important, if we pass it a value) ranked by their importance.

But how is *importance* measured?

The default algorithm will measure which % of our trees’ decisions use each feature (amount of nodes using a certain feature over the total), but there are other options, and they all work differently.

I hadn’t really understood this until I read this very clear article on Medium, which I will just link, instead of trying to come up with another explanation for the same thing.

In our case, the first model gave this plot:

Which means our hypothesis was true: The features with the higher or lower correlations are also the most important ones.

This is the feature importance plot for the second model, which performed better:

So both models were using the first three features the most, though the first one seems to have relied too much on *expected years of schooling*. Neat. This kind of analysis would’ve been a lot harder with, for instance, a neural network.

Try explaining your decisions to management when it requires linear algebra!

## Conclusions

XGBoost provided us with a pretty good regression, and even helped us understand what it was basing its predictions on.

Feature correlation analysis helped us theorize which features would be most important, and reality matched our expectations.

Finally, I think it is important to remark how fast training this kind of model can be, even though this particular example was not too good at demonstrating that.

In the future, I’d like to try XGBoost on a much bigger dataset. If you can think of any, let me know!

Also, this dataset looks like it would be fun for some Time Series Analysis, but I don’t have that much experience with those subjects.

Are there any books, articles, or any other sources you could recommend me on that? Please let me know in the comments!

*Follow me on **Medium** or **Twitter** to know when the next article comes out. If you learned something today, please spread this article on social media and tell your friends!*

Pingback: K-Means Clustering: Diving into Unsupervised Learning | Data Stuff

Pingback: 3 Machine Learning Books that Helped me Level Up | Data Stuff

Pingback: Convolutional Neural Networks: Python Tutorial (TensorFlow Eager API)