Magic: The Gathering and Statistics have been two of my passions for years. Its large card base and long history make for a perfect fit for Data Analysis and Machine Learning.

In case you missed my Unsupervised Learning tutorial, I applied K-Means Clustering (an Unsupervised Learning technique) to a Magic: The Gathering Dataset I scraped myself from mtgtop8, using Python’s Scrapy library.

That article explains the technical side, but doesn’t get into the results, because I didn’t think my readers would be into it.

Since many people have stood up to voice their disagreement, I will now show you some of the things the Algorithm learned.

This will not be the first nor the last time that I say that unsupervised learning can be spooky with all it learns, even when you know how it works.

*Would you help me decide what to write about next? Tell me what's troubling you in this 30 seconds survey!*

## The Data

The Dataset I used for this project contained only professional decks from last year, from the Modern format. I did not include sideboards into this analysis. All of the decks I used for training and visualizations are available, alongside the code, in this GitHub project.

If you know of any good Dataset for casual decks, I’ll be happy to know in the comments. Otherwise, I may scrape one in the future.

For this analysis, I’m looking at 777 different decks, containing a total of 642 unique cards (counting lands).

## Magic: The Gathering Clustering Analysis: Statistics

First of all, I strongly encourage you to pull the repository and try the Jupyter Notebook yourself, as there may be some particular insights you find interesting that I may be missing.

That said, if you want to see what the Data say about a particular card (provided it is part of the competitive meta, which we’ve seen is small enough) ask me in the comments if you don’t see it here!

Now, the first question we’ll ask ourselves is…

### What are the Statistics for each Magic: The Gathering cluster?

Remember, we clustered decks, not cards, so we would expect each cluster to roughly represent an archetype, particularly one seeing play in the Modern meta.

First of all: here are the counts for each cluster. That is, how many decks fell into each.

We can see right off the bat there are two particularly small clusters, with less than 30 decks each. Let’s take a closer look.

### What cards fall on each Metagame cluster?

For cluster number 4, I got the set of 40 cards that appeared the most times for each deck in it, and then took the intersection to see what they all had in common. I repeated that procedure for cluster number 6.

Cluster number 4:

{'Devoted Druid', 'Horizon Canopy', 'Ezuri, Renegade Leader', 'Forest', 'Elvish Archdruid', 'Pendelhaven', "Dwynen\\'s Elite", 'Llanowar Elves', 'Collected Company', 'Windswept Heath', 'Temple Garden', 'Westvale Abbey', 'Razorverge Thicket', 'Heritage Druid', 'Elvish Mystic', 'Nettle Sentinel','Eternal Witness', 'Cavern of Souls', 'Chord of Calling', 'Vizier of Remedies', 'Selfless Spirit'}

Cluster number 6:

{'Funeral Charm', 'Liliana of the Veil', "Raven\\'s Crime", 'Fatal Push', 'Thoughtseize', 'Wrench Mind', 'Bloodstained Mire', 'Smallpox', 'Inquisition of Kozilek', 'Mutavault', 'Urborg, Tomb of Yawgmoth','Infernal Tutor', 'Swamp', 'The Rack', "Bontu\\'s Last Reckoning", 'Shrieking Affliction'}

It appears one of them is playing a green deck, using elves and green lands, while the other one combines milling and discarding, with cards like Liliana and Inquisition of Kozilek.

Here’s the result for the previous algorithm for all of the clusters, see if you can tell which archetype each belongs to. This also tells us about the distribution of the meta back when I got the data.

The same analysis on a more recent Dataset may even be useful in and of itself, if you’re into competitive tournaments.

### Statistical analysis of particular Magic: the Gathering cards

Three cards stood out to me in those lists: “*Mutavault*“, “*Inquisition of Kozilek*” and “*Llanowar Elves*“.

I wonder if they’re more common in other clusters? I didn’t really know *Mutavault* was so common in competitive play, and I think *Llanowar Elves* appearing on a deck tells us some stuff about it.

As always, you can generate these graphs for any of the cards, or ask me if you’re interested in a particular one.

### What are the most versatile Magic: the Gathering cards in Modern?

Lastly, I’ll define a new category of card: a card’s versatility will mean how many different clusters contain at least a deck that uses it.

I agree that that definition, admittedly, could be refined a bit more. For instance, by counting apparitions instead of just whether the card is in a deck or not.

However, the results this way are coherent enough, so I don’t think it needs any more tweaking. Here’s a list with the top 10 most versatile cards, after filtering Basic Lands out.

- Dismember
- Ghost Quarter
- Field of Ruin
- Cavern of Souls
- Thoughtseize
- Mutavault
- Sacred Foundry
- Stomping Ground
- Engineered Explosives
- Botanical Sanctum

They’re pretty much the ones you’d expect. However, I’m surprised Lightning Bolt didn’t make the cut. I wasn’t sure whether non-Basic Lands should count, but I left them in in the end.

The fact that I have no idea which card “Engineered Explosives” is, proves I’m out of touch with the state-of-the-meta, and maybe I should be playing more, but that’s beside the point.

## Conclusion

As we expected, Magic: The Gathering can be a fun source of Data, and I think we have all learned a bit by seeing all these Statistics.

Personally, I’m still surprised a bit of glorified linear algebra could learn all about the meta of competitive play.

I’d be even more surprised if it learned about archetypes in casual play, where decks are more diverse, though my intuition tells me with enough clusters, even that should be properly characterized.

What do you think? Would you have liked to see any other bits of information? Were you expecting the algorithm to perform well?

And finally, what other domains do you think are fit for proper Statistical Analysis, particularly using other Unsupervised Machine Learning Techniques?

Please let me know any or all of that in the comments!

*If you’re into Statistics but are new to programming, check out my list of Programming Books for Beginners!*

*Follow me on **Medium** or **Twitter** for more Articles, tutorials and analysis. Please consider **supporting my website and my writing habit with a contribution**. *

Pingback: K-Means Clustering: Diving into Unsupervised Learning | Data Stuff

Clinton BlackmoreHey, thanks for the follow-up!

My knowledge of MtG is rusty, but I’d really like to see how the clusters break down by card colour, and maybe by proportion of different types of cards (land, creatures, spells).

strikinglooThat would be cool! The Dataset didn’t really have metadata about the cards themselves, but I can work something out. I’ll see what I can do!