Why I Use Median And When You Should Too

Tags: data-science learning meta

One of the issues I struggle with on MTG Card Prices it how to price a card based on historical sales.

My initial solution was to take the average price of the prices that had occurred in the last 4 weeks, rolling. Prices for individual cards always fluctuated a little, so smoothing out bumps worked well. Also, it was easy (MySQL has an avg() function).

While it worked for most of the cards, it would occasionally lead to strange results. An example of this is the Beta Sol Ring, which had sales like this:

The $1,725.98 price is a huge outlier, and while the auction itself included other beta cards, the auction title only mentioned the Sol Ring, so of course the matcher matched it to a Beta Sol Ring.

Using average with the above auctions shows a price of $253.59, which no one will pay for.

Knowing that any matching system will incorrectly match an auction like this, the question becomes, "How can I get a reasonable price given the occasional outlier?".

Fortunately, there is a solution to this problem, and I came across it recently while reading Data Analysis with Open Source Tools. Philip mentions that when you have a set that in not evenly distributed, you should use median, not average.

Using our set of prices above, the median comes to $73.77, which is close to what someone would actually pay for a Beta Sol Ring.

This is a much better solution, and it's a shame that MySQL doesn't have a built-in median function. I ended up using a substring_index solution that seemed like the least painful, and didn't require any self-joins.

Now my rule of thumb is: "If you know your data is evenly distributed, use average. Otherwise use median".

Date: 2011-07-27

Author: Ben