Notes on MADness

Abstract

Median Absolute Deviations give your author some surprises.

Target Audience

Newcomers to robust statistics; dabblers in R.

Notes

The analysis of TCP/IP volume data rapidly reveals the limits of Gaussian statistical methods. Looking for a robust analogue to standard deviation for statistical dispersion leads us to often use instead the interquartile range (IQR) as a measure of the spread of our data: this is, after all, the basis of John Tukey’s popular box plot representation, and it was this I used for a long time.

Median Absolute Deviation, whilst similar to IQR for symmetrical distributions¹, seems that it might be even more robust to outliers. But searching for more information on the differences, and when the use of each is most appropriate, I discovered some inconsistency of usage that surprised me.

The first was that the abbreviation MAD is often used for not only Median AD, but also Mean AD. Given this, then finding that the central point used for each distribution is not necesarily the one on the can seems a mere detail. Even an internet search for Median AD returns a large number of links that actually use the Mean AD. Caveat emptor! I have determined to use the abbreviation M̃AD (fashioned after the common use of x̃ to represent the median) for the Median AD in my reports for, whilst it it totally non-standard, it does as least cause the reader to question what they are looking at), at least until someone tells me there’s a better one I’ve overlooked.

I then discovered that the R function stats::mad(), that makes it trivial to calculate this dispersion, applies a constant scaling factor of 1.48 in order to estimate the Gaussian standard deviation², which may not apply to our data, nor be what we want to calculate. I appear not to be the first person caught out by an inability to pay sufficient attention to the instructions. This is easy to bypass, if you spot it, simply by specifying:

stats::mad( data, constant = 1 )

if you want the pure Median AD (or a number other than one if want an alternate correction).

Though, if you’ve already calculated the median of the data, and you don’t require any of the other features of the function, it’s really just as simple to use the rather more transparent:

stats::median( abs(data - median.value) )

which you can then multiply by the factor of your choice.

Thanks

This article was inspired by work performed for SNCF, and is published with their permission, for which many thanks.

Author

Ian S. Worthington is not a statistician, nor does he play one on TV. He does though have over 25 years’ experience as a developer and systems SME delivering solutions with high-volume, high-performance transaction processing systems using z/OS, z/TPF, ALCS, z/VM, and Linux on (and off) IBM Z. His customers have included IBM (three of ’em: US, UK, and India), British Airways, American Express, Citibank (Ass. Bancorp), and currently SNCF. He holds a BSc in Applied Physics, and an MSc in Database and Information Science, both from the University of London.

Footnotes

For a symmetric distribution, the IQR is twice the median absolute deviation (MAD), i.e. the MAD is the distance from the median to the Q1 or Q3 points.↩︎
On the assumption of a large number of samples. Which may not be your case. See Akinshin, A. (2022). Finite-sample bias-correction factors for the median absolute deviation based on the Harrell-Davis quantile estimator and its trimmed modification. arXiv:2207.12005 [stat.ME] https://arxiv.org/pdf/2207.12005.pdf↩︎
https://stats.stackexchange.com/a/329378/286976↩︎
Rousseeuw, P.J., & Croux, C. (1993). Alternatives to the Median Absolute Deviation. Journal of the American Statistical Association, 88, 1273-1283. https://wis.kuleuven.be/stat/robust/papers/publications-1993/rousseeuwcroux-alternativestomedianad-jasa-1993.pdf↩︎