Notes on MADness

statistics
r
Author

Ian S. Worthington

Published

April 19, 2023

Notes on MADness

Abstract

Median Absolute Deviations give your author some surprises.

Target Audience

Newcomers to robust statistics; dabblers in R.

Notes

The analysis of TCP/IP volume data rapidly reveals the limits of Gaussian statistical methods. Looking for a robust analogue to standard deviation for statistical dispersion leads us to often use instead the interquartile range (IQR) as a measure of the spread of our data: this is, after all, the basis of John Tukey’s popular box plot representation, and it was this I used for a long time.

Median Absolute Deviation, whilst similar to IQR for symmetrical distributions1, seems that it might be even more robust to outliers. But searching for more information on the differences, and when the use of each is most appropriate, I discovered some inconsistency of usage that surprised me.

The first was that the abbreviation MAD is often used for not only Median AD, but also Mean AD. Given this, then finding that the central point used for each distribution is not necesarily the one on the can seems a mere detail. Even an internet search for Median AD returns a large number of links that actually use the Mean AD. Caveat emptor! I have determined to use the abbreviation AD (fashioned after the common use of x̃ to represent the median) for the Median AD in my reports for, whilst it it totally non-standard, it does as least cause the reader to question what they are looking at), at least until someone tells me there’s a better one I’ve overlooked.

I then discovered that the R function stats::mad(), that makes it trivial to calculate this dispersion, applies a constant scaling factor of 1.48 in order to estimate the Gaussian standard deviation2, which may not apply to our data, nor be what we want to calculate. I appear not to be the first person caught out by an inability to pay sufficient attention to the instructions. This is easy to bypass, if you spot it, simply by specifying:

stats::mad( data, constant = 1 ) 

if you want the pure Median AD (or a number other than one if want an alternate correction).

Though, if you’ve already calculated the median of the data, and you don’t require any of the other features of the function, it’s really just as simple to use the rather more transparent:

stats::median( abs(data - median.value) )

which you can then multiply by the factor of your choice.

Further Reading

My own experiments with robust statistics in the context of TCP/IP volumetric analysis continue, though I an still working on clarifying my own thoughts on exactly when I want to take extreme values into account and when I wish to regard them as outliers: of particular interest to the identification of suspect TCP/IP traffic volume deviations. (Two links that I’m ruminating upon are a carefully thought out response to a question on Cross Validated, “Median Absolute Deviation vs Standard Deviation”3, and a paper in JASA on “Alternatives to the Median Absolute Deviation”4.)

Thanks

This article was inspired by work performed for SNCF, and is published with their permission, for which many thanks.

Author

Ian S. Worthington is not a statistician, nor does he play one on TV. He does though have over 25 years’ experience as a developer and systems SME delivering solutions with high-volume, high-performance transaction processing systems using z/OS, z/TPF, ALCS, z/VM, and Linux on (and off) IBM Z. His customers have included IBM (three of ’em: US, UK, and India), British Airways, American Express, Citibank (Ass. Bancorp), and currently SNCF. He holds a BSc in Applied Physics, and an MSc in Database and Information Science, both from the University of London.

© 2023 Ian S. Worthington

Footnotes

  1. For a symmetric distribution, the IQR is twice the median absolute deviation (MAD), i.e. the MAD is the distance from the median to the Q1 or Q3 points.↩︎

  2. On the assumption of a large number of samples. Which may not be your case. See Akinshin, A. (2022). Finite-sample bias-correction factors for the median absolute deviation based on the Harrell-Davis quantile estimator and its trimmed modification. arXiv:2207.12005 [stat.ME] https://arxiv.org/pdf/2207.12005.pdf↩︎

  3. https://stats.stackexchange.com/a/329378/286976↩︎

  4. Rousseeuw, P.J., & Croux, C. (1993). Alternatives to the Median Absolute Deviation. Journal of the American Statistical Association, 88, 1273-1283. https://wis.kuleuven.be/stat/robust/papers/publications-1993/rousseeuwcroux-alternativestomedianad-jasa-1993.pdf↩︎