Pythagorean means are nice and all, but throwing the median in the pot is really what turns this into random forest statistics: applying every function you can think of, and then gradually dropping the ones that make the result worse.
This is another one of Randall's Tips, this time a stats tip. This came as the first tip comic after the statistics tip in 2400: Statistics.
There are a number of different ways to identify the "average" value of a series of values, the most common unweighted methods being the median (take the central value from the ordered list of values if there are an odd number - or the value half-way between the two that straddle the divide between two halves if there are an even number) and the arithmetic mean (add all the numbers up, divide by the number of numbers). The geometric mean is less well-known but works similarly to the arithmetic mean. The geometric mean of n positive numbers is the nth root of the product of those numbers. If all of the numbers in a sequence are identical, then its arithmetic mean, geometric mean and median will be identical, since they would all be equal to the common value of the terms of the sequence. However, if the sequence is not constant, then the arithmetic mean will be greater than the geometric mean, and the median may be different than either of those means.
The geometric mean, arithmetic mean, and the harmonic mean (not shown) are collectively known as the Pythagorean means, as specific modes of a greater and more generalized mean formula that extends arbitrarily to various other possible nuances of mean-value rationisations (cubic, etc.).
Outliers and internal biases within the original sample can make boiling down a set of values into a single 'average' sometimes overly biased by flaws in the data, with your choice of which method to use perhaps resulting in a value that is misleading, exaggerating or suppressing the significance of any blips.
In this depiction, the three named methods of averaging are embedded within a single function that produces a sequence of three values - one output for each of the methods. Being a series of values, Randall suggests that this is ideally suited to being itself subjected to the comparative 'averaging' method. Not just once, but as many times as it takes to narrow down to a sequence of three values that are very close to one another.
It can be shown that the xkcd value of 2.089 for GMDN(1,1,2,3,5) is validated:
Arithmetic mean Geometric mean Median F1 2.4 1.974350486 2 F2 2.124783495 2.116192461 2 F3 2.080325319 2.079536819 2.116192461 F4 2.0920182 2.091948605 2.080325319 F5 2.088097374 2.088090133 2.091948605 F6 2.089378704 2.089377914 2.088097374 F7 2.088951331 2.088951244 2.089377914 F8 2.089093496 2.089093487 2.088951331 F9 2.089046105 2.089046103 2.089093487 F10 2.089061898 2.089061898 2.089046105 Each row in this table shows the set Fn(..) composed of the average, geomean and median computed on the previous row, with the sequence {1,1,2,3,5} as the initial F0. While GMDN is not differentiable, due to the median, this can be interpreted as somewhat similar to a heat equation which approaches equilibrium through averaging. Interestingly, the maximum value alternates between the average and the median (highlighted in bold in the table), while the minimum value alternates between the geomean and the median. This observation holds for many inputs.
To not distract from the comedic effect, the definition of the GMDN in the comic is left as a simplified sketch. To make the definition mathematically rigorous the implied infinite limit in the second line can be made precise, for example, as the result of a fixed-point iteration via
G = lim_{k -> infinity} m_k
wherem_0 = F(x_1, x_2, ..., x_n)
andm_{k+1} = F(m_k) for k > 0
. This definition is well-defined only if we can proof convergence to a fixpoint of F for a set of inputs. Indeed, convergence holds if all numbers are non-negative (see discussions for proof and more cases). Note that the above definition yields a three-dimensional fixpoint G. Because all fixpoints of F are of the formG=(g, g, g)
, with elements that are all equal, we can defineGMDN(x_1, x_2, ..., x_n) = G_1
, as the first element of G. This formal definition avoids the inconsistency present in the comic's definition sketch where the function GMDN as defined in the second line has the same three-dimensional output as F, but GMDN in the last line is shown to produce a single real number rather than a vector and is thus missing a final operation of returning a single component. Note also that the comic's definition of the median as the (n+1)/2-th order statistic, i.e. the (n+1)/2-th smallest value, coincides with the more regularly used sample median only on lists of odd length. For lists of even length the sample median is usually defined as the (arithmetic) mean of the two middle valuesX_{n/2}
andX_{(n+2)/2}
instead. Indeed, for lists of even lengthsX_{(n+1)/2}
is not well-defined without adding a flooring operation as (n+1)/2 is not an integer.The comment in the title text about suggests that this will save you the trouble of committing to the 'wrong' analysis as it gradually shaves down any 'outlier average' that is unduly affected by anomalies in the original inputs. It is a method without any danger of divergence of values, since all three averaging methods stay within the interval covering the input values (and two of them will stay strictly within that interval).
The title text may also be a sly reference to an actual mathematical theorem, namely that if one performs this procedure only using the arithmetic mean and the harmonic mean, the result will converge to the geometric mean. Randall suggests that the (non-Pythagorean) median, which does not have such good mathematical properties with relation to convergence, is, in fact, the secret sauce in his definition.
The question of being unsure of which mean to use is especially relevant for the arithmetic and harmonic means in following example.
* Cueball has some US Dollars and wishes to buy Euros. Suppose the bank will exchange US Dollars to Euros at a rate of €5 for $6 (about 0.83333€/$ or 1.20000$/€). * Megan has some Euros and wishes to buy US Dollars. Suppose the bank will exchange Euros to US Dollars at a rate of $7 for €6 (about 0.85714€/$ or 1.16667$/€).Cueball and Megan decide to complete the exchange between themselves in order to save the Bid-ask spread of the Exchange rate which is the cost the bank imposes on Cueball and Megan for its service as a Market maker.
* Cueball offers to split the difference by averaging the rates €5:$6 and €6:$7 yielding a rate of €71:$84 (about 0.84524€/$ or 1.18310$/€). * Megan offers to split the difference by averaging the rates $6:€5 and $7:€6 yielding a rate of €60:$71 (about 0.84507€/$ or 1.18333$/€).In one direction (€/$), Cueball is using the arithmetic mean but Megan is using the harmonic mean while in the other direction ($/€), Megan is using the arithmetic mean but Cueball is using the harmonic mean. This creates two new exchange rates which are closer than the orginal rates, but the new rates are still different for each other. Megan and Cueball can then iterate this process and the rates will converge to the geometric mean of the original rates, namely:
* sqrt((5/6)*(6/7)) = sqrt(5/7) = 0.84515€/$ or * sqrt((6/5)*(7/6)) = sqrt(7/5) = 1.18322$/€.There does exist an arithmetic-geometric mean, which is defined identically to this except with the arithmetic and geometric means, and sees some use in calculus. In some ways it's also philosophically similar to the truncated mean (extremities of the value range, e.g. the highest and lowest 10%s, are ignored as not acceptable and not counted) or Winsorized mean (instead of ignored, the values are readjusted to be the chosen floor/ceiling values that they lie beyond, to still effectively be counted as "edge" conditions), only with a strange dilution-and-compromise method rather than one where quantities can be culled or neutered just for being unexpectedly different from most of the other data.
The input sequence of numbers (1, 1, 2, 3, 5) chosen by Randall is also the opening of the Fibonacci sequence. This may have been selected because the Fibonacci sequence also has a convergent property: the ratio of two adjacent numbers in the sequence approaches the golden ratio as the length of the sequence approaches infinity.
Here is a table of averages classified by the various methods referenced:
averages using various methods Method Value Formula Arithmetic 2.4 Add all numbers, then divide the sum by n, where n is the number of terms. Geometric 1.9743504858348 Multiply all numbers, then take the product's nth root, where n is the number of terms. Median 2 Find the term or terms which separate the upper half of the set from the lower set. If the set has an even number of terms, find the arithmetic mean of the middle two terms. GMDN 2.089 (see above)