xkcd.WTF!?

Image loading failed. try again

Proxy Variable

Our work has produced great answers. Now someone just needs to figure out which questions they go with.

Explanation

In this comic, Hairy is discussing use of a proxy variable with Cueball. In statistics, a proxy variable is used as a stand-in for one or more other variables that are difficult to measure. In order to be useful as such, proxy variables must be correlated with what they are intended to represent. For example, a drug might aim to reduce deaths from a slow-acting disease. But testing if it reduces deaths might take many years, so researchers might test for a proxy outcome instead, like whether the drug appears to mitigate loss of bone density or cell-damage. Physicians use blood pressure as one of many proxies for cardiovascular health.

Hairy is dismissing the question of whether they are studying the right variable as too expensive to answer. This is deeply ironic and thus satirical, because good experiment design requires sufficient attention to the robustness of all the involved parts of an experiment, even if the expense may be prohibitive. This comic might be referring to the recent discovery of nearly two decades of allegedly fraudulent Alzheimer's disease research supporting a mistaken proxy hypothesis.

Choosing the wrong proxy variable might make the research misleading, irrelevant, or as the title text suggests, answer the wrong question. Separating correlation from causation is necessary when interpreting proxy variable results to make sure the question they answer is known. Mere correlation instead of authentic causation yields weaker results. Exploratory causal analysis can assist with finding useful proxy variables, but is difficult for the layperson to interpret and can be misleading, because even if performed correctly, a combinatorial explosion of possible proxy variables can make traditional statistical significance analysis fail, requiring F-scores or similar measures. The history of pharmaceutical research is largely a graveyard of failed proxy hypotheses; that is one of the reasons for experiment registration regulations.

The title text's notion of having an answer without knowing the actual question could also be be a reference to the classic comedy science fiction novel The Hitchhiker's Guide to the Galaxy, where in one scene Earth turns out to be a supercomputer built for the purpose of figuring out the question for the answer "42."

Examples of noteworthy proxy variables

  • Loss of bone density or damage to cells for toxicity
  • Blood pressure for cardiovascular health
  • Amyloid markers for Alzheimer's disease
  • Local temperature for global warming severity
  • GDP growth for development (demolishing a hospital adds to GDP but subtracts from development)
  • Money supply size for price inflation (see e.g. the paradox of thrift)
  • Carbonic anhydrase expression for carbon sequestration
  • Asphalt production for carbon sequestration
  • Proportion renewable energy for carbon reduction (see Jevons paradox)
  • Dialytic desalination for carbon sequestration[1][2]
  • Bacillus thuringiensis israelensis application for mosquito abatement
  • Indoor carbon dioxide levels for air quality and ventilation