If the data without outliers are normally distributed as viewed on a histogram and have an associated Normal Probability Plot with r > r_critical, we expect that when we analyze the entire data set, outliers included:
1. Roughly equal numbers of data points above the upper fence (Q3+1.5*(Q3-Q1)) and below the lower fence (Q1-1.5*(Q3-Q1)).
2. Is it more accurate to weight the outliers by how far they are outside the fences?
Possibility #2 may be more accurate – I will develop SAS algorithms to test it with real data and messaged data.
The importance of this is in detecting phony, heavily messaged data – when the scientists have an agenda, obvious or not, messaged data is expected, and we expect to see asymmetries if not outright made-up data, and we expect to see messaged data passed off as raw data.
For example, the British scientists in the field of climate change/global warming who sent emails basically confessing their extreme political views and their willingness to make up data to get the rest of the world to follow them: would it be at all surprising to statistically analyze their so called ‘raw data’ and find that it is heavily messaged toward eliminating low temperature readings below the lower fences? or that it is just plain made-up? I will use the Newcomb-Benford Law to statistically analyze data first to detect the highest level fraud, made-up data.
Question: How did Benford know that the data he was analyzing to develop his law did not in fact already contain made-up data, thus skewing the numbers?