### Archive

Archive for the ‘statistics’ Category

## Millikan Experiments and Selection Error

From Richard Feynman, “Cargo Cult Science,”:

We have learned a lot from experience about how to handle some of
the ways we fool ourselves. One example: Millikan measured the
charge on an electron by an experiment with falling oil drops, and
got an answer which we now know not to be quite right. It’s a
little bit off, because he had the incorrect value for the
viscosity of air. It’s interesting to look at the history of
measurements of the charge of the electron, after Millikan. If you
plot them as a function of time, you find that one is a little
bigger than Millikan’s, and the next one’s a little bit bigger than
that, and the next one’s a little bit bigger than that, until
finally they settle down to a number which is higher.

Why didn’t they discover that the new number was higher right away?
It’s a thing that scientists are ashamed of–this history–because
it’s apparent that people did things like this: When they got a
number that was too high above Millikan’s, they thought something
must be wrong–and they would look for and find a reason why
something might be wrong. When they got a number closer to
Millikan’s value they didn’t look so hard. And so they eliminated
the numbers that were too far off, and did other things like that.
We’ve learned those tricks nowadays, and now we don’t have that
kind of a disease.

Categories: Tags:

## Defining “Median”

The median of a set of numbers is the middle value. In the set (1,2,3), the median is 2. But how about the set (1,2,3,4)? Most commonly, people define the median as 2.5. That is a good measure of central tendency, I guess, but it isn’t satisfactory because it mixes the ideas of mean and median. Also, then the median isn’t a member of the set.

Perhaps the best definition is that the median is X, where X is the lowest value such that 50% of the values are less than or equal to X. Read more…

Categories: Tags:

## I Want Comment Triage Software

Nobody comments here, so it’s not a personal need, but I want to see comments on blogs and articles organized differently. First I’ll say what I want to see, and then I’ll explain why.

Each comment will be directed to one of four triage categories. These will not be the traditional “Doesn’t need treatment now”, “Needs help”, and “Too hard to help–let him die” categories. Rather, they will be: Read more…

Categories: Tags:

## Out of 1,791 IRS lawyers, 38 made big contributions to Democracts and 2 to Republicans—Meaningful?

What does it tell us if 38 IRS lawyers make big contributions to Democracts and 2 to Republicans, when there are 1,791 IRS lawyers total? The question came up today at Volokh COnspiracy. Isn’t 40 out of 1,791 too small a proportion?

No. Surprisingly, if a sample is chosen randomly, what matters is that the sample be big enough, not how big the population is. Thus, if 40 out of 500 is big enough, so is 40 out of 10,000. That’s why pollsters don’t use samples more than about 1,000— if they’re really random samples, it doesn’t help much go to higher. Read more…

Categories: Tags:

## Regressions and Global Warming

December 29th, 2009 1 comment

The webpost http://tamino.wordpress.com/2009/12/15/how-long/ has a nice step-by-step exposition of how to estimate whether there is a warming trend in temperature data 1975-2008, first using OLS, then using an AR-1 process, then an ARMA. The trend is significant. But the post is responding to the observation that the trend has flattened out since 2000. It doesn’t really respond to that.

To see why, note the graph above. It has artificial temperatures that rise from 1975 to 2000 and then flatten out. If you do an OLS regression, though, YEAR comes in significant with a t-statistic of 25.33 and an R2 of .95. I just did it with Excel, because I haven’t installed StarOffice or STATA on my new computer here, but I’m sure that doing a serial correlation correction wouldn’t alter the result much. Yet eyeballing it, we can see that though it is clear that temperatures have risen since 1975, it is also clear that they’ve flattened out since 2000. A linear regression just doesn’t summarize the data correctly.

Let’s do a couple more examples for fun and to drive home the point. In the second figure, the temperature levels out in 1982 but year is still highly significant, with a t-stat of 4.89, though the R2 drops to .42 (what’s the R2 with the real data? –very small, I’d expect).

Okay, now look at the third figure, in which the trend actually reverses. The t-stat is actually bigger—4.98— and the R2 is .43.

So don’t go and use a linear model when eyeballing the data tells you it isn’t appropriate. When you have a simple regression in which only one variable explains another, use your eyes first, and software second. Do remember, though, that checking for statistical significance— and autocorrelation and all those other things— are useful too, so long as you start off right. Here, the question is not just “Have temperatures been rising with time over the past 30 years?” but, separately, “Have temperatures been rising with time over the past 10 years?”

The way to start addressing that with regression, by the way, is to do a regression of temperature on four variables: Constant, Year, a dummy equaling 1 if the year is after 1999 and 0 otherwise, and an interaction of that dummy with Year.

If a lot of people are interested, I could apply the serial correlation corrections to the artificial data or do this 4-variable regression on the real data, but maybe somebody else can take over now. My Excel spreadsheet is at http://rasmusen.org/t/2009/warming.xlsx, this document at http://rasmusen.org/t/2009/warming.pdf, I’m Eric Rasmusen at erasmuse@indiana.edu, and this is December 29, 2009, and I’ve put a pdf of this post at http://rasmusen.org/t/2009/warming.pdf.

Categories: Tags:

## Principal Components Analysis

From Wikipedia, Principal Components Analysis:

PCA is theoretically the optimal linear scheme, in terms of least mean square error, for compressing a set of high dimensional vectors into a set of lower dimensional vectors and then reconstructing the original set. It is a non-parametric analysis and the answer is unique and independent of any hypothesis about data probability distribution.

Categories: Tags:

## Murder and Medicine

Comparing murder rates across 50-year times periods is misleading,
this blog post tells us:

As Lt. Col. Dave Grossman pointed out in his book On Killing, the
aggravated assault rate serves as a close proxy statistic for
attempted murders. And the aggravated assault rate has increased
dramatically since the 1950s even if the murder rate has not.
Criminologist Anthony Harris estimates today’s homicide rate would
triple if medical and rescue technologies had not improved since the
50s.

Grossman was kind enough to email me an excerpt from his new book On
Combat when I asked him for more detailed source citations for his
writing on this topic. He argues that in comparing today’s homicide
rate with the 1930s and before we ought to multiple today’s rate by
ten for a true comparison:

Since 1957, the U.S. per capita aggravated assault rate (which is,
essentially, the rate of attempted murder) has gone up nearly five-
fold, while the per capita murder rate has less than doubled. The
reason for this disparity is the vast progress in medical technology
since 1957, to include everything from mouth-to-mouth resuscitation,
to the national 911 emergency telephone system, to medical technology
advances. Otherwise, murder would be going up at the same rate as
attempted murder.

In 2002, Anthony Harris and a team of scholars from the University
of Massachusetts and Harvard, published a landmark study in the
journal, Homicide Studies, which concluded that medical technology
advances since 1970 have prevented approximately three out of four
murders. That is, if we had 1970s level medical technology, the murder
rate would be three or four times higher than it is today.

Furthermore, it has been noted that a hypothetical wound that nine
out of ten times would have killed a soldier in World War II, would
have been survived nine out of ten times by U.S. soldiers in Vietnam.
This is due to the great leaps in battlefield evacuation and medical
care technology between 1940 and 1970–and we have made even greater
progress in the years since. Thus, it is probably a conservative
statement to say that if today we had 1930s level evacuation
notification and medical technology (no automobiles and telephones for
most people, and no antibiotics), then we would have ten times the
murder rate we currently do. That is, attempts to inflict bodily harm
upon one another would result in death ten times more often.

Categories: Tags:

## SAT Won’t Report Low Scores

January 9th, 2009 1 comment

National Review’s blog reports that the SAT is changing so that only a student’s MAXIMUM score out of all the times he takes the test will be reported to colleges. What amazing favoritism to rich, stupid, applicants!

Or maybe not so amazing. This will be a bonanza for the SAT company, since their tests will be taken so many more times. This is especially true nowadays, when many colleges have merit-based scholarships and your $45 retest fee might have a 1/10 chance of yielding you$1000 extra in tuition breaks.

It also raises an interesting mathematical question. Suppose everyone ends up taking the test exactly 8 times. This will cost a lot more, of course, but will it yield more accurate evaluation of the applicants? Which provides more useful information:

1. A single test score.

2. The maximum of 8 test scores.

The answer depends on the distribution of an individual’s test scores for his given talent. If someone with ability X scores X on the test with probability .9 and X-y with probability .1, the Maximum is a better measure (in fact, then it is even better than the average of 8 test scores).

If someone with ability X scores X on the test with probability .8, X-y with probability .1, and X+y with probability .1, which is better? The maximum still, I think. In almost every case, person i will end up with a maximum of Xi+y, and we can simply subtract y and get a person’s ability.

If someone with ability X scores X on the test with probability .999 and X+y with probability .001, then I think , the Single reported score is better. It is right with probability .999, whereas the Maximum will frequently be X+y (with probability 1-.999^8) so it will be right with only probability .992. (I haven’t phrased that carefully– what we care about is not the percentage of “right” answers but the variance of the measure minus the true ability, but in this special case the two criteria give the same answer.)

What if the distribution of test scores around ability has a normal distribution? I don’t know. The answer might depend on the variance. I’ll ask our job candidate at lunch. He’s a couple of years out of grad school already, so he shouldn’t freak out at the question.

Categories: Tags:

## Wald, LR, and Score Tests

From Cornell, “Econ 620: Three Classical Tests; Wald, LM(Score), and LR tests” is a good description of the Wald, likelihood ratio, and score tests. The Hausman test seems more like an LR test, since it estimates both the restricted and unrestricted equations. I found the statalist post below on the Wald test for exogeneity of regressors:

This test is mentioned along with the theory behind -ivprobit- in Wooldridge’s “Econometric Analysis of Cross Section and Panel Data” (2002, pp. 472-477).

For the maximum likelihood variant with a single endogenous variable, the test is simply a Wald test that the correlation parameter rho is equal to zero. That is, the test simply asks whether the error terms in the structural equation and the reduced-form equation for the endogenous variable are correlated. If there are multiple endogenous variables, then it is a joint test of the covariances between the k reduced form equations’ errors and the structural equation’s error.

In the two-step estimator, in the second stage we include the residuals from the first-stage OLS regression(s) as regressors. The Wald test is a test of significance on those residuals’ coefficients.

Categories: statistics Tags:

## Conditional Logit

October 11th, 2008 1 comment

I was trying to understand how conditional logit and fixed effects in multinomial logit worked, to explain to someone who asked, and I failed. Greene’s text was not very helpful. The best thing I found was some notes from Penn: “Conditional Logistic Regression (CLR) for Matched or Stratified Data”. The bottom line seems to be that conditional logit (clogit in Stata) chooses its parameter estimates to maximize the likelihood of the variation we see within the strata, while ignoring variation across strata. Thus, if we have data on 30 people choosing to travel by either car or bus over 200 days, we could use 30 dummies for the people, but in conditional logit we don’t. Also, in conditional logit, unlike logit with dummies, if someone always travels by car instead of varying, that person is useless to the estimation.

Categories: statistics Tags:

## Ratio Variables in Regressions

I was reading Gibbs and Firebaugh (Criminology, 1990) on ratio variables in regressions. Suppose you regress Arrests/Crime on Crimes/population using city-by-city data, and in fact there is no causal connection. Will they be negative correlated anyway, since CRIMES is in both variables?

No, so long as all relevant control variables are in the regression. Here is a way to see it. Suppose we regress 1/Crime on Crimes/Population. Suppose too, that Crime and Crimes/Population are uncorrelated— that bigger cities do not have a higher crime rate. Then 1/Crime and Crimes/Population will be uncorrelated.

If, of course, bigger cities do have higher crime rates, then 1/Crime and Crimes/Population will be correlated, but if we suspect that to be true, then in our original regression we should have regressed Arrests/Crime on not only Crimes/Population but on the control variable Crimes.

There is some issue of measurement error– of false correlation arising if Crime has measurement error. Then we are regressing Arrests/(Crime+Error) on (Crime+Error)/Population. I think if we use (Crime +Error) as a control variable that will fix the problem, though.

Categories: Tags:

Too little attention has been given to the news last August that NASA had made a year-2000 mistake in calculating US temperatures, a mistake that meant the temperatures after 2000 were all too high. Details are at Coyote Blog. The mistake was in the adjustment NASA makes for the fact that if a weather station’s location become urban, the temperature rises because cities are always hotter.

What is more important than the mistake itself are that

(1) NASA very quietly fixed its data without any indication to users that it had been wrong earlier.

(2) NASA’s adjustment is by a secret method it refuses to disclose to outsiders.

(3) NASA’s adjustment appears (hard to say since it’s kept secret) to both adjust “bad” stations (the ones in cities) down and “good” stations (the ones that read accurately) up, on the excuse of some kind of smoothing of off-trend stations.

(4) The NASA people doing the adjustment are not statisticians.

(5) It isn’t clear what, if any, adjustment is made to weather station data from elsewhere in the world. The US has some of the best data, and there seems to be no warming trend in the US.

Categories: Tags:

Elasticities in Regressions. (update of old post)Here are how to calculate elasticities from regression coefficients, a note possibly useful to economists who like me keep having to rederive this basic method:

1. The elasticity is (%change in Y)/(%change in X) = (dy/dx)*(x/y).
2. If y = beta*x then the elasticity is beta*(x/y).
3. If y = beta* log(x) then the elasticity is (beta/x)*(x/y) =
beta/y.

4. If log(y) = beta* log(x) then the elasticity is (beta*y/x)*(x/y) =
beta, which is a constant elasticity.
(reason: then y= exp(beta*log(x)), so dy/dx = beta*exp(beta*log(x))*(1/x) = beta*y/x.)

5. If log(y) = beta*x then the elasticity is (beta* y )*(x/y) = beta*x.
(reason: then y = exp(beta*x), so dy/dx = beta*exp(beta*x) = beta*y.)

6. If log(y) = alpha + beta*D, where D is a dummy variable, then we are interested in the finite jump from D=0 to D=1, not an infinitesimal elasticity. That percentage jump is

dy/y = exponent(beta)-1,

because log(y,D=0) = alpha and log(y, D=1) = alpha + beta, so

(y,D=1)/(y, D=0) = exp(alpha+beta)/exp(alpha) = exp(beta)

and

dy/y = (y,D=1)/(y, D=0) -1 = exp(beta)-1

This is consistent, but not unbiased. We know that OLS is BLUE, unbiased, as an estimator of the impact of the dummy D on log(Y), but that does not imply that it is unbiased as an estimator of the impact of D on Y. That is because E(f(z)) does not equal f(E(z)) in general and that ultimate effect of D on y, exp(beta)-1, is a nonlinear function of beta. Alexander Borisov pointed out to me that Peter Kennedy (AER, 1981) suggests using exp(betahat-vhat(betahat)/2)-1 as an estimate of the effect of going from D=0 to D=1, as biased, but less biased, and also consistent .

Categories: Tags:

## Partial Identification and Chi-Squared Tests

Categories: Tags:

## An Umbrella with a Drip Case

I brought this umbrella back from Taipei. It has a case to prevent dripping from the wet umbrella onto the floor when it is folded up. The case opens automatically when you open the umbrella, telescoping down into a little cap on top of the umbrella.

Categories: statistics Tags:

## A Coin Flip Example for Intelligent Design

1. Suppose we come across a hundred bags of 20-chip draws from
hundred different urns. Each bag contains 20 red chips. We naturally

Categories: Tags:

## Case Control Studies and Repeated Sampling

A standard counterintuitive result in statistics is that if the
true model is logit, then it is okay to use a sample selected on the
Y’s, which is what the “case-control method” amounts to. You may select
1000 observations with Y=1 and 1000 observations with Y=0 and do
estimation of the effects of every variable but the constant in the
usual way, without any sort of weighting. This was shown in Prentice &
Pyke (1979). They also purport to show that the standard errors may be
computed in the usual way— that is, using the curvature (2nd

Categories: Tags:

## Is Not Necessarily Equal To

At lunch at Nuffield I was just asking MM about some math notation I’d like: a symbol for “is not necessarily equal to”. For example, and economics paper might show the following:

Proposition: Stocks with equal risks might or might not have the same returns. In the model’s notation, x IS NOT NECESSARILY EQUAL TO y.

Categories: Tags:

## Bayesian vs. Frequentist Statistical Theory: George and Susan

October 2nd, 2007 1 comment

Susan either likes George or dislikes him. His prior belief is that there is a 50% chance that she likes him. He also believes that if she does, there is an 80% chance she will smile at him, and if she does not, there is a 60% chance. She smiles at him. What should he think of that?

The Frequentist approach says that George should choose the answer which has the greatest likelihood given the data, and so he should believe that she likes him.Click here to read more

Categories: Tags:

## Weighted Least Squares and Why More Data is Better

<p>In doing statistics, when should we weight different observations differently?<p>

Suppose I have 10 independent observations of $x$ and I want to estimate the population mean, $\mu$. Why should I use the unweighted sample mean rather than weighting the first observation .91 and each of the rest by .01?<p>

Either way, I get an unbiased estimate, but the unweighted mean gives me lower variance of the estimator. If I use just observation 1 (a weight of 100% on it) then my estimator has the variance of the disturbance. If I use two observations, then a big positive disturbance on observation 1 might be cancelled out by a big negative on observation 2. Indeed, the worst case is that observation 2 also has a big positive disturbance, in which case I am no worse off by having it. I do not want to overweight any one observation, because I want mistakes to cancel out as evenly as possible.<p>

All this is completely free of the distribution of the disturbance term. It doesn't rely on the Central Limit Theorem, which says that as $n$ increases then the distribution of the estimator approaches the normal distribution (if I don't use too much weighting, at least!).<p>

If I knew that observation 1 had a smaller disturbance on average, then I *would* want to weight it more heavily. That's heteroskedasticity. <p>

Categories: statistics Tags:

## Asymptotics

Page 96 of David Cox’s 2006 Principles of Statistical
Inference has a very nice one-sentence summary of asymptotic theory:

[A]pproximations are derived on the basis that the amount of
information is large, errors of estimation are small, nonlinear
relations are locally linear and a central limit effect operates to
induce approximate normality of log likelihood derivatives.

Categories: statistics Tags:

## Bayesian vs. Frequentist Statistical Theory

The Frequentist view of probability is that a coin with a 50% probability of heads will turn up heads 50% of the time.

The Bayesian view of probability is that a coin with a 50% probabilit of heads is one on which a knowledgeable risk-neutral observer would put a bet at even odds.

The Bayesian view is better.

When it comes to statistics, the essence of the Frequentist view is to ask whether the number of heads that shows up in one or more trials is probable given the null hypothesis that the true odds in any one toss are 50%.

When it comes to statistics, the essence of the Bayesian view is to estimate, given the number of number of heads that shows up in one or more trials and the observerâ™s prior belief about the odds, the probability that the odds are 50% versus the odds being some alternative number.

I like the frequentist view better. Itâ™s neater not to have a prior involved.

Categories: statistics Tags: