## Case Control Studies and Repeated Sampling

A standard counterintuitive result in statistics is that if the

true model is logit, then it is okay to use a sample selected on the

Y’s, which is what the “case-control method” amounts to. You may select

1000 observations with Y=1 and 1000 observations with Y=0 and do

estimation of the effects of every variable but the constant in the

usual way, without any sort of weighting. This was shown in Prentice &

Pyke (1979). They also purport to show that the standard errors may be

computed in the usual way— that is, using the curvature (2nd

derivative) of the likelihood function. (Click here for more)

This, I was skeptical of. If the

constant is misestimated, how can you deduce the variance of the

disturbance term, and if you can’t deduce that, how can you deduce the

standard error of any of the coefficients? Nowhere have I seen a clear

demonstration or an intuition for the result, so I thought there might

be a crucial unnoticed mistake in the math somewhere, as is not unknown

in famous papers (e. g. Hotelling on location, Tullock on

overdissipation, Viner on average cost curves, and Rothschild-Stiglitz

on risk).

Since I did not follow all the steps of the Prentice-Pyke proof and so

did not know of any error in what they did, I tried doing a Monte Carlo

study which seemed to confirm my intuition.

Since then, however, I have seen where my Monte Carlo study went wrong,

and now I believe Prentice and Pyke. Some details are instructive.

1. An intuition– a bit shaky, I think, but better than nothing (let me

know if it’s false). Suppose that a coefficient is estimated

correctly by some estimator. We want to estimate the estimator’s

standard error, to know how variable the estimate would be if we

repeated the estimation with different disturbances. For this, we need

to know how noisy the data is. We do not need to know how noisy the data

in the whole population is, however, just how noisy in the kind of

sample we draw. If our procedure is to draw a biased sample, then we

need to know what will happen in other biased samples, not in the

population. It is okay to use the sample for this purpose. In using a

standard error, we are not generalizing anything to the population (not

estimating goodness of fit, for example), we are just generalizing to

repeated samples.

2. How to think about repeated sampling and how to do a Monte Carlow

study. What I did was to construct a population of 60,000 data points,

drawing X from a uniform distribution on [0,1] and a disturbance epsilon

from a logit density with an α “constant” coefficient of -4 and a

β X coefficient of 0. If α + epsilon < 0 then Y=0; if

α + epsilon >= 0 then Y = 1. That yields 1,039 points with

Y= 1, about 1.7% of them.

Our estimation procedure is to combine two random samples of 1,000

observations with Y=0 and 1,000 observations with Y=1 and do a logit

estimate of alpha and beta. We would expect the estimate of alpha to

be wrong– not close to 0.017– and the estimate of beta to be right–

close to 0.000– since we have a large enough sample that consistent

estimates ought to be close to the true parameters.

The maximum likelihood estimate would give us standard errors based on

the second derivative of the likelihood function or on bootstrapping.

In repeated sampling, we would expect the standard deviation of the

alpha estimates not to be close to the average of the estimates of its

standard error. The question to be investigated is whether the the

standard deviation of the beta estimates is close to the average of

the estimates of its standard error.

So far, so good. Where I made my mistake, I think, is in the

definition of “repeated sampling”. Ordinarily in frequentist thinking,

in repeated sampling we keep the X values the same in each sample, and

we draw new disturbances, which combine with the fixed X’s to give new

Y’s. That also amounts to conditioning on the X’s, though we wouldn’t

have had to condition the X’s, since our estimator should work fine even

if we changed the X’s in each sample too. (If we did change the X’s,

though, that change the information content in each sample— a sample

in which X only varied between .3 and .4 would have less information and

yield worse estimates than one with X varying widely between .02 and

.94. So in small samples, especially, we’d have to make some allowance

for that.)

Here, though, we can’t keep the X’s fixed. If we did, then although our

first sample would have 1,000 observations with Y=1, our succeeding

samples would have about 34. We wouldn’t be using the case-control

method.

So what we have to do is to think about repeated samples with 1,000 Y=0

observations and 1,000 Y=1’s. Turning our usual thinking upside down,

we need to keep the Y’s fixed, draw new disturbances, and let the X’s

vary. This is especially hard to think about here, because knowing Y

and epsilon does not tell us X– remember, Y is coarse and contains less

information than alpha + beta*X + epsilon,and beta is zero here too,

making things even worse.

The best way to proceed is to think about repeating the entire

scientific procedure, including the sampling as well as the estimation.

The way I did this was to take 100 n=2000 samples from the 60,000-point

population, each time combining equal-sized subsamples with Y=0 and

with Y=1.

Recall, however, that there are only 1,037 Y=1 values in the entire

population. Thus, my repeated sampling had to be with replacement, and

was using the same Y=1 observations over and over. It is OK to use the

same X values repeatedly, but these observations also had the same

epsilon values each time, so the samples are not independent in the way

needed for the law of large numbers to work. The standard errors

computed by maximum likelihood came out wrong— not equal to the

standard deviation of the estimates, but that is to be expected when the

draws are not independent.

Realizing this, I also tried doing the procedure with 100 n=200

samples instead of 100 n=2000 samples. I still used sampling with

replacement, but now there was less overlap between replacements, less

dependence between samples. And now the estimated standard errors were

close to the standard deviations.

This, I expect is what would happen if I did the kind of repeated

sampling that is our thought experiment for the kind of real studies

that use the case-control method. That thought experiment is to take

repeated draws of 60,000-point populations, with the same X’s each time

but with different epsilons and hence Y’s. Each of the 100 Monte Carlo

samples would be from a different population draw.

## Recent Comments