I am enjoying your book ‘Mostly Harmless Econometrics’. Thanks for writing it.

I have a question pertaining to the FUQ you define in Chapter 1. I think I understand the problem. But may be I am missing something and you can help.

So we randomly select a group of kids to start later and we compare average test scores in grade 1 for both groups (the one that started at age 6 and the other at age 7). The problem as you state is that maturation effects are confounded with start age effects (M+S)

Now what if you supplemented this study by using a new group of children none of whom go to school. Administer the same exam to one half of them at age 6 and to the other half at age 7. The difference in average performance for this group will give maturation effect (M).

Subtract M from M+S to get S.

Does this make sense? I would appreciate your help in sorting this out.

Makes sense indeed, Deepti.

*As long as kids are in school there is a linear dependence between current age, years of schooling, and starting age. So for kids in school there can only be two independent effects; hence no way to answer three causal questions.*

* *

* *

*As you suggest, however, adding kids not in school solves this problem (at least hypothetically).
Kids not in school isolate a pure age effect that you could then subtract from the combined starting age and age effects for the kids in school. The FUQ’d nature of the question arises in studies that use data on kids still enrolled.*

I think that there may be a small error on page 185 of Mostly Harmless Econometrics. It lies with the y-axes for the panels of Figure 4.5.1. Those panels are from pages 32-33 of Acemoglu and Angrist 2001 (http://www.nber.org/chapters/c11054.pdf), but in the Acemoglu-Angrist article, the y-axes range from -.03 to .07. In MHE — at least, in my printing — they range from -.2 to .6 and from -.1 to .4. I think that the axes are correct in the article, incorrect in MHE. But please don’t hesitate to let me know if I’m wrong about that.

*We’re not hesitating John – you’re right! The axes in MHE Fig 4.5.1 should be divided by 10. Yikes!*

I have a question about semantics. You use the word "random" to describe both the treatment assignment variables D in RCTs, and also to describe the outcome variables Y in regression analysis. "Random" means something different in the first application than in the second. Indeed, you explain that Y varies with "systematic randomness" (MHE p. 29), but I am still a little confused about how to square away the two uses of the word. Any help?

]]>Good question, lauren. There are many sorts of randomness in the 'Metrics world: random variables, random samples, randomized trials. We try to make the distinction clear in our new book,Mastering 'Metrics

Here’s the data and a program to do ’em. ]]>

In chapter 3, you emphasize early on the important property of

prediction. For example, Theorem 3.1.2 and 3.1.5. In my econometrics

training years ago, early initiation into regression focused more on

OLS as BLUE than as BLP (best linear predictor). I was curious why,

then, in your pedagogy you chose to make prediction and not

unbiasedness so central a concept for introducing people to causal

inference. I hesitate to say this, because it’s probably wrong, but I

don’t even remember Greene’s textbook going into BLP at all. Why are

these BLP properties so pedagogically valuable to you, as opposed to

just focusing on BLUE like the traditional econometrics pedagogy seems

to do?

Thanks! I’m a huge fan.

*Great question, Scott, all the more so in view of the release of our undergrad-focused Mastering Metrics this winter. Your undergrad econometrics training (like most people’s) focused on the sampling distribution of OLS. Hence you were tortured with the Gauss-Markov Thm, which says that OLS is a Best Linear Unbiased Estimator (BLUE). MHE and MM are largely unconcerned with such things. Rather, we try to give our students a clear understanding of what regression means. To that end, we introduce regression as the best linear approximation to whatever conditional expectation fn. (CEF) motivates your empirical work – this is the BLP property you mention, which is a regression feature unrelated to samples. (MM also emphasises our interpretation of regression as a form of “automated matching”).*

* *

*In particular, our MHE/MM understanding of regn is divorced from sampling properties like BLUE, which, the attention your old-school training gave them notwithstanding, are: (a) boring (b) of little practical importance for the quality of your empirical work (c) untrue in most applications. BLUEness of OLS estimates (the solution to the least squares problem that Stata solves when u ask it to regress) holds only when the underlying CEF is linear, with constant residual variance to boot. Since there’s usually no reason to believe such things obtain in the empirical world, and there is no need to assume they apply either, sampling properties like unbiasedness and efficiency (“best”) needn’t trouble us. When it comes to sampling properties, we care only to get the standard errors right, also a boring problem, but necessary for statistical inference and not driven by the sophomoric literalism of old school ‘metrics pedagogy.*

— Master Joshway

I have a question about including LDV in a model with FE. Will the problem of inconsistent estimates also arise if my DV is measured in year t+3 while my LDV is measured--for economic reasons--in year t-3? My guess is that it would only be problematic if the LDV was measured in t+2. Could you please help clarify? Thank you so much for your time ... and for your wonderful book! -From a big fan of your book]]>What's LDV here, Celia? We used this as shorthand for Limited Dependent Variables, but you seem to be referencing Lagged Dependent Variables. Assuming that's what you mean ... good question! The whole FE/LDV thing is tricky. Under some assumptions (e.g., serially uncorrelated resids) a model in which treatment is determined by long-ago lagged outcomes can indeed be differenced to kill fixed effects, with no harm done and no further ado (as they say in Stataland).But if selection is on an LDV, who cares about fixed effects, anyway?! Remember its what determines treatment that counts; this is the TAO of OVB. -- Master Joshway

In the last paragraph of p. 55, the

expectations of $f_{i}(s−4)$ is taken and the expectation of

$f_{i}(s−1)$ is not. The text reads:

Conditional on $X_{i}$, the average causal effect of one-year increase

in schooling is $E[f_{i}(s)−f_{i}(s−1)|X_{i}]$, while the average

causal effect of a four-year increase in schooling is

$E[f_{i}(s)−E[f_{i}(s−4)]|X_{i}]$

In the second equation there is an expectation inside the expectation.

*Indeed, Robson, that inner E is a typo!*

A brief question about statistical significance: taking a “population first” approach to econometrics, you note on page 36 that “the regression coefficients defined in this section are not estimators; rather, they are nonstochastic features of the joint distribution of dependent and independent variables.” You later imply on page 40 that the issue of statistical inference arises when we draw samples. My question is how do we interpret standard errors in those (admittedly rare) instances when we have data on the entire population. Does this circumstance render the notion of statistical significance moot?

*Good question Colin. No single answer, I’d say.*

* *

*Some would say all data come from “super-populations,” that is, the data we happen to have could have come from other times, other places, or
other people, even if we seem to everyone from a particular scenario. Others take a model-based approach:some kind of stochastic process generates the data at hand; there is always more where they came from. Finally, an approach known as randomization inference recognizes that even in finite populations, counterfactuals remain hidden, and therefore we always require inference. You could spend your life pondering such things. I have to admit I try not to.*

*-JA
*

Say the long regression of interest is

(1) yi =α+ρsi +γ1MOi +γ2IQi +vi . (1)

Here, MO stands for motivation and IQ stands for intelligence. In your notation then, Ai = (MOi, IQi)′ and γ = (γ1, γ2)′.

In practice, motivation and intelligence are not observed and one estimates the short regression

y i = α + ρ s i + η i ,

w i t h

(2) η i = A ′i γ + v i .

Since si is correlated with the error term vi (unless si is uncorrelated with Ai or γ = 0), the short regression has OVB. So far so good.

But then you claim that if one estimates the short regression with IV using an suitable instrument xi that is uncorrelated with both the omitted control variables MOi and IQi (and thus uncorrelated with the error term ηi), on the one hand, and correlated with the regressor si, on the other hand, one can estimate ρ consistently. Here, I have some doubt.

What if instead of (1), the long regression of interest were ‘only’

(3) y i = α* + ρ*s i + γ I Q i + u i .

So here Ai = IQi. Since the instrument xi is uncorrelated with the (single) omitted control variable IQi, then estimating the short regression (2) with IV using the same instrument xi also results in a consistent estimator of ρ *, according to your logic. But this would seem a contradiction, since ρ* differs from ρ.

We start with 4.1.1, which defines a constant linear causal effect. So LATE = rho in this setup.

Some other OLS regression, which controls for only say a subset of the variables in A (assuming A is multivariate) does not produce the same rho, as Michael rightly notes. But IV is indifferent to the various OLS regs you’re thinking about running. We have anchored the IV parameter by making regression 4.1.2 causal and arguing that it is

Another way to put this: Given our assumptions, LATE is rho in 4.1.1. What OLS regression produces this same parameter? Only the one including the controls required for selection on observables. Since Michael’s equation (3) is inadequately controlled, it won’t generate the same rho. How to see this in the math? It’s subtle. Take the residual, eta, in 4.1.1, and regress that on the IQ variable that appears in equation (3). The residual from this is orthogonal to IQ of course. But since our A is Michael’s [MO, IQ], its not orthogonal to S because we must control for MO as well as IQ to get orthogonality with S. Therefore the schooling coefficient in (3) in is not the schooling coefficient in (1) or in our causal model, 4.1.1 and 4.1.2.