Average Treatment Effect
Li Gan
Nov, 2007
党员推荐1. The Regression Method:
We are interested in average changes in outcome y . Denote 1 if with treatment, and 0 without treatment. Average Treatment Effect is defined as:
ATE = E(y 1 – y 0) (1)
The difficulty in estimating is that we obrve y1 or y0, not both, for each person. More precily, let w = 1 if treatment. The obrved outcome y can be written as:
y = (1-w ) y 0 + w y 1. (2)
If w is independent of y , then:
E(y 1-y 0) = E(y 1-y 0 |w ) = E(y 1|w =1) – E(y 0|w =0)
In fact, we only need the weak assumption (rather than independence): mean independence: E(y 0|w ) = E(y 0), E(y 1|w ) = E(y 1).
Now let:
()()0
,0,11110000=+==+=v E v y v E v y μμ
英语年月日
Therefore, (2) can be written as:
(3) ()(01001010)1(v v w v w wy y w y −++−+=+−=μμμ)
First, assume conditional mean independence:
Assumption 1 (ATE 1): (a) E(y 0|w,x ) = E(y 0|x ), and (b) E(y 1|w,x ) = E(y 1|x )
Intuition: even though y 1 and y 0 may be correlated with w , they are uncorrelated with w if we partial out x .
Taking expectation of (3) (and with ATE 1):
E(y|w,x ) = μ0 + αw + g 0(x ) + w (g 1(x )- g 0(x )), (4)
where α=μ1- μ0 is the Average Treatment Effect (ATE), and g i (x )=E (v i |x ).
Linearization of g i (x ):
E(y|w,x ) = μ0 + αw + x β0 + w (x-ψ)δ,
where ψ=E(x). The last term is to ensure that g 1(x )- g 0(x )=0. So the regression to estimate ATE α is:
y i on 1, w i , x i , w i (x i – x )
Here the control functions involve not just xi, but also interactions of the covariates with the treatment variable.
We can estimate treatment effect conditional on x:
()()δαˆˆˆx x x E T A −+=
2. Propensity Score:
Let p(x ) = Pr(w =1|x ).
(w – p(x )) y = (w – p(x ))(wy 1 + (1-w ) y 0)
= wy 1 – p(x ) (1-w ) y 0 – p(x )wy 1
Take conditional expectation with respect to y:
E y [(w – p(x ))y|w,x ]= wm 1(x )– p(x ) (1-w ) m 0(x )– p(x )wm 1(x ),
where E(y j |w,x )= E(y j |x )=m j (x ). Taking expectation with respect to w:
E w {E y [(w – p(x ))y|w,x ]|x }
= E w [wm 1(x )– p(x ) (1-w ) m 0(x )– p(x )wm 1(x )]
= p(x )m 1(x )– p(x ) (1- p(x )) m 0(x )– p(x ) p(x )m 1(x )
= m 1(x )p(x )(1-p(x ))- m 0(x )p(x )(1- p(x ))
=(m 1(x )-m 0(x ))p(x )(1-p(x ))
我的安全故事Therefore,
()()
()()))
陈布雷简介(1)(()(01x p x p y x p w E x m x m ATE −−=−= A simple and popular estimator in program evaluation is obtained from OLS regression:
y i on 1, w i , ()i x p
ˆ
where coefficient for w i is the estimate of the treatment effect. In other words, the estimated propensity score plays the role of the control function.
3. Dummy Endogenous Variables
Consider the model again:
E(y|w,x ) = μ0 + αw + x β0 + u 0, (4)
w is endogenous. Again, w = 1 if treated, and 0 otherwi.
Assume that Pr(w=1|x,z ) = G(x, z; γ)
Procedure 1:
(1) Estimate the binary respon model Pr(w i =1|x i ,z i ) = G(x i ,z i ;γ), and obtain the
fitted values . i
G ˆ(2) Estimate (4) using instruments 1, and x i . i
G ˆ
Procedure 1 has important robustness property:
(a) Becau we u as an IV, the model Pr(w i =1|x i ,z i ) = G(x i ,z i ;γ) does not have to be correctly specified.
i
G ˆ(b) Technically, α and β are identified even if we do not have extra variables
excluded from x. But can rarely justify the estimator in this ca.
Suppo that w given x follows a probit model (no z). Becau G(x, γ)你是我最好的朋友
= Φ(γ0 +x γ1), is a nonlinear function of x, it is not perfectly correlated
with x, so it ca nbe ud as IV for w.
(c) In principle, it important to recognize that Procedure 1 is not the same as
using G as a regressor in place of w .
y i on 1, and x i . i
G ˆ
Consistency of the OLS estimators from the regression:
(5) i飞向苗乡侗寨
i i i u x G y +++=00ˆβαδ would rely on G( ) to be correctly specified. Note that (5) also has problems with standard errors that need to be corrected.
Allow interact term:
()i i i i i i e x x w x w y +−+++=δβαδ00 (6)
Procedure 2:
(a) Estimate Pr(w i =1|x i ,z i ) = G(x i ,z i ;γ)
(b) U 1, and x i , and i G ˆ()x x G i
i −ˆ as IVs. Discussions are the same as before.
4. Regression discontinuity
It is uful to distinguish between two general ttings, the Sharp and the Fuzzy Regression Discontinuity designs. In the sharp design, the assignment w i is a
deterministic function of one of the covariates, the forcing (or treatment-determining) variable x :
Sharp design:
w i = 1(x i > x 0)
All units with x i > x 0 are assigned to the treatment group (and participation is mandatory for the individuals), and all units with x i ≤ x 0 are assigned to the control group. In this sharp design, we look at the discontinuity in the conditional expectation of the outcome given the covariates to uncover the ATE :
)|(]|[lim ]|[lim 0010
0x x y y E x y E x y E ATE x x x x =−=−=−+→→
Fuzzy design:
E(w i |x i = x ) = Pr(w i = 1|x ) is discontinuous at known value x 0.
The sharp and fuzzy designs differ in that in the sharp design the treatment
assignment is deterministic given x , while the fuzzy design the treatment assignment may depend on additional factors unobrved by econometrician. In both designs, the discontinuity point x 0 is known.
Assumption (RD):
(i) ()x w E w x x |lim 0+→+= and ()x w E w x x |lim 0
−→−= exist.
(ii) w + ≠ w - In Angrist and Lavy (1999), an identifying assumption would be that the class size for a student in a school with a number of pupils approaching (for example) 800 above differs from that of a student in a school with a number of pupils approaching 800 from below.
Assumption: E(y 1i – y 0i |x i = x ) is continuous in x at x 0.
This assumption is valid where we have reason to believe that person clo to threshold c are similar and thus would experience similar outcome abnt treatment.
Theorem: ATE, denoted as α:
−
+−
+−−=w w y y α Proof:
Let Δ to be a small positive number.
()()
()()()()
()()()()()(()()()()()()()
Δ−−Δ++Δ−−Δ+=Δ−−Δ++Δ−−−Δ+−=Δ−+−−Δ++−=)Δ−−Δ+00000000000010010001000100||||||||||||x y E x y E x w E x w E x y E x y E x w y y E x w y y E x y w y y E x y w y y E x y E x y E α
As Δ Æ0, we have:
()−+−+−=−w w y y α
Here we u the fact (assumption) that E(y 0) is continuous at x 0 without treatment.
艾青诗选读书笔记The conclusion follows.
Given this theorem, we can obtain an estimate of α by estimating y +, y -, w +, and w -. There are veral ways to estimate this. The most popular way is to do it non-parametrically.
In practice,
()()()()∑∑∑∑<<−<<−=+<<+<<=−+0000000011ˆ11ˆx x h x x x h x y y
h x x x h x x x y y
i i i i i i
Note for a sharp design RD, w + - w - = 1. For a fuzzy design RD,
()()()()∑∑∑∑<<−<<−=+<<+<<=−+0
00
0000011ˆ11ˆx x h x x x h x w w h x x x h x x x w w
i i i i i i
where h is the bandwidth. An interesting note is that this is numerically equivalent to an IV estimator for the regression of y i on w i for people in the subsample
using ()h x x h x i +<<−00(h x x x i )+<<001 as the IV. The regression method can be uful becau one can add control variables in the regression.
舍不得的句子Practically, for a sharp design,
1. Graph the data by computing the average value of the outcome variable over a t of bins. The bandwidth has to be large enough to have sufficient amount of
precision so that the plots look smooth on either side of the cutoff value, but at the same time small enough to make the jump around the cutoff value clear.
2. Estimate the treatment effect by running linear regression on both sides of the cutoff point. Since we propo to u a rectangular kernel, the are just standard