重采样⽅法(ResamplingMethods )(CV,Bootstrap )
⽂章⽬录
关于考试的作文Introduction
Resampling methods involve repeatedly drawing samples from a training t and re fitting a model of interest on each sample in order to obtain additional information about the fitted model. (e.g. cross-validation, bootstrap)Estimates of test-t prediction error (CV)S.E. and bias of estimated parameters (Bootstrap)
C.I. of target parameter (Bootstrap)
Cross-Validation
The training error rate often is quite di fferent from the test error rate, and in particular the former can dramatically underestimate the latter.Model Complexity Low: High bias, Low variance
Model Complexity High: Low bias, High variance Prediction Error Estimates Large test t Mathematical adjustment CV: Consider a class of methods that estimate the test error rate by holding o
ut a subt of the training obrvations from the fitting process, and then applying the statistical learning method to tho held out obrvations.
The Validation Set Approach
A random splitting into two halves: left part is training t, right part is validation t.Drawbacks
The validation estimate of the test error rate can be highly variable, depending on precily which obrvations are included in the training t and which obrvations are included in the validation t.
Only a subt of the obrvations are ud to fit the model.
Validation t error rate may tend to overestimate the test error rate for the model fit on the entire data t.Leave-One-Out Cross-Validation
C =p (SSE +n 1d 2d )
σ^2AIC =
员工激励方法(SSE +nσ^21d 2d )σ^2BIC =(SSE +nσ^21d log (n )d )
σ^2
LOOCV involves splitting the t of obrvations into two parts. However, instead of creating two subts of comparable size, a single obrvation is ud for the validation t, and the remaining obrvations make up the training t.In Linear Regression
bacomes a weighted MSE Drawbacks
Estimates from each fold are highly correlated and hence their average can have high variance.
K-fold Cross-Validation
This approach involves randomly dividing the t of obrvations into k groups, or folds, of approximately equal size. The first fold is treated as a validation t, and the method is fit on the remaining k − 1 folds. This procedure is repeated k times; each time, a di fferent group of obrvations is treated as a validation t. This process results in k estimates of the test error. The k-fold CV estimate is computed by averaging the values. If k=n, then it is LOOCV.
Typically, given the considerations, one performs k-fold cross-validation using k = 5 or k = 10, as the values have been shown empirically to yield test error rate estimates that su ffer neither from e
xcessively high bias nor from very high variance.Bootstrap A powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method.For example, it can provide an estimate of the standard error of a coe fficient, or a con fidence interval for that
coe fficient.Steps
Obtain datats ( obrvations) by repeatedly sampling from the original data t with replacement times.
Each of the bootstrap data, denoted as , is the same size as original datat . And bootstrap estimates for denoted as . Thus some obrvations may appear more than once and some not at all ().
Estimate of S.E.
(x ,y )11(x ,y ),...,(x ,y )22n n CV =(n )()n 1i =1∑n
1−h i y −i y
i ^2CV n CV =(k )MSE
k 1i =1∑k
or
水浒传的读后感CV =(k )Err k 1i =1∑k k n Z B Z ,...,Z ∗1∗B n α,...,α^∗1α^∗B
Estimate of C.I.
Bootstrap Percentile C.I.
幼儿评价Bootstrap S.E. bad C.I.Better Option (Basic Bootstrap/Rever Percentile Interval)
Key: the behavior of is approximately the same as the behavior of .Therefore:In General
Each bootstrap sample has signi ficant overlap with the original data. This will cau the bootstrap to riously underestimate the true prediction error.
大足县
Can partly fix this problem by only using predictions for tho obrvations that did not ( by chance ) occur in the current bootstrap sample. (Complicated)
If the data is a time ries, we can’t simply sample the obrvations with replacement. We can inste
ad create blocks of concutive obrvations, and samp le tho with replacements. Then we paste to gether sampled blocks to obtain a bootstrap samples.
Bootstrap in Regression
Find S.E. and C.I. for and Empirical Bootstrap
SE ()=B θ^(−)B −11r =1∑B
θ^∗r θˉ∗2[L ,U ]=[,]θ
^α/2∗θ^1−α/2∗
[L ,U ]=±θ
ˉz ×1−α/2B SE ∗
[L ,U ]=[2−θ
^θ,2−1−α/2∗θ^θ]α/2∗
−θ
中华石园^∗θ^−θ^θ0.95=P (≤θ^α/2∗≤θ^∗)θ^1−α/2∗
=P (−θ^α/2∗≤θ^−θ^∗≤θ^−θ^1−α/2∗)θ
^=P (−θ^α/2∗≤θ^−θ^∗θ≤−θ^1−α/2∗)θ^=P (−θ
^α/2∗≤θ^−θ^θ≤−θ^1−α/2∗)θ^=P (2−θ^θ≤1−α/2∗θ≤2−θ^θ)α/2∗Y =i β+0βX +1i ϵ, i =i 1,...,n
β0β1
Resampling and obtain:Bootstrap sample 1: Bootstrap sample 2: …Bootstrap sample 1: For each Bootstrap sample, fit regression and obtain , then estimate S.E. and C.I.Residual Bootstrap
Recall that residuals to mimic the role of .Bootstrap the residuals and obtain:Bootstrap residual 1: Bootstrap residual 1: …Bootstrap residual 1: Generate new bootstrap sample: For each bootstrap sample, fit regression and estimate S.E. and C.I.
日本维新Wild Bootstrap
When variance of error depends on the value of ( so called heteroskedasticity) , residual bootstrap is
unstable becau the residual bootstrap will swap all the residuals regardless of the value of X. But wild bootstrap us the residual of itlf only.Generate IID random variables Generate new bootstrap sample: For each bootstrap sample, fit regression and estimate S.E. and C.I.(X ,Y ),...,(X ,Y )11n n (X ,Y ),...,(X ,Y )
1∗11∗1n ∗1n ∗1
(X ,Y ),...,(X ,Y )
1∗21∗2n ∗2n ∗2(X ,Y ),...,(X ,Y )1∗B 1∗B n ∗B n ∗B (,)...(,)β^0∗1β^1∗1β^0∗B β^1∗B ϵ,...,e ^1∗1e ^n
背诵法∗1,...,e ^1∗2e ^n
∗2,...,e ^1∗B e ^n ∗B
X =i ∗b X , Y =i i
∗b +β^0X +β^1i e ^i ∗b V ar (ϵ∣X )i i X i V ,...,V ∼1b n b N (0,1)
X =i ∗b X , Y =i i ∗b +β^0X +β^1i V i b e ^i