2016年12月11日 星期日

【R】Stepwise Regression 逐步回歸練習

        逐步回歸 Stepwise Regression 是一種線性回歸的建模方式,主要概念就是逐步一個一個的判斷每一個自變數Independent Variables 是否顯著影響應變數Dependent Variable, 顯著影響的就放入,不顯著的就剔除。逐步回歸的方式通常有兩種,Forward 和 Backward, 以下我參考這篇網站的教學(https://goo.gl/EAAurP),以R語言練習逐步回歸的建模過程。




➤第一步:讀取、觀察資料
這是一個關於甚麼因素會影響出生嬰兒體重的研究,包含了以下欄位的資料,我們用R讀取資料後進行重新編碼。


#ID : Indentification Code
#LOW : Low Birth Weight Baby (1=Yes under 2500g,0 = No)
#AGE : Mother's age in years
#LWT : Weight at Last Period
#RACE : 1=White, 2=Black, 3=Other 
#SMOKE : Smoke during the pregnancy (1=Yes,0=No)
#PTL : History of Premature Labour(# of time)
#HT : History of Hypertension (1=Yes,0=No)
#UI : Presence of Uterine Inrritability (1=Yes,0=No)
#FTV : Visits to Doctor During 1st trimester
#BWT : Baby's birth Weight in Grams

setwd("D:/Data analysis/R/Blog/Stepwise")
lowbwt <- br="" header="T)" lowbwt.csv="" read.csv=""> ##Recoding
lowbwt = within(lowbwt,{
##Relabel race
race = factor(RACE , levels = 1:3, labels = c("White","Black","Other"))
##Categorize ftv
ftv = cut(FTV, breaks = c(-Inf,0,2,Inf), labels = c("None","Normal","Many"))
ftv = relevel(ftv, ref = "Normal")
##Dichotomize prl
ptl = factor(PTL >=1, levels = c(FALSE, TRUE), labels = c("0","1+"))
##Categorie smoke ht ui
smoke = factor(SMOKE, levels = 0:1, labels = c("NO","YES"))
ht = factor(HT, levels = 0:1, labels = c("NO","YES"))
ui = factor(UI, levels = 0:1, labels = c("NO","YES"))
})



➤第二步:建立線性模型
以下為了進行逐步回歸的建模,我們先建立兩組線性模型,以新生嬰兒體重BWT為應變數,一組是將所有自變數放入的全模型,另一組則是什麼都不放入的空模型,然後我們將全模型的圖畫出來,觀察一下每個自變數對應變數的影響狀況。

##Create a full and null linear models
lm.fullmodel = lm(BWT~AGE + LWT + race + smoke + ptl + ht + ui + ftv, data = lowbwt)
lm.nullmodel = lm(BWT~1,data = lowbwt)
plot(allEffects(lm.fullmodel),ylim = c(2000,4000))




➤第三步:Manual F-test-based backward selection
第一個逐步回歸方法我們採用F-test 的向後逐步回歸,也就是我們將以全模型開始,逐步丟棄顯著性低的自變數,過程如下。

drop1(lm.fullmodel, test = "F")
drop1(update(lm.fullmodel, ~. -AGE),test ="F")
drop1(update(lm.fullmodel,~.-AGE-ftv),test = "F")
drop1(update(lm.fullmodel,~.-AGE-ftv-ptl),test = "F")
summary(update(lm.fullmodel,~.-AGE-ftv-ptl))
formula(update(lm.fullmodel,~.-AGE-ftv-ptl)) 


從Full model可以看出最不顯著的自變數為Age
我們將自變數Age拿掉後,觀察到第二個最不顯著的自變數ftv

我們將自變數ftv也拿掉後,觀察到第三個最不顯著的自變數ptl

我們已經將所有不顯著的自變數拿掉了!

最後我們觀察模型的結果

➤第四步:結論
以下我們可以觀察一下利用F-test做向後逐步回歸之後的結果,不過從R的console可以看出逐步回歸後模型的R-squared僅有0.2403,解釋能力不是太高,或許其中有許多因素,像是共線性或者資料蒐集不足等問題值得我們後續慢慢探討,不過這邊就算簡單練習用R進行逐步回歸的過程囉!

原本的全模型:BWT ~ AGE + LWT + race + smoke + ptl + ht + ui + ftv
逐步回歸結果:BWT ~ LWT + race + smoke + ht + ui
➤另外這裡也提供其他逐步回歸的建模方法的R-code 在下面,可以嘗試不同方法看看模型的變化以及解釋能力。

  • Manual F-test-based backward selection
  • Manual F-test-base forward selection
  • Automated F-test-based backward selection using rms::fasrbw()
  • AIC-based backward selection
  • AIC-based forward selection
  • AIC-based forward-backward selection

##Manual F-test-based backward selection
drop1(lm.fullmodel, test = "F")
drop1(update(lm.fullmodel, ~. -AGE),test ="F")
drop1(update(lm.fullmodel,~.-AGE-ftv),test = "F")
drop1(update(lm.fullmodel,~.-AGE-ftv-ptl),test = "F")
summary(update(lm.fullmodel,~.-AGE-ftv-ptl))
formula(lm.fullmodel)
formula(update(lm.fullmodel,~.-AGE-ftv-ptl))

##Manual F-test-base forward selection
add1(lm.nullmodel,scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv,test = "F" )
add1(update(lm.nullmodel,~.+ui),scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv,test = "F" )
add1(update(lm.nullmodel,~.+ui+race),scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv,test = "F" )
add1(update(lm.nullmodel,~.+ui+race+smoke),scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv,test = "F" )
add1(update(lm.nullmodel,~.+ui+race+smoke+ht),scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv,test = "F" )
add1(update(lm.nullmodel,~.+ui+race+smoke+ht+LWT),scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv,test = "F" )
summary(update(lm.nullmodel,~.+ui+race+smoke+ht+LWT))
formula(update(lm.nullmodel,~.+ui+race+smoke+ht+LWT))

#Automated F-test-based backward selection using rms::fasrbw()
install.packages("rms")
library(rms)
ols.fullmodel = ols(BWT~AGE + LWT + race + smoke + ptl + ht + ui + ftv, data = lowbwt)
fastbw(ols.fullmodel, rule = "p" , sls = 0.1)

#AIC-based backward selection
model.AIC.backward = step(lm.fullmodel,direction = "backward",trace = 1)
summary(model.AIC.backward)
formula(model.AIC.backward)

#AIC-based forward selection
model.AIC.forward = step(lm.nullmodel,direction = "forward",trace = 1,scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv)
summary(model.AIC.forward)
formula(model.AIC.forward)

#AIC-based forward-backward selection
model.AIC.both = step(lm.nullmodel,direction = "both",trace = 1,scope = ~AGE + LWT + race + smoke + ptl + ht + ui + ftv)
summary(model.AIC.both)
formula(model.AIC.both)