R语⾔_电信客户流失数据分析
1 引⾔
近年来,各⾏各业往往都会不可避免地⾯临⽤户流失的问题。研究表明,发展新⽤户所花费的宣传、促销等成本显然⾼于维持⽼⽤户的成本,因此,做好"客户流失预警"可以有效降低营销成本,做到精准营销。
如今,随着运营商的竞争不断加剧,电信运营商亟需提⾼⽤户留存率、增加⽤户黏性,减少客户流失。因此,需要对电信客户进⾏流失分析与预测,发掘客户流失的原因,进⽽改善⾃⾝业务,提⾼⽤户的满意度,延长⽤户⽣命周期。
2 数据来源与数据概况
2.1 数据来源
2.2 数据概况
电信客户流失数据集描述了电信⽤户是否流失以及其相关信息,共包含7043条记录,21个字段。 读⼊数据集后,了解数据集的基本信息。
> telco.data <- read.csv("WA_Fn-UC_-Telco-Customer-Churn.csv")
># 展⽰数据集的前六⾏数据
> head(telco.data)
customerID gender SeniorCitizen Partner Dependents tenure PhoneService
1 7590-VHVEG Female 0 Yes No 1 No
2 5575-GNVDE Male 0 No No 34 Yes
3 3668-QPYBK Male 0 No No 2 Yes
4 7795-CFOCW Male 0 No No 4
5 No
5 9237-HQITU Female 0 No No 2 Yes
6 9305-CDSKC Female 0 No No 8 Yes
MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
1 No phone rvice DSL No Yes No
2 No DSL Yes No Yes
3 No DSL Yes Yes No
4 No phone rvice DSL Yes No Yes
5 No Fiber optic No No No
6 Yes Fiber optic No No Yes
TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
1 No No No Month-to-month Yes
bono2 No No No One year No
3 No No No Month-to-month Yes
4 Yes No No One year No
5 No No No Month-to-month Yes
6 No Yes Yes Month-to-month Yes
PaymentMethod MonthlyCharges TotalCharges Churn
1 Electronic check 29.85 29.85 No
2 Mailed check 56.95 1889.50 No
3 Mailed check 53.85 108.15 Yes
4 Bank transfer (automatic) 42.30 1840.7
5 No
5 Electronic check 70.70 151.65 Yes
6 Electronic check 99.65 820.50 Yes
># 数据集的维度
> dim(telco.data)
[1] 7043 21
每个字段的介绍如下表所⽰:
字段名字段含义字段内容
customerID客户ID
gender性别Female & Male
字段名字段含义字段内容
SeniorCitizen⽼年⽤户1表⽰是,0表⽰不是Partner伴侣⽤户Yes or No Dependents亲属⽤户Yes or No
tenure在⽹时长0-72⽉
PhoneService是否开通电话服务服务Yes or No
MultipleLines是否开通多线服务Yes 、No or No phonervice 三种
InternetService是否开通上⽹服务No, DSL数字⽹络,fiber optic光纤⽹络
OnlineSecurity是否开通⽹络安全服务Yes,No,No internetrive
OnlineBackup是否开通在线备份服务Yes,No,No internetrive
DeviceProtection是否开通设备保护服务Yes,No,No internetrive
TechSupport是否开通技术⽀持服务Yes,No,No internetrive
StreamingTV是否开通⽹络电视Yes,No,No internetrive StreamingMovies是否开通⽹络电影Yes,No,No internetrive Contract签订合同⽅式按⽉,⼀年,两年PaperlessBilling是否开通电⼦账单Yes or No
PaymentMethod付款⽅式bank transfer,credit card,electronic check,mailed check MonthlyCharges⽉租费18.85-118.35
TotalCharges累计付费18.85-8684.8
Churn该⽤户是否流失Yes or No
3 研究问题
1. 分析⽤户特征与流失的关系
2. 流失客户普遍具有哪些特征?
3. 尝试找到合适的模型预测流失客户。
4. 针对性给出增加⽤户黏性、降低客户流失率的建议。
4 数据预处理
查看数据集中每个变量的类型。
> str(telco.data)
'data.frame': 7043 obs. of 21 variables:
$ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
实现英文
origen
sway$ gender : chr "Female" "Male" "Male" "Male" ...
$ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ Partner : chr "Yes" "No" "No" "No" ...
$ Dependents : chr "No" "No" "No" "No" ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ PhoneService : chr "No" "Yes" "Yes" "No" ...
$ MultipleLines : chr "No phone rvice" "No" "No" "No phone rvice" ...
$ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
$ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
$ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
$ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
$ TechSupport : chr "No" "No" "No" "Yes" ...
$ StreamingTV : chr "No" "No" "No" "No" ...
$ StreamingMovies : chr "No" "No" "No" "No" ...
$ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
$ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
$ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
$ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
$ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ Churn : chr "No" "No" "Yes" "No" ...
4.1 因⼦变量处理
需要将该数据集中的部分变量转化为因⼦类型。
> telco.data <- within(telco.data,{
+ SeniorCitizen <- factor(SeniorCitizen, levels = c(0,1), labels = c("No","Yes"))
+ Partner <- factor(Partner)
+ Dependents <- factor(Dependents)
+})
> Factors <- c("gender","PhoneService","MultipleLines","InternetService","OnlineSecurity","OnlineBackup","DeviceProtection","TechSupport","Streamin gTV","StreamingMovies","Contract","PaperlessBilling","PaymentMethod","Churn")
> telco.data[Factors]<- lapply(telco.data[Factors],factor)
4.2 缺失值处理
从图中可以看出,TotalCharges列有11个缺失值,占⽐⼤约0.16%。
> colSums(is.na(telco.data))
customerID gender SeniorCitizen Partner
造价师培训0 0 0 0
Dependents tenure PhoneService MultipleLines
0 0 0 0
InternetService OnlineSecurity OnlineBackup DeviceProtection
0 0 0 0
TechSupport StreamingTV StreamingMovies Contract
0 0 0 0
PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
0 0 0 11
Churn
> library(VIM)
> par(cex =0.72, font.axis =3)
> VIM::aggr(telco.data, prop =TRUE, numbers =TRUE)
处理缺失值数据的⼀种⽅法是插补均值、中位数或者众数。
从直⽅图可知,TotalCharges数据呈偏态分布。根据正态分布选均值、中位数填充,偏态分布选中位数填充的原则,选择⽤TotalCharges 列的中位数去填充这11个缺失值。
> hist(telco.data$TotalCharges, breaks =50, prob =TRUE,
+ main ="Histogram Of TotalCharges")
> library(Hmisc)
># 插补中位数
> telco.data$TotalCharges <- as.numeric(Hmisc::impute(telco.data$TotalCharges, median))
4.3 简化分类变量的属性值
OnlineSecurity、OnlineBackup、DeviceProtection、TechSupport、StreamingTV、StreamingMovies这六个变量的属性值有Yes、No、No internet rive 三种。
通过分析这六个变量和Churn⽣成的⼆维列联表,不难发现"No internetrive"出现 的频数是⼀致的,可以认为该属性值不影响客户流失率,所以简化属性值,将其并⼊"No"这⼀属性值。
>for(i in10:15)
+{
+ print(xtabs(~ Churn + get(names(telco.data)[i]), data = telco.data))
+}
结果如下:
get(names(telco.data)[i])
Churn No No internet rvice Yes
No 2037 1413 1724
Yes 1461 113 295
get(names(telco.data)[i])
Churn No No internet rvice Yes
一点英语No 1855 1413 1906
Yes 1233 113 523
get(names(telco.data)[i])
Churn No No internet rvice Yes
No 1884 1413 1877
Yes 1211 113 545
get(names(telco.data)[i])
Churn No No internet rvice Yes
No 2027 1413 1734
Yes 1446 113 310
get(names(telco.data)[i])
Churn No No internet rvice Yes
No 1868 1413 1893
Yes 942 113 814
get(names(telco.data)[i])
Churn No No internet rvice Yes
No 1847 1413 1914
Yes 938 113 818
># 将“No internetrive”并⼊“No”这⼀属性值
> levels(telco.data$OnlineSecurity)[2]<-"No"
> levels(telco.data$OnlineBackup)[2]<-"No"
> levels(telco.data$DeviceProtection)[2]<-"No"
> levels(telco.data$TechSupport)[2]<-"No"
> levels(telco.data$StreamingTV)[2]<-"No"
> levels(telco.data$StreamingMovies)[2]<-"No"英文简历字体
4.4 处理"量纲差异⼤"
⽬前属于这类特征的变量有:MonthlyCharges和TotalCharges。我打算采⽤连续特征离散化的处理⽅式。原因是离散化后的特征对异常数据有更强的鲁棒性,降低过拟合的风险,模型会更稳定,预测的效果也会更好。
数据离散化也称为分箱操作,其⽅法分为有监督分箱(卡⽅分箱、最⼩熵法分箱)和⽆监督分箱(等频分箱、等距分箱)。 本次为采⽤⽆监督分箱中的等频分箱进⾏操作。
> library(Hmisc)
> describe(telco.data[c("MonthlyCharges","TotalCharges")])
telco.data[c("MonthlyCharges", "TotalCharges")]
2 Variables 704
3 Obrvations
-----------------------------------------------------------------------------------
MonthlyCharges
n missing distinct Info Mean Gmd .05 .10 .25
7043 0 1585 1 64.76 34.39 19.65 20.05 35.50
.50 .75 .90 .95
70.35 89.85 102.60 107.40
lowest : 18.25 18.40 18.55 18.70 18.75, highest: 118.20 118.35 118.60 118.65 118.75
-----------------------------------------------------------------------------------
TotalCharges
n missing distinct Info Mean Gmd .05 .10 .25
7043 0 6531 1 2282 2447 49.65 84.61 402.23
.50 .75 .90 .95
1397.47 3786.60 5973.69 6921.02
coalitionlowest : 18.80 18.85 18.90 19.00 19.05, highest: 8564.75 8594.40 8670.10 8672.45 8684.80 -----------------------------------------------------------------------------------
>#根据描述性统计量将变量按0.25,0.5,0.75分位数分成4份希望英语单词
> c_u_t <-function(x, n =1){
+ result <- quantile(x, probs = q(0,1,1/n))
+ result[1]<- result[1]-0.001
+ return(result)
+}
> telco.data <- transform(telco.data,
+ MonthlyCharges_c = cut(telco.data$MonthlyCharges,
+ breaks =c_u_t(telco.data$MonthlyCharges, n=4),
+ labels = c(1,2,3,4)),
+ TotalCharges_c = cut(telco.data$TotalCharges,
+ breaks = c_u_t(telco.data$TotalCharges, n=4),
+ labels = c(1,2,3,4)))
> telco.data <- within(telco.data,{
+ MonthlyCharges_c <- relevel(MonthlyCharges_c, ref =1)
+ TotalCharges_c <- relevel(TotalCharges_c, ref =1)
+})
5 探索性数据分析
高标查看流失客户的数量和占⽐,由图可知,客户流失率约为26.54%。
> table(telco.data$Churn)
No Yes
5174 1869