【数据分析与挖掘实战】电信用户流失分析与预测

更新时间:2023-06-16 23:11:52 阅读：评论：0

【数据分析与挖掘实战】电信⽤户流失分析与预测

背景

关于⽤户留存有这样⼀个观点，如果将⽤户流失率降低5%，公司利润将提升25%-85%。如今⾼居不下的获客成本让电信运营商遭遇“天花板”，甚⾄陷⼊获客难的窘境。随着市场饱和度上升，电信运营商亟待解决增加⽤户黏性，延长⽤户⽣命周期的问题。因此，电信⽤户流失分析与预测⾄关重要。

数据集来⾃kesci中的“电信运营商客户数据集”

数据集：

本⽂将从以下⽅⾯进⾏分析：

训诫是什么意思1.背景

2.提出问题

3.理解数据

4.数据清洗

5.可视化分析

6.⽤户流失预测

7.结论和建议

提出问题

1.分析⽤户特征与流失的关系。

2.从整体情况看，流失⽤户的普遍具有哪些特征？

3.尝试找到合适的模型预测流失⽤户。

4.针对性给出增加⽤户黏性、预防流失的建议。

xiaone理解数据

根据介绍，该数据集有21个字段，共7043条记录。每条记录包含了唯⼀客户的特征。

我们⽬标就是发现前20列特征和最后⼀列客户是否流失特征之间的关系。

数据清洗inxs

数据清洗的“完全合⼀”规则：

完整性：单条数据是否存在空值，统计的字段是否完善。

全⾯性：观察某⼀列的全部数值，通过常识来判断该列是否有问题，⽐如：数据定义、单位标识、数据本⾝。

合法性：数据的类型、内容、⼤⼩的合法性。⽐如数据中是否存在⾮ASCII字符，性别存在了未知，年龄超过了150等。

唯⼀性：数据是否存在重复记录，因为数据通常来⾃不同渠道的汇总，重复的情况是常见的。⾏数据、列数据都需要是唯⼀的。

导⼊⼯具包。

import pandas as pd

import numpy as npjane doe

import matplotlib.pyplot as plt

import aborn as sns

customerDF = pd.read_csv('/home/kesci/input/yidong4170/WA_Fn-UC_-Telco-Customer-Churn.csv')

# 查看数据集⼤⼩

customerDF.shape

# 运⾏结果：(7043, 21)

# 设置查看列不省略

pd.t_option('display.max_columns',None)

# 查看前10条数据

customerDF.head(10)

# Null计数

pd.isnull(customerDF).sum()

# 查看数据类型

customerDF.info()

#customerDf.dtypes

#将‘TotalCharges’总消费额的数据类型转换为浮点型，发现错#误：字符串⽆法转换为数字。

依次检查各个字段的数据类型、字段内容和数量。最后发现“TotalCharges”（总消费额）列有11个⽤户数据缺失。# 查看每⼀列数据取值

for x lumns:

test=customerDF.loc[:,x].value_counts()

print('{0} 的⾏数是：{1}'.format(x,test.sum()))

print('{0} 的数据类型是：{1}'.format(x,customerDF[x].dtypes))

print('{0} 的内容是：\n{1}\n'.format(x,test))

采⽤强制转换，将“TotalCharges”（总消费额）转换为浮点型数据。

#强制转换为数字，不可转换的变为NaN

customerDF['TotalCharges']=customerDF['TotalCharges'].convert_objects(convert_numeric=True)

#强制转换为数字，不可转换的变为NaN

customerDF[‘TotalCharges’]=customerDF[‘TotalCharges’].convert_objects(convert_numeric=True)

test=customerDF.loc[:,'TotalCharges'].value_counts().sort_index()

print(test.sum())

dear#运⾏结果：7032

ure[customerDF['TotalCharges'].isnull().values==True])

#运⾏结果：11shed

#将总消费额填充为⽉消费额

customerDF.loc[:,'TotalCharges'].replace(to_replace=np.nan,value=customerDF.loc[:,'MonthlyCharges'],inplace=True)

#查看是否替换成功

print(customerDF[customerDF['tenure']==0][['tenure','MonthlyCharges','TotalCharges']])

# 将‘tenure’⼊⽹时长从0修改为1

customerDF.loc[:,'tenure'].replace(to_replace=0,value=1,inplace=True)

print(pd.isnull(customerDF['TotalCharges']).sum())

高一英语课文朗读

print(customerDF['TotalCharges'].dtypes)

查看数据的描述统计信息，根据⼀般经验，所有数据正常。

可视化分析

根据⼀般经验，将⽤户特征划分为⽤户属性、服务属性、合同属性，并从这三个维度进⾏可视化分析。

查看流失⽤户数量和占⽐。

plt.pie(customerDF['Churn'].value_counts(),labels=customerDF['Churn'].value_counts().index,autopct='%1.2f%%',explode=(0.1,0)) plt.title('Churn(Yes/No) Ratio')

plt.show()

churnDf=customerDF['Churn'].value_counts().to_frame() x=churnDf.index

y=churnDf['Churn']

plt.bar(x,y,width =0.5,color ='c')

#⽤来正常显⽰中⽂标签（需要安装字库）

plt.title('Churn(Yes/No) Num')

plt.show()

迈克尔杰克逊英文属于不平衡数据集，流失⽤户占⽐达26.54%。（1）⽤户属性分析

def barplot_percentages(feature,orient='v',axis_name="percentage of customers"):

ratios = pd.DataFrame()

g =(upby(feature)["Churn"].value_counts()/len(customerDF)).to_frame() g.rename(columns={"Churn":axis_name},inplace=True)

<_index(inplace=True)

#print(g)

奥巴马英语演讲视频if orient =='v':

ax = sns.barplot(x=feature, y= axis_name, hue='Churn', data=g, orient=orient)

lp是什么意思

ax.t_yticklabels(['{:,.0%}'.format(y)for y _yticks()])

#plt.legend(fontsize=10)

el:

ax = sns.barplot(x= axis_name, y=feature, hue='Churn', data=g, orient=orient)

ax.t_xticklabels(['{:,.0%}'.format(x)for x _xticks()])

plt.legend(fontsize=10)

plt.title('Churn(Yes/No) Ratio as {0}'.format(feature))

plt.show()

barplot_percentages("SeniorCitizen")

barplot_percentages("gender")

customerDF['churn_rate']= customerDF['Churn'].replace("No",0).replace("Yes",1)

g = sns.FacetGrid(customerDF, col="SeniorCitizen", height=4, aspect=.9)

ax = g.map(sns.barplot,"gender","churn_rate", palette ="Blues_d", order=['Female','Male']) Params.update({'font.size':13})

plt.show()

本文发布于:2023-06-16 23:11:52，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/971142.html

上一篇：IV(2SLS)估计应用STATA实现

下一篇：房产证英文翻译模板

标签：数据流失是否电信查看

留言与评论（共有 0 条评论）