数据分析机器学习实战项⽬⼀:关联规则
1、知识点介绍
明确⼏个概念:
事务、项、项集合
蝴蝶的外形特征X==>Y
**⽀持度:**⼀个项集或者规则在所有事务中出现的频率,σ(X):表⽰项集X的⽀持度计数
**置信度:**确定Y在包含X的事务中出现的频繁程度。c(X → Y) = σ(X∪Y)/σ(X)
举个例⼦:
(X , Y)==>Z :
⽀持度:交易中包含{X 、 Y 、 Z}的可能性
置信度:包含{X 、 Y}的交易中也包含Z的条件概率
若关联规则X->Y的⽀持度和置信度分别⼤于或等于⽤户指定的最⼩⽀持率minsupport和最⼩置信度minconfidence,则称关联规则X->Y 为强关联规则,否则称关联规则X->Y为弱关联规则。
**提升度:**物品集A的出现对物品集B的出现概率发⽣了多⼤的变化
lift(A==>B)=confidence(A==>B)/support(B)=p(B|A)/p(B)
由此可见,lift正是弥补了confidence的这⼀缺陷,if lift=1,X与Y独⽴,X对Y出现的可能性没有提升作⽤,其值越⼤(lift>1),则表明X对Y 的提升程度越⼤,也表明关联性越强。
Leverage 与 Conviction的作⽤和lift类似,都是值越⼤代表越关联
三百六十行Leverage (A,B)-P(A)P(B)
Conviction:P(A)P(!B)/P(A,!B)
2、代码⽅⾯知识
1)使⽤mlxtend⼯具包得出频繁项集与规则
pip install mlxtend
2)设置⽀持度 (support) 来选择频繁项集.
选择最⼩⽀持度为50%
围棋简介apriori(df, min_support=0.5, u_colnames=True)
3)计算规则
association_rules(df, metric=‘lift’, min_threshold=1)
可以指定不同的衡量标准与最⼩阈值
4.可乐翅根的做法
设置显⽰的最⼤⾏数
pd.options.display.max_colwidth = 100
显⽰所有列
pd.t_option(‘display.max_columns’, None)
pd.t_option(‘display.max_columns’, 5) #最多显⽰五列
显⽰所有⾏
pd.t_option(‘display.max_rows’, None)
5)分割字符串并返回各个字符的复杂矩阵
_dummies(p=’|’)
参数:
萌化之旅p : 字符串, 默认为“|”
返回值:
数据框(DataFrame)
活动掠影3、实验
实验⼀:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
data ={'ID':[1, 2, 3, 4, 5, 6],
'Onion':[1, 0, 0, 1, 1, 1],
'Potato':[1, 1, 0, 1, 1, 1],
'Burger':[1, 1, 0, 0, 1, 1],
'Milk':[0, 1, 1, 1, 0, 1],
'Beer':[0, 0, 1, 0, 1, 0]}
df= pd.DataFrame(data)
df= df[['ID', 'Onion', 'Potato', 'Burger', 'Milk', 'Beer']]
print(df)
# 找出频繁项集
我的教育故事frequent_itemts = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer']], min_support=0.50, u_colnames=True) print(frequent_itemts)
# 关联规则
rules = association_rules(frequent_itemts, metric='lift', min_threshold=1)
print(rules)
# 挑选
rules_lect = rules[(rules['lift']> 1.125)&(rules['confidence']> 0.8)]
print(rules_lect)
男生吃什么补肾
实验⼆:数据需转换成one-hot编码
retail_shopping_basket ={'ID':[1, 2, 3, 4, 5, 6],
'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
['Soda', 'Chips', 'Milk'],
['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
['Soda', 'Coffee', 'Milk', 'Bread'],
['Beer', 'Chips']
]
}
retail = pd.DataFrame(retail_shopping_basket)
retail = retail[['ID', 'Basket']]
pd.options.display.max_colwidth = 100
# print(retail)
>### one-shot编码
retail_id = retail.drop('Basket', 1)
retail_Basket = retail.Basket.str.join(',')
retail_Basket = retail__dummies(',')
retail = retail_id.join(retail_Basket)
# print(retail)
>##frequent_itemts = apriori(retail.drop('ID', 1), u_colnames=True)
rules = association_rules(frequent_itemts, metric='lift', min_threshold=1)
print(rules)
实验三:电影题材关联
movies = pd.read_csv('./ml-latest-small/movies.csv')
# print(movies.head(10))
movies_ohe = movies.drop('genres', 1)._dummies('|'))
pd.options.display.max_columns = 100
# print(movies_ohe.head())
# print(movies_ohe.shape)
movies_ohe.t_index(['movieId', 'title'], inplace=True)
# print(movies_ohe.head(10))
# print(movies_ohe.shape)
frequent_itemts_movies = apriori(movies_ohe, u_colnames=True, min_support=0.025) # print(frequent_itemts_movies)
rules_movies = association_rules(frequent_itemts_movies, metric='lift', min_threshold=1.25) # print(rules_movies)
rules_lect = rules_movies[(rules_movies.lift > 4)].sort_values(by=['lift'], ascending=Fal) print(rules_lect)
4、结果参考