PythonPandas数据分组pd.groupby的相关操作(⼆)shift,rolli。。。Pandas 数据分组 pd.groupby 的相关操作(⼆)
数据准备
import pandas as pd
# 假设有 5 个⼈,分别参加了 4 门课程,获得了对应的分数
# 同时这个 5 个⼈分别负责的项⽬个数在 'Project_num' 列中显⽰
data ={'name': pd.Series(['Alice','Bob','Cathy','Dany','Ella','Ford','Gary','Ham','Ico','Jack']),
'Math_A': pd.Series([1.1,2.2,3.3,4.4,5,3.2,2.4,1.5,4.3,4.5]),
'English_A': pd.Series([3,2.6,2,1.7,3,3.3,4.4,5,3.2,2.4]),
'Math_B': pd.Series([1.7,2.5,3.6,2.4,5,2.2,3.3,4.4,1.5,4.3]),
'English_B': pd.Series([5,2.6,2.4,1.3,3,3.6,2.4,5,2.2,3.1]),
'Project_num': pd.Series([2,3,0,1,7,2,1,5,3,4]),
'Sex': pd.Series(['F','M','M','F','M','F','M','M','F','M'])
}
df = pd.DataFrame(data)
print(df)
运⾏结果:
name Math_A English_A Math_B English_B Project_num Sex没有感情的感情
0 Alice 1.1 3.0 1.7 5.02 F
1 Bob 2.
2 2.6 2.5 2.6
3 M
2 Cathy 3.
3 2.0 3.6 2.40 M
3 Dany 4.托福考试
4 1.7 2.4 1.31 F
4 Ella 5.0 3.0 5.0 3.07 M
5 Ford 3.2 3.3 2.2 3.62 F
6 Gary 2.4 4.4 3.3 2.41 M
7 Ham 1.5 5.0 4.4 5.05 M
8 Ico 4.3 3.2 1.5 2.23 F
9 Jack 4.5 2.4 4.3 3.14 M
⼀、数据平移 df.shift
1.1 上下平移
# 整个表上下移动(相当于在表第⼀⾏插⼊⼀空⽩⾏,但是最后⼀⾏由于没有 index ,就消失了)
print(df.shift(1))# 下移 1 ⾏
print('\n')
print(df.shift(-2))# 上移 2 ⾏
print('\n')
运⾏结果:
name Math_A English_A Math_B English_B Project_num Sex
宾馆和酒店的区别0 NaN NaN NaN NaN NaN NaN NaN
1 Alice 1.1 3.0 1.7 5.0 2.0 F
2 Bob 2.2 2.6 2.5 2.6 3.0 M
3 Cathy 3.3 2.0 3.6 2.40.0 M
4 Dany 4.4 1.7 2.4 1.3 1.0 F
5 Ella 5.0 3.0 5.0 3.07.0 M
6 Ford 3.2 3.3 2.2 3.6 2.0 F
7 Gary 2.4 4.4 3.3 2.4 1.0 M
8 Ham 1.5 5.0 4.4 5.0 5.0 M
9 Ico 4.3 3.2 1.5 2.2 3.0 F
name Math_A English_A Math_B English_B Project_num Sex
0 Cathy 3.3 2.0 3.6 2.40.0 M
1 Dany 4.4 1.7 2.4 1.3 1.0 F
2 Ella 5.0 3.0 5.0 3.07.0 M
3 Ford 3.2 3.3 2.2 3.6 2.0 F
4 Gary 2.4 4.4 3.3 2.4 1.0 M
5 Ham 1.5 5.0 4.4 5.0 5.0 M
6 Ico 4.3 3.2 1.5 2.2 3.0 F
7 Jack 4.5 2.4 4.3 3.1 4.0 M
8 NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN
1.2 左右平移
print(df.shift(1,axis=1))# 右移 1 ⾏,数据格式不兼容则显⽰ NaN print('\n')
print(df.shift(-2,axis=1))# 左移 2 ⾏,数据格式不兼容则显⽰ NaN print('\n')
运⾏结果:
name Math_A English_A Math_B English_B Project_num Sex
0 NaN NaN 1.1 3.0 1.7 NaN Alice
1 NaN NaN 2.
2 2.6 2.5 NaN Bob
2 NaN NaN 3.
3 2.0 3.6 NaN Cathy
3 NaN NaN 4.
4 1.7 2.4 NaN Dany
4 NaN NaN 5.0 3.0 5.0 NaN Ella
5 NaN NaN 3.2 3.3 2.2 NaN Ford
6 NaN NaN 2.4 4.4 3.3 NaN Gary
7 NaN NaN 1.5 5.0 4.4 NaN Ham
8 NaN NaN 4.3 3.2 1.5 NaN Ico
9 NaN NaN 4.5 2.4 4.3 NaN Jack
name Math_A English_A Math_B English_B Project_num Sex
0 NaN 1.7 5.0 NaN NaN NaN NaN
1 NaN 2.5 2.6 NaN NaN NaN NaN
2 NaN 3.6 2.4 NaN NaN NaN NaN
3 NaN 2.
4 1.3 NaN NaN NaN NaN
4 NaN 5.0 3.0 NaN NaN NaN NaN
5 NaN 2.2 3.
6 NaN NaN NaN NaN
6 NaN 3.3 2.4 NaN NaN NaN NaN
7 NaN 4.4 5.0 NaN NaN NaN NaN
8 NaN 1.5 2.2 NaN NaN NaN NaN
9 NaN 4.3 3.1 NaN NaN NaN NaN
1.3 分组数据平移
for index,data upby(by='Sex'):
print(index)
print(data.shift(1))
print('\n')
运⾏结果:
F
name Math_A English_A Math_B English_B Project_num Sex
0 NaN NaN NaN NaN NaN NaN NaN
3 Alice 1.1 3.0 1.7 5.0 2.0 F
5 Dany 4.4 1.7 2.4 1.3 1.0 F
8 Ford 3.2 3.3 2.2 3.6 2.0 F
M
name Math_A English_A Math_B English_B Project_num Sex
1 NaN NaN NaN NaN NaN NaN NaN
2 Bob 2.2 2.6 2.5 2.6 3.0 M
4 Cathy 3.3 2.0 3.6 2.40.0 M
6 Ella 5.0 3.0 5.0 3.07.0 M
7 Gary 2.4 4.4 3.3 2.4 1.0 M
9 Ham 1.5 5.0 4.4 5.0 5.0 M
⼆、数据滚动 df.rolling
2.1 滚动求和
lling(window=3, min_periods=1, center=Fal, axis=0).sum())
# 这⾥⽤⼀个长度为 3 的窗⼝进⾏从上往下 rolling ,每次移动⼀步,对每次窗⼝中覆盖的数据进⾏求和# 我们以此得到的值是 1.1, 1.1+2.2=3.3, 1.1+2.2+3.3=6.6, 2.2+3.3+4.4=9.9, 3.3+4.4+5=12.7~
# 这⾥的 center 表⽰是否从窗⼝的中⼼位置开始计算
# 即如果 center 取 True,我们得到的是 1.1+2.2=3.3, 1.1+2.2+3.3=6.6, 2.2+3.3+4.4=9.9~
print('\n')
# 应⽤举例:⽐如我们想要滚动求过去 30 天内的总成交量
运⾏结果:
大班安全活动教案40篇
Math_A English_A Math_B English_B Project_num
0 1.1 3.0 1.7 5.0 2.0
1 3.3 5.6 4.27.6 5.0
2 6.67.67.810.0 5.0
心理语言学39.9 6.38.5 6.3 4.0
412.7 6.711.0 6.78.0
512.68.09.67.910.0
610.610.710.59.010.0
77.112.79.911.08.0
88.212.69.29.69.0
910.310.610.210.312.0
2.2 滚动求均值
upby(['Sex'])['Project_num'].rolling(window=3).mean())
# 默认 min_periods = None,表⽰从窗⼝填满开始,所以前⾯两个值为 NaN
# 这⾥⾸先进⾏性别分组,然后从每个组中,以此取3个⼈,看平均负责的项⽬数量
print('\n')
# 其实在这个例⼦中使⽤没什么实际意义,通常实践中时序数据⽤的⽐较多,⽐如求移动平均值
运⾏结果:
Sex
F 0 NaN
3 NaN
5 1.666667
8 2.000000
M 1 NaN
2 NaN
4 3.333333
6 2.666667
7 4.333333
9 3.333333
Name: Project_num, dtype: float64
三、排名 df.rank
3.1 总排名
df_2 = df.t_index('name')# 将姓名提取为 index
print(df_2)
print('\n')
for index, data in upby(['Sex']):
# 按照性别分组显⽰
联盟一号print(index)
print(data)
乐在其中的意思
print('\n')
df_3 = df_2.rank(ascending=Fal)
# 求每个⼈在每⼀列中的综合排名,这⾥是从⼤到⼩排
# 并列排名会被取均值,⽐如 2 3 并列第2,则排名为 2.5
print(df_3)
print('\n')
运⾏结果:
Math_A English_A Math_B English_B Project_num Sex name
Alice 1.1 3.0 1.7 5.02 F
Bob 2.2 2.6 2.5 2.63 M
Cathy 3.3 2.0 3.6 2.40 M
Dany 4.4 1.7 2.4 1.31 F
Ella 5.0 3.0 5.0 3.07 M
Ford 3.2 3.3 2.2 3.62 F
Gary 2.4 4.4 3.3 2.41 M
Ham 1.5 5.0 4.4 5.05 M
Ico 4.3 3.2 1.5 2.23 F
Jack 4.5 2.4 4.3 3.14 M
Math_A English_A Math_B English_B Project_num Sex name
Alice 10.0 5.59.0 1.5 6.58.5
Bob 8.07.0 6.0 6.0 4.53.5
Cathy 5.09.0 4.07.510.03.5
Dany 3.010.07.010.08.58.5
Ella 1.0 5.5 1.0 5.0 1.03.5
Ford 6.0 3.08.0 3.0 6.58.5
Gary 7.0 2.0 5.07.58.53.5
Ham 9.0 1.0 2.0 1.5 2.03.5
Ico 4.0 4.010.09.0 4.58.5
Jack 2.08.0 3.0 4.0 3.03.5
3.1 分组后,针对某⼀列排名
for index, data in upby(['Sex']):
# 按照性别分组显⽰
print(index)
print(data)
print('\n')
df_4 = upby(['Sex']).rank(ascending=Fal).sort_values('English_A')
# 按照性别分组后,求每个⼈在每⼀组中的综合排名(针对每⼀列),这⾥是从⼤到⼩排# 然后我们查看按照 'English_A' 分数从⾼到底排列
print(df_4)
print('\n')
运⾏结果:
F
Math_A English_A Math_B English_B Project_num Sex
name
Alice 1.1 3.0 1.7 5.02 F
Dany 4.4 1.7 2.4 1.31 F
Ford 3.2 3.3 2.2 3.62 F
Ico 4.3 3.2 1.5 2.23 F
M
Math_A English_A Math_B English_B Project_num Sex
name
Bob 2.2 2.6 2.5 2.63 M
Cathy 3.3 2.0 3.6 2.40 M
Ella 5.0 3.0 5.0 3.07 M
Gary 2.4 4.4 3.3 2.41 M
Ham 1.5 5.0 4.4 5.05 M
Jack 4.5 2.4 4.3 3.14 M
Math_A English_A Math_B English_B Project_num
name
Ford 3.0 1.0 2.0 2.0 2.5
Ham 6.0 1.0 2.0 1.0 2.0
Gary 4.0 2.0 5.0 5.5 5.0
Ico 2.0 2.0 4.0 3.0 1.0
Alice 4.0 3.0 3.0 1.0 2.5
Ella 1.0 3.0 1.0 3.0 1.0
Bob 5.0 4.0 6.0 4.0 4.0
Dany 1.0 4.0 1.0 4.0 4.0
Jack 2.0 5.0 3.0 2.0 3.0
Cathy 3.0 6.0 4.0 5.5 6.0
3.2 排名序号限定于 0~1 之间 ptc
猪肝炒什么df_5 = df_2['English_A'].rank(ascending=True, pct=True)
# 'ptc'表⽰把所有的排序序号限定在 0~1 的范围内
# ascending=True 表⽰从⼩到⼤
print(df_5)
print('\n')
运⾏结果: