首页 > 英文翻译

c语言gather用法,R语言tidyr包的三个重要函数：gather，spread，。。。

更新时间:2023-05-25 22:02:49 阅读：评论：0

园丁集

c语⾔gather⽤法,R语⾔tidyr包的三个重要函数：

gather，spread，。。。

udottidyr是Hadley(Tidy Data的作者Hadley Wickham)写的⾮常有⽤、并且经常会使⽤到的包，常与dplyr包结合使⽤(这个包也是他写的)

准备⼯作：

⾸先安装tidyr包(⼀定要加引号，不然报错)

install.packages("tidyr")

载⼊tidyr(可以不加引号)

library(tidyr)

gather()

gather函数类似于Excel(2016起)中的数据透视的功能，能把⼀个变量名含有变量的⼆维表转换成⼀个规范的⼆维表(类似数据库中关系的那种表，具体看例⼦)sicario

我们先 >?gather，看看官⽅⽂档说明：

gather {tidyr} R Documentation

Gather columns into key-value pairs.

Description

Gather takes multiple columns and collaps into key-value pairs, duplicating all other columns as needed. You u gather() when you notice that you have columns that are not variables.

Usage

gather(data, key = "key", value = "value", ..., na.rm = FALSE,

convert = FALSE, factor_key = FALSE)

Arguments

supplyingdata

A data frame.

key, value

Names of new key and value columns, as strings or symbols.

This argument is pasd by expression and supports quasiquotation (you can unquote strings and symbols). The name is captured from the expression with rlang::ensym() (note that this kind of interface where symbols do not reprent actual objects is now discouraged in the tidyver; we support it here for backward compatibility).

... (这是⼀个参数)

A lection of columns. If empty, all variables are lected. You can supply bare variable names, lect all variables between x and z with x:z, exclude y with -y. For more options, e the dplyr::lect() documentation. See also the ction on lection rules below.

If TRUE, will remove rows from output where the value column in NA.

convert

If TRUE will automatically vert() on the key column. This is uful if the column types are actually numeric, integer, or logical.

factor_key

If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which prerves the original ordering of the columns.

说明：

第⼀个参数放的是原数据，数据类型要是⼀个数据框；

下⾯传⼀个键值对，名字是⾃⼰起的，这两个值是做新转换成的⼆维表的表头，即两个变量名；

第四个是选中要转置的列，这个参数不写的话就默认全部转置；

后⾯还可以加可选参数na.rm，如果na.rm = TRUE，那么将会在新表中去除原表中的缺失值(NA)。

gather()举例

先构造⼀个数据框stu：

音频故事下载

stu

这个数据框什么意思就不说了，就是你想的那样，成绩-性别的⼈数分布。

变量中的female和male就是上⾯所说的变量名中含有了变量，female和male应该是“性别”这个变量的的变量值，下⾯的⼈数的变量名(或者说属性名)应该是“⼈数”，下⾯我们需要把原grade⼀列保留，去掉female和male两列，增加x和count两列，值分别与原表对应起来，使⽤这个gather函数：

gather(stu, gender, count,-grade)

结果如下，⾏列就转换过来了，第⼀个参数是原数据stu，⼆、三两个参数是键值对(性别，⼈数)，第四个表⽰减去(除去grade列，就只转置剩下两列)

在原表中单看这两列是这样对应的：

(female, 5), (female, 4), (female, 1), (female, 2), (female, 3)

(male, 1), (male, 2), (male, 3), (male, 4), (male, 5),

就是把原变量名(属性名)做键(key)，变量值做值(value)。

接下来就可以继续正常的统计分析了。

parate()

parate负责分割数据，把⼀个变量中就包含两个变量的数据分来(上例gather中是属性名也是⼀个变量，⼀个属性名⼀个变量)，直接上例⼦：

parate()举例

构造⼀个新数据框stu2：

stu2

female_1=c(5, 4, 1, 2, 3), male_1=c(1, 2, 3, 4, 5),

female_2=c(4, 5, 1, 2, 3), male_2=c(0, 2, 3, 4, 6))

粘合衬跟上⾯stu很像，性别后⾯的1、2表⽰班级

我们先⽤刚才的gather函数转置⼀下：

stu2_new

不解释了，跟上⾯⼀样，结果如下：

但这个表仍然不是个规范⼆维表，我们发现有⼀列(gender_class)的值包含多个属性(变量)，使⽤parate()分开，parate⽤法如下：

parate(data, col, into, p (= 正则表达式), remove =TRUE,convert = FALSE, extra = "warn", fill = "warn", ...)

第⼀个参数放要分离的数据框；

第⼆个参数放要分离的列；

第三个参数是分割成的变量的列(肯定是多个)，⽤向量表⽰；

第四个参数是分隔符，⽤正则表达式表⽰，或者写数字，表⽰从第⼏位分开(⽂档⾥是这样写的：

If character, is interpreted as a regular expression. The default value is a regular expression that matches any quence of non-alphanumeric values.

也许的英文If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; ne研究生现场确认时间

gative value start at -1 at the far-right of the string. The length of p should be one less than into.)

后⾯参数就不⼀⼀说明了，可以⾃⼰看⽂档

现在我们要做的就是把gender_class这⼀列分开：

parate(stu2_new,gender_class,c("gender","class"))

注意第三个参数是向量，⽤c()表⽰，第四个参数本来应该是"_"，这⾥省略不写了(可能是下划线是默认分隔符？)

结果如下：

spread()

envoy

spread⽤来扩展表，把某⼀列的值(键值对)分开拆成多列。

spread(data, key, value, fill = NA, convert = FALSE, drop =TRUE, p = NULL)

key是原来要拆的那⼀列的名字(变量名)，value是拆出来的那些列的值应该填什么(填原表的哪⼀列)

一整天英语

下⾯直接上例⼦

spread()举例

构造数据框stu3：

name

test

class1

class2

class3

class4

class5

stu3

总共5门课，每个学⽣选两门，列出期中、期末成绩。

显然，原表是不整洁的数据，表头中含有变量(class1-5)，所以先⽤gather函数。注意，这⾥⾯有很多缺失值，就可以⽤到上⾯所讲的na.rm=TRUE参数，⾃动去除有缺失值的记录(⼀条记录就是⼀⾏)：

如果不写 na.rm=TRUE 的话，结果是这样的：

(未截全)

分析学⽣没选课的“NA”成绩是没有意义的，所以这个情况下应该舍弃有缺失值的记录。

现在这个表看起来已经很整齐了，但是每个⼈都有四条记录，其中每门课除了test和grade的值不⼀样，姓名、课程是⼀样的，并且很多时候，我们需要分别对期中、期末成绩进⾏统计分析，那么现在这个表就不利于做分类统计了。

⽤spread函数将test列分来成midterm和final两列，这两列的值是选的两门课的成绩。

再重复⼀遍，第⼆个参数是要拆分的那⼀列的列名，第三个参数是扩展出的列的值应该来⾃原表的哪⼀列的列名。

stu3_new

spread(stu3_new,test,grade)

结果如下：

现在得到⾮常整齐的仅有10条数据的表，处理起来会更加⽅便。

最后补充⼀条，现在class列显得有些冗余，直接⽤数字似乎更简洁，使⽤readr包中的par_number()提出数字(还⽤到了dplyr的mutate 函数)，下⾯放出代码：

install.packages("dplyr")

install.packages("readr")

library(readr)

library(dplyr)

mutate(spread(stu3_new,test,grade),class=par_number(class))

最终结果：

是不是整整齐齐很好看(*╹▽╹*)

————————————————

本文发布于:2023-05-25 22:02:49，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/122588.html

上一篇：控制系统中英文对照

下一篇：SWIFT国际银行间电报常识与应用

标签：数据参数变量成绩

留言与评论（共有 0 条评论）