python 描述性统计
描述性统计 (Descriptive Statistics)
After data collection, most Psychology rearchers u different ways to summari the data. In this tutorial we will learn how to do descriptive statistics in Python. Python, being a programming language, enables us  many ways to carry out descriptive statistics. Pandas makes data manipulation and summary statistics quite similar to how you would do it in R. I believe that the dataframe in R is very intuitive to u and pandas offers a DataFrame method similar to Rs. Also, many Psychology rearchers may have experience of R.
收集数据后,⼤多数⼼理学研究⼈员使⽤不同的⽅式来汇总数据。 在本教程中,我们将学习如何在Python中进⾏描述性统计 。 Python是⼀种编程语⾔,它使我们可以采⽤多种⽅式来进⾏描述性统计。 Pandas使数据操作和汇总统计信息与R中的操作⾮常相似。我相信R中的数据框的使⽤⾮常直观,Pandas提供了类似于Rs的DataFrame⽅法。 同样,许多⼼理学研究⼈员可能有R的经验。
Thus, in this tutorial you will learn how to do descriptive statistics using  Pandas, but also using NumPy,
and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy so we u SciPy. Towards the end we learn how get some measures of variability (e.g., variance using pandas).
因此,在本教程中,您将学习如何使⽤Pandas以及NumPy和SciPy进⾏描述性统计。 我们⾸先使⽤熊猫获取摘要统计信息和⼀些⽅差度量。 之后,我们继续使⽤Pandas和NumPy进⾏中央租赁措施(例如,均值和中位数)。 谐波,⼏何和修剪均值⽆法使⽤Pandas或NumPy计算,因此我们使⽤SciPy。 最后,我们学习如何获得⼀些可变性的度量(例如,使⽤熊猫的变异)。
import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
模拟响应时间数据 (Simulate respon time data)
Many times in experimental psychology respon time is the dependent variable. I to simulate an experiment in which the dependent variable is respon time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method describe.
在实验⼼理学中,响应时间很多时候都是因变量。 我模拟⼀个实验,其中因变量是对某些任意⽬标的响应时间。 此外,模拟数据将具有两个⾃变量(IV,“ iv1”具有2个级别,“ iv2”具有3个级别)。 在创建数据框的同时对数据进⾏仿真,并使⽤描述的⽅法获得第⼀个描述性统计信息。
使⽤熊猫进⾏描述性统计 (Descriptive statistics using Pandas)
Pandas will output summary statistics by using this method. Output is a table, as you can e below.
熊猫将使⽤此⽅法输出摘要统计信息。 输出是⼀个表,如下所⽰。
端午节在几月几日Output table of data.describe()
Typically, a rearcher is interested in the descriptive statistics of the IVs. Therefore, I group the data
by the. Using describe on the grouped date aggregated data for each level in each IV.  As can be en from the output it is somewhat hard to read. Note, the method unstack is ud to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.
通常,研究⼈员会对IV的描述性统计感兴趣。 因此,我将这些数据分组。 使⽤分组⽇期上的describe描述每个IV中每个级别的汇总数据。从输出中可以看出,它有点难以阅读。 请注意,unstack⽅法⽤于获取均值,标准差(std)等作为列,并且变得更易于阅读。
Output from describe on the grouped data
中央倾向 (Central tendancy)
Often we want to know something about the “average” or “middle” of our data. Using Pandas and NumPy the two most commonly ud measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean  can also be obtained using Pandas but I will u methods from  SciPy.
通常,我们想了解⼀些有关数据“平均”或“中间”的信息。 使⽤Pandas和NumPy,可以获得两种最常⽤的中央租房措施。 均值和中位数。 模式和修剪后的均值也可以使⽤Pandas获得,但我将使⽤SciPy的⽅法。
意思 (Mean)
There are at least two ways of doing this using our grouped data. First, Pandas have the method mean;
使⽤我们的分组数据⾄少有两种⽅法可以做到这⼀点。 ⾸先,熊猫具有⽅法的含义;
But the method aggregate in combination with NumPys mean can also be ud;
Both methods will give the same output but the aggregate method have some advantages that I will explain later.
Output of mean and aggregate using NumPy – Mean
⼏何与谐波均值 (Geometric & Harmonic mean)
Sometimes the geometric or harmonic mean  can be of interested. The two descriptives can be obtained using the method apply with the methods gmean and hmean (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.
有时,⼏何或调和均值可能令⼈感兴趣。 可以使⽤gmean和hmean(来⾃SciPy)⽅法作为参数的⽅法获得这两个描述。 也就是
⼏何 (Geometric)
grouped_data['rt'].apply(gmean, axis=None).ret_index()
grouped_data['rt'].apply(gmean, axis=None).ret_index()
谐波 (Harmonic)
均值修整 (Trimmed mean)
Trimmed means are, at times, ud. Pandas or NumPy ems not to have methods for obtaining the trimmed mean. However, we can u the method trim_mean from SciPy . By using apply to our grouped data we can u the function
(‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.
有时会使⽤修饰后的⽅法。 Pandas或NumPy似乎没有获得修整平均值的⽅法。 但是,我们可以使⽤SciPy中的trim_mean⽅法。 通过应⽤应⽤于分组数据,我们可以将函数('trim_mean')与参数⼀起使⽤,该参数将使10%av成为要删除的最⼤值和最⼩值。
trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
Output from the mean values above (trimmed, harmonic, and geometric means):
Trimmed Mean
均值Harmonic Mean
Geometric Mean
中位数 (Median)
As with the mean there are also at least two ways of obtaining the median;与平均值⼀样,⾄少还有两种获取中位数的⽅法;
Output of aggregate using Numpy – Median.
使⽤Numpy –中位数的合计输出。
模式 (Mode)
There is a method (i.e., ) for getting the mode for a DataFrame object. However, it cannot be ud on the grouped data so I will u mode from SciPy:
有⼀种⽅法(即 )⽤于获取DataFrame对象的模式。 但是,它不能⽤于分组数据,因此我将使⽤SciPy的模式:
Most of the time I probably would want to e all measures of central tendency at the same time. Lu
ckily, aggregate enables us to u many NumPy and SciPy methods. In the example below the standard deviation (std), mean, harmonic mean, geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.
⼤多数时候,我可能希望同时查看所有集中趋势指标。 幸运的是,聚合使我们能够使⽤许多NumPy和SciPy⽅法。 在下⾯的⽰例中,标准偏差(std),均值,谐波均值,⼏何均值和微调均值都在同⼀输出中。 请注意,我们将必须在之后添加调整后的均值。
descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).ret_index()
descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).ret_index()
descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
Output of aggregate using some of the methods.满江红意思
变异性度量 (Measures of variability)
Central tendency (e.g., the mean & median) is not the only type of summary statistic that we want to calculate. Doing data analysis we also want a measure of the variability of the data.
集中趋势(例如,均值和中位数)不是我们要计算的唯⼀统计摘要类型。 在进⾏数据分析时,我们还希望度量数据的可变性。
标准偏差 (Standard deviation)
四分位间距 (Inter quartile range)
Note that here the u unstack()  also get the quantiles as columns and the output is easier to read.
grouped_data['rt'].quantile([.25, .5, .75]).unstack()
grouped_data['rt'].quantile([.25, .5, .75]).unstack()偷的成语
⽅差 (Variance)
That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes the calculation almost as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.
就这些。 现在,您知道如何使⽤Python获得⼀些最常见的描述性统计信息。 Pandas,NumPy和SciP
y实际上使这些计算⼏乎与在诸如SPSS之类的图形统计软件中进⾏计算⼀样容易。 应⽤和聚合⽅法的⼀⼤优势是我们可以输⼊其他⽅法或函数来获取其他类型的描述。
