机器学习归⼀化标准化_机器学习中的标准化
机器学习 归⼀化 标准化
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the datat to u a common scale, without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.
规范化是⼀种经常⽤作机器学习数据准备过程中的技术。 归⼀化的⽬标是将数据集中的数字列的值更改为使⽤公共刻度,⽽不会扭曲值范围内的差异或丢失信息。 ⼀些算法还需要规范化以正确地对数据建模。
For example, assume your input datat contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cau problems when you attempt to combine the values as features during modeling.杏仁和巴旦木的区别
例如,假设您的输⼊数据集包含⼀列,其值的范围从0到1,另⼀列的值的范围是10,000到100,000。 当您在建模期间尝试将值组合为要素时,数字⽐例的巨⼤差异可能会导致问题。
Normalization avoids the problems by creating new values that maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns ud in the model.
规范化通过创建新值来保持源数据中的⼀般分布和⽐率,同时将值保持在模型中使⽤的所有数字列上的刻度范围内,从⽽避免了这些问题。
There are veral ways to normalize the data.Some of them are as follows.
有⼏种标准化数据的⽅法,其中⼀些如下。
⽇志转换 (Log transformation)
A log transformation is a very uful tool when you have data that clearly does not follow a normal distribution. Log transformation can help reduce skewness when you have skewed data, and can help reducing variability of data. Plea do make sure your data is only positive and non-zero numbers as log of negative or 0 is undefined. For just positive numbers that might contain zero’s there is a log 1+p transformation that, as you might have guesd, adds 1 to all the numbers and then does the log transformation.
当您拥有明显不符合正态分布的数据时,⽇志转换是⾮常有⽤的⼯具。 ⽇志转换可以帮助减少数据偏斜时的偏斜度,并且可以帮助减少数据的可变性。 请确保您的数据只有正数和⾮零数,因为未定义对数负数或0。 对于可能包含零的正数,有⼀个对数1 + p转换,您可能已经猜到了,对所有数字加1,然后进⾏对数转换。
最⼩-最⼤缩放 (Min-max scaling)
When performing min-max scaling, you can transform x to get the transformed ′ by using the formula:
在执⾏最⼩-最⼤缩放时,可以使⽤以下公式对x进⾏变换以得到变换后的 ′:
Image for post
This way of scaling brings all values between 0 and 1.
这种缩放⽅式使所有值介于0和1之间。
标准化 (Standardization)
x’ will have mean =0 and =1
x'的平均值为 = 0和 = 1
Note that standardization does not make data more normal, it will just change the mean and the standard error!
注意,标准化并不能使数据更正常,它只会改变均值和标准误差!
平均归⼀化 (Mean normalization)
When performing mean normalization, you u the following formula:
在执⾏均值归⼀化时,使⽤以下公式:
Image for post
The distribution will have values between -1 and 1, and a mean of 0.
分布的值介于-1和1之间,平均值为0。
单位向量转换 (Unit vector transformation)
When performing unit vector transformations, you can create a new variable x’ with a range [0,1]:
执⾏单位⽮量转换时,可以创建⼀个范围为[0,1]的新变量x':
Image for post
15的月亮动⼒转换 (Power Transformation)
Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like.
This is uful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality
is desired.
幂变换是⼀组参数化,单调变换,可⽤于使数据更像⾼斯型。 这对于建模与异⽅差(⾮恒定⽅差)或其他需要正态性的情况有关的模型很有
美食海鲜⽤。
Currently Power Transformermer supports the Box-Cox transformation and the Yeo-Johnson transformation. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
当前,Power Transformermer⽀持Box-Cox转换和Yeo-Johnson转换。 通过最⼤似然来估计⽤于稳定⽅差和最⼩化偏斜的最佳参数。
Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.
Box-Cox要求输⼊数据严格为正数,⽽Yeo-Johnson⽀持正数或负数。
By default, zero-mean, unit-variance normalization is applied to the transformed data.
默认情况下,零均值,单位⽅差归⼀化应⽤于转换后的数据。
Now that we discusd various normalization, standardization and transformation techniques let’s e an example of how
to do this in python.
现在,我们讨论了各种标准化,标准化和转换技术,下⾯让我们看⼀下如何在python中执⾏此操作的⽰例。
Here is the code snippet for the titanic datat where I am classifying survivors using KNeighborsClassifier.The model F1
score I got for the regular non-normalized data is 49%.
这是泰坦尼克号数据集的代码⽚段,我在其中使⽤KNeighborsClassifier对幸存者进⾏分类。我对常规⾮标准化数据获得的模型F1分数是
49%。
import pandas as pd
del_lection import train_test_split
ighbors import KNeighborsClassifier羊跟什么属相配
ics import f1_scoreX_train,X_test,y_train,y_test=train_test_split(df_dummies,labels,test_size=0.25,random_state=42)knn=KNeighborsClassifier( clf=knn.fit(X_train,y_train)
pred=clf.predict(X_test)
result= f1_score(y_test,pred)
Here is the code snippet for the same datat by using Standard scalar and I got an F1 score is 79%
这是使⽤标准标量的同⼀数据集的代码⽚段,我的F1分数是79%李之仪
from sklearn.preprocessing import StandardScalerscaler2=StandardScaler()
X_train_scaled2=scaler2.fit_transform(X_train)
X_test_ansform(X_test)clf_scaled2= knn.fit(X_train_scaled2,y_train)
网页打不开scaled_pred2=clf_scaled2.predict(X_test_scaled2)result=f1_score(y_test,scaled_pred2)
Here is the code snippet for the same datat by using Power transformation and I got an F1 score is 77%
这是使⽤Power变换的同⼀数据集的代码⽚段,我的F1分数是77%
贺英
from sklearn.preprocessing import PowerTransformeryj = PowerTransformer(method="yeo-johnson")X_train_yj=yj.fit(X_train).transform(X_train)
X_test_ansform(X_test)clf_transformed= knn.fit(X_train_yj,y_train)
transformed_pred=clf_transformed.predict(X_test_yj)result=f1_score(y_test,transformed_pred)
Here is the code snippet for the same datat by using MinMaxScaler and I got an F1 score is 76%
这是使⽤MinMaxScaler的同⼀数据集的代码⽚段,我的F1分数是76%
from sklearn.preprocessing import MinMaxScalerscaler1=MinMaxScaler()
X_train_scaled1=scaler.fit_transform(X_train)
X_test_ansform(X_test)clf_scaled1= knn.fit(X_train_scaled1,y_train)
scaled_pred1=clf_scaled1.predict(X_test_scaled1)result=f1_score(y_test,scaled_pred1)
Here is the code snippet for the same datat by using Normalizer and I got an F1 score is 62%
这是使⽤Normalizer的同⼀数据集的代码⽚段,我的F1分数是62%
from sklearn.preprocessing import Normalizernormalizer=Normalizer()
X_train_normalized=normalizer.fit_transform(X_train)
X_test_ansform(X_test)clf_normalized= knn.fit(X_train_normalized,y_train)
normalized_pred1=clf_normalized.predict(X_test_normalized)result=f1_score(y_test,normalized_pred1)
As you can e various normalizations, transformations and standardization techniques give varying F1 score, so which one to u? Well that depends, it depends on the datat its characteristics. One simple way to find out, try them all ;-)
如您所见,各种归⼀化,转换和标准化技术会产⽣不同的F1分数,那么该使⽤哪⼀个呢? 好吧,这取决于数据集的特征。 ⼀种简单的找出答案的⽅法;全部尝试;-)
Happy reading
阅读愉快!
机器学习 归⼀化 标准化怎么关闭广告弹窗