机器学习综合评价_PyCaret:机器学习综合
机器学习 综合评价
Any Machine Learning project journey starts with loading the datat and ends (continues ?!) with the finalization of the optimum model or enmble of models for predictions on unen data and production deployment.
任何机器学习项⽬的旅程都始于加载数据集,然后结束(继续?!),最后确定最佳模型或模型集合,以预测看不见的数据和⽣产部署。
As machine learning practitioners, we are aware that there are veral pit stops to be made along the way to arrive at the best possible prediction performance outcome. The intermediate steps include Exploratory Data Analysis (EDA), Data Preprocessing — missing value treatment, outlier treatment, changing data types, encoding categorical features, data transformation, feature engineering /lection, sampling, train-test split etc. to name a few — before we can embark on model building, evaluation and then prediction.内裤外穿
作为机器学习的从业者,我们意识到在达到最佳预测性能结果的过程中,有⼏个进站。 这些中间步骤包
括探索性数据分析(EDA),数据预处理-缺失值处理,离群值处理,更改数据类型,编码分类特征,数据转换,特征⼯程/选择,采样,训练测试拆分等,仅举⼏例-在我们开始进⾏模型构建,评估然后进⾏预测之前。
We end up importing dozens of python packages to help us do this and this means getting familiar with the syntax and parameters of multiple function calls within each of the packages.
我们最终导⼊了数⼗个python软件包来帮助我们完成此操作,这意味着要熟悉每个软件包中的多个函数调⽤的语法和参数。
Have you wished that there could be a single package that can handle the entire journey end to end with a consistent syntax interface? I sure have!
您是否希望有⼀个包可以使⽤⼀致的语法接⼝来处理整个旅程,从头到尾? 我肯定有!
输⼊PyCaret (Enter PyCaret)
The wishes were answered with PyCaret package and it is now even more awesome with the relea of pycaret2.0.
PyCaret软件包满⾜了这些愿望,现在pycaret2.0的发布pycaret2.0更加令⼈敬畏。
Starting with this Article, I will post a ries on how pycaret helps us zip through the various stages of an ML project.
从本⽂开始,我将发布⼀系列有关pycaret如何帮助我们完成ML项⽬各个阶段的⽂章。
安装 (Installation)
Installation is a breeze and is over in a few minutes with all dependencies also being installed. It is recommended to install using a virtual environment like or to avoid any clash with other pre-installed packages.
安装轻⽽易举,⼏分钟后就结束了,同时还安装了所有依赖项。 建议使⽤虚拟环境(例如或进⾏安装,以免与其他预装软件包冲突。
pip install pycaret==2.0
pip install pycaret==2.0
Once installed, we are ready to begin! We import the package into our notebook environment. We will take up a classification problem here. Similarly, the respective PyCaret modules can be imported for a scenario involving regression, clustering, anomaly detection, NLP and Association rules mining.
安装完成后,我们就可以开始了! 我们将包导⼊到笔记本环境中。 我们将在这⾥处理分类问题。 同样,可以针对涉及回归,聚类,异常检测,NLP和关联规则挖掘的⽅案导⼊相应的PyCaret模块。
We will u the titanic datat You can download the datat from .抒情的英文歌
我们将使⽤来⾃的titanic数据集。 您可以从下载数据集。
数码产品Let's check the first few rows of the datat using the head() function:
让我们使⽤head()函数检查数据集的前⼏⾏:
建⽴ (Setup)
The tup() function of pycaret does most — correction, ALL, of the heavy-lifting, that normally is otherwi done in dozens of lines of code — in just a single line!
pycaret的tup()函数pycaret完成⼤部分⼯作-校正,全部进⾏繁重的⼯作,否则通常只需要⼀⾏⼏⼗⾏代码即可完成!
We just need to pass the dataframe and specify the name of the target feature as the arguments. The tup command generates the following output.
我们只需要传递数据框并指定⽬标要素的名称作为参数即可。 tup命令⽣成以下输出。
tup has helpfully inferred the data types of the features in the datat. If we agree to it, all we need to do is hit Enter . El, if you think the data types as inferred by tup is not correct then you can type quit in the field at the bottom and go back to the tup function to make changes. We will e how to do that shortly. For now, lets hit Enter and e what happens.
tup有助于推断数据集中要素的数据类型。 如果我们同意,则只需按Enter 。 否则,如果您认为由tup程序推断出的数据类型不正确,则可以在底部的字段中键⼊quit ,然后返回到tup功能进⾏更改。 我们将很快看到如何做。 现在,让我们Enter ,看看会发⽣什么。
output contd.,
讲师开场白
输出续
输出继续低于
end of output
输出结束
感怀是什么意思Whew! A whole lot ems to have happened under the hood in just one line of innocuous-looking code! Let's take stock: ew! 似乎只有⼀⾏⽆害的代码在幕后发⽣了很多事情! 让我们盘点⼀下:
checked for missing values
检查缺失值
充电没反应又开不了机identified numeric and categorical features
确定的数字和分类特征
created train and test data ts from the original datat
从原始数据集中创建训练和测试数据集
imputed missing values in continuous features with mean
连续特征中的插补缺失值
imputed missing values in categorical features with a constant value
具有恒定值的分类特征中的推定缺失值
done label-encoding
完成标签编码
乱蟠桃大圣偷丹..and a whole host of other options em to be available including outlier treatment, data scaling, feature transformation, dimensionality reduction, multi-collinearity treatment, feature lection and handling imbalanced data etc.!
漠然的意思
..以及似乎还有许多其他选择,包括异常值处理,数据缩放,特征转换,降维,多重共线性处理,特征选择和处理不平衡数据等!
But hey! what is that on lines 11 & 12? The number of features in the train and test datats are 1745? Seems to be a ca of label encoding gone berrk most probably from the categorical features like name, ticket and cabin. Further in this article and in the next, we will look at how we can control the tup as per our requirements to address such cas proactively.
但是,嘿! 第11和12⾏是什么? 训练和测试数据集中的要素数量为1745? 似乎是标签编码的⼀种情况,很可能是从name ,
ticket和cabin等分类特征中消失了。 在本⽂的下⼀部分和下⼀部分中,我们将研究如何根据我们的要求控制设置,以主动解决此类情况。定制tup (Customizing tup)
To start with how can we exclude features from model building like the three features above? We pass the variables which we want to exclude in the ignore_features argument of the tup function. It is to be noted that the ID and DateTime columns, when inferred, are automatically t to be ignored for modelling.
⾸先,我们如何像上⾯的三个功能那样从模型构建中排除功能? 我们在tup函数的ignore_features参数中传递要排除的变量。 要注意的是,ID和DateTime列在推断时会⾃动设置为忽略以进⾏建模。
Note below that pycaret, while asking for our confirmation has dropped the above mentioned 3 features. Let's click Enter and proceed.
请注意,在pycaret下⽅,要求我们确认时已删除了上述3个功能。 让我们单击Enter并继续。
In the resultant output (the truncated version is shown below), we can e that post tup, the datat shape is more manageable now with label encoding done only of the remaining more relevant categorical features:
在结果输出中(截断的版本如下所⽰),我们可以看到设置后,现在仅使⽤其余更相关的分类特征进⾏标签编码,就更易于管理数据集形状: