材料数据科学:描述符和机器学习
内容:如何使⽤matminer、automatminer、panas和scikit-learn来获取机器学习材料属性。
⽬录
典型的机器学习⼯作流,整个过程可以概括为:
1.获取原始输⼊,如作⽂列表和相关的⽬标属性来学习。
2.将原始输⼊转换成可通过机器学习算法学习的描述符或特征。
3.在数据上训练机器学习模型。
4.绘制并分析模型的性能。
1.数据检索和过滤
与许多材料数据库接⼝,包括:-材料项⽬-Citrine-AFLOW-材料数据设施(MDF)-数据科学材料平台(MPDS)此外,它还包括来⾃已发表
⽂献的数据集。Matminer拥有⼀个由26个(并且还在增长的)数据集组成的存储库,这些数据集来⾃对材料特性的已发表和同⾏评审的机器
学习研究或⾼通量计算研究的出版物。在本节中,我们将展⽰如何访问和操作已发表⽂献中的数据集。有关访问其他材料数据库的更多信
息,请参见matminer_examples知识库。
可以使⽤get_available_datats()函数打印基于⽂献的数据集列表。
这还会打印数据集包含的信息,例如样本数量、⽬标属性以及数据是如何获得的(例如,通过理论或实验)。
tsimportget_available_datats
get_available_datats()
结果:
boltztrap_mp:Effectivemassandthermoelectricpropertiesof8924compoundsinTheMaterialsProjectdatabathatarecalculatedbytheBoltzTraPsoftwarep
brgoch_superhard_training:2574materialsudfortrainingregressorsthatpredictshearandbulkmodulus.
castelli_perovskites:18,928perovskitesgeneratedwithABXcombinatorics,calculatinggllbscbandgapandpbestructure,andalsoreportingabsolutebandedge
citrine_thermal_conductivity:Thermalconductivityof872compoundsmeasuredexp
dielectric_constant:1,056structureswithdielectricproperties,calculatedwithDFPT-PBE.
double_perovskites_gap:Bandgapof1306doubleperovskites(a_1-b_1-a_2-b_2-O6)calculatedusingGritnko,vanLeeuwen,vanLentheandBaerendspoten
double_perovskites_gap_lumo:Supplementarylumodataof55atomsforthedouble_perovskites_gapdatat.
elastic_tensor_2015:1,181structureswithelasticpropertiescalculatedwithDFT-PBE.
expt_formation_enthalpy:Experimentalformationenthalpiesforinorganiccompounds,re1,276entriesi
expt_gap:Experimentalbandgapof6354inorganicmiconductors.
flla:3938structuresandcomputedformationenergiesfrom"CrystalStructureReprentationsforMachineLearningModelsofFormationEnergies."
glass_binary:Metallicglassformationdataforbinaryalloys,collectedfromvario
glass_binary_v2:Identicaltoglass_binarydatat,ewasadisagreementingfawhenmergingtheclasswasdefaultedto
glass_ternary_hipt:Metallicglassformationdatatforternaryalloys,collectedfromthehigh-throughputsputteringexperimentsmeasuringwhetheritispossible
glass_ternary_landolt:Metallicglassformationdatatforternaryalloys,collectedfromthe"NonequilibriumPhaDiagramsofTernaryAmorphousAlloys
heusler_magnetic:1151153alloysinclude576full,449halfand128inverHeusle
jarvis_dft_2d:Variouspropertiesof6362DmaterialscomputedwiththeOptB88vdWandTBmBJfunctionalstakenfromtheJARVISDFTdataba.
jarvis_dft_3d:Variouspropertiesof25,923bulkmaterialscomputedwiththeOptB88vdWandTBmBJfunctionalstakenfromtheJARVISDFTdataba.
jarvis_ml_dft_training:Variouspropertiesof24,759bulkand2DmaterialscomputedwiththeOptB88vdWandTBmBJfunctionalstakenfromtheJARVISDFTdat
m2ax:Elasticpropertiesof223stableM2AXcompoundsfrom"AcomprehensivesurveyofM2AXphaelasticproperties"ationsarePAWP
matbench_dielectric:dentrieshaving
matbench_expt_gap:Matbenchv0.1mentaryinforma
matbench_expt_is_metal:
matbench_glass:Matbenchv0.1testdatatforprevedfrom"NonequilibriumPhaD
matbench_jdft2d:Matbenchv0.1testdatatforpredictingexfoliationenergiesfromcrystalstructure(computedwiththeOptB88vdWandTBmBJfunctionals).Ad
matbench_log_gvrh:Matbenchv0.1testddfromMaterialsProjectdataba.
matbench_log_kvrh:Matbenchv0.1testdfromMaterialsProjectdataba.R
matbench_mp_e_form:dentrie
matbench_mp_gap:dentrieshav
matbench_mp_is_metal:dentrieshav
matbench_perovskites:dfromanoriginaldatatgeneratedbyCastelli
matbench_phonons:Matbenchv0.1alcalcu
matbench_steels:Matbenchv0.1testdataicate
mp_all_20181018:AcompletecopyoftheMaterialsProjectdatabaasof10/18/_allfilescontainstructuredataforeachmaterialwhilemp_nostructdo
mp_nostruct_20181018:AcompletecopyoftheMaterialsProjectdatabaasof10/18/_allfilescontainstructuredataforeachmaterialwhilemp_nostr
phonon_dielectric_mp:Phonon(lattice/atomsvibrations)anddielectricpropertiesof1296compoundscomputedviaABINITsoftwarepackageintheharmonicap
piezoelectric_tensor:941structureswithpiezoelectricproperties,calculatedwithDFT-PBE.
steel_strength:312steelswithexperimentalyieldstrengthandultimatetensilestrength,extractedandcleaned(includingde-duplicating)fromCitrine.
wolverton_oxides:4,914perovskiteoxidescontainingcompositiondata,latticeconstants,andformation+ovskitesareofthef
数据集列表为:
['boltztrap_mp',
'brgoch_superhard_training',
'castelli_perovskites',
'citrine_thermal_conductivity',
'dielectric_constant',
'double_perovskites_gap',
'double_perovskites_gap_lumo',
'elastic_tensor_2015',
'expt_formation_enthalpy',
'expt_gap',
'flla',
'glass_binary',
'glass_binary_v2',
'glass_ternary_hipt',
'glass_ternary_landolt',
'heusler_magnetic',
'jarvis_dft_2d',
'jarvis_dft_3d',
'jarvis_ml_dft_training',
'm2ax',
'matbench_dielectric',
'matbench_expt_gap',
'matbench_expt_is_metal',
'matbench_glass',
'matbench_jdft2d',
'matbench_log_gvrh',
'matbench_log_kvrh',
'matbench_mp_e_form',
'matbench_mp_gap',
'matbench_mp_is_metal',
'matbench_perovskites',
'matbench_phonons',
'matbench_steels',
'mp_all_20181018',
'mp_nostruct_20181018',
'phonon_dielectric_mp',
'piezoelectric_tensor',
'steel_strength',
'wolverton_oxides']
可以使⽤load_datat()函数和数据库名称加载数据集。为了节省安装空间,安装matminer时不会⾃动下载数据集。相反,第⼀次加载数
据集时,它将从互联⽹上下载并存储在matminer安装⽬录中。
让我们加载介电常数数据集。它包含1056个⽤DFPT-PBE计算的介电性质的结构。
tsimportload_datat
df=load_datat("dielectric_constant")
结果:
Fetchingdielectric_/files/13213475toD:anaconda3libsite-packagesmatminerdatatsdielectric_consta
操作和检查pandasDataFrame对象
数据集以pandas对象的形式提供。在Python中,您可以将这些视为⼀种“电⼦表格”对象。数据框架有⼏种有⽤的⽅法可以
⽤来探索和清理数据,其中⼀些我们将在下⾯探讨。
检查数据集
head()函数打印数据集前⼏⾏的摘要。你可以滚动查看更多栏。由此,很容易看出数据集中可⽤的数据类型。
tsimportload_datat
df=load_datat("dielectric_constant")
print(())
结果:
material_id...poscar
0mp-441...Rb2Te1n1.0n5.2717760.0000003.043661n1.75...
1mp-22881...Cd1Cl2n1.0n3.8509770.0726715.494462n1.78...
2mp-28013...Mn1I2n1.0n4.1580860.0000000.000000n-2.07...
3mp-567290...La2N2n1.0n4.1328650.0000000.000000n-2.06...
4mp-560902...Mn2F4n1.0n3.3545880.0000000.000000n0.000...
[5rowsx16columns]
有时,如果数据集⾮常⼤,将⽆法看到所有可⽤的列。相反,可以使⽤columns属性查看列的完整列表:
tsimportload_datat
df=load_datat("dielectric_constant")
print(s)
结果:
Index(['material_id','formula','nsites','space_group','volume',
'structure','band_gap','e_electronic','e_total','n',
'poly_electronic','poly_total','pot_ferroelectric','cif','meta',
'poscar'],
dtype='object')
pandas包括⼀个名为description()的函数,它有助于确定数据中各种数字/分类列的统计数据。请注意,description()函数
默认情况下只描述数字列。
有时,description()函数会显⽰异常值,表明数据中存在错误。
tsimportload_datat
df=load_datat("dielectric_constant")
print(be())
结果:
nsitesspace_group...poly_electronicpoly_total
count1056..000000...1056..000000
mean7.530303142.970644...7.24804914.777898
std3.38844367.264591...13.05494719.435303
min2.0000001.000000...1.6300002.080000
25%5.00000082.000000...3.1300007.557500
50%8.000000163.000000...4.79000010.540000
75%9.000000194.000000...7.44000015.482500
max20.000000229.000000...256.840000277.780000
[8rowsx7columns]
进程已结束,退出代码为0
对数据集进⾏索引
我们可以通过使⽤列名索引对象来访问数据帧的特定列。例如:
tsimportload_datat
df=load_datat("dielectric_constant")
print(df["band_gap"])
结果:
01.88
13.52
21.17
31.12
42.87
...
10510.87
10523.60
10530.14
10540.21
10550.26
Name:band_gap,Length:1056,dtype:float64
或者,我们可以使⽤iloc属性访问Dataframe的特定⾏。
tsimportload_datat
df=load_datat("dielectric_constant")
print([100])
结果:
material_idmp-7140
formulaSiC
nsites4
space_group186
volume42.005504
structure[[-1.87933700e-061.78517223e+002.53458835e...
band_gap2.3
e_electronic[[6.9589498,-3.29e-06,0.0000001]...
e_total[[10.0001,-3.7006e-05,...
n2.66
poly_electronic7.08
poly_total10.58
pot_ferroelectricFal
cif##CIF1.1n
###...
meta{u'incar':u'NELM=100nIBRION=8nLWAVE=F...
poscarSi2C2n1.0n3.0920070.0000000.000000n-1.54...
Name:100,dtype:object
过滤数据集
pandas对象使得基于特定列过滤数据变得⾮常容易。我们可以使⽤典型的Python⽐较运算符(==、>、>=、<,等等)来过滤
数值。例如,让我们查找单元格体积⼤于580的所有条⽬。我们通过对volume列进⾏过滤来实现这⼀点。
请注意,我们⾸先⽣成⼀个布尔掩码——⼀系列取决于⽐较的true和fal。然后,我们可以使⽤掩码来过滤DataFrame。
tsimportload_datat
df=load_datat("dielectric_constant")
mask=df["volume"]>=580
print(df[mask])
结果:
material_id...poscar
206mp-23280...As4Cl12n1.0n4.6527580.0000000.000000n0.0...
216mp-9064...Rb6Te6n1.0n10.1187170.0000000.000000n-5....
219mp-23230...P4Cl12n1.0n6.5231520.0000000.000000n0.00...
251mp-2160...Sb8Se12n1.0n4.0299370.0000000.000000n0.0...
[4rowsx16columns]
我们可以使⽤这种过滤⽅法来清理数据集。例如,如果我们只希望我们的数据集包含半导体(bandgap⾮零的材料),我们可以通过过滤
bandgap列轻松做到这⼀点。
tsimportload_datat
df=load_datat("dielectric_constant")
mask=df["band_gap"]>0
miconductor_df=df[mask]
print(miconductor_df)
结果:
material_id...poscar
0mp-441...Rb2Te1n1.0n5.2717760.0000003.043661n1.75...
1mp-22881...Cd1Cl2n1.0n3.8509770.0726715.494462n1.78...
2mp-28013...Mn1I2n1.0n4.1580860.0000000.000000n-2.07...
3mp-567290...La2N2n1.0n4.1328650.0000000.000000n-2.06...
4mp-560902...Mn2F4n1.0n3.3545880.0000000.000000n0.000...
............
1051mp-568032...Cd1In2Se4n1.0n5.9120750.0000000.000000n...
1052mp-696944...La2H2Br4n1.0n4.1378330.0000000.000000n-...
1053mp-16238...Li2Ag1Sb1n1.0n4.0789570.0000002.354987n...
1054mp-4405...Rb3Au1O1n1.0n5.6175160.0000000.000000n0...
1055mp-3486...K2Sn2Sb2n1.0n4.4468030.0000000.000000n-...
[1056rowsx16columns]
通常,数据集包含许多机器学习不需要的附加列。在我们可以在数据上训练模型之前,我们需要移除任何⽆关的列。我们可以使⽤drop()
函数从数据集中移除整列。该函数可⽤于删除⾏和列。
该函数接受要删除的项⽬列表。对于列,这是列名,⽽对于⾏,这是⾏号。最后,axis选项指定要删除的数据是列(axis=1)还是⾏(
axis=0)。
例如,要删除nsites、space_group、e_electronic和e_total列,我们可以运⾏:
tsimportload_datat
df=load_datat("dielectric_constant")
print("drop前:")
print(be())
print("--"*20)
cleaned_df=(["nsites","space_group","e_electronic","e_total"],axis=1)
print("drop后")
print(cleaned_be())
结果:
drop前:
nsitesspace_group...poly_electronicpoly_total
count1056..000000...1056..000000
mean7.530303142.970644...7.24804914.777898
std3.38844367.264591...13.05494719.435303
min2.0000001.000000...1.6300002.080000
25%5.00000082.000000...3.1300007.557500
50%8.000000163.000000...4.79000010.540000
75%9.000000194.000000...7.44000015.482500
max20.000000229.000000...256.840000277.780000
[8rowsx7columns]
----------------------------------------
drop后
volumeband_gapnpoly_electronicpoly_total
count1056.....000000
mean166.4203762.1194322.4348867.24804914.777898
std97.4250841.6049241.14884913.05494719.435303
min13.9805480.1100001.2800001.6300002.080000
25%96.2623370.8900001.7700003.1300007.557500
50%145.9446911.7300002.1900004.79000010.540000
75%212.1064052.8850002.7300007.44000015.482500
max597.3411348.32000016.030000256.840000277.780000
⽣成新列
pandas对象还使得对数据进⾏简单计算变得容易。可以把这想象成在Excel电⼦表格中使⽤公式。可以使⽤所有基本的
Python数学运算符(如+、-、/和*)。
例如,电介质数据集(dielectricdatat)包含对介电常数的电⼦贡献(在poly_electronic列中)和总(静态)介电常数(在poly_total列中)。
离⼦对数据集的贡献由下式给出:
下⾯,我们计算离⼦对介电常数的贡献,并将其存储在⼀个名为poly_的新列中。这就像将数据分配给新列⼀样简单,即使该列尚不存在。
tsimportload_datat
df=load_datat("dielectric_constant")
df['poly_ionic']=df['poly_total']-df['poly_electronic']
print(df['poly_ionic'])
结果:
02.79
13.57
25.67
310.95
44.77
...
10514.09
10523.09
105319.99
105416.03
10553.08
Name:poly_ionic,Length:1056,dtype:float64
2.为机器学习⽣成描述符
在这⼀节中,我们将学习如何在pymatgen中从材料对象⽣成机器学习描述符。⾸先,我们将使⽤matminer的“特征器”类⽣成⼀些描述
符。接下来,我们将使⽤前⼀节中关于dataframes的⼀些知识来检查我们的描述符,并准备它们作为机器学习模型的输⼊。
特征化器将材质图元转换为机器可学习的特征。特征化器的⼀般思想是接受⼀个材质图元(例如pymatgenComposition)并输出⼀个向量。
例如:begin{align}f(mathrm{Fe}_2mathrm{O}_3)rightarrow[1.5,7.8,9.1,0.09]end{align}
Matminer包含以下pymatgen对象的特征:*成分*晶体结构*晶体位置*能带结构*态密度
根据特征,返回的特征可能是:*数值、分类或混合向量*矩阵*其他pymatgen对象(⽤于进⼀步处理)
由于我们⼤部分时间都在处理pandas的dataframes,所以所有的特征器都在pandasdataFrame上⼯作。我们将在本课稍后提供这⽅
⾯的⽰例。
在这⼀课中,我们将复习所有特征化程序中的主要⽅法。在本单元结束时,你将能够使⽤⼀个通⽤的软件界⾯为⼴泛的材料信息学问题⽣成
描述符。
特征化⽅法和基础
任何matminer的核⼼⽅法都是“特征化”。该⽅法接受⼀个材料对象,并返回⼀个机器学习向量或矩阵。让我们看⼀个pymatgen结构的
例⼦:
frompymatgenimportComposition
fe2o3=Composition("Fe2O3")
print("fe2o3:",fe2o3)
#作为⼀个简单的例⼦,我们将使⽤element分数特征化器获得元素分数。
itionimportElementFraction
ef=ElementFraction()
#现在我们可以把我们的结果具体化了。
element_fractions=ize(fe2o3)
print(element_fractions)
结果:
fe2o3:Fe2O3
[0,0,0,0,0,0,0,0.6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
我们已经成功地⽣成了⽤于学习的特性,但是它们意味着什么呢?检查的⼀种⽅法是阅读任何Features⽂档中的特征部分...但是更简单的⽅
法是使⽤feature_labels()⽅法。
itionimportElementFraction
ef=ElementFraction()
element_fraction_labels=e_labels()
print(element_fraction_labels)
结果:
['H','He','Li','Be','B','C','N','O','F','Ne','Na','Mg','Al','Si','P','S','Cl','Ar','K','Ca','Sc','Ti','V','Cr','Mn','Fe','Co','Ni','Cu','Zn','Ga','Ge','As','Se','Br','Kr',
我们现在按照⽣成特征的顺序来查看标签。
frompymatgenimportComposition
fe2o3=Composition("Fe2O3")
#作为⼀个简单的例⼦,我们将使⽤element分数特征化器获得元素分数。
itionimportElementFraction
ef=ElementFraction()
#现在我们可以把我们的结果具体化了。
element_fractions=ize(fe2o3)
element_fraction_labels=e_labels()
print(element_fraction_labels[7],element_fractions[7])
print(element_fraction_labels[25],element_fractions[25])
结果:
O0.6
Fe0.4
dataframes特征化(Featurizingdataframes)
我们刚刚从⼀个单独的样本中⽣成了⼀些描述符和它们的标签,但是⼤多数时候我们的数据都在pandas的dataframs中。幸运的
是,matminerfeaturizers实现了⼀个featurize_dataframe()⽅法,该⽅法与dataframes进⾏交互。
让我们从matminer中获取⼀个新的数据集,并在其上使⽤我们的ElementFraxtion特征器。
⾸先,我们像上⼀节⼀样下载数据集。在这个例⼦中,我们将下载⼀个超硬材料的数据集。
t_retrievalimportload_datat
df=load_datat("brgoch_superhard_training")
print(())
结果:
Fetchingbrgoch_superhard_/files/13858931toD:anaconda3envspythonProject1libsite-packagesmatm
formula...suspect_value
0AlPt3...Fal
1Mn2Nb...Fal
2HfO2...Fal
3Cu3Pt...Fal
4Mg3Pt...Fal
接下来,我们可以使⽤featurize_dataframe()⽅法(由所有featurizer实现)⼀次将ElementFraction应⽤于所有数据。唯⼀需要的参数
是作为输⼊的dataframe和输⼊列名(在本例中是composition)。默认情况下,featurize_dataframe()使⽤多处理并⾏化。
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("brgoch_superhard_training")
print(())
print("---"*20)
itionimportElementFraction
ef=ElementFraction()
if__name__=='__main__':
df=ize_dataframe(df,"composition")
print(())
结果:
formulabulk_modulusshear_moduluscompositionmaterial_idstructurebrgoch_featssuspect_value
0AlPt3225.23046191.197748(Al,Pt)mp-188[[0.0.0.]Al,[0.1.961403951.96140...{'atomic_number_feat_1':123.5,'atomic_number...
1Mn2Nb232.69634074.590157(Mn,Nb)mp-12659[[-2.23765223e-081.42974191e+005.92614104e...{'atomic_number_feat_1':45.5,'atomic_n
2HfO2204.57343398.564374(Hf,O)mp-352[[2.244501853.857930224.83390736]O,[2.7788...{'atomic_number_feat_1':44.0,'atomic_numbe
3Cu3Pt159.31264051.778816(Cu,Pt)mp-12086[[0.1.861442481.86144248]Cu,[1.861...{'atomic_number_feat_1':82.5,'atomic_number_.
4Mg3Pt69.63756527.588765(Mg,Pt)mp-18707[[0.0.2.73626461]Mg,[0....{'atomic_number_feat_1':57.0,'atomic_number_...
------------------------------------------------------------
formulabulk_modulusshear_moduluscompositionmaterial_idstructurebrgoch_featssuspect_valueHHe
0AlPt3225.23046191.197748(Al,Pt)mp-188[[0.0.0.]Al,[0.1.961403951.96140...{'atomic_number_feat_1':123.5,'atomic_number...
1Mn2Nb232.69634074.590157(Mn,Nb)mp-12659[[-2.23765223e-081.42974191e+005.92614104e...{'atomic_number_feat_1':45.5,'atomic_n
2HfO2204.57343398.564374(Hf,O)mp-352[[2.244501853.857930224.83390736]O,[2.7788...{'atomic_number_feat_1':44.0,'atomic_numbe
3Cu3Pt159.31264051.778816(Cu,Pt)mp-12086[[0.1.861442481.86144248]Cu,[1.861...{'atomic_number_feat_1':82.5,'atomic_number_.
4Mg3Pt69.63756527.588765(Mg,Pt)mp-18707[[0.0.2.73626461]Mg,[0....{'atomic_number_feat_1':57.0,'atomic_number_...
结构特征器(StructureFeaturizers)
我们可以对其他类型的特征器使⽤相同的语法。现在让我们给⼀个结构分配描述符。我们使⽤与组合特征器相同的语法来完成这个任务。⾸
先,让我们加载⼀个包含结构的数据集。
t_retrievalimportload_datat
df=load_datat("phonon_dielectric_mp")
print(())
结果:
Fetchingphonon_dielectric_/files/13297571toD:anaconda3envspythonProject1libsite-packagesmatminerd
mpid...formula
0mp-1000...BaTe
1mp-1002124...HfC
2mp-1002164...GeC
3mp-10044...BAs
4mp-1008223...CaSe
让我们使⽤DensityFeatures来计算这些结构的⼀些基本密度特征。
t_retrievalimportload_datat
df=load_datat("phonon_dielectric_mp")
ureimportDensityFeatures
densityf=DensityFeatures()
print(e_labels())
结果:
['density','vpa','packingfraction']
这些是我们将获得的特性。现在,我们使⽤featurize_dataframe()为dataframe中的所有样本⽣成这些特征。因为我们使⽤结构作为特
征器的输⼊,所以我们选择“structure”列。
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("phonon_dielectric_mp")
print(())
print("---"*20)
ureimportDensityFeatures
densityf=DensityFeatures()
if__name__=='__main__':
df=ize_dataframe(df,"structure")
print(())
结果:
mpideps_electroniceps_totallastphdospeakstructureformula
0mp-10006.31155512.77345498.585771[[2.89438172.046636935.01321616]Te,[0.0....BaTe
1mp-100212424.13774332.965593677.585725[[0.0.0.]Hf,[-3.78195772-3.78195772-3.78...HfC
2mp-10021648.11102111.169464761.585719[[0.0.0.]Ge,[3.453115923.45311592-3.45...GeC
3mp-1004410.03216810.128936701.585723[[0.983725950.695599291.70386332]B,[0.0....BAs
4mp-10082233.9792016.394043204.585763[[0.0.0.]Ca,[4.954.95-4.95]Se]CaSe
------------------------------------------------------------
mpideps_electroniceps_totallastphdospeakstructureformuladensityvpapackingfraction
0mp-10006.31155512.77345498.585771[[2.89438172.046636935.01321616]Te,[0.0....BaTe4.93788644.5455470.596286
1mp-100212424.13774332.965593677.585725[[0.0.0.]Hf,[-3.78195772-3.78195772-3.78...HfC9.86823416.0278860.531426
2mp-10021648.11102111.169464761.585719[[0.0.0.]Ge,[3.453115923.45311592-3.45...GeC5.76089512.1999960.394180
3mp-1004410.03216810.128936701.585723[[0.983725950.695599291.70386332]B,[0.0....BAs5.08763413.9910160.319600
4mp-10082233.9792016.394043204.585763[[0.0.0.]Ca,[4.954.95-4.95]Se]CaSe2.75019135.9370000.428523
转换特征器(ConversionFeaturizers)
除了Bandstructure/DOS/Structure/Composition特征器之外,matminer还提供了⼀个特征器接⼝,⽤于以容错⽅式在pymatgen对
象之间进⾏转换(例如,将氧化状态辅助到成分)。这些特征器可以在sion中找到,并使⽤相同的
featurize/featurize_dataframe等。语法和其他特征⼀样。
我们之前加载的数据集只包含⼀个带有字符串对象的formula列。要将这些数据转换成composition列包含pymatgenComposition对
象,我们可以在formula列上使⽤StrToComposition转换特性器。
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("phonon_dielectric_mp")
print(())
print("---"*20)
sionsimportStrToComposition
stc=StrToComposition()
if__name__=='__main__':
df=ize_dataframe(df,"formula")
print(())
结果:
mpideps_electroniceps_totallastphdospeakstructureformula
0mp-10006.31155512.77345498.585771[[2.89438172.046636935.01321616]Te,[0.0....BaTe
1mp-100212424.13774332.965593677.585725[[0.0.0.]Hf,[-3.78195772-3.78195772-3.78...HfC
2mp-10021648.11102111.169464761.585719[[0.0.0.]Ge,[3.453115923.45311592-3.45...GeC
3mp-1004410.03216810.128936701.585723[[0.983725950.695599291.70386332]B,[0.0....BAs
4mp-10082233.9792016.394043204.585763[[0.0.0.]Ca,[4.954.95-4.95]Se]CaSe
------------------------------------------------------------
mpideps_electroniceps_totallastphdospeakstructureformulacomposition
0mp-10006.31155512.77345498.585771[[2.89438172.046636935.01321616]Te,[0.0....BaTe(Ba,Te)
1mp-100212424.13774332.965593677.585725[[0.0.0.]Hf,[-3.78195772-3.78195772-3.78...HfC(Hf,C)
2mp-10021648.11102111.169464761.585719[[0.0.0.]Ge,[3.453115923.45311592-3.45...GeC(Ge,C)
3mp-1004410.03216810.128936701.585723[[0.983725950.695599291.70386332]B,[0.0....BAs(B,As)
4mp-10082233.9792016.394043204.585763[[0.0.0.]Ca,[4.954.95-4.95]Se]CaSe(Ca,Se)
⾼级功能(Advancedcapabilities)
在我们开始练习之前,Featurizers有⼀些强⼤的功能值得⼀提(这⾥没有提到更多)。
处理错误
通常,数据是混乱的,某些特性化者会遇到错误。在featurize_dataframe()中设置ignore_errors=True以跳过错误;如果您希望在附
加列中看到返回的错误,也可以将return_errors设置为True。
引⽤作者
许多特征器是使⽤在同⾏评审研究中发现的⽅法实现的。请使⽤citations()⽅法引⽤这些原始作品,该⽅法在Python列表中返回BibTex
格式的引⽤。
3.机器学习模型
在第1部分和第2部分中,我们演⽰了如何下载数据集和添加机器可学习的特性。在第3部分中,我们展⽰了如何在数据集上训练机器学习模
型并分析结果。
Scikit-Learn
本部分⼴泛使⽤scikit-learn包,这是⼀个⽤于机器学习的开源python包。Matminer旨在使scikit-learn的机器学习尽可能简单。还存在其
他机器学习包,例如实现神经⽹络体系结构的。这些软件包也可以与matminer⼀起使⽤,但不在本研讨会的讨论范围内。
加载并准备⼀个预先特征化的模型
⾸先,让我们加载⼀个可以⽤于机器学习的数据集。事先,我们已经为练习1和练习2中使⽤的elastic_tensor_2015数据集添加了⼀
些composition和structure特征。
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
print(())
结果:
material_idformulansitesspace_groupvolumestructureelastic_anisotropyG_ReussG_VRHG_VoigtK_Reuss
0mp-10003Nb4CoSi12124194.419802[[0.948143282.072804672.5112]Nb,[5.273...0.03068896.84453597.14160497.438674
1mp-10010Al(CoSi)2516461.987320[[0.0.0.]Al,[1.966392631.135295530.75278...0.26691093.93965096.25200698.5643621
2mp-10015SiOs222125.952539[[1.4803461.4803461.480346]Si,[0.0.0.]Os]0.756489120.962289130.112955139.2636212
3mp-10021Ga46376.721433[[0.1.090457940.84078375]Ga,[0....2.37680512.20598915.10190117.99781249.025
4mp-10025SiRu21262160.300999[[1.00942654.247717092.9955487]Si,[3.028...0.196930100.110773101.947798103.784823
我们⾸先需要将数据集分成“target”属性和⽤于学习的“features”。在这个模型中,我们将使⽤体积模量(K_VRH)作为target属
性。我们使⽤dataframe的values属性给target属性⼀个numpy数组,⽽不是pandasSeries对象。
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
print(y)
结果:
[194.26888436175.44990675295.07754499...89.4181612699.3845653
35.93865993]
机器学习算法只能使⽤数字特征进⾏训练。因此,我们需要从数据集中移除任何⾮数字列。此外,我们希望从特征集中移除K_VRH列,因
为模型不应该预先知道target属性。
上⾯加载的数据集包括以前⽤于⽣成机器可学习特征的structure,formula和composition。让我们使⽤pandasdrop()函数删除它们,在章
节1中讨论过。请记住,axis=1表⽰我们删除的是列⽽不是⾏。
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
X=(['material_id',"structure","formula","composition","K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'],axis=1)
print("Thereare{}possibledescriptors:".format(s))
print(s)
结果:
ThereareIndex(['nsites','space_group','volume','elastic_anisotropy','G_Reuss',
'G_VRH','G_Voigt','K_Reuss','K_Voigt','poisson_ratio',
'kpoint_density','density','vpa','packingfraction'],
dtype='object')possibledescriptors:
Index(['nsites','space_group','volume','elastic_anisotropy','G_Reuss',
'G_VRH','G_Voigt','K_Reuss','K_Voigt','poisson_ratio',
'kpoint_density','density','vpa','packingfraction'],
dtype='object')
使⽤scikit-learn尝试随机森林模型(randomforestmodel)
scikit-learn库可以轻松地使⽤我们⽣成的功能来训练机器学习模型。它实现了各种不同的回归模型,并包含⽤于交叉验证的⼯具。
为了节省时间,在这个例⼦中,我们将只试验⼀个模型,但最好试验多个模型,看看哪⼀个对你的机器学习问题表现最好。⼀个好的“开
始”模型是随机森林模型。让我们创建⼀个随机森林模型
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
请注意,我们创建的模型的估计量(n_estimators)设置为100。n_estimators是机器学习超参数的⼀个例⼦。⼤多数模型包含许多可调超
参数。为了获得良好的性能,有必要为每个单独的机器学习问题微调这些参数。⽬前没有简单的⽅法提前知道什么样的超参数是最优的。通
常,使⽤试错法。
我们现在可以训练我们的模型使⽤输⼊特征(X)来预测target属性(y)。这是使⽤fit()函数实现的。
(X,y)
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
X=(['material_id',"structure","formula","composition","K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'],axis=1)
(X,y)
评估模型性能
接下来,我们需要评估模型的性能。为此,我们⾸先要求模型预测原始dataframe中每个条⽬的体积模量。
y_pred=t(X)
接下来,我们可以通过查看预测的均⽅根误差来检查模型的准确性。Scikit-learn提供了⼀个mean_squared_error()函数来计算均⽅
差。然后,我们取其平⽅根,以获得最终的性能指标。
importnumpyasnp
simportmean_squared_error
m=mean_squared_error(y,y_pred)
print('trainingRMSE={:.3f}GPa'.format((m)))
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
importnumpyasnp
simportmean_squared_error
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
X=(['material_id',"structure","formula","composition","K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'],axis=1)
(X,y)
y_pred=t(X)
m=mean_squared_error(y,y_pred)
print('trainingRMSE={:.3f}GPa'.format((m)))
结果:
trainingRMSE=0.801GPa
⼀个0.801GPa的RMSE(均⽅根误差)看起来很合理!然⽽,由于模型是在完全相同的数据上训练和评估的,这不是对模型在未知材料
(机器学习研究的主要⽬的)上的表现的真实估计。
交叉验证(Crossvalidation)
为了获得更准确的预测性能估计,并验证我们没有过度拟合,我们需要检查交叉验证分数,⽽不是拟合分数。
在交叉验证中,数据被随机划分为n份(splits)(在本例中为10个),每个分割包含⼤致相同数量的样本。对模型进⾏n-1份分割训练(训练
集),通过⽐较最终分割(测试集)的实际值和预测值来评估模型性能。总的来说,这个过程被重复多次,这样每个分割在某个点被⽤作测试
集。交叉验证分数是所有测试集的平均分数。
有许多⽅法可以将数据分割成多个部分。在这个例⼦中,我们使⽤KFold⽅法,并选择拆分的数量为10。即90%的数据将⽤作训练
集,10%⽤作测试集。
_lectionimportKFold
kfold=KFold(n_splits=10,random_state=1)
结果:
D:anaconda3envspythonProject1libsite-packagessklearnmodel_lection_:293:FutureWarning:Settingarandom_statehasnoeffectsinceshuffleis
(
请注意,我们将设置random_state=1,以确保每个参与者对他们的模型得到相同的答案。
最后,可以使⽤Scikit-Learncross_val_score()函数⾃动获得交叉验证分数。这个函数需要⼀个机器学习模型、输⼊特征和⽬标属性作
为参数。注意,我们将kfold对象作为cv参数传递,以使cross_val_score()使⽤正确的测试/训练分割。
对于每次拆分,在评估性能之前,将从头开始训练模型。由于我们必须训练和预测10次,交叉验证通常需要⼀些时间来执⾏。在我们的例
⼦中,模型相当⼩,所以这个过程只需要⼤约⼀分钟。最终的交叉验证分数是所有分割的平均值。
_lectionimportcross_val_score
scores=cross_val_score(rf,X,y,scoring='neg_mean_squared_error',cv=kfold)
rm_scores=[(abs(s))forsinscores]
print('MeanRMSE:{:.3f}'.format((rm_scores)))
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
importnumpyasnp
simportmean_squared_error
_lectionimportKFold
_lectionimportcross_val_score
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
X=(['material_id',"structure","formula","composition","K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'],axis=1)
(X,y)
y_pred=t(X)
m=mean_squared_error(y,y_pred)
print('trainingRMSE={:.3f}GPa'.format((m)))
kfold=KFold(n_splits=10,random_state=1)
scores=cross_val_score(rf,X,y,scoring='neg_mean_squared_error',cv=kfold)
rm_scores=[(abs(s))forsinscores]
print('MeanRMSE:{:.3f}'.format((rm_scores)))
结果:
trainingRMSE=0.801GPa
D:anaconda3envspythonProject1libsite-packagessklearnmodel_lection_:293:FutureWarning:Settingarandom_statehasnoeffectsinceshuffleis
(
MeanRMSE:1.731
请注意,我们的RMSE有点变化,因为现在它反映了模型的真正预测能⼒。不过⼀个~0.9GPa的均⽅根误差还是不错的!
可视化模型性能(Visualizingmodelperformance)
对于所有测试/训练分割的测试集中的每个样本,我们可以通过对照实际值绘制我们的预测来可视化我们的模型的预测性能。
⾸先,我们使⽤cross_val_predict⽅法获得每个分割的测试集的预测值。这类似于cross_val_score⽅法,只是它返回的是实际预测
值,⽽不是模型得分。
_lectionimportcross_val_predict
y_pred=cross_val_predict(rf,X,y,cv=kfold)
为了绘图,我们使⽤matminer的PlotlyFig模块,它可以帮助您快速⽣成出版就绪的图表。PlotlyFig可以⽣成许多不同类型的图。详细解
释它的使⽤超出了本教程的范围,但是可⽤图的⽰例在.
portPlotlyFig
pf=PlotlyFig(x_title='DFT(MP)bulkmodulus(GPa)',
y_title='Predictedbulkmodulus(GPa)',
mode='notebook')
(xy_pairs=[(y,y_pred),([0,400],[0,400])],
labels=df['formula'],
modes=['markers','lines'],
lines=[{},{'color':'black','dash':'dash'}],
showlegends=Fal)
#此代码需在Jupyter中运⾏
#*Jupyter最简便⽅式为在anaconda中安装JupyterLab和JupyterNotebook
#*参考ipes⼿册,需在jupyter_notebook_中设置_data_rate_limit=1.0e10,否则⽆法作图,可在当前python环境的终端中
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
importnumpyasnp
simportmean_squared_error
_lectionimportKFold
_lectionimportcross_val_score
_lectionimportcross_val_predict
portPlotlyFig
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
X=(['material_id',"structure","formula","composition","K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'],axis=1)
#(X,y)
#y_pred=t(X)
#m=mean_squared_error(y,y_pred)
#print('trainingRMSE={:.3f}GPa'.format((m)))
kfold=KFold(n_splits=10,random_state=1)
#scores=cross_val_score(rf,X,y,scoring='neg_mean_squared_error',cv=kfold)
#rm_scores=[(abs(s))forsinscores]
#print('MeanRMSE:{:.3f}'.format((rm_scores)))
y_pred=cross_val_predict(rf,X,y,cv=kfold)
pf=PlotlyFig(x_title='DFT(MP)bulkmodulus(GPa)',
y_title='Predictedbulkmodulus(GPa)',
mode='notebook')
(xy_pairs=[(y,y_pred),([0,400],[0,400])],
labels=df['formula'],
modes=['markers','lines'],
lines=[{},{'color':'black','dash':'dash'}],
showlegends=Fal)
还不错!但是,肯定也有⼀些离群点(可以⽤⿏标悬停在点上看看是什么)。
模型解释(Modelinterpretation)
机器学习的⼀个重要⽅⾯是能够理解为什么模型会做出某些预测。随机森林模型特别容易解释,因为它们具有feature_importances属性,
该属性包含每个特征在决定最终预测中的重要性。让我们看看我们模型的特征重要性。
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
frompymatgen
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
X=(['material_id',"structure","formula","composition","K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'],axis=1)
(X,y)
print(e_importances_)
结果:
[1.33190021e-051.59243590e-053.00289964e-051.31361719e-04
4.30431755e-044.14454063e-042.24079389e-048.12596104e-01
1.85751384e-011.10635270e-041.32575216e-063.02149617e-05
4.80359906e-052.02700688e-04]
为了理解这⼀点,我们需要知道每个数字对应于哪个特征。我们可以使⽤PlotlyFig来绘制5个最重要特征的重要性。
importances=e_importances_
included=
indices=t(importances)[::-1]
pf=PlotlyFig(y_title='Importance(%)',
title='Featurebyimportances',
mode='notebook')
(x=included[indices][0:5],y=importances[indices][0:5])
importpandasaspd
#显⽰Dateframe所有列(参数设置为None代表显⽰所有⾏,也可以⾃⾏设置数字)
frommatminerimportPlotlyFig
_option('_columns',None)
#禁⽌Dateframe⾃动换⾏(设置为Fla不⾃动换⾏,True反之)
_option('expand_frame_repr',Fal)
t_retrievalimportload_datat
df=load_datat("elastic_tensor_2015")
ureimportDensityFeatures
densityf=DensityFeatures()
sionsimportStrToComposition
stc=StrToComposition()
leimportRandomForestRegressor
rf=RandomForestRegressor(n_estimators=100,random_state=1)
importnumpyasnp
importpymatgen
if__name__=='__main__':
df=ize_dataframe(df,"structure")
df=ize_dataframe(df,"formula")
y=df['K_VRH'].values
X=(['material_id',"structure","formula","composition","K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'],axis=1)
(X,y)
importances=e_importances_
included=
indices=t(importances)[::-1]
pf=PlotlyFig(y_title='Importance(%)',
title='Featurebyimportances',
mode='notebook')
(x=included[indices][0:5],y=importances[indices][0:5])
4.使⽤⾃动机的⾃动机器学习
是⼀个包,⽤于使⽤matminer的特征器、特征约简技术和⾃动机器学习(AutoML)⾃动创建ML管道。Automatminer以端到端⼯作,从
原始数据到预测,不需要任何⼈⼯输⼊。
放⼊⼀个数据集,得到⼀个预测材料属性的机器。*Automatminer在材料信息学的多个领域与最先进的⼿动调整机器学习模型竞争。
Automatminer还包括运⾏MatBench的实⽤程序,Matbench是⼀种材料科学ML基准。*从[]中了解更多关于Automatminer和
MatBench的信息。
automatminer是如何⼯作的?Automatminer使⽤matminer描述符库中的数百种描述符技术来⾃动修饰数据集,挑选最有⽤的特征进⾏
学习,并运⾏⼀个单独的AutoML管道。⼀旦管道被安装好,它可以被总结成⼀个⽂本⽂件,保存到磁盘上,或者⽤于对新材料进⾏预测。
Automatminer概述材料图元(如晶体结构)在⼀端,⽽属性预测则在另⼀端。MatPipe处理中间操作,如分配描述符、清理有问题的数据、
数据转换、插补和机器学习。
MatPipe是Automatminer'`MatPipe`'的主要对象,`MatPipe'是Automatminer的中⼼对象。它有⼀个sklearnBaEstimator语法⽤
于“拟合”和“预测”操作。简单地“适应”你的训练数据,然后“预测”你的测试数据。MatPipe使⽤[]。作为输⼊和输出的
dataframes。放⼊(材料的)dataframes,取出(属性预测的)dataframes。
概述在本节中,我们将介绍使⽤⾃动挖掘器对数据进⾏训练和预测的基本步骤。我们还将使⽤⾃动挖掘器的应⽤编程接⼝来查看我们的⾃动
管道的内部。*⾸先,我们将从材料项⽬中加载⼤约4600个介电常数的数据集。*接下来,我们将为数据拟合⼀个
Automatminer‘MatPipe’(管道)*然后,我们将从结构中预测介电常数,并看看我们的预测是如何进⾏的(注意,这不是⼀个容易的问
题!)*我们将使⽤“MatPipe”的内省⽅法来检查我们的管道。*最后,我们看看如何保存和加载管道,以实现可再现的预测。*注意:为了简
洁起见,我们将在本笔记本中使⽤单个列车测试分割。要运⾏完整的Automatminer基准测试,请参见“ark”⽂档。
为机器学习准备数据集让我们加载⼀个数据集来玩。在这个例⼦中,我们将使⽤matminer来加载MatBenchv0.1数据集之⼀。
t_retrievalimportload_datat
importpymatgen
if__name__=='__main__':
df=load_datat("matbench_dielectric")
print(())
结果:
structuren
0[[4.293041472.47858861.07248561]S,[4.2930...1.752064
1[[3.950514344.511214370.28035002]K,[4.3099...1.652859
2[[-1.786881044.796041171.53044621]Rb,[-1...1.867858
3[[4.514380644.514380640.]Mn,[0.133...2.676887
4[[-4.367319586.88860970.50929706]Li,[-2...1.793232
通过检查数据集,我们可以看到只有“structure”和“n”(介电常数)列。
接下来,我们可以⽣成⼀个train-test分割。⽤于评估automatminer。
model_lectionimporttrain_test_split
train_df,test_df=train_test_split(df,test_size=0.2,shuffle=True,random_state=20191014)
让我们删除测试中dataframe的target属性,这样我们就可以确定我们没有给⾃动挖掘器任何测试信息。
我们的target变量是“n”。
target="n"
prediction_df=test_(columns=[target])
prediction_()
t_retrievalimportload_datat
importpymatgen
_lectionimporttrain_test_split
if__name__=='__main__':
df=load_datat("matbench_dielectric")
train_df,test_df=train_test_split(df,test_size=0.2,shuffle=True,random_state=20191014)
target="n"
prediction_df=test_(columns=[target])
print(prediction_())
结果:
structure
1802[[3.712058662.143153941.14375057]Si,[-3.71...
1881[[0.0.0.]Cd,[1.353148920.956820782.34372...
1288[[-0.507140724.98931426.08288682]K,[-1....
4490[[3.907047972.762700116.76720559]Si,[0.558...
32[[1.915061731.234739564.58373805]P,[5.553...
⽤Automatminer'sMatPipe进⾏拟合和预测
现在我们拥有了启动我们的AutoML管道所需的⼀切。为了简单起见,我们将使⽤MatPipe预设。MatPipe是⾼度可定制的,有数百个配
置选项,但⼤多数⽤例将通过使⽤预设配置之⼀来满⾜。我们使⽤from_pret⽅法。
在这个例⼦中,出于时间的考虑,我们将使⽤“debug”预设,这将花费⼤约1.5分钟进⾏机器学习。如果你有更多的时
间,“express”预设是⼀个很好的选择。
fromautomatminerimportMatPipe
pipe=_pret("debug")
t_retrievalimportload_datat
importpymatgen
_lectionimporttrain_test_split
fromautomatminerimportMatPipe
if__name__=='__main__':
df=load_datat("matbench_dielectric")
train_df,test_df=train_test_split(df,test_size=0.2,shuffle=True,random_state=20191014)
target="n"
prediction_df=test_(columns=[target])
pipe=_pret("debug")
结果:
D:ana:144:FutureWarning:moduleisdeprecatedinversio
(message,FutureWarning)
D:ana:144:FutureWarning:e_duleisdeprecated
(message,FutureWarning)
D:ana:144:FutureWarning:rvidmoduleisdeprecated
(message,FutureWarning)
D:anaconda3envspythonProject1libsite-packagessklearnexternalsjoblib__init__.py:15:FutureWarning:isdeprecatedin0.21andwill
(msg,category=FutureWarning)
安装管道(Fittingthepipeline)
要将AutomatminerMatPipe与数据匹配,请输⼊您的训练数据和所需⽬标。
(train_df,target)
t_retrievalimportload_datat
importpymatgen
_lectionimporttrain_test_split
fromautomatminerimportMatPipe
if__name__=='__main__':
df=load_datat("matbench_dielectric")
train_df,test_df=train_test_split(df,test_size=0.2,shuffle=True,random_state=20191014)
target="n"
prediction_df=test_(columns=[target])
pipe=_pret("debug")
(train_df,target)
结果:
2021-03-1520:45:47INFOProblemtypeis:regression
2021-03-1520:45:47INFOFittingMatPipepipelinetodata.
2021-03-1520:45:47INFOAutoFeaturizer:Startingfitting.
2021-03-1520:45:47INFOAutoFeaturizer:Addingcompositionsfromstructures.
2021-03-1520:45:47INFOAutoFeaturizer:Guessingoxidationstatesofstructuresiftheywerenotprentininput.
StructureToOxidStructure:0%||0/3811[00:00<?,?it/s]D:ana:144:FutureWarnin
StructureToComposition:0%||0/3811[00:00<?,?it/s]D:ana:144:FutureWarning
zeros[:len(eigs)]=eigs
D:anaconda:743:ComplexWarning:Castingcomplexvaluestorealdiscardstheimagin
zeros[:len(eigs)]=eigs
zeros[:len(eigs)]=eigs
SineCoulombMatrix:100%|██████████|3811/3811[00:17<00:00,219.80it/s]
2021-03-1520:48:44INFOAutoFeaturizer:ng...
2021-03-1520:48:44INFOAutoFeaturizer:ng...
2021-03-1520:48:44INFOAutoFeaturizer:Finishedtransforming.
2021-03-1520:48:44INFODataCleaner:Startingfitting.
2021-03-1520:48:44INFODataCleaner:Cleaningwithrespecttosampleswithsamplena_method'drop'
2021-03-1520:48:44INFODataCleaner:Replacinginfinitevalueswithnanforeasierscreening.
2021-03-1520:48:44INFODataCleaner:Beforehandlingna:3811samples,421features
2021-03-1520:48:44INFODataCleaner:redropped.
2021-03-1520:48:44INFODataCleaner:Handlingfeaturenabymaxnathresholdof0.01withmethod'drop'.
2021-03-1520:48:44INFODataCleaner:Afterhandlingna:3811samples,421features
2021-03-1520:48:44INFODataCleaner:Finishedfitting.
2021-03-1520:48:44INFOFeatureReducer:Startingfitting.
2021-03-1520:48:45INFOFeatureReducer:285featuresremovedduetocrosscorrelationmorethan0.95
2021-03-1520:52:46INFOTreeFeatureReducer:Finishedtree-badfeaturereductionof135initialfeaturesto13
2021-03-1520:52:46INFOFeatureReducer:Finishedfitting.
2021-03-1520:52:46INFOFeatureReducer:Startingtransforming.
2021-03-1520:52:46INFOFeatureReducer:Finishedtransforming.
2021-03-1520:52:46INFOTPOTAdaptor:Startingfitting.
27operatorshavebeenimportedbyTPOT.
OptimizationProgress:0%||0/10[00:00<?,?pipeline/s]D:ana:144:FutureWarn
ExtraTreesRegressor(StandardScaler(SelectPercentile(input_matrix,SelectPercentile__percentile=99)),ExtraTreesRegressor__bootstrap=True,ExtraTreesRegr
OptimizationProgress:100%|██████████|30/30[00:57<00:00,1.41s/pipeline]Generation2-CurrentParetofrontscores:
-3-0.465852ExtraTreesRegressor(StandardScaler(SelectPercentile(input_matrix,SelectPercentile__percentile=99)),ExtraTreesRegressor__bootstr
OptimizationProgress:100%|██████████|30/30[00:58<00:00,1.41s/pipeline]_pre_testdecorator:_random_mutation_operator:num_test=0Foundarray
OptimizationProgress:100%|██████████|30/30[00:58<00:00,1.41s/pipeline]_pre_testdecorator:_random_mutation_operator:num_test=0Foundarray
llclodown.
TPOTclodduringevaluationinonegeneration.
WARNING:TPOTmaynotprovideagoodpipelineifTPOTisstopped/interruptedinaearlygeneration.
ethecurrentbestpipeline.
2021-03-1520:53:51INFOTPOTAdaptor:Finishedfitting.
2021-03-1520:53:51INFOMatPipesuccessfullyfit.
预测新数据(Examinepredictions)
我们的MatPipe现在适合了。让我们⽤t预测我们的测试数据。这应该只需要⼏分钟
prediction_df=t(prediction_df)
t_retrievalimportload_datat
importpymatgen
_lectionimporttrain_test_split
fromautomatminerimportMatPipe
if__name__=='__main__':
df=load_datat("matbench_dielectric")
train_df,test_df=train_test_split(df,test_size=0.2,shuffle=True,random_state=20191014)
target="n"
prediction_df=test_(columns=[target])
pipe=_pret("debug")
(train_df,target)
prediction_df=t(prediction_df)
结果:
2020-07-2714:36:25INFOBeginningMatPipepredictionusingfittedpipeline.
2020-07-2714:36:25INFOAutoFeaturizer:Startingtransforming.
2020-07-2714:36:25INFOAutoFeaturizer:Addingcompositionsfromstructures.
2020-07-2714:36:25INFOAutoFeaturizer:Guessingoxidationstatesofstructuresiftheywerenotprentininput.
2020-07-2714:37:09INFOAutoFeaturizer:Guessingoxidationstatesofcompositions,astheywerenotprentininput.
2020-07-2714:37:15INFOAutoFeaturizer:FeaturizingwithElementProperty.
2020-07-2714:37:19INFOAutoFeaturizer:Guessingoxidationstatesofstructuresiftheywerenotprentininput.
2020-07-2714:37:22INFOAutoFeaturizer:FeaturizingwithSineCoulombMatrix.
2020-07-2714:37:28INFOAutoFeaturizer:ng...
2020-07-2714:37:28INFOAutoFeaturizer:ng...
2020-07-2714:37:28INFOAutoFeaturizer:Finishedtransforming.
2020-07-2714:37:28INFODataCleaner:Startingtransforming.
2020-07-2714:37:28INFODataCleaner:Cleaningwithrespecttosampleswithsamplena_method'fill'
2020-07-2714:37:28INFODataCleaner:Replacinginfinitevalueswithnanforeasierscreening.
2020-07-2714:37:28INFODataCleaner:Beforehandlingna:953samples,420features
2020-07-2714:37:28INFODataCleaner:Afterhandlingna:953samples,420features
2020-07-2714:37:28INFODataCleaner:ng...
2020-07-2714:37:28INFODataCleaner:Finishedtransforming.
2020-07-2714:37:28INFOFeatureReducer:Startingtransforming.
2020-07-2714:37:28WARNINGFeatureReducer:Targetnotfoundincolumnstotransform.
2020-07-2714:37:28INFOFeatureReducer:Finishedtransforming.
2020-07-2714:37:28INFOTPOTAdaptor:Startingpredicting.
2020-07-2714:37:28INFOTPOTAdaptor:Predictionfinishedsuccessfully.
2020-07-2714:37:28INFOTPOTAdaptor:Finishedpredicting.
2020-07-2714:37:28INFOMatPipepredictioncompleted.
检查预测集
MatPipe将预测放在⼀个名为“{target}predicted”的列中:
prediction_()
t_retrievalimportload_datat
importpymatgen
_lectionimporttrain_test_split
fromautomatminerimportMatPipe
if__name__=='__main__':
df=load_datat("matbench_dielectric")
train_df,test_df=train_test_split(df,test_size=0.2,shuffle=True,random_state=20191014)
target="n"
prediction_df=test_(columns=[target])
pipe=_pret("debug")
(train_df,target)
prediction_df=t(prediction_df)
print(prediction_())
结果:
MagpieDatarangeAtomicWeight...npredicted
1802102.710600...1.951822
188115.189000...3.295348
128849.380600...1.656971
44900.000000...4.706100
3232.572238...2.754411
评分预测(Scorepredictions)
现在让我们⽤平均误差来给我们的预测打分,并将它们与sklearn的虚拟回归器进⾏⽐较。
t_retrievalimportload_datat
importpymatgen
_lectionimporttrain_test_split
fromautomatminerimportMatPipe
simportmean_absolute_error
mportDummyRegressor
if__name__=='__main__':
df=load_datat("matbench_dielectric")
train_df,test_df=train_test_split(df,test_size=0.2,shuffle=True,random_state=20191014)
target="n"
prediction_df=test_(columns=[target])
pipe=_pret("debug")
(train_df,target)
prediction_df=t(prediction_df)
#fitthedummy
dr=DummyRegressor()
(train_df["structure"],train_df[target])
dummy_test=t(test_df["structure"])
#ScoredummyandMatPipe
true=test_df[target]
matpipe_test=prediction_df[target+"predicted"]
mae_matpipe=mean_absolute_error(true,matpipe_test)
mae_dummy=mean_absolute_error(true,dummy_test)
print("DummyMAE:{}".format(mae_dummy))
print("MatPipeMAE:{}".format(mae_matpipe))
结果:
DummyMAE:0.7772666142371938
MatPipeMAE:0.5582
检查MatPipe的内部(ExaminingtheinternalsofMatPipe)
使⽤t(所有适当属性名称的长⽽全⾯的版本)或ize(执⾏摘要)中的dict/text摘要检查MatPipe内部。
importpprint
#Getasummaryandsaveacopytojson
summary=ize(filename="MatPipe_predict_experimental_gap_from_composition_")
(summary)
结果:
{'data_cleaning':{'drop_na_targets':'True',
'encoder':'one-hot',
'feature_na_method':'drop',
'na_method_fit':'drop',
'na_method_transform':'fill'},
'feature_reduction':{'reducer_params':"{'tree':{'importance_percentile':"
"0.9,'mode':'regression',"
"'random_state':0}}",
'reducers':"('corr','tree')"},
'features':['MagpieDatarangeAtomicWeight',
'MagpieDataavg_devAtomicWeight',
'MagpieDatameanMeltingT',
'MagpieDatamaximumElectronegativity',
'MagpieDatameanElectronegativity',
'MagpieDataavg_devElectronegativity',
'MagpieDataavg_devNUnfilled',
'MagpieDatameanGSvolume_pa',
'sinecoulombmatrixeig0',
'sinecoulombmatrixeig6',
'sinecoulombmatrixeig7'],
'featurizers':{'bandstructure':[BandFeaturizer()],
'composition':[ElementProperty(data_source=
features=['Number','MendeleevNumber','AtomicWeight',
'MeltingT','Column','Row','CovalentRadius',
'Electronegativity','NsValence','NpValence',
'NdValence','NfValence','NValence','NsUnfilled',
'NpUnfilled','NdUnfilled','NfUnfilled','NUnfilled',
'GSvolume_pa','GSbandgap','GSmagmom',
'SpaceGroupNumber'],
stats=['minimum','maximum','range','mean','avg_dev',
'mode'])],
'dos':[DOSFeaturizer()],
'structure':[SineCoulombMatrix()]},
'ml_model':'Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),n'
"steps=[('lectpercentile',n"
'SelectPercentile(percentile=23,n'
'score_func=
'f_regressionat0x7f92217f2040>)),n'
"('robustscaler',RobustScaler()),n"
"('randomforestregressor',n"
'RandomForestRegressor(bootstrap=Fal,'
'max_features=0.05,n'
'min_samples_leaf=7,'
'min_samples_split=5,n'
'n_estimators=20))])'}
#ExplaintheMatPipe'sinternalsmorecomprehensively
details=t(filename="MatPipe_predict_experimental_gap_from_composition_")
print(details)
结果:
{'autofeaturizer':{'autofeaturizer':{'cache_src':None,'pret':'debug','featurizers':{'composition':[ElementProperty(data_source=
features=['Number','MendeleevNumber','AtomicWeight',
'MeltingT','Column','Row','CovalentRadius',
'Electronegativity','NsValence','NpValence',
'NdValence','NfValence','NValence','NsUnfilled',
'NpUnfilled','NdUnfilled','NfUnfilled','NUnfilled',
'GSvolume_pa','GSbandgap','GSmagmom',
'SpaceGroupNumber'],
stats=['minimum','maximum','range','mean','avg_dev',
'mode'])],'structure':[SineCoulombMatrix()],'bandstructure':[BandFeaturizer()],'dos':[DOSFeaturizer()]},'exclude':[],'functionalize':Fal,'ignore
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.]),'tol':[1e-05,0.0001,0.001,0.01,0.1]},'reesRegressor':{'n_estimators':[20,1
0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.]),'max_features':array([0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,
0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.]),'alpha':[0.75,0.8,0.85,0.9,0.95,0.99]},'onTreeRegressor':{'max_depth':range(1,11,2
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.])},'A':{'tol':array([0.,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.])},'eAgglomeration':{'linkage':['ward','complete','average'],'affinity':['euclidean',
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.]),'n_components':range(1,11)},'':{'svd_solver':['randomized'],'iterated_po
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.])},'Scaler':{},'rdScaler':{},'
0.009,0.01,0.011,0.012,0.013,0.014,0.015,0.016,0.017,
0.018,0.019,0.02,0.021,0.022,0.023,0.024,0.025,0.026,
0.027,0.028,0.029,0.03,0.031,0.032,0.033,0.034,0.035,
0.036,0.037,0.038,0.039,0.04,0.041,0.042,0.043,0.044,
0.045,0.046,0.047,0.048,0.049]),'score_func':{'e_lection.f_regression':None}},'e_Percentile':{'percentile':r
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.]),'estimator':{'reesRegressor':{'n_estimators':[100],'max_features':array([0.0
'l1',
'l2',
'manhattan',
'cosine'],
'linkage':['ward',
'complete',
'average']},
'A':{'tol':array([0.,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,
0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.])},
'':{'iterated_power'...
'Encoder':{'minimum_fraction':[0.05,
0.1,
0.15,
0.2,
0.25],
'spar':[Fal],
'threshold':[10]},
'unt':{}},
log_file=
max_eval_time_mins=1,max_time_mins=1,memory='auto',n_jobs=2,
population_size=10,scoring='neg_mean_absolute_error',
template='Selector-Transformer-Regressor',verbosity=3),'_features':['MagpieDatarangeAtomicWeight','MagpieDataavg_devAtomicWeight','Magpie
直接访问MatPipe的内部对象
您可以直接访问MatPipe的内部对象,⽽不是通过⽂本摘要;你只需要知道要访问哪些属性。有关更多信息,请参见在线应⽤编程接⼝⽂档
或源代码。
#AccesssomeattributesofMatPipedirectly,insteadofviaatextdigest
print(_pipeline)
结果:
Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),
steps=[('lectpercentile',
SelectPercentile(percentile=23,
score_func=
('robustscaler',RobustScaler()),
('randomforestregressor',
RandomForestRegressor(bootstrap=Fal,max_features=0.05,
min_samples_leaf=7,min_samples_split=5,
n_estimators=20))])
print(izers["composition"])
结果:
[ElementProperty(data_source=
features=['Number','MendeleevNumber','AtomicWeight',
'MeltingT','Column','Row','CovalentRadius',
'Electronegativity','NsValence','NpValence',
'NdValence','NfValence','NValence','NsUnfilled',
'NpUnfilled','NdUnfilled','NfUnfilled','NUnfilled',
'GSvolume_pa','GSbandgap','GSmagmom',
'SpaceGroupNumber'],
stats=['minimum','maximum','range','mean','avg_dev',
'mode'])]
print(izers["structure"])
结果:
[SineCoulombMatrix()]
管道的持久性(Persistenceofpipelines)
能够复制你的结果是材料信息学的⼀个重要⽅⾯。MatPipe提供了⽅便保存和加载整个管道的⽅法,供他⼈使⽤。
⽤保存⼀个MatPipe供以后使⽤。装上。
filename="MatPipe_predict_experimental_gap_from_composition.p"
(filename)
pipe_loaded=(filename)
结果:
2020-07-2714:37:33INFOLoadedMatPipefromfileMatPipe_predict_experimental_gap_from_composition.p.
2020-07-2714:37:33WARNINGOnlyuthismodeltomakepredictions(donotretrain!).Backendwasrialzedasonlythetopmodel,notthefullautomlback
本文发布于:2022-11-14 13:52:40,感谢您对本站的认可!
本文链接:http://www.wtabcd.cn/fanwen/fan/88/17798.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |