首页 > 英语园地

pyspark使用分布式xgboost

更新时间:2023-05-20 10:04:51 阅读：评论：0

pyspark使⽤分布式xgboost

亲测跑通

环境：

Python 3.6.5

Pyspark：2.4.5

Spark: 2.4.3

步骤：

第⼀步：配置好环境

第⼆步：下载相关⽂件（）

1. xgboost4j-0.7

2.jar

2. xgboost4j-spark-0.72.jar

3. Sparkxgb.zip

第三步：

1. 关键点1：将xgboost4j-0.7

2.jar和Xgboost4j-spark-0.72.jar添加到job中（使⽤--jars或者配置spark.jars）

2. 关键点2：需要每个executor执⾏：spark.sparkContext.addPyFile("hdfs:///xxxx/xxx/sparkxgb.zip")

3. 将以上3个包放⼊：

代码⽰例：

os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j-spark-0.72.jar,xgboost4j-0.72.jar pyspark-

shell' # 本地运⾏时，jar包放在当前代码的路径下；提交任务

spark = SparkSession \

recordtype.builder \

欢快的英文歌曲.master("local") \

.appName("PythonWordCount") \

.getOrCreate()

spark.sparkContext.addPyFile("hdfs:///xxxx/xxx/sparkxgb.zip") # sparkxgb.zip包路径，如果本地运⾏则改成本地路径

# Load Data

dataPath = "xxx\\spark-2.4.3-bin-hadoop2.6\\data\\mllib\\sample_binary_"

外语教育机构dataDF = ad.format("libsvm").load(dataPath)

衣衫褴褛的英语

# Split into Train/Test

trainDF, testDF = dataDF.randomSplit([0.8, 0.2], ed=1000)

# Define and train model

xgboost = XGBoostEstimator(

# General Params

nworkers=1, nthread=1, checkpointInterval=-1, checkpoint_path="",

u_external_memory=Fal, silent=0, missing=float("nan"),

# Column Params

featuresCol="features", labelCol="label", predictionCol="prediction",

pdlweightCol="weight", baMarginCol="baMargin",

# Booster Params

booster="gbtree", ba_score=0.5, objective="binary:logistic", eval_metric="error",

num_class=2, num_round=2, ed=None,

# Tree Booster Params

eta=0.3, gamma=0.0, max_depth=6, min_child_weight=1.0, max_delta_step=0.0, subsample=1.0,

colsample_bytree=1.0, colsample_bylevel=1.0, reg_lambda=0.0, alpha=0.0, tree_method="auto",

sketch_eps=0.03, scale_pos_weight=1.0, grow_policy='depthwi', max_bin=256,

# Dart Booster Params

sample_type="uniform", normalize_type="tree", rate_drop=0.0, skip_drop=0.0,

# Linear Booster Params

lambda_bias=0.0

)

xgboost_model = xgboost.fit(trainDF)

# Transform test t

ansform(testDF).show()

# Write model/classifier

xgboost.write().overwrite().save("xgboost_class_test")

xgboost_model.write().overwrite().save("xgboost_del")

n2报名

附：xgboost4j-0.90.jar、xgboost4j-spark-0.90.jar、Sparkxgb.zip版本测试——运⾏不成功，下⾯是测试情况

注意：在0.90⾥没有XGBoostEstimator

0.90版本—代码⽰例：

spark = SparkSession \

.builder \

.master("local") \

.appName("PythonWordCount") \

.getOrCreate()

spark.sparkContext.addPyFile("hdfs:///xxxx/xxx/sparkxgb.zip")

# Load Data

dataPath = "xxx\\spark-2.4.3-bin-hadoop2.6\\data\\mllib\\sample_binary_"

dataDF = ad.format("libsvm").load(dataPath)

# Split into Train/Test

trainDF, testDF = dataDF.randomSplit([0.8, 0.2], ed=1000)

# Define and train model

xgboost = XGBoostClassifier(

approve名词# General Params

nworkers=1, nthread=1, checkpointInterval=-1, checkpoint_path="",

u_external_memory=Fal, silent=0, missing=float("nan"),

# Column Params

featuresCol="features", labelCol="label", predictionCol="prediction",

weightCol="weight", baMarginCol="baMargin",

# Booster Params

booster="gbtree", ba_score=0.5, objective="binary:logistic", eval_metric="error",

num_class=2, num_round=2, ed=None,

# Tree Booster Params

eta=0.3, gamma=0.0, max_depth=6, min_child_weight=1.0, max_delta_step=0.0, subsample=1.0,

colsample_bytree=1.0, colsample_bylevel=1.0, reg_lambda=0.0, alpha=0.0, tree_method="auto",

日射sketch_eps=0.03, scale_pos_weight=1.0, grow_policy='depthwi', max_bin=256,

# Dart Booster Params

sample_type="uniform", normalize_type="tree", rate_drop=0.0, skip_drop=0.0,

# Linear Booster Params

lambda_bias=0.0

)

xgboost_model = xgboost.fit(trainDF)

# Transform test t

ansform(testDF).show()

# Write model/classifier

xgboost.write().overwrite().save("xgboost_class_test")

xgboost_model.write().overwrite().save("xgboost_del")

会报错：

Traceback (most recent call last):

File "D:/gyl/scalaProgram/python_OwnerIdentify/test.py", line 48, in <module>

missing=float("+inf"))

File "D:\Program Files\python\python3\lib\site-packages\pyspark\__init__.py", line 110, in wrapper

return func(lf, **kwargs)

File "D:\software\bigData\pyspark_study-master\source\pyspark-xgboost\sparkxgb.zip\sparkxgb\xgboost.py", line 85, in __init__ File "D:\software\bigData\pyspark_study-master\source\pyspark-xgboost\sparkxgb.zip\sparkxgb\common.

corkpy", line 68, in __init__ File "D:\Program Files\python\python3\lib\site-packages\pyspark\ml\wrapper.py", line 67, in _new_java_obj

return java_obj(*java_args)

TypeError: 'JavaPackage' object is not callable

参考：

注意：缺失值⽤ float("+inf")

本文发布于:2023-05-20 10:04:51，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/706083.html

上一篇：GC0328 Application Notes V1.0_20120921

下一篇：机器学习——随机森林（RF）算法

标签：路径代码放在机构配置相关需要

留言与评论（共有 0 条评论）