When doing machine learning with PySpark, the ML library may not be fully functional and you may want to use other libraries such as scikit-learn.
The learning at that time needs to be done separately because Spark's DataFrame does not support it in the first place, but inference can be done smoothly by using UDF, so it is posted as a reminder.
If you have a trained model (model
: scikit-learn image), you can do as follows.
data
is the DataFrame of the inference data and features
is the list of explanatory variables.
Here, the result predicted by model.predict (x)
is returned, and it is necessary to replace it with the prediction function of the created model as appropriate.
Similarly, if the return value is a continuous value, change it to DoubleType ()
.
Inference using a trained model on pyspark
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType
@pandas_udf(returnType=IntegerType())
def predict_udf(*cols):
X = pd.concat(cols, axis=1)
return pd.Series(model.predict(X))
data.withColumn('predict', predict_udf(*features))
Recommended Posts