In-Database Python Analysis for SQL Developers
Step 5: Training and saving a model using T-SQL
In this step you will learn how to use the model trained in the previous step. "Use" here means "deploying the model to production for scoring." This is easy to expand because the Python code is contained in the stored procedure. To make new observation predictions from your application, just call the stored procedure.
There are two ways to call a Python model from a stored procedure:
-** Batch scoring mode : Use a SELECT query to serve multiple rows of data. The stored procedure returns a table of observations corresponding to the input - Individual scoring mode **: Pass a set of individual parameter values as input. The stored procedure returns a single record or value.
The stored procedure PredictTipSciKitPy
uses the scikit-learn model.
The stored procedure PredictTipSciKitPy
is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).
In Management Studio's Object Explorer, expand Programming> Stored Procedures.
Right-click on PredictTipSciKitPy
and select Modify to open the Transact-SQL script in a new query window.
DROP PROCEDURE IF EXISTS PredictTipSciKitPy;
GO
CREATE PROCEDURE [dbo].[PredictTipSciKitPy](@model varchar(50), @inquery nvarchar(max))
AS
BEGIN
DECLARE @lmodel2 varbinary(max) = (select model from nyc_taxi_models where name = @model);
EXEC sp_execute_external_script
@language = N'Python',
@script = N'
import pickle
import numpy
# import pandas
from sklearn import metrics
mod = pickle.loads(lmodel2)
X = InputDataSet[["passenger_count", "trip_distance", "trip_time_in_secs", "direct_distance"]]
y = numpy.ravel(InputDataSet[["tipped"]])
prob_array = mod.predict_proba(X)
prob_list = [item[1] for item in prob_array]
prob_array = numpy.asarray(prob_list)
fpr, tpr, thresholds = metrics.roc_curve(y, prob_array)
auc_result = metrics.auc(fpr, tpr)
print("AUC on testing data is:", auc_result)
OutputDataSet = pandas.DataFrame(data=prob_list, columns=["predictions"])
',
@input_data_1 = @inquery,
@input_data_1_name = N'InputDataSet',
@params = N'@lmodel2 varbinary(max)',
@lmodel2 = @lmodel2
WITH RESULT SETS ((Score float));
END
GO
--Specify the model name to use in the input parameters of the stored procedure.
--Fetch the serialized model from the nyc_taxi_models table based on the specified model name.
--The serialized model is stored in the Python variable mod
.
--The new scored case is taken from the Transact-SQL query specified by @ input_data_1
. The result of this query is saved in the default data frame ʻInputDat a Set. --This data frame is passed to the function
predict_proba of the logistic regression model
modcreated using the scikit-learn model. --The function
predict_proba returns a float value that indicates the probability of receiving an arbitrary amount of chips. --In addition, calculate the precision metric ʻAUC (area under curve)
. Precision metrics such as AUC are only generated if you specify the objective variable (ie the tapped column) in addition to the explanatory variables. The prediction does not require the objective variable (variable Y), but it is required to calculate the accuracy metric. Therefore, if the data to be scored does not have an objective variable, remove the AUC calculation block from the source code and modify the stored procedure to simply return the probability of receiving a chip by the explanatory variable (variable X).
The stored procedure PredictTipRxPy
uses a model created using the revoscalepy library. This works much like the procedure PredictTipSciKitPy
, but with some changes to the revoscalepy function.
The stored procedure PredictTipRxPy
is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).
In Management Studio's Object Explorer, expand Programming> Stored Procedures.
Right-click on PredictTipRxPy
and select Modify to open the Transact-SQL script in a new query window.
DROP PROCEDURE IF EXISTS PredictTipRxPy;
GO
CREATE PROCEDURE [dbo].[PredictTipRxPy](@model varchar(50), @inquery nvarchar(max))
AS
BEGIN
DECLARE @lmodel2 varbinary(max) = (select model from nyc_taxi_models where name = @model);
EXEC sp_execute_external_script
@language = N'Python',
@script = N'
import pickle
import numpy
# import pandas
from sklearn import metrics
from revoscalepy.functions.RxPredict import rx_predict
mod = pickle.loads(lmodel2)
X = InputDataSet[["passenger_count", "trip_distance", "trip_time_in_secs", "direct_distance"]]
y = numpy.ravel(InputDataSet[["tipped"]])
prob_array = rx_predict(mod, X)
prob_list = prob_array["tipped_Pred"].values
prob_array = numpy.asarray(prob_list)
fpr, tpr, thresholds = metrics.roc_curve(y, prob_array)
auc_result = metrics.auc(fpr, tpr)
print("AUC on testing data is:", auc_result)
OutputDataSet = pandas.DataFrame(data=prob_list, columns=["predictions"])
',
@input_data_1 = @inquery,
@input_data_1_name = N'InputDataSet',
@params = N'@lmodel2 varbinary(max)',
@lmodel2 = @lmodel2
WITH RESULT SETS ((Score float));
END
GO
The stored procedures PredictTipSciKitPy
and PredictTipRxPy
require two input parameters.
--Query to extract data to be scored --Trained model identifier used for scoring
In this section, you'll learn how to pass these arguments to a stored procedure to easily modify both the model and the data used for scoring.
Define the input data and call the stored procedure for scoring as follows: This example uses the stored procedure PredictTipSciKitPy
for scoring and passes the model name and query string.
DECLARE @query_string nvarchar(max) -- Specify input query
SET @query_string='
select tipped, fare_amount, passenger_count, trip_time_in_secs, trip_distance,
dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance
from nyctaxi_sample_testing'
EXEC [dbo].[PredictTipSciKitPy] 'SciKit_model', @query_string;
The stored procedure returns an estimated probability of receiving a tip for each operation record passed as part of an input query. The predicted values are displayed in the results pane of Management Studio. The accuracy metric ʻAUC (area under the curve)` is output in the message pane.
To use the revoscalepy model for scoring, call the stored procedure PredictTipRxPy.
DECLARE @query_string nvarchar(max) -- Specify input query
SET @query_string='
select tipped, fare_amount, passenger_count, trip_time_in_secs, trip_distance,
dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance
from nyctaxi_sample_testing'
EXEC [dbo].[PredictTipRxPy] 'revoscalepy_model', @query_string;
In some cases, instead of batch scoring, you want to pass a single case and get a single result based on that value. For example, configure an Excel worksheet, web application, or Reporting Services report to call a stored procedure based on user input.
In this section, you will learn how to call the stored procedures PredictTipSingleModeSciKitPy
and PredictTipSingleModeRxPy
to create a single prediction.
The stored procedures PredictTipSingleModeSciKitPy
and PredictTipSingleModeRxPy
are defined in SQL Server through Step 2: Import data into SQL Server using PowerShell.
In Management Studio's Object Explorer, expand Programming> Stored Procedures.
Right-click PredictTipSingleModeSciKitPy
or PredictTipSingleModeRxPy
and select Modify to open the Transact-SQL script in a new query window.
These stored procedures use the scikit-learn and revoscalepy models to perform scoring as follows:
--The model name and multiple single values are provided as input. These inputs include the number of passengers, driving distance, and so on.
--The table-valued function fnEngineerFeatures
takes latitude and longitude as input and converts them directly to distance.
--When calling a stored procedure from an external application, make sure that the input data matches the required input functionality of the Python model. This includes casting the input data to a Python data type and validating the data type and data length.
--The stored procedure creates a score based on the stored Python model.
Below is a stored procedure PredictTipSingleModeSciKitPy
that performs scoring using the scikit-learn model.
```SQL:PredictTipSingleModeSciKitPy
CREATE PROCEDURE [dbo].[PredictTipSingleModeSciKitPy](@model varchar(50), @passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0)
AS
BEGIN
DECLARE @inquery nvarchar(max) = N'
SELECT * FROM [dbo].[fnEngineerFeatures](
@passenger_count,
@trip_distance,
@trip_time_in_secs,
@pickup_latitude,
@pickup_longitude,
@dropoff_latitude,
@dropoff_longitude)
'
DECLARE @lmodel2 varbinary(max) = (select model from nyc_taxi_models where name = @model);
EXEC sp_execute_external_script
@language = N'Python',
@script = N'
import pickle
import numpy
# import pandas
# Load model and unserialize
mod = pickle.loads(model)
# Get features for scoring from input data
X = InputDataSet[["passenger_count", "trip_distance", "trip_time_in_secs", "direct_distance"]]
# Score data to get tip prediction probability as a list (of float)
prob = [mod.predict_proba(X)[0][1]]
# Create output data frame
OutputDataSet = pandas.DataFrame(data=prob, columns=["predictions"])
',
@input_data_1 = @inquery,
@params = N'@model varbinary(max),@passenger_count int,@trip_distance float,
@trip_time_in_secs int ,
@pickup_latitude float ,
@pickup_longitude float ,
@dropoff_latitude float ,
@dropoff_longitude float',
@model = @lmodel2,
@passenger_count =@passenger_count ,
@trip_distance=@trip_distance,
@trip_time_in_secs=@trip_time_in_secs,
@pickup_latitude=@pickup_latitude,
@pickup_longitude=@pickup_longitude,
@dropoff_latitude=@dropoff_latitude,
@dropoff_longitude=@dropoff_longitude
WITH RESULT SETS ((Score float));
END
GO
```
Below is a stored procedure PredictTipSingleModeRxPy
that performs scoring using the revoscalepy model.
```SQL
CREATE PROCEDURE [dbo].[PredictTipSingleModeRxPy](@model varchar(50), @passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0)
AS
BEGIN
DECLARE @inquery nvarchar(max) = N'
SELECT * FROM [dbo].[fnEngineerFeatures](
@passenger_count,
@trip_distance,
@trip_time_in_secs,
@pickup_latitude,
@pickup_longitude,
@dropoff_latitude,
@dropoff_longitude)
'
DECLARE @lmodel2 varbinary(max) = (select model from nyc_taxi_models where name = @model);
EXEC sp_execute_external_script
@language = N'Python',
@script = N'
import pickle
import numpy
# import pandas
from revoscalepy.functions.RxPredict import rx_predict
# Load model and unserialize
mod = pickle.loads(model)
# Get features for scoring from input data
x = InputDataSet[["passenger_count", "trip_distance", "trip_time_in_secs", "direct_distance"]]
# Score data to get tip prediction probability as a list (of float)
prob_array = rx_predict(mod, x)
prob_list = prob_array["tipped_Pred"].values
# Create output data frame
OutputDataSet = pandas.DataFrame(data=prob_list, columns=["predictions"])
',
@input_data_1 = @inquery,
@params = N'@model varbinary(max),@passenger_count int,@trip_distance float,
@trip_time_in_secs int ,
@pickup_latitude float ,
@pickup_longitude float ,
@dropoff_latitude float ,
@dropoff_longitude float',
@model = @lmodel2,
@passenger_count =@passenger_count ,
@trip_distance=@trip_distance,
@trip_time_in_secs=@trip_time_in_secs,
@pickup_latitude=@pickup_latitude,
@pickup_longitude=@pickup_longitude,
@dropoff_latitude=@dropoff_latitude,
@dropoff_longitude=@dropoff_longitude
WITH RESULT SETS ((Score float));
END
GO
```
Open a new query window in Management Studio and enter the explanatory variable column to call the stored procedure.
-- Call stored procedure PredictTipSingleModeSciKitPy to score using SciKit-Learn model
EXEC [dbo].[PredictTipSingleModeSciKitPy] 'SciKit_model', 1, 2.5, 631, 40.763958,-73.973373, 40.782139,-73.977303
-- Call stored procedure PredictTipSingleModeRxPy to score using revoscalepy model
EXEC [dbo].[PredictTipSingleModeRxPy] 'revoscalepy_model', 1, 2.5, 631, 40.763958,-73.973373, 40.782139,-73.977303
The seven explanatory variable values are in the following order: - passenger_count - trip_distance - trip_time_in_secs - pickup_latitude - pickup_longitude - dropoff_latitude - dropoff_longitude
As a result, the probability that a tip will be paid in an operation with the above parameters is returned.
![sqldev-python-step6-5-gho9o9.png](https://qiita-image-store.s3.amazonaws.com/0/195839/27bc40a6-07ae-595d-fec1-4c030e6c04ab.png)
In this tutorial, you learned how to work with Python code embedded in stored procedures. We found that integration with Transact-SQL makes it even easier to deploy Python models for forecasting and incorporate model retraining as part of an enterprise data workflow.
Step 5: Training and saving a model using T-SQL
In-Database Python Analysis for SQL Developers
Step 6: Operationalize the Model
Machine Learning Services with Python
Recommended Posts