In-Database Python Analysis for SQL Developers
Step 4: Feature extraction of data using T-SQL
In this step, you will learn how to train a machine learning model using the Python packages scikit-learn and revoscalepy. These Python packages are already installed with SQL Server Machine Learning Services, so you can load modules and call the required functions from within your stored procedure. Train your model with the data features you create and store the trained model in a SQL Server table.
TrainTestSplit
The stored procedure TrainTestSplit
is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).
TrainTestSplit
and select Modify to open the Transact-SQL script in a new query window.TrainTestSplit
splits the data in the nyctaxi_sample table into two tables, nyctaxi_sample_training and nyctaxi_sample_testing.
```SQL:TrainTestSplit
CREATE PROCEDURE [dbo].[TrainTestSplit](@pct int)
AS
DROP TABLE IF EXISTS dbo.nyctaxi_sample_training
SELECT * into nyctaxi_sample_training FROM nyctaxi_sample WHERE (ABS(CAST(BINARY_CHECKSUM(medallion,hack_license) as int)) % 100) < @pct
DROP TABLE IF EXISTS dbo.nyctaxi_sample_testing
SELECT * into nyctaxi_sample_testing FROM nyctaxi_sample
WHERE (ABS(CAST(BINARY_CHECKSUM(medallion,hack_license) as int)) % 100) > @pct
GO
```
Run the stored procedure and enter an integer that represents the percentage you want to assign to the training set. For example, the following statement assigns 60% of the data to a training set. Training and test data are stored in two separate tables.
EXEC TrainTestSplit 60
GO
In this section, you will use the training data you created to create a stored procedure that trains your model. This stored procedure uses the scikit-learn function to train a logistic regression model. This is implemented by using the system stored procedure sp_execute_external_script to call the Python runtime installed with SQL Server.
The model is facilitated by retraining the model by defining new training data as parameters and creating a stored procedure that wraps the call to the system stored procedure sp_execute_exernal_script.
The stored procedure TrainTipPredictionModelSciKitPy
is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).
In Management Studio's Object Explorer, expand Programming> Stored Procedures.
Right-click TrainTipPredictionModelSciKitPy
and select Modify to open the Transact-SQL script in a new query window.
DROP PROCEDURE IF EXISTS TrainTipPredictionModelSciKitPy;
GO
CREATE PROCEDURE [dbo].[TrainTipPredictionModelSciKitPy](@trained_model varbinary(max) OUTPUT)
AS
BEGIN
EXEC sp_execute_external_script
@language = N'Python',
@script = N'
import numpy
import pickle
# import pandas
from sklearn.linear_model import LogisticRegression
##Create SciKit-Learn logistic regression model
X = InputDataSet[["passenger_count", "trip_distance", "trip_time_in_secs", "direct_distance"]]
y = numpy.ravel(InputDataSet[["tipped"]])
SKLalgo = LogisticRegression()
logitObj = SKLalgo.fit(X, y)
##Serialize model
trained_model = pickle.dumps(logitObj)
',
@input_data_1 = N'
select tipped, fare_amount, passenger_count, trip_time_in_secs, trip_distance,
dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance
from nyctaxi_sample_training
',
@input_data_1_name = N'InputDataSet',
@params = N'@trained_model varbinary(max) OUTPUT',
@trained_model = @trained_model OUTPUT;
;
END;
GO
Execute the following SQL statement to register the trained model in the nyc_taxi_models table.
DECLARE @model VARBINARY(MAX);
EXEC TrainTipPredictionModelSciKitPy @model OUTPUT;
INSERT INTO nyc_taxi_models (name, model) VALUES('SciKit_model', @model);
Make sure that one new record is added to the nyc_taxi_models table and the serialized model is registered.
Then train your logistic regression model with the stored procedure TrainTipPredictionModelRxPy
using the new release RevoScalePy package. Python's RevoScalePy package contains algorithms for object definition, data processing, and machine learning similar to those provided by R's RevoScaleR package. This library allows you to train predictive models using common algorithms such as logistic, linear regression, and decision trees, create computational contexts, move data between computational contexts, and process data. For more information on RevoScalePy, see Introducing RevoScalePy.
The stored procedure TrainTipPredictionModelRxPy
is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).
In Management Studio's Object Explorer, expand Programming> Stored Procedures.
Right-click TrainTipPredictionModelRxPy
and select Modify to open the Transact-SQL script in a new query window.
DROP PROCEDURE IF EXISTS TrainTipPredictionModelRxPy;
GO
CREATE PROCEDURE [dbo].[TrainTipPredictionModelRxPy](@trained_model varbinary(max) OUTPUT)
AS
BEGIN
EXEC sp_execute_external_script
@language = N'Python',
@script = N'
import numpy
import pickle
# import pandas
from revoscalepy.functions.RxLogit import rx_logit
## Create a logistic regression model using rx_logit function from revoscalepy package
logitObj = rx_logit("tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance", data = InputDataSet);
## Serialize model
trained_model = pickle.dumps(logitObj)
',
@input_data_1 = N'
select tipped, fare_amount, passenger_count, trip_time_in_secs, trip_distance,
dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) as direct_distance
from nyctaxi_sample_training
',
@input_data_1_name = N'InputDataSet',
@params = N'@trained_model varbinary(max) OUTPUT',
@trained_model = @trained_model OUTPUT;
;
END;
GO
--The SELECT query uses the custom scalar function fnCalculateDistance to calculate the direct distance between the boarding position and the disembarking position. The result of the query is stored in the default Python input variable ʻInputDataset. --The Python script calls the logisticRegression function of revoscalepy included in Machine Learning Services to create a logistic regression model. --Create a model with tapped as the objective variable (label) and passenger_count, trip_distance, trip_time_in_secs, and direct_distance as the explanatory variables (feature). --The trained model indicated by the Python variable
logitObj` is serialized and returned as an output parameter. By registering this output in the nyc_taxi_models table, you can use it repeatedly for future prediction.
Execute the following SQL statement to register the trained model in the nyc_taxi_models table.
DECLARE @model VARBINARY(MAX);
EXEC TrainTipPredictionModelRxPy @model OUTPUT;
INSERT INTO nyc_taxi_models (name, model) VALUES('revoscalepy_model', @model);
Data processing and model fitting can take a few minutes. Messages that are piped to Python's stdout stream are displayed in the Management Studio message window.
![sqldev-python-step5-4-gho9o9.png](https://qiita-image-store.s3.amazonaws.com/0/195839/ed97cdcb-3322-c5c2-c24d-e7d5d71bcf21.png)
Make sure that one new record is added to the nyc_taxi_models table and the serialized model is registered.
The next step is to create a forecast using the trained model.
Step 4: Feature extraction of data using T-SQL
In-Database Python Analysis for SQL Developers
Step 5: Train and save a model using T-SQL
Machine Learning Services with Python
Recommended Posts