In-Database Python Analysis for SQL Developers
Step 3: Data exploration and visualization
Step 5: Training and saving a model using T-SQL
After exploring the data, we will gather some insights from the data and move on to feature engineering. The process of feature extraction from raw data is an important step in advanced analytical modeling.
In this step, you will learn how to use Transact-SQL functions to extract features from raw data. Then call that function from the stored procedure to create a table containing the feature values.
The meter distance values recorded in the original data may not represent geographic or travel distances, so the coordinates available in this dataset are used to direct the distance between the boarding and disembarking positions. To calculate. To do this, use the Haversine Formula (https://en.wikipedia.org/wiki/Haversine_formula) in your custom Transact-SQL function.
The T-SQL function fnCalculateDistance
uses a Haversine expression to calculate the distance, and the T-SQL function fnEngineerFeatures
creates a table containing all the features.
The T-SQL function fnCalculateDistance
is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).
In Management Studio's Object Explorer, expand Programming> Functions> Scalar Value Functions.
Right-click fnCalculateDistance
and select Modify to open the Transact-SQL script in a new query window.
CREATE FUNCTION [dbo].[fnCalculateDistance](@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)
-- User-defined function that calculates the direct distance between two geographical coordinates
RETURNS float
AS
BEGIN
DECLARE @distance decimal(28, 10)
-- Convert to radians
SET @Lat1 = @Lat1 / 57.2958
SET @Long1 = @Long1 / 57.2958
SET @Lat2 = @Lat2 / 57.2958
SET @Long2 = @Long2 / 57.2958
-- Calculate distance
SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))
--Convert to miles
IF @distance <> 0
BEGIN
SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);
END
RETURN @distance
END
GO
--This function is a scalar value function that returns a single data value of a predefined type. --The latitude and longitude values obtained from the boarding and disembarking positions are used as inputs. The Haversine formula converts positions to radians and uses these values to calculate the direct distance between two locations.
The T-SQL function fnEngineerFeatures
is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).
In Management Studio's Object Explorer, expand Programming> Functions> Table Value Functions.
Right-click fnCalculateDistance
and select Modify to open the Transact-SQL script in a new query window.
fnEngineerFeatures
is a table-valued function that takes multiple columns as input and returns multiple feature-valued columns. The purpose of fnEngineerFeatures
is to create a feature value set to use for model building. fnEngineerFeatures
calls fnCalculateDistance
to get the linear distance between the boarding position and the disembarking position.
```SQL:fnEngineerFeatures
CREATE FUNCTION [dbo].[fnEngineerFeatures](
@passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0)
RETURNS TABLE
AS
RETURN
(
-- Add the SELECT statement with parameter references here
SELECT
@passenger_count AS passenger_count,
@trip_distance AS trip_distance,
@trip_time_in_secs AS trip_time_in_secs,
[dbo].[fnCalculateDistance](@pickup_latitude,@pickup_longitude,@dropoff_latitude,@dropoff_longitude) AS direct_distance
)
GO
```
As an operation check, let's calculate the geographical distance for the record where the meter distance value is set to 0 even though the boarding position and the getting-off position are different.
SELECT tipped, fare_amount, passenger_count,(trip_time_in_secs/60) as TripMinutes,
trip_distance, pickup_datetime, dropoff_datetime,
dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) AS direct_distance
FROM nyctaxi_sample
WHERE pickup_longitude != dropoff_longitude and pickup_latitude != dropoff_latitude and trip_distance = 0
ORDER BY trip_time_in_secs DESC
As you can see, the distances reported by the meters are not always recorded as an indication of geographical distance. These pretreatments are characteristic engineering is the reason why engineering is important.
In the next step, you'll learn how to use these features to create and train machine learning models using Python.
Step 5: Training and saving a model using T-SQL
Step 3: Data exploration and visualization
In-Database Python Analysis for SQL Developers
Step 4: Create Data Features using T-SQL
Machine Learning Services with Python
Recommended Posts