Step 3: Data exploration and visualization

Data science solution development typically involves intensive data exploration and data visualization. In this step, you will explore the sample data and generate some plots. In addition, you will learn how to serialize graphics objects in Python and how to deserialize and create plots.

[!NOTE] This tutorial only shows the binary classification model. It is also possible to build other models such as regression classification and multinomial classification.

Data confirmation

The original dataset provides the taxi identifier and driving record in separate files, but we have joined with medallion, hack_license, and pickup_datetime as keys to make the sample data easier to use. The records used are sampled at 1% of the original number of records. The sampled dataset has 1,703,957 rows and 23 columns.

** Taxi identifier **

--The medallion column shows the unique ID number of the taxi. --The hack_license column shows the driver's anonymized driver's license number.

** Driving record and fare record **

――Each driving record includes the location and time of boarding and alighting, and the driving distance. --Each fare record contains payment information such as payment type, total payment amount, tip amount and so on. --The last three columns can be used for various machine learning tasks. --The tip_amount column contains consecutive numbers and can be used as a label column (objective variable) for regression analysis. --The tipped column has only yes / no values and can be used as a label column (objective variable) for binary classification. --The tip_class column has multiple ** class labels ** and can be used as a label column (objective variable) for multinomial classification. --The value used as the label column is based on the tip_amount column.

 |Column|rule|
 |:--|:--|
 |tipped|If tip_amount > 0, tipped = 1, otherwise tipped = 0|
 |tip_class|Class 0: tip_amount =$ 0<br/> Class 1: tip_amount >$ 0and tip_amount <=$ 5<br/> Class 2: tip_amount >$ 5and tip_amount <=10 dollars<br/> Class 3: tip_amount >10 dollarsand tip_amount <=$ 20<br/> Class 4: tip_amount >$ 20|

Create plots in Python within T-SQL

Visualization is important for understanding the distribution of data and outliers, and Python provides many packages for data visualization. The matplotlib module includes many features for creating histograms, scatter plots, box plots, and other data exploration graphs.

In this section, you will learn how to work with plots using stored procedures. Here, the plot is treated as varbinary type data.

Store plot as varbinary data type

The ** RevoScalePy ** package of the Python library included in SQL Server 2017 Machine Learning Services is equivalent to the RevoScaleR package of the R library. This example uses rxHistogram to plot a histogram based on the Transact-SQL query result data. Wrap it in a PlotHistogram stored procedure for ease of use.

This stored procedure returns a serialized Python drawing object as a stream of varbinary data. You cannot view the binary data directly, but you can use Python code on the client to deserialize the binary data and save the image file on the client computer.

Define a SerializePlots stored procedure

The stored procedure SerializePlots is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).

In Management Studio's Object Explorer, expand Programming> Stored Procedures.

Right-click SerializePlots and select Modify to open the Transact-SQL script in a new query window.


CREATE PROCEDURE [dbo].[SerializePlots]
AS
BEGIN
  SET NOCOUNT ON;
  DECLARE @query nvarchar(max) =
  N'SELECT cast(tipped as int) as tipped, tip_amount, fare_amount FROM [dbo].[nyctaxi_sample]'
  EXECUTE sp_execute_external_script
  @language = N'Python',
  @script = N'
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import pandas as pd
import pickle

fig_handle = plt.figure()
plt.hist(InputDataSet.tipped)
plt.xlabel("Tipped")
plt.ylabel("Counts")
plt.title("Histogram, Tipped")
plot0 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
plt.clf()

plt.hist(InputDataSet.tip_amount)
plt.xlabel("Tip amount ($)")
plt.ylabel("Counts")
plt.title("Histogram, Tip amount")
plot1 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
plt.clf()

plt.hist(InputDataSet.fare_amount)
plt.xlabel("Fare amount ($)")
plt.ylabel("Counts")
plt.title("Histogram, Fare amount")
plot2 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
plt.clf()

plt.scatter( InputDataSet.fare_amount, InputDataSet.tip_amount)
plt.xlabel("Fare Amount ($)")
plt.ylabel("Tip Amount ($)")
plt.title("Tip amount by Fare amount")
plot3 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
plt.clf()

OutputDataSet = plot0.append(plot1, ignore_index=True).append(plot2, ignore_index=True).append(plot3, ignore_index=True)
',
                                 @input_data_1 = @query
  WITH RESULT SETS ((plot varbinary(max)))
END

GO

--The variable @ query defines the query text passed as input @ input_data_1 to the Python code block. -** Histograms and scatter plots are created with the figure of the matplotlib library ** and these objects are serialized using the ** pickle library **. --Python drawing objects are serialized into ** pandas ** dataframes for output.

Output varbinary data as an image file

Run the following query in Management Studio.
```
EXEC [dbo].[SerializePlots]
```
Change the connection string in the Python script DeserializeSavePlots.py in the downloaded files to suit your environment, and then execute it.

** For SQL Server Authentication **

```python:DeserializeSavePlots
import pyodbc
import pickle
import os
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER={SERVER_NAME};DATABASE={DB_NAME};UID={USER_NAME};PWD={PASSOWRD}')
cursor = cnxn.cursor()
cursor.execute("EXECUTE [dbo].[SerializePlots]")
tables = cursor.fetchall()
for i in range(0, len(tables)):
    fig = pickle.loads(tables[i][0])
    fig.savefig(str(i)+'.png')
print("The plots are saved in directory: ",os.getcwd())
```

** For Windows authentication **

```Python:DeserializeSavePlots.py
import pyodbc
import pickle
import os
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER={SERVER_NAME};DATABASE={DB_NAME};Trusted_Connection=yes;')
cursor = cnxn.cursor()
cursor.execute("EXECUTE [dbo].[SerializePlots]")
tables = cursor.fetchall()
for i in range(0, len(tables)):
    fig = pickle.loads(tables[i][0])
    fig.savefig(str(i)+'.png')
print("The plots are saved in directory: ",os.getcwd())
```

> [!NOTE]

Match the Python runtime version on the server and client. Also, the version of the Python library such as matplotlib used by the client should match the server or be a higher version.