Practice Map function in the created Pyspark environment I tried ridge regression.
The original story is from edX's Scalable Machine Learning lecture.
The data can be obtained from the Million Song Dataset.
So, if you write the result first, even though you did a Grid search including the Default value The result is that the regression performed with the Default value is more accurate, which is likely to cause nosebleeds.
First of all, import.
#library import
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from test_helper import Test
import numpy as np
#data load
rawData = sc.textFile('millionsong.txt')
I used LinearRegressionWithSGD.
samplePoints = rawData.take(2)
#Create LabeledPoint
def parsePoint(line):
tokens = line.split(',')
label = tokens[0]
features = tokens[1:]
return LabeledPoint(label, features)
parsedData = rawData.map(parsePoint)
print parsedData.take(3)
[LabeledPoint(2001.0, [0.884123733793,0.610454259079,0.600498416968,0.474669212493,0.247232680947,0.357306088914,0.344136412234,0.339641227335,0.600858840135,0.425704689024,0.60491501652,0.419193351817]), LabeledPoint(2001.0, [0.854411946129,0.604124786151,0.593634078776,0.495885413963,0.266307830936,0.261472105188,0.506387076327,0.464453565511,0.665798573683,0.542968988766,0.58044428577,0.445219373624]), LabeledPoint(2001.0, [0.908982970575,0.632063159227,0.557428975183,0.498263761394,0.276396052336,0.312809861625,0.448530069406,0.448674249968,0.649791323916,0.489868662682,0.591908113534,0.4500023818])]
Next, divide the data into Traning, Validation, and Test.
#Devide rawData into Traning, Validation and Test
weights = [.8, .1, .1]
seed = 50
parsedTrainData, parsedValData, parsedTestData = parsedData.randomSplit(weights, seed)
# Fit the model with default values
fitModel = LinearRegressionWithSGD.train(parsedTrainData)
print fitModel
(weights=[348.391703677,312.158507888,303.514705245,268.768326368,260.265535915,321.082923267,345.636059404,369.96940298,414.587178279,328.611497772,389.972179858,348.42792115], intercept=0.0)
Comparison of measured and predicted values based on this model. The error is about 2 years.
# Prediction
testPoint = parsedTrainData.take(1)[0]
print testPoint.label
testPrediction = fitModel.predict(testPoint.features)
print testPrediction
2001.0
2003.04838193
Next, a model evaluation using the root mean square error (RMSE) was performed. In addition, we will search for hyperparameters by grid search using RMSE as an evaluation function.
# Define a formula to caluculate RMSE(Root Mean Square Error)
def squaredError(label, prediction):
##Calculates the the squared error for a single prediction.
return (label - prediction)**2
def calcRMSE(labelsAndPreds):
##Calculates the root mean squared error for an `RDD` of (label, prediction) tuples.
return np.sqrt(labelsAndPreds
.map(lambda (label, prediction): squaredError(label, prediction))
.mean())
#Create new RDD with actual label and predicted label
labelsAndPreds = parsedValData.map(lambda lp: (lp.label, fitModel.predict(lp.features)))
#Calculation RMSE
rmseValLR1 = calcRMSE(labelsAndPreds)
print rmseValLR1
126.788570325
RMSE is about 126 in Fitting performed with the default value. So, next, I will try to find hyperparameters with grid search.
##Grid search
# Values to use when training the linear regression model
miniBatchFrac = 1.0 # miniBatchFraction
regType = 'l2' # regType
useIntercept = True # intercept
# Seed of minmum RMSE
min_rmseVal = 10**10
#Fix HyperParameters
modelRMSEs = []
for grid_alpha in [1e-5,1.0,10,100]:
for grid_numIters in [50,100,500]:
for grid_reg in [1e-5,0.0, 1.0]:
model = LinearRegressionWithSGD.train(parsedTrainData,
iterations=grid_numIters,
step=grid_alpha,
regParam=grid_reg)
labelsAndPreds = parsedValData.map(lambda lp: (lp.label, model.predict(lp.features)))
rmseVal = calcRMSE(labelsAndPreds)
if rmseVal < min_rmseVal:
min_rmseVal = rmseVal
best_alpha = grid_alpha
best_numIter =grid_numIters
best_reg = grid_reg
print "best_alpha:{},best_numIter:{}, best_reg:{}, best_rmseVal:{}".format(best_alpha, best_numIter, best_reg, min_rmseVal)
best_alpha:1.0,best_numIter:500, best_reg:1e-05, best_rmseVal:117.120806943
Here, the RMSE has dropped slightly to about 117, and it can be expected that the accuracy will increase. So, fitting immediately and comparing the measured value and the predicted value as in the first model.
#Fitting with HyperParameters fixed by grid search
Final_model = LinearRegressionWithSGD.train(parsedTrainData,
iterations=best_numIter,
step=best_alpha,
regParam =best_reg)
#Labels comparison between Final_model and actual data.
Final_testPoint = parsedTrainData.take(1)[0]
print Final_testPoint.label
Final_Prediction = Final_model.predict(Final_testPoint.features)
print Final_Prediction
2001.0
1969.7210425
The accuracy is significantly worse with an error of nearly 30 years. Why? I don't know the cause right away, so I'll rewrite it if I try again.
Recommended Posts