This time, I would like to use VARISTA to generate a model that predicts the sales of game software. I tried to build a prediction model with VARISTA by referring to the following Codexa article. The article uses AWS SageMaker, but I'll try it with VARISTA.

I tried to predict the sales of game software with XGBoost [Amazon SageMaker notebook + model training + model hosting]

However, I had to write some code by formatting the data, so I'm doing it with Google Colabratory. (For the time being, Python is a knowledge level that I have bitten.)

Time required

The actual operation is about 10 minutes Learning is about 1 or 2 minutes at level 1 and about 1.5 hours at level 3

Incurred costs

Free with VARISTA's Free account

Download data

Download from the following page of Kaggle. Video Game Sales with Ratings

A description of the data contained in Kaggle

This dataset was a Metacritic scraping. Unfortunately, Metacritic only covers a subset of the platform, so there is a lack of aggregated data. Also, in some games, the variables described below are missing.

Critic_score --A critic score compiled by Metacritic staff. Critic_count --The number of critics used to calculate the Critic_score. User_score --Score by Metacritic subscribers Usercount-Number of users who voted for a user score Developer-Game development company Rating --ESRB Rating

So just keep in mind that it's quite ** missing or not sales data for all games. ** **

Data processing

This time, I will process the data a little as the article I referred to. Since we have defined more than 1 million sales as hits, we will add a new column with more than 1 million as Yes and others as No.

I don't really like to create an environment locally, so I write this code in Google Colaboratory to process the data.

Colaboratory - Google Colab

import pandas as pd
filename = './sample_data/kaggle/Video_Games_Sales_as_at_22_Dec_2016.csv'
data = pd.read_csv(filename)                               
#Set target
# Global_Create y based on sales of 1 (1 million) or more in Sales
data['y'] = 'no'
data.iloc[data['Global_Sales'] > 1, 'y'] = 'yes
pd.set_option('display.max_rows', 20)
#View data
data
#Save the processed data as a new CSV
data.to_csv('sample_data/kaggle/Add_y_Column_Video_Games_Sales.csv')

You can see that ** y ** has been added to the rightmost column.

When you execute the above code, a file called "Add_y_Column_Video_Games_Sales.csv" will be generated, so download it.

Upload data to VARISTA

Click here for VARISTA

Create a new project in VARISTA and upload the ** Add_y_Column_Video_Games_Sales.csv ** you created. This time, select ** y ** for the column to predict.

Data confirmation

The outline of the data is as follows.

The number of releases seems to peak in 2008-2010.

Most of the platforms are PS2 and DS, followed by PS3. It seems that the smartphone is not included.

The distribution of genres is like this.

EA seems to be the top in the number of published books. I'm glad that there are many Japanese game companies.

As for whether or not it was a hit, it is quite a narrow gate with 2,057 / 16,719 books. I used to develop smartphone games, but I had the impression that million hits had capital or luck. Moreover, this data is a consumer machine, so it's difficult. ..

See the correlation

yes (yellow): More than a million hit no (light blue): Million hit not reached

Platforms are NES and GB % E3% 83% A0% E3% 83% 9C% E3% 83% BC% E3% 82% A4) has a high million hit rate. Is it because there weren't many other options when these game consoles were popular? .. ??

Publisher / Developer Since I'm Japanese, I'm always interested in Nintendo and Square Enix, but it's amazing that all the titles I developed in this data are million hits. As you can imagine from this graph, Nintendo is good at planning and development, and may not be very good at selling games developed by other companies.

The difference between Publisher and Developer is that Publisher is the company that sells and provides games, and Developer is the company that develops games. In some cases, Developer is also Publisher.

Critic_score & Critic_count

User_score

Learning

Learning was done at ** level 3 **. Detailed parameter settings have been done like this since Titanic. Level 1 learning is completed in a few minutes, but with this setting, it took an hour to find a large number of parameters.

Also, I turned off the columns (Unnamed0, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales) that are directly related to the predicted columns from the dataset.

Check the result

When I check the score, it looks like this.

The confusion matrix is displayed like this. Using 103 cases as test data for verification, it seems that 82 cases were hit and 21 cases were not hit.

Also, this time I was angry that the learning data was biased. For this, there is no choice but to adjust the amount of data by undersampling etc., but I would like to try what actually happens. I will make time to try it again.

It seems that the value judged as Yes / No is also automatically adjusted. In this case, it seems to judge YES if it exceeds 0.222.

Since there is no actual test data, it should be created by picking up from the training data. This time I tried it briefly, so I tried to verify it using the data automatically divided by VARISTA.

If you read this article and decided to use VARISTA, please use the link below! Earn 7 $ credits for me and you! m (_ _) m

https://console.varista.ai/welcome/jamaica-draft-coach-cup-blend

Reference article I tried to predict the sales of game software with XGBoost [Amazon SageMaker notebook + model training + model hosting]

I tried to predict the sales of game software with VARISTA by referring to the article of Codexa