This time, I downloaded J League data from a soccer data site called FootyStats and examined it with Python. However, since the csv data for the 2020 season could not be downloaded yet, I am using the data for 2019. This site deals with information not only on the J League but also on leagues around the world, so it's interesting just to look at the site.
For J-League (J1, J2, J, Cup match) data, match data, team data, and player data can be downloaded respectively. (The standings are from J.LEAGUE Data Site).
#Import various libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Since there was no point column in the FootyStats data, the URL was directly obtained from the site J.LEAGUE Data Site and displayed.
#Capture data
j1_rank = 'https://data.j-league.or.jp/SFRT01/?search=search&yearId=2019&yearIdLabel=2019%E5%B9%B4&competitionId=460&competitionIdLabel=%E6%98%8E%E6%B2%BB%E5%AE%89%E7%94%B0%E7%94%9F%E5%91%BD%EF%BC%AA%EF%BC%91%E3%83%AA%E3%83%BC%E3%82%B0&competitionSectionId=0&competitionSectionIdLabel=%E6%9C%80%E6%96%B0%E7%AF%80&homeAwayFlg=3'
j1_rank = pd.read_html(j1_rank)
df_rank = pd.DataFrame(j1_rank[0])
df_rank.index = df_rank.index + 1
df_rank[['team','Points', 'Win', 'Minutes', 'Loss', 'score', 'Conceded', '得Conceded差']]
df_rank.describe()
The average points are 47 points and the median is 46.5 points, so the average and median are almost the same. If you check the standings, the points 46-47 are concentrated in the middle or slightly above the 10th-6th.
Load the csv data for this analysis. Looking at the team data, there were also 293 columns. It is quite difficult to check what kind of data is available. .. ..
df_team = pd.read_csv('j1_team_2019.csv')
pd.set_option('display.max_columns', None)
df_team.head(6)
len(df_teams.columns)
#Number of columns: 293
You can also display it with df.culumns (), but it is easier to see personally if you turn it in for minutes.
for team in df_team:
print(team)
Not all items are listed, but it seems that very detailed data such as the score rate in the first and second half of the game is included.
I summarized the winning percentage of the J1 team. Of the games I won, I wanted to see how many homes I won at home, so I wanted to see the clubs that are strong at home, so I made a line of "home rate out of wins" and arranged them in descending order.
#Win rate
df_team['wins_rate'] = df_team.apply(lambda row: row['wins'] / 34, axis=1)
#Home win rate
df_team['home_wins_rate'] = df_team.apply(lambda row: row['wins_home'] / 17, axis=1)
#Home out of victory
df_team['wins_at_home'] = df_team.apply(lambda row: row['wins_home'] / row['wins'], axis=1)
df_team = df_team[['team_name', 'wins', 'wins_home','wins_rate', 'home_wins_rate', 'wins_at_home']].sort_values('wins_at_home', ascending=False).reset_index(drop=True)
df_team.index = df_team.index + 1
df_team.rename(columns={'team_name': 'Club name', 'wins': 'victory', 'wins_home': 'ホームvictory', 'wins_rate': 'Win rate', 'home_wins_rate': 'ホームWin rate', 'wins_at_home': 'victoryのうちホーム率'})
Nagoya wins 9 times a year, and 7 of them are at home (about 78%) and at home. Sendai is also expensive. Looking down, Kawasaki, who has been competing for victory in recent years, was surprisingly low.
Obviously, the more points you have, the more games you will win, but let's look at the correlation between the number of points and the number of wins. Plot the number of points on the horizontal axis and the number of wins on the vertical axis. As you can see, there is still a positive correlation.
df = df_team
plt.scatter(df['goals_scored'], df['wins'])
plt.xlabel('goals_scored')
plt.ylabel('wins')
for i, txt in enumerate(df.team_name):
plt.annotate(txt, (df['goals_scored'].values[i], df['wins'].values[i]))
print(txt)
plt.scatter(df['goals_scored'], df['wins'])
plt.xlabel('goals_scored')
plt.ylabel('wins')
plt.show()
On the contrary, let's look at the correlation between the number of goals conceded and the number of wins. There seems to be a correlation (negative correlation) here as well, but it does not seem to be as strong as the correlation between the number of points scored and the number of wins.
df = df_team
plt.scatter(df['goals_conceded'], df['wins'])
plt.xlabel('goals_conceded')
plt.ylabel('wins')
for i, txt in enumerate(df.team_name):
plt.annotate(txt, (df['goals_conceded'].values[i], df['wins'].values[i]))
print(txt)
plt.scatter(df['goals_conceded'], df['wins'])
plt.xlabel('goals_conceded')
plt.ylabel('wins')
plt.show()
Let's find the correlation coefficient between the number of points and the number of wins, and the number of goals and the number of wins.
wins = df['wins']
goals_scored = df['goals_scored']
r = np.corrcoef(wins, goals_scored)
r
#Correlation coefficient: 0.7184946
wins = df['wins']
goals_conceded = df['goals_conceded']
r = np.corrcoef(wins, goals_conceded)
r
#Correlation coefficient:-0.58795491
The correlation coefficient between the number of points and the number of wins is still high at about 0.72. The correlation coefficient between the number of goals conceded and the number of wins is about -0.58 (absolute value 0.58), which seems to be correlated, but not as much as the number of points scored.
I may add it because I am analyzing various other things. Also, when the data for the 2020 season becomes available for download, we plan to take a look at the 2020 season as well. Due to the influence of Corona, the schedule has become overcrowded, and the rules for replacement slots have changed, so I would like to compare how it has changed from normal.
Recommended Posts