This time, I downloaded J League data from a soccer data site called FootyStats and examined it with Python. However, since the csv data for the 2020 season could not be downloaded yet, I am using the data for 2019. This site deals with information not only on the J League but also on leagues around the world, so it's interesting just to look at the site.

For J-League (J1, J2, J, Cup match) data, match data, team data, and player data can be downloaded respectively. (The standings are from J.LEAGUE Data Site).

#Import various libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

2019 ranking

Since there was no point column in the FootyStats data, the URL was directly obtained from the site J.LEAGUE Data Site and displayed.

#Capture data
j1_rank = 'https://data.j-league.or.jp/SFRT01/?search=search&yearId=2019&yearIdLabel=2019%E5%B9%B4&competitionId=460&competitionIdLabel=%E6%98%8E%E6%B2%BB%E5%AE%89%E7%94%B0%E7%94%9F%E5%91%BD%EF%BC%AA%EF%BC%91%E3%83%AA%E3%83%BC%E3%82%B0&competitionSectionId=0&competitionSectionIdLabel=%E6%9C%80%E6%96%B0%E7%AF%80&homeAwayFlg=3'
j1_rank = pd.read_html(j1_rank)
df_rank = pd.DataFrame(j1_rank[0])
df_rank.index = df_rank.index + 1
df_rank[['team','Points', 'Win', 'Minutes', 'Loss', 'score', 'Conceded', '得Conceded差']]

スクリーンショット 2020-12-31 10.44.59.png

Take a quick look at the basic statistics

df_rank.describe()

The average points are 47 points and the median is 46.5 points, so the average and median are almost the same. If you check the standings, the points 46-47 are concentrated in the middle or slightly above the 10th-6th.

スクリーンショット 2020-12-31 10.04.05.png

Read Footy Stats data

Load the csv data for this analysis. Looking at the team data, there were also 293 columns. It is quite difficult to check what kind of data is available. .. ..

df_team = pd.read_csv('j1_team_2019.csv')
pd.set_option('display.max_columns', None)
df_team.head(6)

スクリーンショット 2020-12-30 18.48.23.png


len(df_teams.columns)
#Number of columns: 293

Take a quick look at what kind of data you have

You can also display it with df.culumns (), but it is easier to see personally if you turn it in for minutes.

for team in df_team:
    print(team)

スクリーンショット 2020-12-31 9.57.25.png

スクリーンショット 2020-12-31 9.58.36.png

Not all items are listed, but it seems that very detailed data such as the score rate in the first and second half of the game is included.

Win rate

I summarized the winning percentage of the J1 team. Of the games I won, I wanted to see how many homes I won at home, so I wanted to see the clubs that are strong at home, so I made a line of "home rate out of wins" and arranged them in descending order.

#Win rate
df_team['wins_rate'] = df_team.apply(lambda row: row['wins'] / 34, axis=1)
#Home win rate
df_team['home_wins_rate'] = df_team.apply(lambda row: row['wins_home'] / 17, axis=1)
#Home out of victory
df_team['wins_at_home'] = df_team.apply(lambda row: row['wins_home'] / row['wins'], axis=1)
df_team = df_team[['team_name', 'wins', 'wins_home','wins_rate', 'home_wins_rate', 'wins_at_home']].sort_values('wins_at_home', ascending=False).reset_index(drop=True)
df_team.index = df_team.index + 1
df_team.rename(columns={'team_name': 'Club name', 'wins': 'victory', 'wins_home': 'ホームvictory', 'wins_rate': 'Win rate', 'home_wins_rate': 'ホームWin rate', 'wins_at_home': 'victoryのうちホーム率'})

スクリーンショット 2020-12-30 18.23.24.png

Nagoya wins 9 times a year, and 7 of them are at home (about 78%) and at home. Sendai is also expensive. Looking down, Kawasaki, who has been competing for victory in recent years, was surprisingly low.

Correlation between the number of points and the number of wins

Obviously, the more points you have, the more games you will win, but let's look at the correlation between the number of points and the number of wins. Plot the number of points on the horizontal axis and the number of wins on the vertical axis. As you can see, there is still a positive correlation.

df = df_team
plt.scatter(df['goals_scored'], df['wins'])
plt.xlabel('goals_scored')
plt.ylabel('wins')

スクリーンショット 2020-12-30 18.34.28.png

Display the team name and take a look

for i, txt in enumerate(df.team_name):
    plt.annotate(txt, (df['goals_scored'].values[i], df['wins'].values[i]))
    print(txt)

plt.scatter(df['goals_scored'], df['wins'])
plt.xlabel('goals_scored')
plt.ylabel('wins')
plt.show()

スクリーンショット 2020-12-31 11.46.08.png

Correlation between the number of goals conceded and the number of wins

On the contrary, let's look at the correlation between the number of goals conceded and the number of wins. There seems to be a correlation (negative correlation) here as well, but it does not seem to be as strong as the correlation between the number of points scored and the number of wins.

df = df_team
plt.scatter(df['goals_conceded'], df['wins'])
plt.xlabel('goals_conceded')
plt.ylabel('wins')

スクリーンショット 2020-12-30 18.35.26.png

for i, txt in enumerate(df.team_name):
    plt.annotate(txt, (df['goals_conceded'].values[i], df['wins'].values[i]))
    print(txt)

plt.scatter(df['goals_conceded'], df['wins'])
plt.xlabel('goals_conceded')
plt.ylabel('wins')
plt.show()

Show team name

スクリーンショット 2020-12-31 11.48.31.png

Correlation coefficient

Let's find the correlation coefficient between the number of points and the number of wins, and the number of goals and the number of wins.

Correlation coefficient between score and number of wins

wins = df['wins']
goals_scored = df['goals_scored']
r = np.corrcoef(wins, goals_scored)
r
#Correlation coefficient: 0.7184946

Correlation coefficient between goals and wins

wins = df['wins']
goals_conceded = df['goals_conceded']
r = np.corrcoef(wins, goals_conceded)
r
#Correlation coefficient:-0.58795491

The correlation coefficient between the number of points and the number of wins is still high at about 0.72. The correlation coefficient between the number of goals conceded and the number of wins is about -0.58 (absolute value 0.58), which seems to be correlated, but not as much as the number of points scored.

I may add it because I am analyzing various other things. Also, when the data for the 2020 season becomes available for download, we plan to take a look at the 2020 season as well. Due to the influence of Corona, the schedule has become overcrowded, and the rules for replacement slots have changed, so I would like to compare how it has changed from normal.

I tried to analyze J League data with Python