☆ Professor Anzai… !! I want to analyze the data …… Part 1 Data preparation ☆ Let's analyze the NBA player stats (results) with Python. basketball

Everyone across the country of love basketball, hello. My name is Hikoichi Aida. I usually work as a manager and data scientist on a high school basketball team, analyzing various data.

This time, I would like to analyze the player stats (results) of the NBA, which is a professional basketball league in the United States. Analysis is easy, but please keep in touch.

The first is about scraping & preprocessing because it is data preparation. I don't know when the second and subsequent sessions will be, please forgive me. It may not be forever.

environment

I used Google Colaboratory. The process introduced this time can be operated only with the pre-installed library. It's very convenient.

Data collection

Scraping

When I was searching for what to do with the data collection part, I found the following blog article.

Most of the content is the same as this article, but I thought that the height and weight of the players might be scraped, so I included the URL of the player's personal page as a scraping target. Since the record representing the column name is inserted at regular intervals, that record is skipped.

data = pd.DataFrame()
years = [i for i in range(2000, 2002)]
for year in years:
    url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)
    # this is the HTML from the given URL
    html = urlopen(url)
    soup = BeautifulSoup(html)

    soup.findAll('tr', limit=2)
    # use getText()to extract the text we need into a list
    headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
    # exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
    headers = ['URL'] + headers[1:] + ['Year']

    rows = soup.findAll('tr')[1:]
    player_stats = [[rows[i].a.get('href')] + [td.getText() for td in rows[i].findAll('td')] for i in range(len(rows)) if (rows[i].findAll('td')) and (rows[i].a)]
    stats = pd.DataFrame(player_stats)
    stats['Year'] = str(year)
    stats.columns = headers
    data = pd.concat([data, stats])
data = data.dropna()

Here is an example of the page to be scraped.

All the stats of the players who participated in that season are contained in the one-page site, so I think that you can collect enough data even if you drag and copy and paste it into spreadsheet software such as Excel.

This time, we targeted data for 20 years (2000-2019). The scraping result looks like this.

URL Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS Year
0 /players/a/abdulta01.html Tariq Abdul-Wahad SG 25 TOT 61 56 25.9 4.5 10.6 .424 0.0 0.4 .130 4.4 10.2 .435 .426 2.4 3.2 .756 1.7 3.1 4.8 1.6 1.0 0.5 1.7 2.4 11.4 2000
1 /players/a/abdulta01.html Tariq Abdul-Wahad SG 25 ORL 46 46 26.2 4.8 11.2 .433 0.0 0.5 .095 4.8 10.7 .447 .435 2.5 3.3 .762 1.7 3.5 5.2 1.6 1.2 0.3 1.9 2.5 12.2 2000
2 /players/a/abdulta01.html Tariq Abdul-Wahad SG 25 DEN 15 10 24.9 3.4 8.7 .389 0.1 0.1 .500 3.3 8.6 .388 .393 2.1 2.8 .738 1.6 1.9 3.5 1.7 0.4 0.8 1.3 2.1 8.9 2000
3 /players/a/abdursh01.html Shareef Abdur-Rahim SF 23 VAN 82 82 39.3 7.2 15.6 .465 0.4 1.2 .302 6.9 14.4 .478 .477 5.4 6.7 .809 2.7 7.4 10.1 3.3 1.1 1.1 3.0 3.0 20.3 2000
4 /players/a/alexaco01.html Cory Alexander PG 26 DEN 29 2 11.3 1.0 3.4 .286 0.3 1.2 .257 0.7 2.2 .302 .332 0.6 0.8 .773 0.3 1.2 1.4 2.0 0.8 0.1 1.0 1.3 2.8 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
703 /players/z/zellety01.html Tyler Zeller C 29 MEM 4 1 20.5 4.0 7.0 .571 0.0 0.0 4.0 7.0 .571 .571 3.5 4.5 .778 2.3 2.3 4.5 0.8 0.3 0.8 1.0 4.0 11.5 2019
704 /players/z/zizican01.html Ante Žižić C 22 CLE 59 25 18.3 3.1 5.6 .553 0.0 0.0 3.1 5.6 .553 .553 1.6 2.2 .705 1.8 3.6 5.4 0.9 0.2 0.4 1.0 1.9 7.8 2019
705 /players/z/zubaciv01.html Ivica Zubac C 21 TOT 59 37 17.6 3.6 6.4 .559 0.0 0.0 3.6 6.4 .559 .559 1.7 2.1 .802 1.9 4.2 6.1 1.1 0.2 0.9 1.2 2.3 8.9 2019
706 /players/z/zubaciv01.html Ivica Zubac C 21 LAL 33 12 15.6 3.4 5.8 .580 0.0 0.0 3.4 5.8 .580 .580 1.7 2.0 .864 1.6 3.3 4.9 0.8 0.1 0.8 1.0 2.2 8.5 2019
707 /players/z/zubaciv01.html Ivica Zubac C 21 LAC 26 25 20.2 3.8 7.2 .538 0.0 0.0 3.8 7.2 .538 .538 1.7 2.3 .733 2.3 5.3 7.7 1.5 0.4 0.9 1.4 2.5 9.4 2019

Preprocessing

Missing value

The stats that represent the probability such as FG% (field goal success rate = shoot success rate) seem to be empty if the number of trial shots is 0. Replace with NaN.

data = data.replace(r'^\s*$', np.NaN, regex=True)

Data type change

The data type is a character string. Convert the data you want to treat as numbers to float. There are some data with only integers, but it's troublesome, so I'll make them all float. Before the change, the grade expressed as a percentage is written as .XXX, and it cannot be converted to a number as it is, so add 0 at the beginning.

add_zero_cols = [col for col in data.columns if '%' in col]
num_cols = ['Age'] + list(data.columns[5:-1])

for col in add_zero_cols:
    data[col] = '0' + data[col]
for col in num_cols:
    data[col] = data[col].astype(float)

Let's check. Shows the top 10 average scores.

data.sort_values('PTS', ascending=False)[['Player', 'PTS', 'Year']].head(10)
Player PTS Year
11135 James Harden 36.1 2019
3266 Kobe Bryant* 35.4 2006
3428 Allen Iverson* 33.0 2006
1813 Tracy McGrady* 32.1 2003
7954 Kevin Durant 32.0 2014
3818 Kobe Bryant* 31.6 2007
10167 Russell Westbrook 31.6 2017
1249 Allen Iverson* 31.4 2002
3442 LeBron James 31.4 2006
715 Allen Iverson* 31.1 2001

There seems to be no problem. There are many well-known superstars in Japan such as James Harden, Allen Iverson, and Kobe Bryant.

I took a quick look at the data source page to see what the * mark after the name represents, but I wasn't sure: sweat_smile: It may represent a player in the Hall of Fame.

Additional data Height / weight

We also collected data on height and weight from the URL of the personal page that was additionally included in the scraping item. (Since it can be collected by slightly changing the code at the beginning, the code for additional data is omitted.)

Finally, such data is ready. (Weight and Height columns have been added to the far right)

Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS Year Weight Height
0 Tariq Abdul-Wahad SG 25.0 TOT 61.0 56.0 25.9 4.5 10.6 0.424 0.0 0.4 0.130 4.4 10.2 0.435 0.426 2.4 3.2 0.756 1.7 3.1 4.8 1.6 1.0 0.5 1.7 2.4 11.4 2000 101.24 1.98
3 Shareef Abdur-Rahim SF 23.0 VAN 82.0 82.0 39.3 7.2 15.6 0.465 0.4 1.2 0.302 6.9 14.4 0.478 0.477 5.4 6.7 0.809 2.7 7.4 10.1 3.3 1.1 1.1 3.0 3.0 20.3 2000 102.15 2.06
5 Ray Allen* SG 24.0 MIL 82.0 82.0 37.4 7.8 17.2 0.455 2.1 5.0 0.423 5.7 12.2 0.468 0.516 4.3 4.9 0.887 1.0 3.4 4.4 3.8 1.3 0.2 2.2 2.3 22.1 2000 93.07 1.96
7 John Amaechi C 29.0 ORL 80.0 53.0 21.1 3.8 8.8 0.437 0.0 0.1 0.167 3.8 8.7 0.439 0.438 2.8 3.6 0.766 0.8 2.6 3.3 1.2 0.4 0.5 1.7 2.0 10.5 2000 122.58 2.08
8 Derek Anderson SG 25.0 LAC 64.0 58.0 34.4 5.9 13.4 0.438 0.9 2.8 0.309 5.0 10.7 0.472 0.470 4.2 4.8 0.877 1.3 2.8 4.0 3.4 1.4 0.2 2.6 2.3 16.9 2000 88.08 1.96
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11561 Delon Wright PG 26.0 TOT 75.0 13.0 22.7 3.2 7.4 0.434 0.7 2.2 0.298 2.6 5.2 0.492 0.478 1.6 2.0 0.793 0.9 2.6 3.5 3.3 1.2 0.4 1.0 1.4 8.7 2019 83.08 1.96
11566 Thaddeus Young PF 30.0 IND 81.0 81.0 30.7 5.5 10.4 0.527 0.6 1.8 0.349 4.8 8.6 0.564 0.557 1.1 1.7 0.644 2.4 4.1 6.5 2.5 1.5 0.4 1.5 2.4 12.6 2019 99.88 2.03
11567 Trae Young PG 20.0 ATL 81.0 81.0 30.9 6.5 15.5 0.418 1.9 6.0 0.324 4.6 9.6 0.477 0.480 4.2 5.1 0.829 0.8 2.9 3.7 8.1 0.9 0.2 3.8 1.7 19.1 2019 81.72 1.86
11572 Ante Žižić C 22.0 CLE 59.0 25.0 18.3 3.1 5.6 0.553 0.0 0.0 NaN 3.1 5.6 0.553 0.553 1.6 2.2 0.705 1.8 3.6 5.4 0.9 0.2 0.4 1.0 1.9 7.8 2019 115.32 2.08
11573 Ivica Zubac C 21.0 TOT 59.0 37.0 17.6 3.6 6.4 0.559 0.0 0.0 NaN 3.6 6.4 0.559 0.559 1.7 2.1 0.802 1.9 4.2 6.1 1.1 0.2 0.9 1.2 2.3 8.9 2019 108.96 2.13

By the way, the top 10 heights in the last 20 years

df.groupby(['Player']).max().reset_index().sort_values('Height', ascending=False)[['Player', 'Year', 'Height', 'Weight']].head(10)
Player Year Height Weight
672 Gheorghe Mureșan 2000 2.31 137.56
1641 Shawn Bradley 2005 2.29 106.69
1890 Yao Ming* 2011 2.29 140.74
1653 Sim Bhullar 2015 2.26 163.44
1442 Pavel Podkolzin 2006 2.26 118.04
1656 Slavko Vraneš 2004 2.26 124.85
1519 Rik Smits 2000 2.24 113.50
172 Boban Marjanović 2019 2.24 131.66
1449 Peter John Ramos 2005 2.21 124.85
583 Edy Tavares 2017 2.21 118.04

The unit is m (meter). Third place is that Yao Min, also known as the Great Wall of China. The height is too big.

end

Now that the data is ready, I would like to visualize it next time.

bonus

Average Assist Top 10

data.sort_values('AST', ascending=False)[['Player', 'AST', 'Year']].head(10)
Player AST Year
6617 Deron Williams 12.8 2011
7061 Rajon Rondo 11.7 2012
9486 Rajon Rondo 11.7 2016
4085 Steve Nash* 11.6 2007
4686 Chris Paul 11.6 2008
2977 Steve Nash* 11.5 2005
6440 Steve Nash* 11.4 2011
6509 Rajon Rondo 11.2 2011
9819 James Harden 11.2 2017
7641 Rajon Rondo 11.1 2013

Average Rebound Top 10

data.sort_values('TRB', ascending=False)[['Player', 'TRB', 'Year']].head(10)
Player TRB Year
7236 Earl Barron 18.0 2013
651 Danny Fortson 16.3 2001
10372 Andre Drummond 16.0 2018
11056 Andre Drummond 15.6 2019
1974 Ben Wallace 15.4 2003
6391 Kevin Love 15.2 2011
10537 DeAndre Jordan 15.2 2018
8695 DeAndre Jordan 15.0 2015
9163 Andre Drummond 14.8 2016
6894 Dwight Howard 14.5 2012

It seems that players who have not reached the specified number of games must be excluded.

Recommended Posts

☆ Professor Anzai… !! I want to analyze the data …… Part 1 Data preparation ☆ Let's analyze the NBA player stats (results) with Python. basketball
I want to be able to analyze data with Python (Part 3)
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
I want to analyze logs with Python
I want to inherit to the back with python dataclass
I tried to analyze J League data with Python
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)
[Python] I want to use the -h option with argparse
I want to debug with Python
I want to know the weather with LINE bot feat.Heroku + Python
[Pandas] I tried to analyze sales data with Python [For beginners]
I want to handle the rhyme part1
I want to handle the rhyme part3
I want to play with aws with python
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I started machine learning with Python (I also started posting to Qiita) Data preparation
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to use MATLAB feval with python
I want to analyze songs with Spotify API 2
I tried to save the data with discord
I want to knock 100 data sciences with Colaboratory
I want to make a game with Python
I want to handle the rhyme part7 (BOW)
I tried to get CloudWatch data with Python
I want to use Temporary Directory with Python2
#Unresolved I want to compile gobject-introspection with Python3
I want to solve APG4b with Python (Chapter 2)
I want to write to a file with Python
I want to display the progress in Python!
I want to use a python data source in Re: Dash to get query results
I want to handle optimization with python and cplex
I tried to touch the CSV file with Python
I want to work with a robot in python.
I want to write in Python! (3) Utilize the mock
I want to AWS Lambda with Python on Mac!
[ML Ops] I want to do multi-project with Python
I tried to solve the problem with Python Vol.1
I want to handle the rhyme part6 (organize once)
I want to use the R dataset in python
I want to run a quantum computer with Python
I want to handle the rhyme part8 (finished once)
I know? Data analysis using Python or things you want to use when you want with numpy
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action
I tried to find the entropy of the image with python
I want to initialize if the value is empty (python)
I tried to simulate how the infection spreads with Python
I want to specify another version of Python with pyvenv
I wanted to solve the Panasonic Programming Contest 2020 with Python
I tried to make various "dummy data" with Python faker
I want to automate ssh using the expect command! part2
maya Python I want to fix the baked animation again.
I want to change the Japanese flag to the Palau flag with Numpy
[Part.2] Crawling with Python! Click the web page to move!
I want to automatically attend online classes with Python + Selenium!
I want to know the features of Python and pip