Everyone across the country of love basketball, hello. My name is Hikoichi Aida. I usually work as a manager and data scientist on a high school basketball team, analyzing various data.
This time, I would like to analyze the player stats (results) of the NBA, which is a professional basketball league in the United States. Analysis is easy, but please keep in touch.
The first is about scraping & preprocessing because it is data preparation. I don't know when the second and subsequent sessions will be, please forgive me. It may not be forever.
I used Google Colaboratory. The process introduced this time can be operated only with the pre-installed library. It's very convenient.
When I was searching for what to do with the data collection part, I found the following blog article.
Most of the content is the same as this article, but I thought that the height and weight of the players might be scraped, so I included the URL of the player's personal page as a scraping target. Since the record representing the column name is inserted at regular intervals, that record is skipped.
data = pd.DataFrame()
years = [i for i in range(2000, 2002)]
for year in years:
url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html)
soup.findAll('tr', limit=2)
# use getText()to extract the text we need into a list
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
# exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
headers = ['URL'] + headers[1:] + ['Year']
rows = soup.findAll('tr')[1:]
player_stats = [[rows[i].a.get('href')] + [td.getText() for td in rows[i].findAll('td')] for i in range(len(rows)) if (rows[i].findAll('td')) and (rows[i].a)]
stats = pd.DataFrame(player_stats)
stats['Year'] = str(year)
stats.columns = headers
data = pd.concat([data, stats])
data = data.dropna()
Here is an example of the page to be scraped.
All the stats of the players who participated in that season are contained in the one-page site, so I think that you can collect enough data even if you drag and copy and paste it into spreadsheet software such as Excel.
This time, we targeted data for 20 years (2000-2019). The scraping result looks like this.
URL | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | FG% | 3P | 3PA | 3P% | 2P | 2PA | 2P% | eFG% | FT | FTA | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | /players/a/abdulta01.html | Tariq Abdul-Wahad | SG | 25 | TOT | 61 | 56 | 25.9 | 4.5 | 10.6 | .424 | 0.0 | 0.4 | .130 | 4.4 | 10.2 | .435 | .426 | 2.4 | 3.2 | .756 | 1.7 | 3.1 | 4.8 | 1.6 | 1.0 | 0.5 | 1.7 | 2.4 | 11.4 | 2000 |
1 | /players/a/abdulta01.html | Tariq Abdul-Wahad | SG | 25 | ORL | 46 | 46 | 26.2 | 4.8 | 11.2 | .433 | 0.0 | 0.5 | .095 | 4.8 | 10.7 | .447 | .435 | 2.5 | 3.3 | .762 | 1.7 | 3.5 | 5.2 | 1.6 | 1.2 | 0.3 | 1.9 | 2.5 | 12.2 | 2000 |
2 | /players/a/abdulta01.html | Tariq Abdul-Wahad | SG | 25 | DEN | 15 | 10 | 24.9 | 3.4 | 8.7 | .389 | 0.1 | 0.1 | .500 | 3.3 | 8.6 | .388 | .393 | 2.1 | 2.8 | .738 | 1.6 | 1.9 | 3.5 | 1.7 | 0.4 | 0.8 | 1.3 | 2.1 | 8.9 | 2000 |
3 | /players/a/abdursh01.html | Shareef Abdur-Rahim | SF | 23 | VAN | 82 | 82 | 39.3 | 7.2 | 15.6 | .465 | 0.4 | 1.2 | .302 | 6.9 | 14.4 | .478 | .477 | 5.4 | 6.7 | .809 | 2.7 | 7.4 | 10.1 | 3.3 | 1.1 | 1.1 | 3.0 | 3.0 | 20.3 | 2000 |
4 | /players/a/alexaco01.html | Cory Alexander | PG | 26 | DEN | 29 | 2 | 11.3 | 1.0 | 3.4 | .286 | 0.3 | 1.2 | .257 | 0.7 | 2.2 | .302 | .332 | 0.6 | 0.8 | .773 | 0.3 | 1.2 | 1.4 | 2.0 | 0.8 | 0.1 | 1.0 | 1.3 | 2.8 | 2000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
703 | /players/z/zellety01.html | Tyler Zeller | C | 29 | MEM | 4 | 1 | 20.5 | 4.0 | 7.0 | .571 | 0.0 | 0.0 | 4.0 | 7.0 | .571 | .571 | 3.5 | 4.5 | .778 | 2.3 | 2.3 | 4.5 | 0.8 | 0.3 | 0.8 | 1.0 | 4.0 | 11.5 | 2019 | |
704 | /players/z/zizican01.html | Ante Žižić | C | 22 | CLE | 59 | 25 | 18.3 | 3.1 | 5.6 | .553 | 0.0 | 0.0 | 3.1 | 5.6 | .553 | .553 | 1.6 | 2.2 | .705 | 1.8 | 3.6 | 5.4 | 0.9 | 0.2 | 0.4 | 1.0 | 1.9 | 7.8 | 2019 | |
705 | /players/z/zubaciv01.html | Ivica Zubac | C | 21 | TOT | 59 | 37 | 17.6 | 3.6 | 6.4 | .559 | 0.0 | 0.0 | 3.6 | 6.4 | .559 | .559 | 1.7 | 2.1 | .802 | 1.9 | 4.2 | 6.1 | 1.1 | 0.2 | 0.9 | 1.2 | 2.3 | 8.9 | 2019 | |
706 | /players/z/zubaciv01.html | Ivica Zubac | C | 21 | LAL | 33 | 12 | 15.6 | 3.4 | 5.8 | .580 | 0.0 | 0.0 | 3.4 | 5.8 | .580 | .580 | 1.7 | 2.0 | .864 | 1.6 | 3.3 | 4.9 | 0.8 | 0.1 | 0.8 | 1.0 | 2.2 | 8.5 | 2019 | |
707 | /players/z/zubaciv01.html | Ivica Zubac | C | 21 | LAC | 26 | 25 | 20.2 | 3.8 | 7.2 | .538 | 0.0 | 0.0 | 3.8 | 7.2 | .538 | .538 | 1.7 | 2.3 | .733 | 2.3 | 5.3 | 7.7 | 1.5 | 0.4 | 0.9 | 1.4 | 2.5 | 9.4 | 2019 |
The stats that represent the probability such as FG% (field goal success rate = shoot success rate) seem to be empty if the number of trial shots is 0. Replace with NaN.
data = data.replace(r'^\s*$', np.NaN, regex=True)
The data type is a character string. Convert the data you want to treat as numbers to float. There are some data with only integers, but it's troublesome, so I'll make them all float.
Before the change, the grade expressed as a percentage is written as .XXX
, and it cannot be converted to a number as it is, so add 0
at the beginning.
add_zero_cols = [col for col in data.columns if '%' in col]
num_cols = ['Age'] + list(data.columns[5:-1])
for col in add_zero_cols:
data[col] = '0' + data[col]
for col in num_cols:
data[col] = data[col].astype(float)
Let's check. Shows the top 10 average scores.
data.sort_values('PTS', ascending=False)[['Player', 'PTS', 'Year']].head(10)
Player | PTS | Year | |
---|---|---|---|
11135 | James Harden | 36.1 | 2019 |
3266 | Kobe Bryant* | 35.4 | 2006 |
3428 | Allen Iverson* | 33.0 | 2006 |
1813 | Tracy McGrady* | 32.1 | 2003 |
7954 | Kevin Durant | 32.0 | 2014 |
3818 | Kobe Bryant* | 31.6 | 2007 |
10167 | Russell Westbrook | 31.6 | 2017 |
1249 | Allen Iverson* | 31.4 | 2002 |
3442 | LeBron James | 31.4 | 2006 |
715 | Allen Iverson* | 31.1 | 2001 |
There seems to be no problem. There are many well-known superstars in Japan such as James Harden, Allen Iverson, and Kobe Bryant.
I took a quick look at the data source page to see what the * mark after the name represents, but I wasn't sure: sweat_smile: It may represent a player in the Hall of Fame.
We also collected data on height and weight from the URL of the personal page that was additionally included in the scraping item. (Since it can be collected by slightly changing the code at the beginning, the code for additional data is omitted.)
Finally, such data is ready. (Weight and Height columns have been added to the far right)
Player | Pos | Age | Tm | G | GS | MP | FG | FGA | FG% | 3P | 3PA | 3P% | 2P | 2PA | 2P% | eFG% | FT | FTA | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | Year | Weight | Height | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Tariq Abdul-Wahad | SG | 25.0 | TOT | 61.0 | 56.0 | 25.9 | 4.5 | 10.6 | 0.424 | 0.0 | 0.4 | 0.130 | 4.4 | 10.2 | 0.435 | 0.426 | 2.4 | 3.2 | 0.756 | 1.7 | 3.1 | 4.8 | 1.6 | 1.0 | 0.5 | 1.7 | 2.4 | 11.4 | 2000 | 101.24 | 1.98 |
3 | Shareef Abdur-Rahim | SF | 23.0 | VAN | 82.0 | 82.0 | 39.3 | 7.2 | 15.6 | 0.465 | 0.4 | 1.2 | 0.302 | 6.9 | 14.4 | 0.478 | 0.477 | 5.4 | 6.7 | 0.809 | 2.7 | 7.4 | 10.1 | 3.3 | 1.1 | 1.1 | 3.0 | 3.0 | 20.3 | 2000 | 102.15 | 2.06 |
5 | Ray Allen* | SG | 24.0 | MIL | 82.0 | 82.0 | 37.4 | 7.8 | 17.2 | 0.455 | 2.1 | 5.0 | 0.423 | 5.7 | 12.2 | 0.468 | 0.516 | 4.3 | 4.9 | 0.887 | 1.0 | 3.4 | 4.4 | 3.8 | 1.3 | 0.2 | 2.2 | 2.3 | 22.1 | 2000 | 93.07 | 1.96 |
7 | John Amaechi | C | 29.0 | ORL | 80.0 | 53.0 | 21.1 | 3.8 | 8.8 | 0.437 | 0.0 | 0.1 | 0.167 | 3.8 | 8.7 | 0.439 | 0.438 | 2.8 | 3.6 | 0.766 | 0.8 | 2.6 | 3.3 | 1.2 | 0.4 | 0.5 | 1.7 | 2.0 | 10.5 | 2000 | 122.58 | 2.08 |
8 | Derek Anderson | SG | 25.0 | LAC | 64.0 | 58.0 | 34.4 | 5.9 | 13.4 | 0.438 | 0.9 | 2.8 | 0.309 | 5.0 | 10.7 | 0.472 | 0.470 | 4.2 | 4.8 | 0.877 | 1.3 | 2.8 | 4.0 | 3.4 | 1.4 | 0.2 | 2.6 | 2.3 | 16.9 | 2000 | 88.08 | 1.96 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11561 | Delon Wright | PG | 26.0 | TOT | 75.0 | 13.0 | 22.7 | 3.2 | 7.4 | 0.434 | 0.7 | 2.2 | 0.298 | 2.6 | 5.2 | 0.492 | 0.478 | 1.6 | 2.0 | 0.793 | 0.9 | 2.6 | 3.5 | 3.3 | 1.2 | 0.4 | 1.0 | 1.4 | 8.7 | 2019 | 83.08 | 1.96 |
11566 | Thaddeus Young | PF | 30.0 | IND | 81.0 | 81.0 | 30.7 | 5.5 | 10.4 | 0.527 | 0.6 | 1.8 | 0.349 | 4.8 | 8.6 | 0.564 | 0.557 | 1.1 | 1.7 | 0.644 | 2.4 | 4.1 | 6.5 | 2.5 | 1.5 | 0.4 | 1.5 | 2.4 | 12.6 | 2019 | 99.88 | 2.03 |
11567 | Trae Young | PG | 20.0 | ATL | 81.0 | 81.0 | 30.9 | 6.5 | 15.5 | 0.418 | 1.9 | 6.0 | 0.324 | 4.6 | 9.6 | 0.477 | 0.480 | 4.2 | 5.1 | 0.829 | 0.8 | 2.9 | 3.7 | 8.1 | 0.9 | 0.2 | 3.8 | 1.7 | 19.1 | 2019 | 81.72 | 1.86 |
11572 | Ante Žižić | C | 22.0 | CLE | 59.0 | 25.0 | 18.3 | 3.1 | 5.6 | 0.553 | 0.0 | 0.0 | NaN | 3.1 | 5.6 | 0.553 | 0.553 | 1.6 | 2.2 | 0.705 | 1.8 | 3.6 | 5.4 | 0.9 | 0.2 | 0.4 | 1.0 | 1.9 | 7.8 | 2019 | 115.32 | 2.08 |
11573 | Ivica Zubac | C | 21.0 | TOT | 59.0 | 37.0 | 17.6 | 3.6 | 6.4 | 0.559 | 0.0 | 0.0 | NaN | 3.6 | 6.4 | 0.559 | 0.559 | 1.7 | 2.1 | 0.802 | 1.9 | 4.2 | 6.1 | 1.1 | 0.2 | 0.9 | 1.2 | 2.3 | 8.9 | 2019 | 108.96 | 2.13 |
By the way, the top 10 heights in the last 20 years
df.groupby(['Player']).max().reset_index().sort_values('Height', ascending=False)[['Player', 'Year', 'Height', 'Weight']].head(10)
Player | Year | Height | Weight | |
---|---|---|---|---|
672 | Gheorghe Mureșan | 2000 | 2.31 | 137.56 |
1641 | Shawn Bradley | 2005 | 2.29 | 106.69 |
1890 | Yao Ming* | 2011 | 2.29 | 140.74 |
1653 | Sim Bhullar | 2015 | 2.26 | 163.44 |
1442 | Pavel Podkolzin | 2006 | 2.26 | 118.04 |
1656 | Slavko Vraneš | 2004 | 2.26 | 124.85 |
1519 | Rik Smits | 2000 | 2.24 | 113.50 |
172 | Boban Marjanović | 2019 | 2.24 | 131.66 |
1449 | Peter John Ramos | 2005 | 2.21 | 124.85 |
583 | Edy Tavares | 2017 | 2.21 | 118.04 |
The unit is m (meter). Third place is that Yao Min, also known as the Great Wall of China. The height is too big.
Now that the data is ready, I would like to visualize it next time.
data.sort_values('AST', ascending=False)[['Player', 'AST', 'Year']].head(10)
Player | AST | Year | |
---|---|---|---|
6617 | Deron Williams | 12.8 | 2011 |
7061 | Rajon Rondo | 11.7 | 2012 |
9486 | Rajon Rondo | 11.7 | 2016 |
4085 | Steve Nash* | 11.6 | 2007 |
4686 | Chris Paul | 11.6 | 2008 |
2977 | Steve Nash* | 11.5 | 2005 |
6440 | Steve Nash* | 11.4 | 2011 |
6509 | Rajon Rondo | 11.2 | 2011 |
9819 | James Harden | 11.2 | 2017 |
7641 | Rajon Rondo | 11.1 | 2013 |
data.sort_values('TRB', ascending=False)[['Player', 'TRB', 'Year']].head(10)
Player | TRB | Year | |
---|---|---|---|
7236 | Earl Barron | 18.0 | 2013 |
651 | Danny Fortson | 16.3 | 2001 |
10372 | Andre Drummond | 16.0 | 2018 |
11056 | Andre Drummond | 15.6 | 2019 |
1974 | Ben Wallace | 15.4 | 2003 |
6391 | Kevin Love | 15.2 | 2011 |
10537 | DeAndre Jordan | 15.2 | 2018 |
8695 | DeAndre Jordan | 15.0 | 2015 |
9163 | Andre Drummond | 14.8 | 2016 |
6894 | Dwight Howard | 14.5 | 2012 |
It seems that players who have not reached the specified number of games must be excluded.
Recommended Posts