Data scraping data, also known as web scraping is the process of extracting data from a website programmatically. The destination of the extracted data can vary, in some cases to channel that data to another website but is commonly saved to a spreadsheet or local file saved on your computer. It’s one of the most efficient ways to get data from the web aside from directly querying a REST API.

We are going to be extracting Premier League data such as:

  • All-time Top Scorers
  • 2020-21 – League Table
  • 2020/21 Top Scorers

Table of Contents ?

Importing Libraries ? 

We are going to be using the official Premier League website to extract the data we need. The libraries we are using are as follows:

  • Pandas: This will be used to generate our dataframes
  • Numpy: Numpy will be used to calculate our numeric data such as player statistics
  • Beautiful Soup: Is a parsing library we will use it for extract data for each player and put them into a table
  • Matplotlib & Seaborn: We will use these two libraries to plot charts of data we extract at the EDA stage
  • Datetime: We will use datetime to deal with date of birth and match date strings
  • RegEx (RE): Regex is a library used for extracting patterns of strings, we can use it for
  • Requests: This library will be used for sending HTTP/1.1 requests to the Premier League website
  • JSON: With requests there is no need to manually add query strings to our URLs, or to form-encode your PUT & POST data. We can just use the json method! We will be using JSON to read the JSON files from the API included in the HTTP requests

Planning ? 

Our objective is to take data from the existing tables where possible in the least amount of code. However, some of the data we want will be specifically curated depending on our requirements. It will also be nice to have as clean data as possible before it is written to the dataframe.

Method 1: HTML Table Scraping ?

The easiest way to extract tabular data from a website is if we extract it from an existing html table, we can do that using requests.

Method 2: Beautiful Soup Scraping ?

Beautiful Soup is a great way to extract data from various sources by targeting specific HTML elements such as p and div tags. We can then create a dataframe and append each element to it programmatically using loops. However, this approach has certain limitations especially on the Premier League website which uses dynamic webpage loading. This means the HTML page has elements which are hidden on the client-side then generated as the user scrolls down. Examples of such behavour can be seen when we scrolling down the webpage to reveal more players on the player page.

The problem with this is that the HTML file does not contain the information of the hidden elements so if you tried to scrap the page you would only get the data of the visible elements. The only way you would be able to fully load the page would be to scroll down the page manually until all the elements have been revealed then scrape the page. This method can be rather tedious as if you did that with something like a Twitter feed you could be scrolling for hours! There are two solutions for this.

  • Option 1: We could use Selenium, a browser automation tool; to open a headless browser, scroll down the page then use that HTML for Beautiful Soup to parse and scrape.
  • Option 2: We can access the API by intercepting and reusing the JSON requests used to pull data into the search results when we scroll down the page.
Image.open('images/premier_league_players.png')

Method 3: Using an API and JSONs ?

We can use this method to directly interface with the API populating the webpage. The JSON file contains all the data we need to populate our dataframe but we can also use it inconjuntion with our Beautiful Soup scraping method to fill in missing data we find from each player. We can do this by left clicking on the webpage and scrolling down to ‘Inspect’ on Google Chrome, then going to the ‘Network’ section of the resulting sidebar called the ‘Elements Panel’. In the Network we can see every request the webpage makes to the server, we need to filter down our search to ‘XHR and Fetch’ so we will click the ‘XHR’ tab.

Scroll down on the ‘Players’ page once this is all setup and you will notice some files start to populate the Elements Panel. These are URL requests as this webpage is dynamic and generates content with Javascript as we scroll down but is actually paginated even though it looks like an infinitely scrolling page.

Using Datetime to keep track of time data was scraped

# Creating a datetime variable
now = datetime.now()
current_date = now.strftime("%d/%m/%y")
current_time = now.strftime("%H:%M:%S")

# Print
print("Current Date =", current_date)
print("Current Time =", current_time)
Current Date = 29/03/21
Current Time = 19:33:48

Method 1: HTML Table Scraping ? 

We can use requests.get() to load the website into the variable called Response before reading it using pd.read_html. We load this variable into another variable named at_goals then append response.text inside our built-in function: pd.read_html(). Our next step is to select the table number we need then remove the data or columns we don’t need.

All-time goal scorers Table

# Extracting basic tabular data for all-time goal scorers
response = requests.get('https://www.premierleague.com/stats/top/players/goals') 
at_goals_r = pd.read_html(response.text)
at_goals_r
[    Rank                   Player               Club        Nationality  Stat  \
 0      1             Alan Shearer                  -            England   260   
 1      2             Wayne Rooney                  -            England   208   
 2      3              Andrew Cole                  -            England   187   
 3      4            Sergio Agüero    Manchester City          Argentina   181   
 4      5            Frank Lampard                  -            England   177   
 5      6            Thierry Henry                  -             France   175   
 6      7            Robbie Fowler                  -            England   163   
 7      8            Jermain Defoe                  -            England   162   
 8      9               Harry Kane  Tottenham Hotspur            England   160   
 9     10             Michael Owen                  -            England   150   
 10    11            Les Ferdinand                  -            England   149   
 11    12         Teddy Sheringham                  -            England   146   
 12    13         Robin van Persie                  -        Netherlands   144   
 13    14  Jimmy Floyd Hasselbaink                  -        Netherlands   127   
 14    15             Robbie Keane                  -            Ireland   126   
 15    16           Nicolas Anelka                  -             France   125   
 16    17             Dwight Yorke                  -  Trinidad & Tobago   123   
 17    18           Steven Gerrard                  -            England   120   
 18    19              Jamie Vardy     Leicester City            England   115   
 19    20            Romelu Lukaku                  -            Belgium   113   
 
     Unnamed: 5  
 0          NaN  
 1          NaN  
 2          NaN  
 3          NaN  
 4          NaN  
 5          NaN  
 6          NaN  
 7          NaN  
 8          NaN  
 9          NaN  
 10         NaN  
 11         NaN  
 12         NaN  
 13         NaN  
 14         NaN  
 15         NaN  
 16         NaN  
 17         NaN  
 18         NaN  
 19         NaN  ]
# Loading the all-time list and using iloc to remove column: Unamed: 5 (the filter column)
at_goals = at_goals_r[0].iloc[:,:-1].style.hide_index() 
at_goals
RankPlayerClubNationalityStat
1Alan ShearerEngland260
2Wayne RooneyEngland208
3Andrew ColeEngland187
4Sergio AgüeroManchester CityArgentina181
5Frank LampardEngland177
6Thierry HenryFrance175
7Robbie FowlerEngland163
8Jermain DefoeEngland162
9Harry KaneTottenham HotspurEngland160
10Michael OwenEngland150
11Les FerdinandEngland149
12Teddy SheringhamEngland146
13Robin van PersieNetherlands144
14Jimmy Floyd HasselbainkNetherlands127
15Robbie KeaneIreland126
16Nicolas AnelkaFrance125
17Dwight YorkeTrinidad & Tobago123
18Steven GerrardEngland120
19Jamie VardyLeicester CityEngland115
20Romelu LukakuBelgium113

All-time assists Table

# Extracting basic tabular data for all-time assists
response = requests.get('https://www.premierleague.com/stats/top/players/goal_assist') 
at_ast_r = pd.read_html(response.text)

# Loads the all-time list
at_ast = at_ast_r[0].iloc[:,:-1].style.hide_index() 
at_ast
RankPlayerClubNationalityStat
1Ryan GiggsWales162
2Cesc FàbregasSpain111
3Wayne RooneyEngland103
4Frank LampardEngland102
5Dennis BergkampNetherlands94
6David SilvaSpain93
7Steven GerrardEngland92
8James MilnerLiverpoolEngland85
9David BeckhamEngland80
10Kevin De BruyneManchester CityBelgium77
11Teddy SheringhamEngland76
12Thierry HenryFrance74
13Andrew ColeEngland73
14Ashley YoungEngland69
15Darren AndertonEngland68
16Gareth BarryEngland64
16Matthew Le TissierEngland64
16Alan ShearerEngland64
19Christian EriksenDenmark62
19Nolberto SolanoPeru62

2020-21 League Table

# Extracting basic tabular data for all-time assists
response = requests.get('https://www.premierleague.com/tables') 
pl_table_r = pd.read_html(response.text)
# Loads the all-time list
pl_table = pl_table_r[0].iloc[:,:-3]

# Remove excess rows containing any string: 'recent'
pl_table = pl_table[~pl_table['Club'].str.contains('Recent')]

# Cleaning up the 'Club' column by removing abbreviations
pl_table['Club'] = pl_table['Club'].str.slice(0, -5)

# Renaming and cleaning up the 'Position  Pos' column
pl_table.rename(columns={'Position  Pos':'Position'}, inplace=True)
pl_table['Position'] = pl_table['Position'].str.extract('(\d+)').astype(int)

# Final table
pl_table.style.hide_index()
PositionClubPlayed PlWon WDrawn DLost LGFGAGDPoints Pts
1Manchester City3022536421+4371
2Manchester United2916945632+2457
3Leicester City2917575332+2156
4Chelsea2914964425+1951
5West Ham United2914784535+1049
6Tottenham Hotspur2914694930+1948
7Liverpool2913794836+1246
8Everton28144104037+346
9Arsenal29126114032+842
10Aston Villa28125113930+941
11Leeds United29123144547-239
12Crystal Palace29107123147-1637
13Wolverhampton Wanderers2998122838-1035
14Southampton2996143651-1533
15Burnley2989122237-1533
16Brighton and Hove Albion29711113236-432
17Newcastle United2977152848-2028
18Fulham30511142338-1526
19West Bromwich Albion2939172057-3718
20Sheffield United2942231650-3414
# Column names
pl_table.columns
Index(['Position', 'Club', 'Played  Pl', 'Won  W', 'Drawn  D', 'Lost  L', 'GF',
       'GA', 'GD', 'Points  Pts'],
      dtype='object')

We can see the benefits of using this method if you want a quick and easy extraction of tabular data but what if the data we need is not from a single structured table? For that we will need to build a table ourselves, next we will be looking at using Beautiful Soup and RE.

Method 2: Beautiful Soup Scraping ? 

Initially, I used a Chrome Extension called ‘Link Grabber’ to extract the URL’s from the player page since I was having trouble scraping with the API and automating scrolling. I manually scrolled down the page and loaded the whole page before running it, then I filtered the links to just ‘/overview’ to get links for players only. I found just over 844 player links and saved them to an .csv file. We want to intercept and use HTTP requests with query parameters as it will be more efficient in the long run. It will also be easier to repeat the extraction if the data needs to be updated at a future date. Unfortunately, scraping with Beautiful Soup alone will not work. We would need a library like Selenium to automate the browser and that currently is outside the scope of this tutorial.

Potential Issues

  • Some players don’t have countries so they will have to be named NaN or None in order to keep the shape of the df
  • Need to find a way to get more links from the page due to infinite scroll (asynchronous loading)
  • If we scrape data with various positions it creates NaN columns as data on each player position is different
  • We need to choose a season to get all data from instead of grabbing all data for more precice data
  • Some players are no longer in the league so current club name will be missing
  • Data is missing before 2007 as Opta was not collecting match statistics at that time
  • Missing weights can cause our script to stop due to NaN errors

Method 3: Method 3: Using an API and JSONs ? 

First, we need to get the ‘Request URL’ of the page we are getting data from. We can find this information as discussed earlier from the ‘Network’ tab in the Elements Panel. Remember this will only be revealed to us once we have triggered a URL Request scrolling down the page. Once we have opened the ‘Network’ tab, there will be a filter tab inside it called ‘XHR’ which basically means HTTP Request to a web server. This tab will show us every request made by the current webpage to the server. We need to click on one of these files that pop-up on the list, the ‘Header’ section will show us every parameter required to make a successful request to the server.

At the top of the ‘Header’ page we will see a ‘Request URL: https://footballapi.pulselive.com/football/stats/ranked/players/goals. If the requirements are not satisfied the request will result in a ‘403 ERROR: The request could not be satisfied’, this can be seen if you try to enter the URL directly into your browser. We need to include headers and query parameters to successfully send a request to the server, we can start by copying the parameters from the Header and modifying them to our specifications. In this case instead of calling each page individually we can populate the page size (number of search results) to what we need.

Plan of action:

Our plan of action here is to parse the JSON file we obtain from the URL Request then scrape the data from each player to obtain the playerId. This is important as the playerId is also used to navigate to their individual stats webpages, for example: https://www.premierleague.com/players/13286/Tammy-Abraham/overview

We can navigate to each players’ page by extracting the playerId from each result and appending it to each URL we scrape. The beauty is we can combine a hybrid method of scraping with Beautiful Soup and parsing the JSON file for data missing from our scraping attempts.

Here is the code I used below:

# API URL
url = "https://footballapi.pulselive.com/football/stats/ranked/players/goals"

# Headers required for making a GET request. It is a good practice to provide headers with each request.
headers = {
    "content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "DNT": "1",
    "Origin": "https://www.premierleague.com",
    "Referer": "https://www.premierleague.com/players",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}

# Query parameters required to make get request
queryParams = {
    "age": 0,
    "pageSize": 20,
    "compSeasons": 363,
    "comps": 1,
    "compCodeForActivePlayer": "EN_PR",
    "positions": "FORWARD",
    "altIds": "true"
}

# Sending the request with url, headers, and query params
response = requests.get(url = url, headers = headers, params = queryParams)

# Empty containers to store player id's and other info
player_ids_test = []
info_test = []
counter = 0

# if response status code is 200 OK, then
if response.status_code == 200:
    # load the json data
    data_test = json.loads(response.text)
    # print the required data
    for player in data_test["stats"]["content"]:
        counter +=1
        print(counter, {player["owner"]["name"]["display"]:player["owner"]["info"]["positionInfo"]}) #int(player["id"])
        player_ids_test.append(int(player["owner"]["id"]))
        info_test.append(player["owner"]["info"]["positionInfo"])
1 {'Harry Kane': 'Centre Striker'}
2 {'Mohamed Salah': 'Left/Centre/Right Winger'}
3 {'Patrick Bamford': 'Centre Striker'}
4 {'Dominic Calvert-Lewin': 'Left/Centre Second Striker'}
5 {'Son Heung-Min': 'Left/Centre/Right Winger'}
6 {'Jamie Vardy': 'Centre Striker'}
7 {'Alexandre Lacazette': 'Centre Striker'}
8 {'Ollie Watkins': 'Left/Centre/Right Second Striker'}
9 {'Callum Wilson': 'Centre Striker'}
10 {'Pierre-Emerick Aubameyang': 'Centre Striker'}
11 {'Riyad Mahrez': 'Left/Right Winger'}
12 {'Marcus Rashford': 'Left/Centre Second Striker'}
13 {'Raheem Sterling': 'Forward'}
14 {'Wilfried Zaha': 'Left/Centre/Right Winger'}
15 {'Danny Ings': 'Centre Striker'}
16 {'Neal Maupay': 'Centre Striker'}
17 {'Che Adams': 'Centre Striker'}
18 {'Gabriel Jesus': 'Left/Centre/Right Second Striker'}
19 {'Sadio Mané': 'Left/Right Winger'}
20 {'Tammy Abraham': 'Centre Striker'}

JSON data

These are the contents of the JSON data on each player, we can delve into the data more deeply if we take the JSON file from the players’ page instead of the search results page. For example we have access to hundreds of data points in the form of statistical events such as touches on the ball, passing accuracy, duels won and so on.

player
{'owner': {'playerId': 136935.0,
  'info': {'position': 'F',
   'shirtNum': 18.0,
   'positionInfo': 'Centre Striker',
   'loan': True},
  'nationalTeam': {'isoCode': 'GB-ENG',
   'country': 'England',
   'demonym': 'English'},
  'currentTeam': {'name': 'Chelsea',
   'club': {'name': 'Chelsea', 'abbr': 'CHE', 'id': 4.0},
   'teamType': 'FIRST',
   'shortName': 'Chelsea',
   'id': 4.0,
   'altIds': {'opta': 't8'}},
  'active': True,
  'birth': {'date': {'millis': 875750400000.0, 'label': '2 October 1997'},
   'country': {'isoCode': 'GB-ENG',
    'country': 'England',
    'demonym': 'English'},
   'place': 'London'},
  'age': '23 years 178 days',
  'name': {'display': 'Tammy Abraham', 'first': 'Tammy', 'last': 'Abraham'},
  'id': 13286.0,
  'altIds': {'opta': 'p173879'}},
 'rank': 20.0,
 'name': 'goals',
 'value': 6.0,
 'description': 'Todo: goals',
 'additionalInfo': {}}

Now we have generated a list of our playerId’s, we opted to filter the all-time top 20 strikers currently still in the Premier League based on total goals scored. Check out the id’s extracted below.

player_ids_test
[3960,
 5178,
 4291,
 9576,
 4999,
 8979,
 6899,
 8658,
 8454,
 5110,
 8983,
 13565,
 4316,
 4539,
 8245,
 16256,
 10905,
 19680,
 6519,
 13286]

Building our dataframe

Now for the fun part! We will create a function to automatically append extracted data to a dataframe, it will also print the results of the player being added. We also added functionality of warning if a player has missing data and ignoring errors, this data can be automatically be appended if needed from the JSON data. We will manually assign data outside tables to columns specifically such as ClubPositionDate of BirthHeightWeight, and Nationality. The file will automatically be saved to a .csv.

def get_data(player_num):
    url = "https://www.premierleague.com/players/{}/player".format(player_num)
    soup = BeautifulSoup(requests.get(url + "/overview").content, "html.parser")
    data = {}
    data["Name"] = soup.select_one(".playerDetails .name").text
    tmp = soup.select_one(".playerbadgeContainer .visuallyHidden")
    null = 'WARNING!!! ------- Missing Data for ' + data["Name"]
    
    while data:  
        try: 
            if tmp:   
                print(soup.select_one(".playerDetails .name").text)
                data["Club"] = tmp.text
                data["Position"] = soup.select_one('.label:-soup-contains("Position") + .info').text
                data["Date of Birth"] = soup.select_one('.label:-soup-contains("Date of Birth") + .info').text.strip()
                data["Height"] = soup.select_one('.label:-soup-contains("Height") + .info').text.strip()
                data["Weight"] = soup.select_one('.label:-soup-contains("Weight") + .info').text.strip()
                data["Nationality"] = soup.select_one('.label:-soup-contains("Nationality") + .info').text.strip()
                
        except:
            #print("Checking Counter:",counter)
            data["Nationality"] = data_test['stats']['content'][counter]['owner']['birth']['country']['isoCode']
            print(null)
            pass

        soup = BeautifulSoup(requests.get(url + "/stats").content, "html.parser")

        for s in soup.select(".topStat"):
            v = s.text.split()
            if len(v) == 2:
                data[v[0]] = v[1]

        for s in soup.select(".normalStat"):
            v = list(map(str.strip, s.text.rsplit(maxsplit=1)))
            if len(v) == 2:
                data[v[0]] = v[1]

        return data

player_nums = player_ids_test
counter = -1

list_of_data = []
for num in player_nums:
    counter +=1
    print(counter, num)
    list_of_data.append(get_data(num))

pps = pd.DataFrame(list_of_data)
pps['Role'] = pd.DataFrame(info_test, columns=['Role'])
pps.to_csv('pl_player_stats_test.csv', index=False)
0 3960
Harry Kane
1 5178
Mohamed Salah
2 4291
Patrick Bamford
3 9576
Dominic Calvert-Lewin
4 4999
Son Heung-Min
5 8979
Jamie Vardy
6 6899
Alexandre Lacazette
7 8658
Ollie Watkins
8 8454
Callum Wilson
9 5110
Pierre-Emerick Aubameyang
10 8983
Riyad Mahrez
11 13565
Marcus Rashford
12 4316
Raheem Sterling
13 4539
Wilfried Zaha
WARNING!!! ------- Missing Data for Wilfried Zaha
14 8245
Danny Ings
15 16256
Neal Maupay
16 10905
Che Adams
17 19680
Gabriel Jesus
18 6519
Sadio Mané
19 13286
Tammy Abraham
# We can access different parts of the JSON file like this:
data_test['stats']['content'][0]['owner']['birth']['country']['isoCode']
'GB-ENG'
pps
NameClubPositionDate of BirthHeightWeightNationalityAppearancesGoalsWinsLossesGoals per matchHeaded goalsGoals with right footGoals with left footPenalties scoredFreekicks scoredShotsShots on targetShooting accuracy %Hit woodworkBig chances missedAssistsPassesPasses per matchBig Chances CreatedCrossesYellow cardsRed cardsFoulsOffsidesTacklesBlocked shotsInterceptionsClearancesHeaded ClearanceRole
0Harry KaneTottenham HotspurForward28/07/1993 (27)188cm86kgEngland237160133540.6826993424188139044%2694335,06321.364820227023116615621168175137Centre Striker
1Mohamed SalahLiverpoolForward15/06/1992 (28)175cm71kgEgypt14992100210.625117613052323545%1269324,23828.44492204077787312925189Left/Centre/Right Winger
2Patrick BamfordLeeds UnitedForward05/09/1993 (27)185cm71kgEngland561515330.273210101084542%419657110.25113048212821153323Centre Striker
3Dominic Calvert-LewinEvertonForward16/03/1997 (24)187cm71kgEngland1393854540.27161750025010642%54392,06414.859532001684954443511089Left/Centre Second Striker
4Son Heung-MinTottenham HotspurForward08/07/1992 (28)183cm78kgSouth Korea18866107440.35437250039917343%1637384,43423.59443344274107131105684520Left/Centre/Right Winger
5Jamie VardyLeicester CityForward11/01/1987 (34)179cm74kgEngland236115100820.4913742824051725149%1993363,23413.758269213195217138926311375Centre Striker
6Alexandre LacazetteArsenalForward28/05/1991 (29)175cm73kgFrance1224854390.3953675124011849%540182,42319.8616631401566710247405742Centre Striker
7Ollie WatkinsAston VillaForward30/12/1995 (25)180cm70kgEngland281012110.3635210773343%711361321.89618103324241971411Left/Centre/Right Second Striker
8Callum WilsonNewcastle UnitedForward27/02/1992 (29)180cm66kgEngland1475145690.3583299026210640%749171,97013.424881502061095054245939Centre Striker
9Pierre-Emerick AubameyangArsenalForward18/06/1989 (31)187cm80kgGabon1096350350.5845099126712246%1045132,34621.52181825141666764305016Centre Striker
10Riyad MahrezManchester CityForward21/02/1991 (30)179cm67kgAlgeria22266115610.3410518347420343%1231446,87130.958781170128512301381465422Left/Right Winger
11Marcus RashfordManchester UnitedForward31/10/1997 (23)180cm70kgEngland1715392310.3144276136316144%1049283,91922.9230285161110768689437840Left/Centre Second Striker
12Raheem SterlingManchester CityForward08/12/1994 (26)170cm69kgEngland28495177580.33861261159724541%2284499,15532.24675323013221032751721385622Forward
13Wilfried ZahaCrystal PalaceForwardNaNNaNNaNCôTE D’IVOIRE23645771010.1913682037513135%936265,02621.33956137130589307119128338Left/Centre/Right Winger
14Danny IngsSouthamptonForward23/07/1992 (28)178cm73kgEngland1335140550.38729147030412441%1130122,29017.22171101201035712774406237Centre Striker
15Neal MaupayBrighton and Hove AlbionForward14/08/1996 (24)173cm69kgFrance641816250.28396401566240%11851,18418.5821406127334912135Centre Striker
16Che AdamsSouthamptonForward13/07/1996 (24)175cm70kgScotland581121260.19010100813341%220674812.9131610402826246209Centre Striker
17Gabriel JesusManchester CityForward03/04/1997 (23)175cm73kgBrazil1234895140.39924152026213150%863202,28718.59181613098706652361913Left/Centre/Right Second Striker
18Sadio ManéLiverpoolForward10/04/1992 (28)175cm69kgSenegal22091131420.411155250051021542%1577336,82031503222432521372601051076033Left/Right Winger
19Tammy AbrahamChelseaForward02/10/1997 (23)190cm80kgEngland872634330.33201001636842%329591910.567322055392331125343Centre Striker
# Filling missing data for Zaha manually
pps.loc[13, ('Height')] = '180cm' # Height
pps.loc[13, ('Weight')] = '66kg' # Height
pps.loc[13, ('Date of Birth')] = '10/11/1992 (28)' # DOB
pps.loc[13, ('Nationality')] = 'Ivory Coast' # Nation

Height and Weight has cm and kg

# Removing strings from integers
pps['Height'] = pps['Height'].str.extract('(\d+)').astype(int)
pps['Weight'] = pps['Weight'].str.extract('(\d+)').astype(int)
pps['Passes'] = pps['Passes'].str.replace(',', '').astype(int)
pps['Shooting accuracy %'] = pps['Shooting accuracy %'].str.replace('%', '').astype(int)
pps['Goals'] = pps['Goals'].astype(int)

# Add age
pps['Age'] = pps['Date of Birth'].apply(lambda st: st[st.find("(")+1:st.find(")")])

# Strip age from DOB
#pps['Date of Birth'] = pps['Date of Birth'].str[:10]
pps.drop(['Date of Birth'], axis=1, inplace=True)

# Move columns
col = pps.pop('Name')
pps.insert(0, col.name, col)
col = pps.pop('Role')
pps.insert(3, col.name, col)
col = pps.pop('Nationality')
pps.insert(4, col.name, col)
col = pps.pop('Age')
pps.insert(5, col.name, col)

# Show
pps
NameClubPositionRoleNationalityAgeHeightWeightAppearancesGoalsWinsLossesGoals per matchHeaded goalsGoals with right footGoals with left footPenalties scoredFreekicks scoredShotsShots on targetShooting accuracy %Hit woodworkBig chances missedAssistsPassesPasses per matchBig Chances CreatedCrossesYellow cardsRed cardsFoulsOffsidesTacklesBlocked shotsInterceptionsClearancesHeaded Clearance
0Harry KaneTottenham HotspurForwardCentre StrikerEngland2718886237160133540.6826993424188139044269433506321.364820227023116615621168175137
1Mohamed SalahLiverpoolForwardLeft/Centre/Right WingerEgypt281757114992100210.625117613052323545126932423828.44492204077787312925189
2Patrick BamfordLeeds UnitedForwardCentre StrikerEngland2718571561515330.273210101084542419657110.25113048212821153323
3Dominic Calvert-LewinEvertonForwardLeft/Centre Second StrikerEngland24187711393854540.271617500250106425439206414.859532001684954443511089
4Son Heung-MinTottenham HotspurForwardLeft/Centre/Right WingerSouth Korea281837818866107440.35437250039917343163738443423.59443344274107131105684520
5Jamie VardyLeicester CityForwardCentre StrikerEngland3417974236115100820.4913742824051725149199336323413.758269213195217138926311375
6Alexandre LacazetteArsenalForwardCentre StrikerFrance29175731224854390.395367512401184954018242319.8616631401566710247405742
7Ollie WatkinsAston VillaForwardLeft/Centre/Right Second StrikerEngland2518070281012110.3635210773343711361321.89618103324241971411
8Callum WilsonNewcastle UnitedForwardCentre StrikerEngland29180661475145690.358329902621064074917197013.424881502061095054245939
9Pierre-Emerick AubameyangArsenalForwardCentre StrikerGabon31187801096350350.5845099126712246104513234621.52181825141666764305016
10Riyad MahrezManchester CityForwardLeft/Right WingerAlgeria301796722266115610.3410518347420343123144687130.958781170128512301381465422
11Marcus RashfordManchester UnitedForwardLeft/Centre Second StrikerEngland23180701715392310.3144276136316144104928391922.9230285161110768689437840
12Raheem SterlingManchester CityForwardForwardEngland261706928495177580.33861261159724541228449915532.24675323013221032751721385622
13Wilfried ZahaCrystal PalaceForwardLeft/Centre/Right WingerIvory Coast281806623645771010.191368203751313593626502621.33956137130589307119128338
14Danny IngsSouthamptonForwardCentre StrikerEngland28178731335140550.38729147030412441113012229017.22171101201035712774406237
15Neal MaupayBrighton and Hove AlbionForwardCentre StrikerFrance2417369641816250.283964015662401185118418.5821406127334912135
16Che AdamsSouthamptonForwardCentre StrikerScotland2417570581121260.19010100813341220674812.9131610402826246209
17Gabriel JesusManchester CityForwardLeft/Centre/Right Second StrikerBrazil23175731234895140.3992415202621315086320228718.59181613098706652361913
18Sadio ManéLiverpoolForwardLeft/Right WingerSenegal281756922091131420.411155250051021542157733682031503222432521372601051076033
19Tammy AbrahamChelseaForwardCentre StrikerEngland2319080872634330.33201001636842329591910.567322055392331125343
pps.dtypes
Name                     object
Club                     object
Position                 object
Role                     object
Nationality              object
Age                      object
Height                    int32
Weight                    int32
Appearances              object
Goals                     int32
Wins                     object
Losses                   object
Goals per match          object
Headed goals             object
Goals with right foot    object
Goals with left foot     object
Penalties scored         object
Freekicks scored         object
Shots                    object
Shots on target          object
Shooting accuracy %       int32
Hit woodwork             object
Big chances missed       object
Assists                  object
Passes                    int32
Passes per match         object
Big Chances Created      object
Crosses                  object
Yellow cards             object
Red cards                object
Fouls                    object
Offsides                 object
Tackles                  object
Blocked shots            object
Interceptions            object
Clearances               object
Headed Clearance         object
dtype: object

Fix data types

# Int Columns
pps[['Age', 'Height', 'Weight', 'Appearances', 'Goals', 'Wins', 'Losses', 'Headed goals', 
       'Goals with right foot', 'Goals with left foot', 'Penalties scored', 
       'Freekicks scored', 'Shots', 'Shots on target', 'Shooting accuracy %', 
       'Hit woodwork', 'Big chances missed', 'Assists', 'Passes', 'Big Chances Created', 
       'Crosses', 'Yellow cards', 'Red cards', 'Fouls', 'Offsides', 'Tackles',
       'Blocked shots', 'Interceptions', 'Clearances', 'Headed Clearance']] = pps[['Age', 'Height', 'Weight', 'Appearances', 'Goals', 'Wins', 'Losses', 'Headed goals', 
       'Goals with right foot', 'Goals with left foot', 'Penalties scored', 
       'Freekicks scored', 'Shots', 'Shots on target', 'Shooting accuracy %', 
       'Hit woodwork', 'Big chances missed', 'Assists', 'Passes', 'Big Chances Created', 
       'Crosses', 'Yellow cards', 'Red cards', 'Fouls', 'Offsides', 'Tackles',
       'Blocked shots', 'Interceptions', 'Clearances', 'Headed Clearance']].astype(int)

# Float Columns
pps[['Goals per match', 'Passes per match']] = pps[['Goals per match', 'Passes per match']].astype(float).round(2)

Dates

# If we needed to fix the dates to datetime objects then we could use this code: date_time_str = *data_here* date_time_obj = datetime.strptime(date_time_str, ‘%d %B %Y’) print(‘Date:’, date_time_obj.date())

# All-time top strikers in the pl sorted by goals scored by current players in the premier league
top_scorer = pps[pps['Role'].str.contains("Striker")].sort_values(by=['Goals'], ascending=False).copy()

# Add rank
top_scorer["Rank"] = np.arange(len(top_scorer))+1
col = top_scorer.pop('Rank')
top_scorer.insert(0, col.name, col)

# Hide index
top_scorer_ndx = top_scorer.style.hide_index()

# Name index
top_scorer_n = top_scorer.set_index('Name')

# Show df
top_scorer #_ndx
RankNameClubPositionRoleNationalityAgeHeightWeightAppearancesGoalsWinsLossesGoals per matchHeaded goalsGoals with right footGoals with left footPenalties scoredFreekicks scoredShotsShots on targetShooting accuracy %Hit woodworkBig chances missedAssistsPassesPasses per matchBig Chances CreatedCrossesYellow cardsRed cardsFoulsOffsidesTacklesBlocked shotsInterceptionsClearancesHeaded Clearance
01Harry KaneTottenham HotspurForwardCentre StrikerEngland2718886237160133540.6826993424188139044269433506321.364820227023116615621168175137
52Jamie VardyLeicester CityForwardCentre StrikerEngland3417974236115100820.4913742824051725149199336323413.7058269213195217138926311375
93Pierre-Emerick AubameyangArsenalForwardCentre StrikerGabon31187801096350350.5845099126712246104513234621.52181825141666764305016
114Marcus RashfordManchester UnitedForwardLeft/Centre Second StrikerEngland23180701715392310.3144276136316144104928391922.9230285161110768689437840
85Callum WilsonNewcastle UnitedForwardCentre StrikerEngland29180661475145690.358329902621064074917197013.4024881502061095054245939
146Danny IngsSouthamptonForwardCentre StrikerEngland28178731335140550.38729147030412441113012229017.22171101201035712774406237
67Alexandre LacazetteArsenalForwardCentre StrikerFrance29175731224854390.395367512401184954018242319.8616631401566710247405742
178Gabriel JesusManchester CityForwardLeft/Centre/Right Second StrikerBrazil23175731234895140.3992415202621315086320228718.59181613098706652361913
39Dominic Calvert-LewinEvertonForwardLeft/Centre Second StrikerEngland24187711393854540.271617500250106425439206414.859532001684954443511089
1910Tammy AbrahamChelseaForwardCentre StrikerEngland2319080872634330.303201001636842329591910.567322055392331125343
1511Neal MaupayBrighton and Hove AlbionForwardCentre StrikerFrance2417369641816250.283964015662401185118418.50821406127334912135
212Patrick BamfordLeeds UnitedForwardCentre StrikerEngland2718571561515330.273210101084542419657110.205113048212821153323
1613Che AdamsSouthamptonForwardCentre StrikerScotland2417570581121260.19010100813341220674812.90131610402826246209
714Ollie WatkinsAston VillaForwardLeft/Centre/Right Second StrikerEngland2518070281012110.3635210773343711361321.89618103324241971411
# All-time top wingers in the pl sorted by goals scored (excluding players no longer in the league)
top_wscorer = pps[pps['Role'].str.contains("Left")].sort_values(by=['Goals'], ascending=False).copy()

# Add rank
top_wscorer["Rank"] = np.arange(len(top_wscorer))+1
col = top_wscorer.pop('Rank')
top_wscorer.insert(0, col.name, col)

# Add age
col = top_wscorer.pop('Age')
top_wscorer.insert(5, col.name, col)

# Move role column
col = top_wscorer.pop('Role')
top_wscorer.insert(4, col.name, col)

# Hide index
top_wscorer = top_wscorer.style.hide_index()

# Show df
top_wscorer
RankNameClubPositionRoleAgeNationalityHeightWeightAppearancesGoalsWinsLossesGoals per matchHeaded goalsGoals with right footGoals with left footPenalties scoredFreekicks scoredShotsShots on targetShooting accuracy %Hit woodworkBig chances missedAssistsPassesPasses per matchBig Chances CreatedCrossesYellow cardsRed cardsFoulsOffsidesTacklesBlocked shotsInterceptionsClearancesHeaded Clearance
1Mohamed SalahLiverpoolForwardLeft/Centre/Right Winger28Egypt1757114992100210.6200005117613052323545126932423828.440000492204077787312925189
2Sadio ManéLiverpoolForwardLeft/Right Winger28Senegal1756922091131420.4100001155250051021542157733682031.000000503222432521372601051076033
3Son Heung-MinTottenham HotspurForwardLeft/Centre/Right Winger28South Korea1837818866107440.350000437250039917343163738443423.590000443344274107131105684520
4Riyad MahrezManchester CityForwardLeft/Right Winger30Algeria1796722266115610.300000410518347420343123144687130.9500008781170128512301381465422
5Marcus RashfordManchester UnitedForwardLeft/Centre Second Striker23England180701715392310.31000044276136316144104928391922.92000030285161110768689437840
6Gabriel JesusManchester CityForwardLeft/Centre/Right Second Striker23Brazil175731234895140.39000092415202621315086320228718.590000181613098706652361913
7Wilfried ZahaCrystal PalaceForwardLeft/Centre/Right Winger28Ivory Coast1806623645771010.1900001368203751313593626502621.3000003956137130589307119128338
8Dominic Calvert-LewinEvertonForwardLeft/Centre Second Striker24England187711393854540.2700001617500250106425439206414.8500009532001684954443511089
9Ollie WatkinsAston VillaForwardLeft/Centre/Right Second Striker25England18070281012110.36000035210773343711361321.890000618103324241971411
print("Average Age:", top_scorer['Age'].mean())
print("Average Height:", round(top_scorer['Height'].mean() ,2), "cm")
print("Average Weight:", round(top_scorer['Weight'].mean() ,2), "kg")
print("Average Goals:", round(top_scorer['Goals'].mean() ,2))
print("Foot distribution:", round(top_scorer['Goals with right foot'].mean() ,2))
Average Age: 26.5
Average Height: 180.86 cm
Average Weight: 73.29 kg
Average Goals: 50.5
Foot distribution: 32.07

Headed goals %

round(top_scorer_n['Headed goals'] / top_scorer_n['Goals'] * 100, 2).sort_values(ascending=False)
Name
Dominic Calvert-Lewin        42.11
Ollie Watkins                30.00
Patrick Bamford              20.00
Gabriel Jesus                18.75
Neal Maupay                  16.67
Harry Kane                   16.25
Callum Wilson                15.69
Danny Ings                   13.73
Tammy Abraham                11.54
Jamie Vardy                  11.30
Alexandre Lacazette          10.42
Marcus Rashford               7.55
Pierre-Emerick Aubameyang     6.35
Che Adams                     0.00
dtype: float64

Goals with right foot %

round(top_scorer_n['Goals with right foot'] / top_scorer_n['Goals'] * 100, 2).sort_values(ascending=False)
Name
Che Adams                    90.91
Pierre-Emerick Aubameyang    79.37
Marcus Rashford              79.25
Tammy Abraham                76.92
Alexandre Lacazette          75.00
Jamie Vardy                  64.35
Callum Wilson                62.75
Harry Kane                   61.88
Danny Ings                   56.86
Gabriel Jesus                50.00
Neal Maupay                  50.00
Ollie Watkins                50.00
Dominic Calvert-Lewin        44.74
Patrick Bamford              13.33
dtype: float64

Goals with left foot %

round(top_scorer_n['Goals with left foot'] / top_scorer_n['Goals'] * 100, 2).sort_values(ascending=False)
Name
Patrick Bamford              66.67
Neal Maupay                  33.33
Gabriel Jesus                31.25
Danny Ings                   27.45
Jamie Vardy                  24.35
Harry Kane                   21.25
Ollie Watkins                20.00
Callum Wilson                17.65
Alexandre Lacazette          14.58
Pierre-Emerick Aubameyang    14.29
Marcus Rashford              13.21
Dominic Calvert-Lewin        13.16
Che Adams                     9.09
Tammy Abraham                 3.85
dtype: float64

Goal Efficiency %

round(top_scorer_n['Goals'] / top_scorer_n['Shots'] * 100, 2).sort_values(ascending=False)
Name
Pierre-Emerick Aubameyang    23.60
Jamie Vardy                  22.24
Alexandre Lacazette          20.00
Callum Wilson                19.47
Gabriel Jesus                18.32
Harry Kane                   18.16
Danny Ings                   16.78
Tammy Abraham                15.95
Dominic Calvert-Lewin        15.20
Marcus Rashford              14.60
Patrick Bamford              13.89
Che Adams                    13.58
Ollie Watkins                12.99
Neal Maupay                  11.54
dtype: float64
top_scorer_dist = top_scorer
top_scorer_dist['Headed goals %'] = round(top_scorer['Headed goals'] / top_scorer['Goals'] * 100, 2).sort_values(ascending=False)
top_scorer_dist['Goals with right foot %'] = round(top_scorer['Goals with right foot'] / top_scorer['Goals'] * 100, 2).sort_values(ascending=False)
top_scorer_dist['Goals with left foot %'] = round(top_scorer['Goals with left foot'] / top_scorer['Goals'] * 100, 2).sort_values(ascending=False)
top_scorer_dist['Goal efficiency %'] = round(top_scorer['Goals'] / top_scorer['Shots'] * 100, 2).sort_values(ascending=False)
top_scorer_dist[['Headed goals %', 'Goals with right foot %', 'Goals with left foot %',]].median()
Headed goals %             14.710
Goals with right foot %    62.315
Goals with left foot %     18.825
dtype: float64
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

top_scorer_dist[['Name', 'Rank', 'Headed goals %', 'Goals with right foot %', 'Goals with left foot %', 'Goal efficiency %']].style.apply(highlight_max)
NameRankHeaded goals %Goals with right foot %Goals with left foot %Goal efficiency %
0Harry Kane116.25000061.88000021.25000018.160000
5Jamie Vardy211.30000064.35000024.35000022.240000
9Pierre-Emerick Aubameyang36.35000079.37000014.29000023.600000
11Marcus Rashford47.55000079.25000013.21000014.600000
8Callum Wilson515.69000062.75000017.65000019.470000
14Danny Ings613.73000056.86000027.45000016.780000
6Alexandre Lacazette710.42000075.00000014.58000020.000000
17Gabriel Jesus818.75000050.00000031.25000018.320000
3Dominic Calvert-Lewin942.11000044.74000013.16000015.200000
19Tammy Abraham1011.54000076.9200003.85000015.950000
15Neal Maupay1116.67000050.00000033.33000011.540000
2Patrick Bamford1220.00000013.33000066.67000013.890000
16Che Adams130.00000090.9100009.09000013.580000
7Ollie Watkins1430.00000050.00000020.00000012.990000
top_scorer_dist[['Name', 'Rank', 'Headed goals %', 'Goals with right foot %', 'Goals with left foot %', 'Goal efficiency %']].corr()
RankHeaded goals %Goals with right foot %Goals with left foot %Goal efficiency %
Rank1.0000000.278093-0.3072650.200387-0.758831
Headed goals %0.2780931.000000-0.6735540.198372-0.292960
Goals with right foot %-0.307265-0.6735541.000000-0.8503950.321372
Goals with left foot %0.2003870.198372-0.8503951.000000-0.216355
Goal efficiency %-0.758831-0.2929600.321372-0.2163551.000000
top_scorer_dist[['Name', 'Headed goals %', 'Goals with right foot %', 'Goals with left foot %']].set_index('Name').iloc[::-1].plot(kind='barh', figsize=(10,15))
plt.xlabel('Goal Distribution %', fontsize=13, fontweight='bold')
plt.ylabel('Name', fontsize=13, fontweight='bold')
plt.title('Goal Distribution', fontsize=17, fontweight='bold')
plt.show()
# Top scorer goal distrubution compared / free kicks and penalties not included
top_scorer_dist[['Name', 'Headed goals %', 'Goals with right foot %', 'Goals with left foot %']].set_index('Name').iloc[::-1]
Headed goals %Goals with right foot %Goals with left foot %
Name
Ollie Watkins30.0050.0020.00
Che Adams0.0090.919.09
Patrick Bamford20.0013.3366.67
Neal Maupay16.6750.0033.33
Tammy Abraham11.5476.923.85
Dominic Calvert-Lewin42.1144.7413.16
Gabriel Jesus18.7550.0031.25
Alexandre Lacazette10.4275.0014.58
Danny Ings13.7356.8627.45
Callum Wilson15.6962.7517.65
Marcus Rashford7.5579.2513.21
Pierre-Emerick Aubameyang6.3579.3714.29
Jamie Vardy11.3064.3524.35
Harry Kane16.2561.8821.25

Conclusion ? 

And that’s it! We have just successfully programmatically scraped data from the Premier League website using three different methods. Using HTML table using pd.read_HTML(), then using Beautiful Soup, and finally a REST API. Hopefully the emoji’s added some style to our substance ???. Congratulations!

https://thispointer.com/python-remove-elements-from-a-list-while-iterating/
https://stackoverflow.com/questions/904746/how-to-remove-all-characters-after-a-specific-character-in-python
https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
https://stackoverflow.com/questions/50185926/valueerror-shape-of-passed-values-is-1-6-indices-imply-6-6
https://stackoverflow.com/questions/3232953/python-removing-spaces-from-list-objects/39668144
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ https://www.codementor.io/blog/python-web-scraping-63l2v9sf2q
https://stackoverflow.com/questions/48243018/concatenate-string-to-the-end-of-all-elements-of-a-list-in-python
https://stackoverflow.com/questions/22702277/crawl-site-that-has-infinite-scrolling-using-python
https://www.accordbox.com/blog/how-crawl-infinite-scrolling-pages-using-python/
https://chromedriver.chromium.org/downloads
https://stackoverflow.com/questions/42478591/python-selenium-chrome-webdriver
https://www.pluralsight.com/guides/advanced-web-scraping-tactics-python-playbook
https://datatofish.com/convert-pandas-dataframe-to-list/
https://stackoverflow.com/questions/38565849/pandas-replace-substring-in-string
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
https://www.programiz.com/python-programming/function
https://www.dataquest.io/blog/tutorial-advanced-for-loops-python-pandas/
https://stackoverflow.com/questions/54343378/pandas-valueerror-pattern-contains-no-capture-groups
https://www.w3resource.com/pandas/series/series-str-extract.php
https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/
https://www.geeksforgeeks.org/parsing-and-processing-url-using-python-regex/
https://stackoverflow.com/questions/51074511/how-to-stop-execution-of-infinite-loop-with-exception-handling
https://towardsdatascience.com/web-scraping-advanced-football-statistics-11cace1d863a
https://stackoverflow.com/questions/56206038/how-to-loop-through-paginated-api-using-python
https://stackoverflow.com/questions/25122099/move-column-by-name-to-front-of-table-in-pandas
https://stackoverflow.com/questions/16842001/copy-text-between-parentheses-in-pandas-dataframe-column-into-another-column
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
https://pandas.pydata.org/pandas-docs/version/0.22/api.html#string-handling

Author

Write A Comment