Posts Tagged With: BeautifulSoup

Scraping Tabular Data from Web Pages using Python

In my continuing quest to make sense of the baseball pitching data that I’m studying, I realized that I needed the game performance of the pitchers on whom I have pitching data. This helps validate any theories about what the data from training tells us about ability.

Since we’re using the 2024-2025 Brevard College Rapsodo data, we want to add the 2025 season statistics to see if our analysis of their fall and early spring data matches with their performance.

BeautifulSoup

The package I’m using to parse the HTML is called BeautifulSoup and was created by Leonard Richardson in 2004 (and he’s maintained & upgraded it [with help] for over 20 years!) We’re not going to do anything complex, but it is powerful.

We’re importing two objects from the library – BeautifulSoup to get all the HTML and SoupStrainer to allow use to choose what to parse. We also have to import pandas to put our table in memory and requests to get the document from the web.

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import pandas as pd
import requests

url = "https://bctornados.com/sports/baseball/stats"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

The Web Page

We’re very fortunate that the Brevard team’s web page is designed well. In particular, the section that contains the pitching data is tagged as “individual-overall-pitching”. This made finding and scraping the data far easier. It’s not odd to see web page developers leaving id’s off or for CMS systems to fail to include them. Here’s the top of the appropriate section on Brevard’s page:

<!-- Individual - Overall - Pitching -->
<section id="individual-overall-pitching">

This makes it very easy for us to acquire the tabular data and put it into a DataFrame.

only_pitching = SoupStrainer(id='individual-overall-pitching')

pitching_section = BeautifulSoup(html_doc, "html.parser", parse_only=only_pitching)

pitching_table = pitching_section.find('table')

pitching_data_df = pd.read_html(str(pitching_table))[0]

Quick transform on the name column

Since I realized this put some gobbledygook in the Player Name column, I did slip in code to modify that, but even if you’re scraping pitching data from somewhere, you might not need those steps, but here’s the code I used.

# Function to convert "Last, First Jersey#etc" to "First Last"
def transform_name(name):
if pd.isna(name):
return name
parts = name.split(',')
if len(parts) < 2:
return name
last = parts[0].strip()
first_parts = parts[1].strip().split()
first = first_parts[0]
return f"{first} {last}"

# Apply transformation
pitching_data_df['Player'] = pitching_data_df['Player'].apply(transform_name)

To Excel

Now, I wanted to know immediately how well this worked, so I dumped it into an Excel spreadsheet.

pitching_data_df.to_excel('brevard.xlsx', index=False)

Next Steps

Now that I have my practice data and my game data in DataFrames, I can start merging in the performance results when displaying practice data and drawing conclusions.

This series includes:

Categories: Python | Tags: , , , , , , , , , , , , , | 4 Comments

Blog at WordPress.com.