Posts Tagged With: artificial-intelligence

Scraping Tabular Data from Web Pages using Python

In my continuing quest to make sense of the baseball pitching data that I’m studying, I realized that I needed the game performance of the pitchers on whom I have pitching data. This helps validate any theories about what the data from training tells us about ability.

Since we’re using the 2024-2025 Brevard College Rapsodo data, we want to add the 2025 season statistics to see if our analysis of their fall and early spring data matches with their performance.

BeautifulSoup

The package I’m using to parse the HTML is called BeautifulSoup and was created by Leonard Richardson in 2004 (and he’s maintained & upgraded it [with help] for over 20 years!) We’re not going to do anything complex, but it is powerful.

We’re importing two objects from the library – BeautifulSoup to get all the HTML and SoupStrainer to allow use to choose what to parse. We also have to import pandas to put our table in memory and requests to get the document from the web.

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import pandas as pd
import requests

url = "https://bctornados.com/sports/baseball/stats"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

The Web Page

We’re very fortunate that the Brevard team’s web page is designed well. In particular, the section that contains the pitching data is tagged as “individual-overall-pitching”. This made finding and scraping the data far easier. It’s not odd to see web page developers leaving id’s off or for CMS systems to fail to include them. Here’s the top of the appropriate section on Brevard’s page:

<!-- Individual - Overall - Pitching -->
<section id="individual-overall-pitching">

This makes it very easy for us to acquire the tabular data and put it into a DataFrame.

only_pitching = SoupStrainer(id='individual-overall-pitching')

pitching_section = BeautifulSoup(html_doc, "html.parser", parse_only=only_pitching)

pitching_table = pitching_section.find('table')

pitching_data_df = pd.read_html(str(pitching_table))[0]

Quick transform on the name column

Since I realized this put some gobbledygook in the Player Name column, I did slip in code to modify that, but even if you’re scraping pitching data from somewhere, you might not need those steps, but here’s the code I used.

# Function to convert "Last, First Jersey#etc" to "First Last"
def transform_name(name):
if pd.isna(name):
return name
parts = name.split(',')
if len(parts) < 2:
return name
last = parts[0].strip()
first_parts = parts[1].strip().split()
first = first_parts[0]
return f"{first} {last}"

# Apply transformation
pitching_data_df['Player'] = pitching_data_df['Player'].apply(transform_name)

To Excel

Now, I wanted to know immediately how well this worked, so I dumped it into an Excel spreadsheet.

pitching_data_df.to_excel('brevard.xlsx', index=False)

Next Steps

Now that I have my practice data and my game data in DataFrames, I can start merging in the performance results when displaying practice data and drawing conclusions.

This series includes:

Categories: Python | Tags: , , , , , , , , , , , , , | 4 Comments

Automate PDF Creation from Data Visualizations with Python

When creating some really great visualizations, I wondered how I could create a bunch of them and not be overwhelmed by the process of exporting them to PDF files to share with others. So I explored a few options and found that img2pdf was most suitable: always lossless, small, and fast. It allows me to loop through my data, creating charts, move them to individual PDFs, and then combine them into multi-page PDFs.

I’m using DataLab by DataCamp, where I’m learning Python, Data Science, and AI. So, some of my code may rely on that environment and YMMV. Installing and importing img2pdf was very straightforward for me.

!pip install img2pdf
import img2pdf

It turned out to be pretty simple to loop through my CSV data using one of the columns to get data by player and then create each graph, saving it as a PDF, then combining then as multi-page PDFs by player and by category.

I created a directory structure for the files, with a Fig Storage directory for all the individual PDFs and a directory for each team and year. This allows me to scale it to handle data at volume, letting me focus on analyzing that data instead of being bogged down in copying and pasting.

Within each loop, it creates an empty array, imagefiles, in which all filenames are placed, so that those files can be copied into the summary PDFs once the charts have all been generated. Outside the loop, there is another array, byDateArrayFiles, for storing all of the filenames to be bundled together for a ‘Velocity by Date’ file.

Here’s a sample of the loop with only two charts created. I have 8 different ones created for each player, but that would be excessive. This gives you the idea.

season = '2025'
team = 'ICI'
byDateArrayFiles = []
playerNames = pitchLogic['Player Name'].unique()
for player in playerNames:
player_df = pitchLogic[pitchLogic['Player Name'] == player]
imagefiles = []

bydatefig, bydateax = plt.subplots()
bydateax = sns.lineplot(x='Date', y='Speed', data=player_df).set(title='Velocity for ' + player)
bydatefig.autofmt_xdate(rotation=75)
filename = "Fig Storage/" + player + ' VelocityByDate.jpg'
bydatefig.savefig(filename)
byDateArrayFiles.append(filename)
imagefiles.append(filename)

only100_player_df = player_df[abs(player_df['Slot Diff'])<=100]
only100_player_df.loc[only100_player_df['Location'] != 'K', 'Location'] = 'BB'
slotfig = sns.relplot(x='Horiz Mvmt', y='Vertical Mvmt', data=only100_player_df, kind='scatter', hue='Slot Diff', size='Location', style='Type').set(title='Slot Difference and Movement Profile for ' + player)
filename = "2025 Samples/" + player + ' SlotMovement.jpg'
slotfig.savefig(filename)
imagefiles.append(filename)

with open(season + " " + team + "/" + player + ".pdf", "wb") as pdf_file:
pdf_file.write(img2pdf.convert(imagefiles))

with open(season + " " + team + "/Velocity By Date.pdf", "wb") as pdf_file:
pdf_file.write(img2pdf.convert(byDateArrayFiles))

This all saved me loads of time and headaches generating the charts. It lets me quickly explore whether my visualizations are meaningful. It also makes modifying them or updating them very easy. I can put a season’s worth of pitches into the system and have a suite of charts for each player a minute later.

This series includes:

Categories: Python, Visualizations | Tags: , , , , , , , , , , , | 5 Comments

Blog at WordPress.com.