Posts Tagged With: ai

Scraping Tabular Data from Web Pages using Python

In my continuing quest to make sense of the baseball pitching data that I’m studying, I realized that I needed the game performance of the pitchers on whom I have pitching data. This helps validate any theories about what the data from training tells us about ability.

Since we’re using the 2024-2025 Brevard College Rapsodo data, we want to add the 2025 season statistics to see if our analysis of their fall and early spring data matches with their performance.

BeautifulSoup

The package I’m using to parse the HTML is called BeautifulSoup and was created by Leonard Richardson in 2004 (and he’s maintained & upgraded it [with help] for over 20 years!) We’re not going to do anything complex, but it is powerful.

We’re importing two objects from the library – BeautifulSoup to get all the HTML and SoupStrainer to allow use to choose what to parse. We also have to import pandas to put our table in memory and requests to get the document from the web.

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import pandas as pd
import requests

url = "https://bctornados.com/sports/baseball/stats"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

The Web Page

We’re very fortunate that the Brevard team’s web page is designed well. In particular, the section that contains the pitching data is tagged as “individual-overall-pitching”. This made finding and scraping the data far easier. It’s not odd to see web page developers leaving id’s off or for CMS systems to fail to include them. Here’s the top of the appropriate section on Brevard’s page:

<!-- Individual - Overall - Pitching -->
<section id="individual-overall-pitching">

This makes it very easy for us to acquire the tabular data and put it into a DataFrame.

only_pitching = SoupStrainer(id='individual-overall-pitching')

pitching_section = BeautifulSoup(html_doc, "html.parser", parse_only=only_pitching)

pitching_table = pitching_section.find('table')

pitching_data_df = pd.read_html(str(pitching_table))[0]

Quick transform on the name column

Since I realized this put some gobbledygook in the Player Name column, I did slip in code to modify that, but even if you’re scraping pitching data from somewhere, you might not need those steps, but here’s the code I used.

# Function to convert "Last, First Jersey#etc" to "First Last"
def transform_name(name):
if pd.isna(name):
return name
parts = name.split(',')
if len(parts) < 2:
return name
last = parts[0].strip()
first_parts = parts[1].strip().split()
first = first_parts[0]
return f"{first} {last}"

# Apply transformation
pitching_data_df['Player'] = pitching_data_df['Player'].apply(transform_name)

To Excel

Now, I wanted to know immediately how well this worked, so I dumped it into an Excel spreadsheet.

pitching_data_df.to_excel('brevard.xlsx', index=False)

Next Steps

Now that I have my practice data and my game data in DataFrames, I can start merging in the performance results when displaying practice data and drawing conclusions.

This series includes:

Categories: Python | Tags: , , , , , , , , , , , , , | 4 Comments

Automate PDF Creation from Data Visualizations with Python

When creating some really great visualizations, I wondered how I could create a bunch of them and not be overwhelmed by the process of exporting them to PDF files to share with others. So I explored a few options and found that img2pdf was most suitable: always lossless, small, and fast. It allows me to loop through my data, creating charts, move them to individual PDFs, and then combine them into multi-page PDFs.

I’m using DataLab by DataCamp, where I’m learning Python, Data Science, and AI. So, some of my code may rely on that environment and YMMV. Installing and importing img2pdf was very straightforward for me.

!pip install img2pdf
import img2pdf

It turned out to be pretty simple to loop through my CSV data using one of the columns to get data by player and then create each graph, saving it as a PDF, then combining then as multi-page PDFs by player and by category.

I created a directory structure for the files, with a Fig Storage directory for all the individual PDFs and a directory for each team and year. This allows me to scale it to handle data at volume, letting me focus on analyzing that data instead of being bogged down in copying and pasting.

Within each loop, it creates an empty array, imagefiles, in which all filenames are placed, so that those files can be copied into the summary PDFs once the charts have all been generated. Outside the loop, there is another array, byDateArrayFiles, for storing all of the filenames to be bundled together for a ‘Velocity by Date’ file.

Here’s a sample of the loop with only two charts created. I have 8 different ones created for each player, but that would be excessive. This gives you the idea.

season = '2025'
team = 'ICI'
byDateArrayFiles = []
playerNames = pitchLogic['Player Name'].unique()
for player in playerNames:
player_df = pitchLogic[pitchLogic['Player Name'] == player]
imagefiles = []

bydatefig, bydateax = plt.subplots()
bydateax = sns.lineplot(x='Date', y='Speed', data=player_df).set(title='Velocity for ' + player)
bydatefig.autofmt_xdate(rotation=75)
filename = "Fig Storage/" + player + ' VelocityByDate.jpg'
bydatefig.savefig(filename)
byDateArrayFiles.append(filename)
imagefiles.append(filename)

only100_player_df = player_df[abs(player_df['Slot Diff'])<=100]
only100_player_df.loc[only100_player_df['Location'] != 'K', 'Location'] = 'BB'
slotfig = sns.relplot(x='Horiz Mvmt', y='Vertical Mvmt', data=only100_player_df, kind='scatter', hue='Slot Diff', size='Location', style='Type').set(title='Slot Difference and Movement Profile for ' + player)
filename = "2025 Samples/" + player + ' SlotMovement.jpg'
slotfig.savefig(filename)
imagefiles.append(filename)

with open(season + " " + team + "/" + player + ".pdf", "wb") as pdf_file:
pdf_file.write(img2pdf.convert(imagefiles))

with open(season + " " + team + "/Velocity By Date.pdf", "wb") as pdf_file:
pdf_file.write(img2pdf.convert(byDateArrayFiles))

This all saved me loads of time and headaches generating the charts. It lets me quickly explore whether my visualizations are meaningful. It also makes modifying them or updating them very easy. I can put a season’s worth of pitches into the system and have a suite of charts for each player a minute later.

This series includes:

Categories: Python, Visualizations | Tags: , , , , , , , , , , , | 5 Comments

Using ChatGPT to Write Code in LotusScript, TypeScript and C#

So, I was reading about an open position that I might be interested in and followed some links that lead to an article on AI in programming and wondered just how good it would be. So, I went ahead and asked ChatGPT to write an agent for me.

write an agent for use in HCL Notes that sends messages to users at intervals determined by a configuration document that identifies which documents to notify about using a querystring and whom to notify by a list on the configuration document. The configuration document should specify whether to send individual messages for each document or a list of documents.

I thought it couldn’t do half the job that I did. I figured I’d need to go through a bunch of refining iterations, explaining that I wanted a function to do this and another function to do that. Nope.

Sub Initialize
Dim session As New NotesSession
Dim db As NotesDatabase
Set db = session.CurrentDatabase

' Retrieve configuration document
Dim configDoc As NotesDocument
Set configDoc = GetConfigDocument(db)

If configDoc Is Nothing Then
MsgBox "Configuration document not found!", 48, "Error"
Exit Sub
End If

' Read configuration values
Dim queryString As String
queryString = configDoc.QueryString(0)

Dim usersToNotify As Variant
usersToNotify = configDoc.UsersToNotify

Dim notificationMethod As String
notificationMethod = configDoc.NotificationMethod(0)

Dim interval As Integer
interval = configDoc.IntervalInHours(0)

' Find the documents to notify about based on query string
Dim view As NotesView
Set view = db.GetView("YourDocumentsView") ' Adjust view as needed
Dim doc As NotesDocument
Set doc = view.GetFirstDocument

Dim docsToNotify As NotesDocumentCollection
Set docsToNotify = New NotesDocumentCollection

Do While Not (doc Is Nothing)
If Evaluate(queryString, doc) Then
docsToNotify.Add doc
End If
Set doc = view.GetNextDocument(doc)
Loop

' Send notifications based on configuration
Dim user As String
ForAll user In usersToNotify
If notificationMethod = "Individual" Then
Call SendIndividualMessages(user, docsToNotify)
ElseIf notificationMethod = "List" Then
Call SendListMessage(user, docsToNotify)
End If
End ForAll

' Wait for the interval before running again (example of a timed agent)
Call ScheduleNextRun(interval)

End Sub

' Function to retrieve configuration document
Function GetConfigDocument(db As NotesDatabase) As NotesDocument
Dim configView As NotesView
Set configView = db.GetView("ConfigView") ' Adjust view as needed
Set GetConfigDocument = configView.GetFirstDocument
End Function

' Function to send individual messages
Sub SendIndividualMessages(user As String, docs As NotesDocumentCollection)
Dim doc As NotesDocument
Set doc = docs.GetFirstDocument

While Not (doc Is Nothing)
Dim mailDoc As NotesDocument
Set mailDoc = db.CreateDocument
mailDoc.Form = "Memo"
mailDoc.Subject = "Document Notification"
mailDoc.SendTo = user
mailDoc.Body = "You have a document to review: " & doc.GetItemValue("Subject")(0)

' Send the mail
Call mailDoc.Send(False)

' Move to next document
Set doc = docs.GetNextDocument(doc)
Wend
End Sub

' Function to send list of documents in one message
Sub SendListMessage(user As String, docs As NotesDocumentCollection)
Dim doc As NotesDocument
Set doc = docs.GetFirstDocument

Dim bodyText As String
bodyText = "You have the following documents to review:" & Chr(10)

While Not (doc Is Nothing)
bodyText = bodyText & doc.GetItemValue("Subject")(0) & Chr(10)
Set doc = docs.GetNextDocument(doc)
Wend

Dim mailDoc As NotesDocument
Set mailDoc = db.CreateDocument
mailDoc.Form = "Memo"
mailDoc.Subject = "Documents Notification"
mailDoc.SendTo = user
mailDoc.Body = bodyText

' Send the mail
Call mailDoc.Send(False)
End Sub

' Function to schedule the next run of the agent
Sub ScheduleNextRun(interval As Integer)
' Schedule the agent to run again in 'interval' hours
Dim agent As NotesAgent
Set agent = session.CurrentDatabase.GetAgent("YourAgentName") ' Adjust agent name
Call agent.Schedule(Interval:=interval)
End Sub

I’ve written one myself and I’m pretty proud of it. I talk about it often. Now, I talk about how my team created the same kind of functionality in MS Dynamics 365 and was working on doing a full stack version before we all got furloughed (USAID contract cuts). Nonetheless,, I think this version for Notes, which completed in the amount of time it took to scroll through the message, would have been a great start.

Some of the positions I’m looking at use TypeScript in the tech stack, and I know TypeScript is still a Javascript-based language. So, I decided to ask it to do the same in TypeScript. Notes is integrated with email, so some of the extra coding in TypeScript wasn’t necessary in LotusScript.

import nodemailer from 'nodemailer';
import axios from 'axios';
import { CronJob } from 'cron';

// Interface for Configuration Document
interface ConfigDocument {
queryString: string;
usersToNotify: string[];
notificationMethod: 'Individual' | 'List';
intervalInHours: number;
}

// Function to fetch configuration document from an API or database
async function getConfigDocument(): Promise<ConfigDocument> {
try {
const response = await axios.get('http://your-api-url/config');
return response.data;
} catch (error) {
console.error('Error fetching configuration document:', error);
throw new Error('Failed to fetch config document');
}
}

// Function to send email
async function sendEmail(to: string, subject: string, body: string): Promise<void> {
// Create a transporter using your email service
let transporter = nodemailer.createTransport({
service: 'gmail', // Example service
auth: {
user: 'your-email@gmail.com', // Replace with your email
pass: 'your-email-password', // Replace with your email password
},
});

// Set up email data
let mailOptions = {
from: 'your-email@gmail.com',
to: to,
subject: subject,
text: body,
};

try {
await transporter.sendMail(mailOptions);
console.log('Email sent successfully to', to);
} catch (error) {
console.error('Error sending email:', error);
}
}

// Function to get documents from the Notes database based on query string
async function fetchDocuments(query: string): Promise<any[]> {
try {
// Assuming there is an API endpoint to fetch documents based on a query
const response = await axios.get(`http://your-api-url/documents?query=${encodeURIComponent(query)}`);
return response.data;
} catch (error) {
console.error('Error fetching documents:', error);
throw new Error('Failed to fetch documents');
}
}

// Function to send individual messages for each document
async function sendIndividualMessages(users: string[], documents: any[]): Promise<void> {
for (const user of users) {
for (const doc of documents) {
const subject = `Document Notification: ${doc.title}`;
const body = `Hello, you have a document to review: ${doc.title}\n\nDetails: ${doc.details}`;
await sendEmail(user, subject, body);
}
}
}

// Function to send a list of documents in one email
async function sendListMessage(users: string[], documents: any[]): Promise<void> {
const documentList = documents.map((doc) => `- ${doc.title}`).join('\n');
const body = `Hello, you have the following documents to review:\n\n${documentList}`;

for (const user of users) {
const subject = 'Documents Notification';
await sendEmail(user, subject, body);
}
}

// Function to run the agent logic
async function runAgent() {
try {
const config = await getConfigDocument();

// Fetch documents based on the query string from the configuration
const documents = await fetchDocuments(config.queryString);

// Send notifications based on the method defined in the config
if (config.notificationMethod === 'Individual') {
await sendIndividualMessages(config.usersToNotify, documents);
} else {
await sendListMessage(config.usersToNotify, documents);
}
} catch (error) {
console.error('Error during agent run:', error);
}
}

// Scheduling the agent to run at a defined interval (in hours)
async function scheduleAgent() {
const config = await getConfigDocument();

// Using Cron to schedule the agent run based on the interval (in hours)
const cronExpression = `0 0 */${config.intervalInHours} * * *`; // Run at interval in hours
new CronJob(cronExpression, runAgent, null, true, 'America/New_York'); // Adjust timezone as needed

console.log(`Agent scheduled to run every ${config.intervalInHours} hours`);
}

// Start the scheduling
scheduleAgent().catch((error) => console.error('Error scheduling agent:', error));

Why not get more afield and have it do the full stack concept version for me. This one is in C#

using System;
using System.Collections.Generic;
using System.Net;
using System.Net.Mail;
using System.Net.Http;
using System.Threading.Tasks;
using Quartz;
using Quartz.Impl;

class Program
{
// Configuration Document class
public class ConfigDocument
{
public string QueryString { get; set; }
public List<string> UsersToNotify { get; set; }
public string NotificationMethod { get; set; } // "Individual" or "List"
public int IntervalInHours { get; set; }
}

// Document class that represents a document in the Notes database
public class Document
{
public string Title { get; set; }
public string Details { get; set; }
}

// Function to fetch configuration from an API or database
static async Task<ConfigDocument> GetConfigDocumentAsync()
{
// Replace with actual logic to fetch the configuration
// For demonstration, we're using hardcoded values
return new ConfigDocument
{
QueryString = "SELECT * FROM Documents WHERE status = 'pending'",
UsersToNotify = new List<string> { "user1@example.com", "user2@example.com" },
NotificationMethod = "Individual",
IntervalInHours = 1
};
}

// Function to fetch documents based on the query string
static async Task<List<Document>> FetchDocumentsAsync(string query)
{
// Replace with actual logic to query the HCL Notes database or API
// Simulating fetching documents
return new List<Document>
{
new Document { Title = "Document 1", Details = "Details of Document 1" },
new Document { Title = "Document 2", Details = "Details of Document 2" }
};
}

// Function to send an email
static async Task SendEmailAsync(string to, string subject, string body)
{
var smtpClient = new SmtpClient("smtp.gmail.com")
{
Port = 587,
Credentials = new NetworkCredential("your-email@gmail.com", "your-email-password"),
EnableSsl = true,
};

var message = new MailMessage
{
From = new MailAddress("your-email@gmail.com"),
Subject = subject,
Body = body,
IsBodyHtml = false,
};

message.To.Add(to);

await smtpClient.SendMailAsync(message);
}

// Function to send individual messages
static async Task SendIndividualMessagesAsync(List<string> users, List<Document> documents)
{
foreach (var user in users)
{
foreach (var doc in documents)
{
var subject = $"Document Notification: {doc.Title}";
var body = $"Hello, you have a document to review:\n\n{doc.Title}\n\nDetails: {doc.Details}";

await SendEmailAsync(user, subject, body);
}
}
}

// Function to send a list of documents in a single email
static async Task SendListMessageAsync(List<string> users, List<Document> documents)
{
var documentList = string.Join("\n", documents.ConvertAll(doc => $"- {doc.Title}"));
var body = $"Hello, you have the following documents to review:\n\n{documentList}";

foreach (var user in users)
{
var subject = "Documents Notification";
await SendEmailAsync(user, subject, body);
}
}

// Main Agent Logic
static async Task RunAgentAsync()
{
try
{
var config = await GetConfigDocumentAsync();
var documents = await FetchDocumentsAsync(config.QueryString);

// Send notifications based on the method specified in the config
if (config.NotificationMethod == "Individual")
{
await SendIndividualMessagesAsync(config.UsersToNotify, documents);
}
else if (config.NotificationMethod == "List")
{
await SendListMessageAsync(config.UsersToNotify, documents);
}
}
catch (Exception ex)
{
Console.WriteLine($"Error during agent run: {ex.Message}");
}
}

// Scheduling the agent to run at the specified interval (in hours)
static async Task ScheduleAgentAsync()
{
var config = await GetConfigDocumentAsync();

var scheduler = await StdSchedulerFactory.GetDefaultScheduler();
await scheduler.Start();

// Schedule the job with Quartz.NET to run at the specified interval
var job = JobBuilder.Create<Job>()
.WithIdentity("DocumentNotificationJob")
.Build();

var trigger = TriggerBuilder.Create()
.WithIdentity("DocumentNotificationTrigger")
.StartNow()
.WithSimpleSchedule(x => x.WithIntervalInHours(config.IntervalInHours).RepeatForever())
.Build();

await scheduler.ScheduleJob(job, trigger);

Console.WriteLine($"Agent scheduled to run every {config.IntervalInHours} hours");
}

// Quartz job that triggers the agent run
public class Job : IJob
{
public Task Execute(IJobExecutionContext context)
{
return RunAgentAsync();
}
}

// Main Entry Point
static async Task Main(string[] args)
{
await ScheduleAgentAsync();

// Keep the application running
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
}

Now, in each case, the code needs to be refined and enhanced. I imagine that could just be done with more specific descriptions of what’s needed, but dang, this AI stuff is powerful. I suspect that if I give ChatGPT some of my code and ask it to port it to another tech stack, it’s going to save me loads of time. I can focus on design patterns and on getting interfaces right, leaving the grunt work to the AI. These are indeed exciting times.

Categories: AI Coding, C#, Code Challenge, TypeScript | Tags: , , , , | Leave a comment

Blog at WordPress.com.