Wednesday, October 15, 2025

Primary Sources Download

 

How I Automated Downloading Thousands of Historical Documents from Founders Online (And Created a 2.5 Million Word Problem)

It all started with a book.

I'd been reading Jill Lepore's We the People, and I was completely absorbed in the world of America's founding era. Lepore has this incredible way of weaving together primary sources—letters, speeches, diary entries—to tell a richer, more nuanced story than any secondhand account could provide. I found myself wanting to dig deeper, to read more of these original documents myself.

That's when I had an idea: what if I could gather all these primary sources and feed them into Google NotebookLM? For those who haven't tried it, NotebookLM is this amazing AI tool that lets you upload your own documents and then chat with them, ask questions, make connections. Imagine being able to ask, "What did Benjamin Franklin think about democracy?" and having the AI cite directly from his actual writings. That was my goal.

But first, I needed to get those writings. All of them. And that's where things got interesting.

The Challenge: Getting the Documents Out

Founders Online has all this amazing content, but there's no "download all" button for a specific author. Sure, you can browse, search, and read online, but I wanted local copies—something I could upload to NotebookLM and interact with in a completely new way. The solution? Their API and metadata files.

After some poking around, I discovered that Founders Online publishes their entire catalog as a JSON metadata file. Every single document, with details about authors, dates, and—most importantly—the permalink to access each one. This was my golden ticket.

Step One: Understanding the Metadata

The first thing I did was download the founders-online-metadata.json file. This thing is massive—we're talking about thousands and thousands of historical documents all catalogued in one place. Each entry includes:

  • The document title
  • The author(s)
  • The date
  • A permalink (the URL path to the document)
  • The project it belongs to (Franklin, Adams, Washington, etc.)

Opening up this file and seeing all that structured data was exciting. It meant I could filter for exactly what I wanted—say, everything written by Benjamin Franklin—and then figure out how to download each document programmatically.

Step Two: Building the Script

I'm on Windows 11, so PowerShell was my tool of choice. The plan was straightforward:

  1. Parse the JSON metadata
  2. Filter for documents by my chosen author (starting with Benjamin Franklin)
  3. Extract the document identifier from each permalink
  4. Hit the API endpoint for each document
  5. Save the full text locally as individual JSON files

Sounds simple enough, right? Well, as with most coding projects, the devil was in the details.

The Identifier Puzzle

The first hurdle was figuring out how to construct the correct API URL. The permalinks in the metadata look something like /documents/Franklin/01-02-02-0056, but you can't just plug that directly into the API. I had to extract the identifier part (that 01-02-02-0056 bit) and build the proper API endpoint: https://founders.archives.gov/API/docdata/Franklin/01-02-02-0056.

I wrote a PowerShell script with some regex pattern matching to pull out those identifiers. After a few test runs and manual URL checks to make sure I was hitting the right endpoints, I had it working. The script would loop through each Benjamin Franklin document, grab the content from the API, and save it to my hard drive.

Rate Limiting and Being a Good Citizen

Here's an important point: when you're hitting someone's API hundreds or thousands of times, you need to be respectful. I built in rate limiting—basically telling my script to wait a bit between each request so I wasn't hammering their server. Nobody likes a greedy script that acts like a denial-of-service attack.

I also made sure to organize everything into subdirectories. Franklin's documents went into a "Franklin" folder, which kept things tidy and made it easy to find what I was looking for later.

Step Three: Expanding to Other Authors

Once I had the script working for Benjamin Franklin, I thought, "Why stop here?" So I adapted it for Abigail Adams. This is where things got a little tricky.

The Name Variation Challenge

Abigail Adams appears in the metadata in different ways: sometimes as "Adams, Abigail," sometimes as "Adams, Abigail Smith" (using her maiden name), and occasionally just as "Smith, Abigail." If I only filtered for one version of her name, I'd miss a bunch of her documents.

The solution was to expand my filter logic to catch all variations:

$docs = $metadata | Where-Object {
    ($_.authors -contains "Adams, Abigail") -or
    ($_.authors -contains "Adams, Abigail Smith") -or
    ($_.authors -contains "Smith, Abigail")
}

This made sure I was capturing everything she wrote, regardless of how her name was recorded in the metadata.

The Next Problem: Combining Everything for NotebookLM

So now I had thousands of individual JSON files sitting in a folder—one for each document Franklin wrote. This was great for organization, but NotebookLM doesn't want thousands of separate files. It wants clean, readable documents it can process.

I needed to combine everything into a single markdown file.

This is where I turned to Perplexity AI. I asked it to write me a script that would:

  • Read all the JSON files
  • Extract the text content from each one
  • Combine them into a single, formatted markdown document

Perplexity delivered. The script worked beautifully, pulling together letters, speeches, essays, everything Franklin wrote into one comprehensive file. I hit run, waited a few minutes, and then opened the result.

The 2.5 Million Word Problem

Here's where I learned an important lesson about the difference between "collecting data" and "usable data."

The combined Franklin markdown file clocked in at over 2.5 million words.

To put that in perspective:

  • War and Peace is about 587,000 words
  • The entire Harry Potter series is around 1 million words
  • My Franklin file was nearly three Harry Potter series

NotebookLM has limits. And 2.5 million words exceeds those limits by... a lot.

I stared at this massive file and had to laugh. I'd successfully automated the collection and compilation of one of history's most prolific writers, only to discover that I'd collected too much of a good thing. The file was simply too large for NotebookLM to process.

This wasn't a failure of the scripts—they worked perfectly. It was a reality check about the scale of these historical figures' output. Benjamin Franklin didn't just write a few famous letters. The man wrote constantly for decades. And I'd just tried to cram his entire literary output into a single file.

What I Learned Along the Way

This project taught me several valuable lessons:

1. APIs Are Your Friend: When a website has a public API, it's usually the best way to get bulk data. It's designed for this kind of thing, unlike web scraping which can be fragile and sometimes unwelcome.

2. Metadata Is Gold: That JSON file with all the document metadata was the key to everything. It let me filter, sort, and target exactly what I needed without guessing.

3. Edge Cases Matter: Things like name variations, different permalink formats, and rate limiting aren't just details—they're the difference between a script that kind of works and one that reliably gets everything you need.

4. Start Small, Then Scale: I validated everything with Benjamin Franklin first—testing the identifier extraction, checking the API responses manually, making sure files were saving correctly. Once that worked, expanding to other authors was straightforward.

5. Success Can Create New Problems: Sometimes your automation works too well. Just because you can download and compile everything doesn't mean your downstream tools can handle everything. I should have checked NotebookLM's limits before creating a file larger than several novels combined.

Making It Modular

The beauty of this approach is that it's completely modular. Want to download everything George Washington wrote? Just change the author filter to "Washington, George" and adjust the project slug in the API path. The core logic stays the same.

Here's the basic pattern:

  1. Filter the metadata for your chosen author
  2. Extract the document identifiers
  3. Construct the API URLs
  4. Download and save

You can clone this process for any author in the Founders Online collection.

The Solution: Selective Compilation

So where does this leave me with my NotebookLM project? I have a few options:

  • Break it into chunks: Create multiple markdown files for different time periods or topics in Franklin's life
  • Be selective: Filter for specific types of documents (just letters to John Adams, for example)
  • Use a different tool: Find something designed to handle massive datasets
  • Try other authors: Maybe someone less prolific than Franklin (though good luck with that among the Founding Fathers)

The point is, the automation worked. The scripts did exactly what they were supposed to do. Now I just need to be smarter about what I'm automating and how much I'm trying to process at once.

Final Thoughts

Projects like this remind me why I love working with technology. Historical archives like Founders Online do the hard work of digitizing and cataloging these documents. They provide the API and metadata to make them accessible. And then, with a bit of scripting knowledge and help from modern AI tools, anyone can build automated workflows to work with that data in creative ways.

The 2.5 million word markdown file sitting on my hard drive is both a success and a challenge. It's proof that the automation worked perfectly, but also a reminder that tools have limits and "more data" isn't always "better data."

Still, inspired by Jill Lepore's masterful use of primary sources in We the People, I'm not giving up on the NotebookLM idea. I'll just need to be more strategic about what I feed it. Maybe start with Franklin's correspondence with specific individuals, or documents from a particular decade. The beautiful thing is that I have all the data, perfectly organized. Now I just need to figure out the right way to use it.

And hey, at least now I can say I have Benjamin Franklin's entire body of work on my laptop. All 2.5 million words of it. How cool is that?

No comments:

Post a Comment

Washington on the Fox Hunt

  A Gentleman's Pursuit: Five Days in the Hunt As told by George Washington, Master of Mount Vernon The frost lay heavy upon the Neck...