I Needed to Pull A Ton of Files Out of Salesforce. Here’s What I Built.

salesforce binge

2 months ago

When the platform says “no” and the vendors say “$8/user/month,” sometimes you just build it yourself.

The Problem

You’ve got files in Salesforce. Contracts, proposals, signed PDFs, that one mystery screenshot someone attached to a Case in 2019. Hundreds of them. Maybe thousands.

And now you need them out. Migration project. Compliance audit. Org cleanup. Backup before a big deployment. Org to Org Migration. Doesn’t matter why, you need the actual files on your disk.

So you look for the “Export All Files” button.

There isn’t one. 🙃

Data Loader? Handles records like a champ. Binary files? Nope! Not its thing.

Data Export Wizard? The name writes checks the feature can’t cash. It exports records as CSVs, not a single binary file in sight. Some wizard!

Apex? Sure, go ahead – right up until you hit the 6 MB heap limit. Game over!

AppExchange tools? They exist. I even tried one. Genuinely solid tool, no complaints about the functionality. But when you’re staring at a $360/year invoice with a minimum of four user licenses… to download files you already own… from your own org… let’s just say that procurement meeting wrote itself. My finance team had questions. Good questions.

Now, to be fair, if you’re already running an ETL platform like MuleSoft or Informatica, or you’ve got a backup tool like Odaseva keeping your files in sync, you might already have this covered. Those tools can handle binary extraction as part of a larger pipeline. But if your org doesn’t come with a six-figure integration budget and most don’t — you’re back to square one.

I had to face this a few years ago, until someone jogged my memory on this requirement again. So I did what any self-respecting architect would do on a Friday night: I built the thing myself. (So you don’t think I’m lazy, I did try the paid route first! 😄)

Why It’s Harder Than It Should Be

Here’s the part that trips people up. Salesforce doesn’t store files the way you’d expect.

Files live across three objects: ContentDocument (the logical file, think of it as the envelope), ContentVersion (the actual binary, versioned), and ContentDocumentLink (the junction that ties a file to whatever record it’s attached to an Account, a Case, an Opportunity). Three objects. For one file. Welcome to Salesforce content management.

For extraction, what you care about is ContentVersion where IsLatest = true.

The only reliable way to pull the binary? REST API. You stream it directly from /sobjects/ContentVersion/{Id}/VersionData. No heap limits. No file size constraints. Beautiful!

But stitching that together from scratch like handling authentication, pagination, tracking what’s already downloaded, verifying nothing corrupted in transit, dealing with duplicate filenames and special characters, that’s a full weekend you don’t get back. And if it fails halfway through a thousand files? Start over. Fun.

I wanted something I could run once, walk away, and trust.

The Tool

Meet the Salesforce Files Extractor — an open-source bash tool that does exactly what the name says.

It runs in two modes:

🖥️ Local Script

Drop into your terminal and go:

			
./extract-files.sh              # All files, 5 parallel downloads
./extract-files.sh 5            # Only files over 5 MB
./extract-files.sh 5 ./output 10  # 5 MB filter, custom folder, 10 parallel

Prerequisites? Just sf CLI, jq, and curl. That’s it.

⚙️ GitHub Action

Prefer automation? The included GitHub Action lets you trigger extraction from the Actions tab. It downloads to a separate extracted-data branch — your main stays clean with only source code. Hit the button, come back later, files are committed and pushed.

⚠️ Important: If you use the Action, keep your repo private. You don’t want your Salesforce files on a public branch.

What You Get

Feature	Why It Matters
5 parallel downloads (configurable)	Hundreds of files don’t take all day
Automatic resume	Fails at file 847? Re-run picks up at 848
MD5 checksum verification	Every file verified against `ContentVersion.Checksum`
CSV manifest	Full traceability — ID, title, size, owner, checksum status
Duplicate filename handling	Two files named `Contract.pdf`? Second one gets the ContentVersionId appended
Size filtering	Only need files over 10 MB? One argument

The resume feature is probably my favorite. Real-world extractions fail, network blips, token expiry, your laptop going to sleep. Being able to re-run and pick up exactly where you left off turns a fragile process into a reliable one.

Under the Hood (Brief)

The flow is straightforward:

SOQL query → REST API stream → MD5 verify → CSV manifest

The script queries ContentVersion for all latest files (with optional size filter), then streams each binary directly to disk via the REST API. Each download is verified against Salesforce’s stored MD5 checksum. Everything gets logged to _file_manifest.csv — which also doubles as the resume tracker.

For the GitHub Action, an orphan branch (extracted-data) keeps extracted files completely separated from your source code. Clean git history. No merge conflicts with your workflow files.

I’m not going to turn this into a tutorial — the README covers setup, authentication, usage, and edge cases in detail. There’s also a bonus list-files.apex script if you just want to inventory your org’s files without downloading anything.

Try It

The Salesforce Files Extractor is open source, MIT licensed. Free to use, free to modify, free to contribute to.

If you’ve ever stared at a Salesforce org full of files and thought “How do I get these out?” — this is your answer.

⭐ Star it on GitHub if it saves you a weekend.

🐛 Found a bug or want a feature? Open an issue. PRs welcome.

This was a real itch I needed to scratch, and I figured if I needed it, others probably do too. That’s the whole point of open source.

Until next time! 🙂