Why PhotoSpeak exists

Photos without their stories are just pictures of people at places. PhotoSpeak gives photos their voice back.

The Problem

You know how when you come back from a holiday and you sit everyone down and go through the photos? You're talking through each one, telling the stories, who's in it, what was happening, why it's funny or important. That experience is 90% of the point of photos. But it's never captured. It happens in the moment and then it's gone.

That's fine when you're 35 and you remember everything. But people get old. People get dementia. People die. And all those stories die with them. The photos survive though — people digitise slides, scan old prints, they get passed down. But a photo without its story is just a picture of some people at a place. It doesn't mean anything anymore.

The Insight

Nobody cares that there's a car in a photo. It gets a bit more interesting when you know it's a Ford GT. But when you know it was Dad's, and he bought it in 1980 and had it until he passed away in 2022 — now that photo means something. That's what a photo should be in this modern world. Not just pixels. The story.

The Solution

PhotoSpeak is the modern version of writing on the back of the photo. You open up a folder of images, click through them one by one, and just talk about each one — naturally, like you're showing someone.

"That's Mum and Dad, my brother and me, standing out front of the house in '94. Dad's wearing his singlet like he always does."

AI takes everything — what you said, what it detected — and extracts all the structured data: names, locations, dates, events, tags, a clean description. But it keeps your original words too. Your actual transcript. That's the sentimental part.

Data Philosophy

XMP is the source of truth

Everything portable lives in the image file itself. Title, description, keywords, voice note text, voice audio, face regions — all stored as standard XMP metadata. Not in some app. Not in a cloud database. In the actual photo.

SQLite is a local cache

The database is rebuilt from XMP on folder load. Delete it anytime. It exists purely for fast search and filtering. The files are the truth.

No sidecar files

Nothing loose to get separated or forgotten. Everything travels with the photo. Copy it, share it, back it up — the story comes with it.

No lock-in

All metadata uses industry-standard IPTC, Dublin Core, and MWG tags. Readable by Lightroom, digiKam, Apple Photos, and any XMP-aware software. Delete PhotoSpeak and the stories remain.

Graceful degradation

Each AI model is optional. If one fails to load, everything else keeps working. The pipeline adapts to your hardware. No model is required — you can annotate photos with voice alone.

Everything runs on your machine

Whisper, face detection, CLIP, Florence, YOLOv8, BioCLIP, MiDaS — all local. No data leaves your computer unless you choose a cloud LLM provider. Even then, only the voice note text and context are sent, never the image itself.

Local Speech-to-Text

Faster Whisper runs entirely on your machine. Your voice recordings never leave your computer.

Local Vision AI

Face detection, object detection, captioning, depth estimation, species ID — all local models. Your photos stay private.

Your Choice of LLM

Use Ollama for fully offline operation, or Anthropic Claude / OpenAI-compatible endpoints if you prefer cloud quality. The choice is yours.

Photos shouldn't be a pixel archive. They should be a living record with a voice attached.