How to use PhotoSpeak

PhotoSpeak turns your spoken descriptions into rich, searchable photo metadata. Talk about your photos, and AI extracts the details.

Five-Step Workflow

1

Open a Folder

Click Open Folder and browse to a folder of photos. PhotoSpeak indexes them and runs the AI pipeline automatically — face detection, CLIP embeddings, object detection, and more.

2

Record Your Description

Click Record image details and talk about the photo. Mention who's in it, where it was taken, when, and what's happening. You can record multiple takes — they get stitched together automatically.

3

Extract Fields

Click Extract Fields. The AI reads your transcript plus EXIF data, GPS, weather, and detections to fill in title, description, keywords, people, location, and more.

4

Review & Edit

Review the extracted fields. Click suggestion chips to pick alternatives. Edit anything by hand. Accept or reject face and object detections.

5

Browse & Export

Open the Gallery to browse, search, and filter your annotated photos. Export to a standalone HTML gallery you can share with anyone.

Everything is saved in the image file. All metadata is written to XMP — it travels with the photo. No database lock-in, no cloud dependency. Copy the image anywhere and the metadata comes with it.

Recording Tips

Better transcripts produce better metadata. Here's how to get the most out of your recordings.

Mention the basics

Who is in the photo, where it was taken, when, and what's happening. These are the four pillars the AI uses to populate fields.

Tell the story

Don't just list facts — tell the story behind the photo. The AI turns your words into a narrative description. The more you say, the richer the output. A 30-second recording produces far better results than a 5-second one.

Use names

Say people's names clearly: "That's Nana Margaret on the left, and Uncle Dave." The AI uses these to populate the People field and match with face detections.

Multiple takes are fine

Record once, then record again to add details. The voice note text appends automatically, and all audio recordings are stitched together. You won't lose anything.

Edit the transcript

After recording, you can freely edit the voice note text — fix names, add details you forgot, remove mistakes. The AI only sees the final text.

Keyboard Shortcuts

KeyAction
SpaceToggle recording
Previous / next image
Ctrl + SSave metadata
Ctrl + ZUndo
Ctrl + EExtract fields from voice note text
Alt + AToggle annotate mode (draw boxes)
DDetect faces
EscClose dialogs / lightbox

In the Gallery, use Ctrl+click to select multiple images. Arrow keys navigate the lightbox.

System Requirements

Python 3.10+

Install dependencies with pip install -r requirements.txt and run with python main.py. Opens at http://localhost:8008.

exiftool

The metadata engine. Windows: download from exiftool.org, rename to exiftool.exe, add to PATH. macOS: brew install exiftool. Linux: sudo apt install libimage-exiftool-perl.

ffmpeg

Required for audio conversion (voice notes to Opus). Windows: download from gyan.dev, add bin/ to PATH. macOS: brew install ffmpeg. Linux: sudo apt install ffmpeg.

First run downloads Whisper (~150 MB). Other models download on first use. A GPU is helpful but not required — everything works on CPU.