PhotoSpeak turns your spoken descriptions into rich, searchable photo metadata. Talk about your photos, and AI extracts the details.
Click Open Folder and browse to a folder of photos. PhotoSpeak indexes them and runs the AI pipeline automatically — face detection, CLIP embeddings, object detection, and more.
Click Record image details and talk about the photo. Mention who's in it, where it was taken, when, and what's happening. You can record multiple takes — they get stitched together automatically.
Click Extract Fields. The AI reads your transcript plus EXIF data, GPS, weather, and detections to fill in title, description, keywords, people, location, and more.
Review the extracted fields. Click suggestion chips to pick alternatives. Edit anything by hand. Accept or reject face and object detections.
Open the Gallery to browse, search, and filter your annotated photos. Export to a standalone HTML gallery you can share with anyone.
Everything is saved in the image file. All metadata is written to XMP — it travels with the photo. No database lock-in, no cloud dependency. Copy the image anywhere and the metadata comes with it.
Better transcripts produce better metadata. Here's how to get the most out of your recordings.
Who is in the photo, where it was taken, when, and what's happening. These are the four pillars the AI uses to populate fields.
Don't just list facts — tell the story behind the photo. The AI turns your words into a narrative description. The more you say, the richer the output. A 30-second recording produces far better results than a 5-second one.
Say people's names clearly: "That's Nana Margaret on the left, and Uncle Dave." The AI uses these to populate the People field and match with face detections.
Record once, then record again to add details. The voice note text appends automatically, and all audio recordings are stitched together. You won't lose anything.
After recording, you can freely edit the voice note text — fix names, add details you forgot, remove mistakes. The AI only sees the final text.
| Key | Action |
|---|---|
| Space | Toggle recording |
| ← → | Previous / next image |
| Ctrl + S | Save metadata |
| Ctrl + Z | Undo |
| Ctrl + E | Extract fields from voice note text |
| Alt + A | Toggle annotate mode (draw boxes) |
| D | Detect faces |
| Esc | Close dialogs / lightbox |
In the Gallery, use Ctrl+click to select multiple images. Arrow keys navigate the lightbox.
Install dependencies with pip install -r requirements.txt and run with python main.py. Opens at http://localhost:8008.
The metadata engine. Windows: download from exiftool.org, rename to exiftool.exe, add to PATH. macOS: brew install exiftool. Linux: sudo apt install libimage-exiftool-perl.
Required for audio conversion (voice notes to Opus). Windows: download from gyan.dev, add bin/ to PATH. macOS: brew install ffmpeg. Linux: sudo apt install ffmpeg.
First run downloads Whisper (~150 MB). Other models download on first use. A GPU is helpful but not required — everything works on CPU.