PhotoSpeak combines voice recording, computer vision, and language AI to turn spoken descriptions into rich, searchable, portable photo metadata. Here's what's happening behind the scenes.
Everything portable lives in the image file itself. Title, description, keywords, voice note text, voice audio, face regions — all stored as standard XMP metadata using industry-standard IPTC, Dublin Core, and MWG tags. Readable by Lightroom, digiKam, Apple Photos, and any XMP-aware software.
The database is rebuilt from the images on folder load. Delete it anytime — it exists purely for fast search and filtering. The files are the truth.
Nothing loose to get separated or forgotten. Everything travels with the photo. Copy it, share it, back it up — the story comes with it.
All metadata uses open, industry-standard tags. Delete PhotoSpeak and the stories remain in your photos, readable by any photo software that supports XMP.
All AI models run locally. Face detection, object recognition, captioning, depth estimation, species identification, speech-to-text — none of it leaves your computer. Your photos stay private.
Every vision and audio model runs on your machine. No data is sent to external servers for image analysis. Your photos never leave your computer.
Use Ollama for fully offline operation with any open-source model. Or choose Anthropic Claude or any OpenAI-compatible endpoint for cloud-quality extraction. The choice is yours.
When sharing, voice clips, transcripts, and metadata in transit are E2E encrypted. Key exchange happens on collection invite. The relay server sees only opaque encrypted blobs.
Local-first with a thin relay server. Offline edits queue and merge on reconnect. The relay never sees originals or unencrypted metadata. Originals never leave your machine — only display-resolution copies and encrypted metadata pass through.
PhotoSpeak builds a persistent memory of the people, places, animals, events, and organisations in your world — a knowledge graph that grows every time you use it. You can teach it directly through the "Teach Me" onboarding, or it learns as you annotate.
PhotoSpeak resolves nicknames, maiden names, and informal references automatically. Say "Nan" and it knows you mean Margaret Elizabeth Walsh. Say "the old house" and it maps to 42 George Street.
Tracks relationships between entities: spouse, parent, child, sibling, friend, employer, attendee, alias. When you name a person, related context enriches the metadata.
Knowledge builds across your entire collection. Name a face once and the identity propagates. Mention a place once and it's linked everywhere it appears.
PhotoSpeak doesn't just listen — it asks follow-up questions based on what it sees and what it remembers about your family. It prompts for details you wouldn't have thought to mention, building richer metadata with each conversation.
When you open a folder, the pipeline runs automatically in the background. Each step is independent — if one fails, everything else keeps working. Models are modular and can be swapped for alternatives. New steps are added regularly.
| Step | What It Does |
|---|---|
| Inventory | Catalogs images, extracts EXIF data, builds working set |
| Normalise | Orientation correction, consistent sizing |
| Perceptual Hash | Fingerprints images for duplicate and near-duplicate detection |
| Face Detection | Finds faces in images using neural network detection |
| Face Embedding | Generates identity embeddings for face matching across photos |
| Face Clustering | Groups faces across images into distinct identities |
| Face from Person | Extracts face crops from full-body person detections |
| Object Detection | Identifies objects in images (80+ categories) |
| Image Captioning & OCR | Generates descriptions, dense captions, and reads text in images |
| Species Identification | Identifies animal and plant species |
| Text Recognition | Dedicated text recognition on detected text regions |
| Semantic Embeddings | Generates embeddings for visual search and similarity matching |
| Depth Estimation | Estimates depth from a single image for spatial understanding |
| Colour Analysis | Extracts colour palette and dominant colour information |
| B&W Detection | Detects black-and-white images and estimates original era |
| Era Detection | Estimates when a photo was originally taken from visual cues |
| Environment | Classifies indoor/outdoor, scene type, and lighting conditions |
| Pose & Activity | Estimates human poses and what people are doing |
| Scene Understanding | High-level scene analysis and context extraction |
| Geocoding | Converts GPS coordinates to place names and addresses |
| Weather Lookup | Retrieves historical weather conditions for the date and location |
| Collection | Groups related images into logical collections |
| Face Animation | Generates animation data for living portrait effects |
| ...and growing. New analysis steps are added as models improve. | |
Fields appear progressively as they're generated. Watch title, description, keywords, and people fill in live.
Semantic embeddings find visually similar photos in your collection. Discover connections and duplicates you didn't know existed.
Edit the extraction prompts to customise how PhotoSpeak interprets your photos. Changes take effect immediately — no restart needed.
JPEG, PNG, TIFF, WebP, and HEIC. All standard image formats with full XMP read/write support.
Fetches historical weather, elevation, sunrise/sunset, and nearby points of interest from GPS coordinates and date.
Pipeline presets (Quick, Standard, Full) let you balance speed and depth. Individual steps can be enabled or disabled. Everything adapts to your hardware.
Open standards. Local processing. Your data, your photos, your stories.