How it all works

PhotoSpeak combines voice recording, computer vision, and language AI to turn spoken descriptions into rich, searchable, portable photo metadata. Here's what's happening behind the scenes.

The file is the source of truth

XMP Metadata

Everything portable lives in the image file itself. Title, description, keywords, voice note text, voice audio, face regions — all stored as standard XMP metadata using industry-standard IPTC, Dublin Core, and MWG tags. Readable by Lightroom, digiKam, Apple Photos, and any XMP-aware software.

SQLite is a local cache

The database is rebuilt from the images on folder load. Delete it anytime — it exists purely for fast search and filtering. The files are the truth.

No sidecar files

Nothing loose to get separated or forgotten. Everything travels with the photo. Copy it, share it, back it up — the story comes with it.

No lock-in

All metadata uses open, industry-standard tags. Delete PhotoSpeak and the stories remain in your photos, readable by any photo software that supports XMP.

Everything runs on your machine

All AI models run locally. Face detection, object recognition, captioning, depth estimation, species identification, speech-to-text — none of it leaves your computer. Your photos stay private.

Local AI

Every vision and audio model runs on your machine. No data is sent to external servers for image analysis. Your photos never leave your computer.

Your Choice of LLM

Use Ollama for fully offline operation with any open-source model. Or choose Anthropic Claude or any OpenAI-compatible endpoint for cloud-quality extraction. The choice is yours.

End-to-End Encryption

When sharing, voice clips, transcripts, and metadata in transit are E2E encrypted. Key exchange happens on collection invite. The relay server sees only opaque encrypted blobs.

Hybrid sync

Local-first with a thin relay server. Offline edits queue and merge on reconnect. The relay never sees originals or unencrypted metadata. Originals never leave your machine — only display-resolution copies and encrypted metadata pass through.

Your Memories, Remembered

PhotoSpeak builds a persistent memory of the people, places, animals, events, and organisations in your world — a knowledge graph that grows every time you use it. You can teach it directly through the "Teach Me" onboarding, or it learns as you annotate.

Entity Resolution

PhotoSpeak resolves nicknames, maiden names, and informal references automatically. Say "Nan" and it knows you mean Margaret Elizabeth Walsh. Say "the old house" and it maps to 42 George Street.

Relationship Mapping

Tracks relationships between entities: spouse, parent, child, sibling, friend, employer, attendee, alias. When you name a person, related context enriches the metadata.

Cross-Photo Intelligence

Knowledge builds across your entire collection. Name a face once and the identity propagates. Mention a place once and it's linked everywhere it appears.

Your Personal Interviewer

PhotoSpeak doesn't just listen — it asks follow-up questions based on what it sees and what it remembers about your family. It prompts for details you wouldn't have thought to mention, building richer metadata with each conversation.

23+ Steps of Automatic Analysis

When you open a folder, the pipeline runs automatically in the background. Each step is independent — if one fails, everything else keeps working. Models are modular and can be swapped for alternatives. New steps are added regularly.

🖼
Image
🔍
Inventory
👤
Faces
📝
OCR
🌿
Species
🗺
Geocode
Weather
📖
Story
Enriched
StepWhat It Does
InventoryCatalogs images, extracts EXIF data, builds working set
NormaliseOrientation correction, consistent sizing
Perceptual HashFingerprints images for duplicate and near-duplicate detection
Face DetectionFinds faces in images using neural network detection
Face EmbeddingGenerates identity embeddings for face matching across photos
Face ClusteringGroups faces across images into distinct identities
Face from PersonExtracts face crops from full-body person detections
Object DetectionIdentifies objects in images (80+ categories)
Image Captioning & OCRGenerates descriptions, dense captions, and reads text in images
Species IdentificationIdentifies animal and plant species
Text RecognitionDedicated text recognition on detected text regions
Semantic EmbeddingsGenerates embeddings for visual search and similarity matching
Depth EstimationEstimates depth from a single image for spatial understanding
Colour AnalysisExtracts colour palette and dominant colour information
B&W DetectionDetects black-and-white images and estimates original era
Era DetectionEstimates when a photo was originally taken from visual cues
EnvironmentClassifies indoor/outdoor, scene type, and lighting conditions
Pose & ActivityEstimates human poses and what people are doing
Scene UnderstandingHigh-level scene analysis and context extraction
GeocodingConverts GPS coordinates to place names and addresses
Weather LookupRetrieves historical weather conditions for the date and location
CollectionGroups related images into logical collections
Face AnimationGenerates animation data for living portrait effects
...and growing. New analysis steps are added as models improve.

Built for power users

Streaming Extraction

Fields appear progressively as they're generated. Watch title, description, keywords, and people fill in live.

Similar Photo Search

Semantic embeddings find visually similar photos in your collection. Discover connections and duplicates you didn't know existed.

Custom Prompts

Edit the extraction prompts to customise how PhotoSpeak interprets your photos. Changes take effect immediately — no restart needed.

Supported Formats

JPEG, PNG, TIFF, WebP, and HEIC. All standard image formats with full XMP read/write support.

Weather & Environment

Fetches historical weather, elevation, sunrise/sunset, and nearby points of interest from GPS coordinates and date.

Configurable Pipeline

Pipeline presets (Quick, Standard, Full) let you balance speed and depth. Individual steps can be enabled or disabled. Everything adapts to your hardware.

Open standards. Local processing. Your data, your photos, your stories.