Under the Hood — PhotoSpeak

Data Philosophy

The file is the source of truth

XMP Metadata

Everything portable lives in the image file itself. Title, description, keywords, voice note text, voice audio, face regions — all stored as standard XMP metadata using industry-standard IPTC, Dublin Core, and MWG tags. Readable by Lightroom, digiKam, Apple Photos, and any XMP-aware software.

SQLite is a local cache

The database is rebuilt from the images on folder load. Delete it anytime — it exists purely for fast search and filtering. The files are the truth.

No sidecar files

Nothing loose to get separated or forgotten. Everything travels with the photo. Copy it, share it, back it up — the story comes with it.

No lock-in

All metadata uses open, industry-standard tags. Delete PhotoSpeak and the stories remain in your photos, readable by any photo software that supports XMP.

Privacy & Security

Everything runs on your machine

All AI models run locally. Face detection, object recognition, captioning, depth estimation, species identification, speech-to-text — none of it leaves your computer. Your photos stay private.

Local AI

Every vision and audio model runs on your machine. No data is sent to external servers for image analysis. Your photos never leave your computer.

Your Choice of LLM

Use Ollama for fully offline operation with any open-source model. Or choose Anthropic Claude or any OpenAI-compatible endpoint for cloud-quality extraction. The choice is yours.

End-to-End Encryption

When sharing, voice clips, transcripts, and metadata in transit are E2E encrypted. Key exchange happens on collection invite. The relay server sees only opaque encrypted blobs.

Hybrid sync

Local-first with a thin relay server. Offline edits queue and merge on reconnect. The relay never sees originals or unencrypted metadata. Originals never leave your machine — only display-resolution copies and encrypted metadata pass through.

Intelligence

Your Memories, Remembered

PhotoSpeak builds a persistent memory of the people, places, animals, events, and organisations in your world — a knowledge graph that grows every time you use it. You can teach it directly through the "Teach Me" onboarding, or it learns as you annotate.

Entity Resolution

PhotoSpeak resolves nicknames, maiden names, and informal references automatically. Say "Nan" and it knows you mean Margaret Elizabeth Walsh. Say "the old house" and it maps to 42 George Street.

Relationship Mapping

Tracks relationships between entities: spouse, parent, child, sibling, friend, employer, attendee, alias. When you name a person, related context enriches the metadata.

Cross-Photo Intelligence

Knowledge builds across your entire collection. Name a face once and the identity propagates. Mention a place once and it's linked everywhere it appears.

Your Personal Interviewer

PhotoSpeak doesn't just listen — it asks follow-up questions based on what it sees and what it remembers about your family. It prompts for details you wouldn't have thought to mention, building richer metadata with each conversation.

The Pipeline

23+ Steps of Automatic Analysis

When you open a folder, the pipeline runs automatically in the background. Each step is independent — if one fails, everything else keeps working. Models are modular and can be swapped for alternatives. New steps are added regularly.

🖼

Image

→

🔍

Inventory

→

👤

Faces

→

📝

OCR

→

🌿

Species

→

🗺

Geocode

→

☁

Weather

→

📖

Story

→

✨

Enriched

Step	What It Does
Inventory	Catalogs images, extracts EXIF data, builds working set
Normalise	Orientation correction, consistent sizing
Perceptual Hash	Fingerprints images for duplicate and near-duplicate detection
Face Detection	Finds faces in images using neural network detection
Face Embedding	Generates identity embeddings for face matching across photos
Face Clustering	Groups faces across images into distinct identities
Face from Person	Extracts face crops from full-body person detections
Object Detection	Identifies objects in images (80+ categories)
Image Captioning & OCR	Generates descriptions, dense captions, and reads text in images
Species Identification	Identifies animal and plant species
Text Recognition	Dedicated text recognition on detected text regions
Semantic Embeddings	Generates embeddings for visual search and similarity matching
Depth Estimation	Estimates depth from a single image for spatial understanding
Colour Analysis	Extracts colour palette and dominant colour information
B&W Detection	Detects black-and-white images and estimates original era
Era Detection	Estimates when a photo was originally taken from visual cues
Environment	Classifies indoor/outdoor, scene type, and lighting conditions
Pose & Activity	Estimates human poses and what people are doing
Scene Understanding	High-level scene analysis and context extraction
Geocoding	Converts GPS coordinates to place names and addresses
Weather Lookup	Retrieves historical weather conditions for the date and location
Collection	Groups related images into logical collections
Face Animation	Generates animation data for living portrait effects
...and growing. New analysis steps are added as models improve.

Technical Features

Built for power users

Streaming Extraction

Fields appear progressively as they're generated. Watch title, description, keywords, and people fill in live.

Similar Photo Search

Semantic embeddings find visually similar photos in your collection. Discover connections and duplicates you didn't know existed.

Custom Prompts

Edit the extraction prompts to customise how PhotoSpeak interprets your photos. Changes take effect immediately — no restart needed.

Supported Formats

JPEG, PNG, TIFF, WebP, and HEIC. All standard image formats with full XMP read/write support.

Weather & Environment

Fetches historical weather, elevation, sunrise/sunset, and nearby points of interest from GPS coordinates and date.

Configurable Pipeline

Pipeline presets (Quick, Standard, Full) let you balance speed and depth. Individual steps can be enabled or disabled. Everything adapts to your hardware.

How it all works