Everything under the hood

PhotoSpeak combines voice recording, computer vision, and language AI to turn your spoken descriptions into rich, searchable photo metadata.

Core Features

🎤

Voice Annotation

Record yourself talking about each photo. Multiple takes are stitched together automatically. Your raw words are preserved alongside the AI-polished version.

👤

Face Detection & Clustering

YuNet neural network finds faces. Embeddings group them across photos into identities. Name one face and the name propagates to every match.

🔍

Streaming Extraction

Fields appear progressively as the AI generates them. Watch title, description, keywords, and people fill in live.

📷

Photographer Portrait

Embed a photo of yourself in every image you annotate. The photographer is finally part of the story, not just the person behind the camera.

🧠

Knowledge Graph

The "Teach Me" onboarding builds knowledge about people, places, and events. The AI resolves nicknames and adds richer context automatically.

📈

Similar Photos

CLIP embeddings find visually similar photos in your collection. Discover connections you didn't know existed.

Era Detection

Automatically flags photos-of-photos where EXIF data belongs to the scanning device, not the original. Prevents wrong metadata from being applied.

🌦

Weather & Environment

Fetches historical weather, elevation, sunrise/sunset, and nearby points of interest based on GPS coordinates and date.

📝

Custom Prompts

Edit the extraction prompts in the prompts/ folder. Changes take effect immediately — no restart needed.

Annotator interface screenshot

23 Steps of Automatic Analysis

When you open a folder, the pipeline runs automatically in the background. Each step is independent — if one model fails to load, everything else keeps working.

StepWhat It Does
inventoryCatalogs images, extracts EXIF, builds working set
normaliseOrientation correction, consistent sizing
phashPerceptual hashing for duplicate/near-duplicate detection
face_detectFinds faces using YuNet neural network
face_embedGenerates face embeddings for identity matching
face_clusterGroups faces across images into identities
face_from_personExtracts face crops from full-body person detections
object_detectYOLOv8 object detection (80+ categories)
florence_detectFlorence-2 captioning, dense detection, and OCR
bioclip_speciesIdentifies animal/plant species using BioCLIP
trocr_ocrDedicated text recognition on detected text regions
clip_embedCLIP semantic embeddings for search and similarity
depthMiDaS monocular depth estimation
dominant_colorsExtracts palette and dominant colour analysis
colouriseDetects B&W images, estimates original era
era_detectEstimates photo era from visual cues
environmentClassifies indoor/outdoor, scene type, lighting
pose_activityEstimates human poses and activities
visual_contextHigh-level scene understanding via vision LLM
geocodeReverse geocodes GPS coordinates to place names
weatherLooks up historical weather for date + location
collectionGroups related images into collections
animate_facesGenerates face animation data for living portraits
Gallery view screenshot

Built for Privacy and Portability

100% Local AI

Whisper, face detection, CLIP, Florence, YOLOv8, BioCLIP, MiDaS — all run on your machine. No data leaves your computer unless you choose a cloud LLM.

Multiple LLM Providers

Ollama (local, any model), Anthropic Claude (cloud, best quality), or any OpenAI-compatible endpoint (LM Studio, vLLM).

XMP Source of Truth

All metadata lives in the image file. SQLite is a local cache rebuilt on demand. No sidecar files. No database lock-in.

Opus Voice Encoding

Voice recordings compressed to Opus 24kbps (16kHz mono). Base64-encoded and embedded in XMP. ~200KB per minute.

Graceful Degradation

Each AI model is optional. If one fails to load, everything else keeps working. The pipeline adapts to your hardware.

Standard Formats

JPEG, PNG, TIFF, WebP, HEIC. Standard IPTC/Dublin Core/MWG tags. Readable by Lightroom, digiKam, Apple Photos.

Model Stack

ModelSizePurpose
YuNet / MediaPipetinyFace detection
Florence-2 base0.23BCaptioning, object detection, OCR
YOLOv8variesObject detection (80+ categories)
BioCLIP0.4BSpecies identification
CLIP0.4BSemantic embeddings for search
MiDaS0.1BMonocular depth estimation
Whisper base0.07BSpeech-to-text transcription
Qwen 3 8B8BStructured metadata extraction (text)
Qwen 2.5 VL 7B7BStructured metadata extraction (vision)