Features — PhotoSpeak

Core Features

🎤

Voice Annotation

Record yourself talking about each photo. Multiple takes are stitched together automatically. Your raw words are preserved alongside the AI-polished version.

👤

Face Detection & Clustering

YuNet neural network finds faces. Embeddings group them across photos into identities. Name one face and the name propagates to every match.

🔍

Streaming Extraction

Fields appear progressively as the AI generates them. Watch title, description, keywords, and people fill in live.

📷

Photographer Portrait

Embed a photo of yourself in every image you annotate. The photographer is finally part of the story, not just the person behind the camera.

🧠

Knowledge Graph

The "Teach Me" onboarding builds knowledge about people, places, and events. The AI resolves nicknames and adds richer context automatically.

📈

Similar Photos

CLIP embeddings find visually similar photos in your collection. Discover connections you didn't know existed.

⏰

Era Detection

Automatically flags photos-of-photos where EXIF data belongs to the scanning device, not the original. Prevents wrong metadata from being applied.

🌦

Weather & Environment

Fetches historical weather, elevation, sunrise/sunset, and nearby points of interest based on GPS coordinates and date.

📝

Custom Prompts

Edit the extraction prompts in the prompts/ folder. Changes take effect immediately — no restart needed.

The Pipeline

23 Steps of Automatic Analysis

When you open a folder, the pipeline runs automatically in the background. Each step is independent — if one model fails to load, everything else keeps working.

Step	What It Does
`inventory`	Catalogs images, extracts EXIF, builds working set
`normalise`	Orientation correction, consistent sizing
`phash`	Perceptual hashing for duplicate/near-duplicate detection
`face_detect`	Finds faces using YuNet neural network
`face_embed`	Generates face embeddings for identity matching
`face_cluster`	Groups faces across images into identities
`face_from_person`	Extracts face crops from full-body person detections
`object_detect`	YOLOv8 object detection (80+ categories)
`florence_detect`	Florence-2 captioning, dense detection, and OCR
`bioclip_species`	Identifies animal/plant species using BioCLIP
`trocr_ocr`	Dedicated text recognition on detected text regions
`clip_embed`	CLIP semantic embeddings for search and similarity
`depth`	MiDaS monocular depth estimation
`dominant_colors`	Extracts palette and dominant colour analysis
`colourise`	Detects B&W images, estimates original era
`era_detect`	Estimates photo era from visual cues
`environment`	Classifies indoor/outdoor, scene type, lighting
`pose_activity`	Estimates human poses and activities
`visual_context`	High-level scene understanding via vision LLM
`geocode`	Reverse geocodes GPS coordinates to place names
`weather`	Looks up historical weather for date + location
`collection`	Groups related images into collections
`animate_faces`	Generates face animation data for living portraits

Technical

Built for Privacy and Portability

100% Local AI

Whisper, face detection, CLIP, Florence, YOLOv8, BioCLIP, MiDaS — all run on your machine. No data leaves your computer unless you choose a cloud LLM.

Multiple LLM Providers

Ollama (local, any model), Anthropic Claude (cloud, best quality), or any OpenAI-compatible endpoint (LM Studio, vLLM).

XMP Source of Truth

All metadata lives in the image file. SQLite is a local cache rebuilt on demand. No sidecar files. No database lock-in.

Opus Voice Encoding

Voice recordings compressed to Opus 24kbps (16kHz mono). Base64-encoded and embedded in XMP. ~200KB per minute.

Graceful Degradation

Each AI model is optional. If one fails to load, everything else keeps working. The pipeline adapts to your hardware.

Standard Formats

JPEG, PNG, TIFF, WebP, HEIC. Standard IPTC/Dublin Core/MWG tags. Readable by Lightroom, digiKam, Apple Photos.

Model	Size	Purpose
YuNet / MediaPipe	tiny	Face detection
Florence-2 base	0.23B	Captioning, object detection, OCR
YOLOv8	varies	Object detection (80+ categories)
BioCLIP	0.4B	Species identification
CLIP	0.4B	Semantic embeddings for search
MiDaS	0.1B	Monocular depth estimation
Whisper base	0.07B	Speech-to-text transcription
Qwen 3 8B	8B	Structured metadata extraction (text)
Qwen 2.5 VL 7B	7B	Structured metadata extraction (vision)

Everything under the hood