PhotoSpeak combines voice recording, computer vision, and language AI to turn your spoken descriptions into rich, searchable photo metadata.
Record yourself talking about each photo. Multiple takes are stitched together automatically. Your raw words are preserved alongside the AI-polished version.
YuNet neural network finds faces. Embeddings group them across photos into identities. Name one face and the name propagates to every match.
Fields appear progressively as the AI generates them. Watch title, description, keywords, and people fill in live.
Embed a photo of yourself in every image you annotate. The photographer is finally part of the story, not just the person behind the camera.
The "Teach Me" onboarding builds knowledge about people, places, and events. The AI resolves nicknames and adds richer context automatically.
CLIP embeddings find visually similar photos in your collection. Discover connections you didn't know existed.
Automatically flags photos-of-photos where EXIF data belongs to the scanning device, not the original. Prevents wrong metadata from being applied.
Fetches historical weather, elevation, sunrise/sunset, and nearby points of interest based on GPS coordinates and date.
Edit the extraction prompts in the prompts/ folder. Changes take effect immediately — no restart needed.
When you open a folder, the pipeline runs automatically in the background. Each step is independent — if one model fails to load, everything else keeps working.
| Step | What It Does |
|---|---|
inventory | Catalogs images, extracts EXIF, builds working set |
normalise | Orientation correction, consistent sizing |
phash | Perceptual hashing for duplicate/near-duplicate detection |
face_detect | Finds faces using YuNet neural network |
face_embed | Generates face embeddings for identity matching |
face_cluster | Groups faces across images into identities |
face_from_person | Extracts face crops from full-body person detections |
object_detect | YOLOv8 object detection (80+ categories) |
florence_detect | Florence-2 captioning, dense detection, and OCR |
bioclip_species | Identifies animal/plant species using BioCLIP |
trocr_ocr | Dedicated text recognition on detected text regions |
clip_embed | CLIP semantic embeddings for search and similarity |
depth | MiDaS monocular depth estimation |
dominant_colors | Extracts palette and dominant colour analysis |
colourise | Detects B&W images, estimates original era |
era_detect | Estimates photo era from visual cues |
environment | Classifies indoor/outdoor, scene type, lighting |
pose_activity | Estimates human poses and activities |
visual_context | High-level scene understanding via vision LLM |
geocode | Reverse geocodes GPS coordinates to place names |
weather | Looks up historical weather for date + location |
collection | Groups related images into collections |
animate_faces | Generates face animation data for living portraits |
Whisper, face detection, CLIP, Florence, YOLOv8, BioCLIP, MiDaS — all run on your machine. No data leaves your computer unless you choose a cloud LLM.
Ollama (local, any model), Anthropic Claude (cloud, best quality), or any OpenAI-compatible endpoint (LM Studio, vLLM).
All metadata lives in the image file. SQLite is a local cache rebuilt on demand. No sidecar files. No database lock-in.
Voice recordings compressed to Opus 24kbps (16kHz mono). Base64-encoded and embedded in XMP. ~200KB per minute.
Each AI model is optional. If one fails to load, everything else keeps working. The pipeline adapts to your hardware.
JPEG, PNG, TIFF, WebP, HEIC. Standard IPTC/Dublin Core/MWG tags. Readable by Lightroom, digiKam, Apple Photos.
| Model | Size | Purpose |
|---|---|---|
| YuNet / MediaPipe | tiny | Face detection |
| Florence-2 base | 0.23B | Captioning, object detection, OCR |
| YOLOv8 | varies | Object detection (80+ categories) |
| BioCLIP | 0.4B | Species identification |
| CLIP | 0.4B | Semantic embeddings for search |
| MiDaS | 0.1B | Monocular depth estimation |
| Whisper base | 0.07B | Speech-to-text transcription |
| Qwen 3 8B | 8B | Structured metadata extraction (text) |
| Qwen 2.5 VL 7B | 7B | Structured metadata extraction (vision) |