AI-Powered Personalised Story Generator: A Six-Stage Multimodal Pipeline for Theme-Controlled Narrated Video Synthesis
Keywords:
Multimodal AI, Automated Storytelling, Video Synthesis, Vision-Language Models, Text-to-Speech, CLIP, GPT-4o-mini, StreamlitAbstract
We present the Multimodal Story Generator (MSG), a six-stage AI pipeline that transforms heterogeneous inputs — text prompts, still images, and video clips — into themed, narrated MP4 video stories with synchronised captions and background music. MSG integrates OpenAI CLIP ViT-B/32 for 512-dimensional visual feature extraction, Sentence-Transformers all-MiniLM-L6-v2 for text embeddings with a novel dimensionality normalisation scheme to 512 dimensions, Gaussian Mixture Model clustering for scene grouping, GPT-4o-mini Vision API for image-grounded story generation, a multi-backend Text-to-Speech engine with six emotionally modulated voice profiles, and a PIL/MoviePy composition pipeline with cross-platform caption rendering. The system supports six narrative themes (Default, Adventure, Romance, Comedy, Mystery, Documentary) and is deployed as an interactive Streamlit web application. Three key engineering contributions address critical portability issues: a unified embedding dimension normalisation that enables mixed-modality clustering, PIL-based caption compositing that eliminates a platform-specific MoviePy crash on Windows, and live environment variable injection that ensures runtime API key propagation. System characterisation experiments are reported on pipeline latency, component integration correctness, and qualitative output assessment.