AI-Powered Personalised Story Generator: A Six-Stage Multimodal Pipeline for Theme-Controlled Narrated Video Synthesis

Authors

  • Sarubala T Department of Computer Science, Saranathan College of Engineering, Tiruchirapalli, Tamil Nadu, India
  • Shruthika B E Department of Computer Science, Saranathan College of Engineering, Tiruchirapalli, Tamil Nadu, India
  • Yavvna Lakshmi J Department of Computer Science, Saranathan College of Engineering, Tiruchirapalli, Tamil Nadu, India

Keywords:

Multimodal AI, Automated Storytelling, Video Synthesis, Vision-Language Models, Text-to-Speech, CLIP, GPT-4o-mini, Streamlit

Abstract

We present the Multimodal Story Generator (MSG), a six-stage AI pipeline that transforms heterogeneous inputs — text prompts, still images, and video clips — into themed, narrated MP4 video stories with synchronised captions and background music. MSG integrates OpenAI CLIP ViT-B/32 for 512-dimensional visual feature extraction, Sentence-Transformers all-MiniLM-L6-v2 for text embeddings with a novel dimensionality normalisation scheme to 512 dimensions, Gaussian Mixture Model clustering for scene grouping, GPT-4o-mini Vision API for image-grounded story generation, a multi-backend Text-to-Speech engine with six emotionally modulated voice profiles, and a PIL/MoviePy composition pipeline with cross-platform caption rendering. The system supports six narrative themes (Default, Adventure, Romance, Comedy, Mystery, Documentary) and is deployed as an interactive Streamlit web application. Three key engineering contributions address critical portability issues: a unified embedding dimension normalisation that enables mixed-modality clustering, PIL-based caption compositing that eliminates a platform-specific MoviePy crash on Windows, and live environment variable injection that ensures runtime API key propagation. System characterisation experiments are reported on pipeline latency, component integration correctness, and qualitative output assessment.

Downloads

Published

2026-04-15