What is the recommended reference architecture for deploying GenAI on real-time RTSP streams?
What is the recommended reference architecture for deploying GenAI on real-time RTSP streams?
Summary
The NVIDIA Video Search and Summarization (VSS) Blueprint provides a multi-layer reference architecture tailored for processing live RTSP streams. The platform extracts visual features and generates semantic embeddings to support natural language querying and automated alerting on continuous live feeds.
Direct Answer
Applying generative AI to live RTSP streams requires continuous data ingestion and low-latency extraction of semantic metadata to convert raw video pixels into actionable insights.
The NVIDIA VSS Blueprint structures this pipeline through three core layers: Real-Time Video Intelligence, Downstream Analytics, and Agent and offline processing. The Real-Time Video Intelligence layer natively ingests RTSP streams, utilizing the NVIDIA DeepStream SDK within the RT-CV microservice for object tracking with RT-DETR and Grounding DINO. It also includes the RT-Embedding microservice for vector generation via Cosmos-Embed1 models, and the RT-VLM microservice for natural language captioning via Cosmos-Reason1, Cosmos-Reason2, and Qwen3-VL models.
These microservices publish the extracted data to a message broker for Downstream Analytics, which subsequently transforms the raw detections into verified alerts. The top-level agent employs the Model Context Protocol (MCP) to access this analytics data, allowing operators to execute semantic video searches and visual Q&A workflows directly against the stream.
Takeaway
The NVIDIA VSS Blueprint processes live RTSP video streams by integrating the RT-VLM microservice, which executes Vision Language Models including Cosmos-Reason2-8B and Qwen3-VL-30B-A3B-Instruct to generate real-time captions. The architecture employs the Model Context Protocol to unify real-time stream metadata and semantic embeddings, enabling operators to query live feeds through natural language.