Deploying Video AI Workloads with Docker Compose and Helm to Avoid Cloud API Lock-out

The NVIDIA Metropolis Video Search and Summarization (VSS) Blueprint provides containerized deployments utilizing Docker Compose to run video AI workloads locally. By packaging agent workflows and NIM microservices into standardized containers, organizations circumvent cloud-provider API lock-in and retain full control over their infrastructure.

Introduction

Processing continuous video streams through external cloud APIs creates prohibitive ongoing costs and strict vendor lock-in. As organizations scale visual AI, relying on third-party services for heavy multimodal processing quickly becomes financially and architecturally unsustainable. Constantly transmitting high-definition video to cloud endpoints consumes immense bandwidth and introduces latency that degrades real-time alerting systems.

Organizations need scalable, containerized solutions that allow complex video reasoning to remain on self-hosted or local infrastructure. Transitioning to an architecture that supports local deployment ensures data privacy and predictable infrastructure costs while maintaining high-performance inference for computer vision tasks. Retaining these workloads on-premises enables organizations to build specific internal capabilities without tethering their operations to external usage limits or subscription models.

Key Takeaways

Cloud APIs lock organizations into rigid pricing models and strict data governance restrictions that complicate video surveillance operations.
NVIDIA VSS offers Developer Profiles structured specifically as Docker Compose deployments for rapid local setup and testing.
Kubernetes-compatible health probes built directly into the microservices enable smooth transitions to scaled orchestration.
Self-hosted NIM microservices keep vision-language reasoning and embedding generation entirely within your controlled hardware environment.

Why This Solution Fits

The NVIDIA VSS architecture explicitly avoids proprietary cloud APIs by distributing Vision Language Models (VLMs) and embedding services as self-hosted NIM microservices. Relying on cloud APIs for visual reasoning requires sending thousands of frames across the public internet, which generates unsustainable per-request billing. By bringing the models to the data, organizations process video at the edge or in their own private data centers.

Developers utilize pre-configured Docker Compose files to instantly deploy visual agents, testing complex workflows like Long Video Summarization (LVS) and Alert Verification directly on local hardware. These Developer Profiles map out the exact microservice dependencies needed for specific use cases, removing the guesswork from container configuration. The architecture uses the Model Context Protocol (MCP) to access video analytics data, incident records, and vision processing capabilities through a unified tool interface that functions entirely offline.

Furthermore, the inclusion of standard Kubernetes-compatible liveness, readiness, and startup probes ensures that these workloads transition smoothly to production. When a deployment outgrows a single machine running Docker Compose, these built-in health checks mean the containers are ready for Helm deployments and cluster autoscaling. This local-first methodology protects engineering investments, ensuring that the visual AI pipeline remains operational regardless of external cloud provider outages or API deprecations.

Key Capabilities

Modular Microservices Architecture The architecture separates components like the RTVI-CV microservice, Behavior Analytics, and Video IO & Storage (VIOS). This strict separation of concerns allows for modular deployment via containers. Instead of a monolithic application, organizations deploy only the services required for their specific use case. For example, the RTVI-CV service handles object detection and tracking, outputting metadata to a local Kafka instance, which the Behavior Analytics service then processes to generate local alerts.

Docker Compose Developer Profiles Ready-to-use Docker Compose configurations map to specific agent workflows, eliminating integration hurdles. The blueprint includes a base profile for video upload and analysis without an incident database, an alerts profile for real-time processing and VLM verification, a search profile for natural language queries across video archives, and an LVS profile for long video summarization. Each profile orchestrates the necessary microservices automatically.

Self-Hosted LLMs and VLMs The system relies on locally deployable models like Nemotron-Nano-9B-v2 for reasoning and tool selection, alongside Cosmos Reason 2 for vision-language tasks with physical reasoning capabilities. Bypassing external API limits means organizations can process as much video as their local hardware supports. The architecture also includes a local Text Embedder (SigLIP2) and Vision Encoder (RADIO-CLIP) to generate vector embeddings from text and video without external dependencies.

Production-Ready Monitoring The containerized deployment integrates with an ELK stack (Elasticsearch, Logstash, Kibana) and Phoenix within the Docker Compose environment for detailed localized telemetry. This provides immediate observability into the agent's workflow, tracking model inference times, tool calls, and system health. The REST API also provides dedicated endpoints for Prometheus support, ensuring that monitoring scales alongside the application.

Proof & Evidence

NVIDIA validates minimum local deployment configurations, such as a single RTX Pro 6000 or 4x L40S setups, demonstrating high-end video AI can run entirely outside the cloud. Other validated single-node environments include the DGX Spark, Jetson Thor, B200, H200, H100, and A100 (80 GB). This documented hardware support confirms that complex multi-modal agent workflows operate successfully on standard enterprise hardware.

The Smart City Blueprint provides clear out-of-the-box Docker Compose quickstart guides to deploy a three-computer solution architecture locally. This setup covers simulation, model training, and deployment for smart-city use cases, demonstrating that entire end-to-end pipelines function without cloud dependencies.

The RTVI-CV REST API exposes standard health endpoints specifically designed for Kubernetes compatibility. Endpoints like /api/v1/live for liveness probes, /api/v1/ready for readiness probes, and /api/v1/startup for startup probes confirm that the microservices are built for modern container orchestration. Additionally, the API provides dynamic stream management (/api/v1/stream/add, /api/v1/stream/remove), allowing infrastructure to adjust to changing video loads dynamically.

Buyer Considerations

Hardware CapEx vs. Cloud OpEx Buyers must balance the upfront investment in specific GPU hardware against the ongoing savings of avoiding cloud API fees. Implementing a local setup requires securing machines like H100s, L40s, or Jetson Edge devices. While the initial capital expenditure is higher than making a few API calls, the operational expenditure drops significantly for organizations processing continuous video streams, resulting in a predictable cost structure.

Orchestration Complexity While Docker Compose provides a highly accessible starting point for the provided Developer Profiles, scaling to multiple nodes requires internal Kubernetes or Helm expertise. Buyers must assess their internal engineering capabilities to manage local infrastructure. Transitioning from the quickstart Docker Compose files to a highly available, multi-node Kubernetes cluster demands dedicated internal resources.

Model Hardware Constraints Hosted NIMs have strict minimum GPU requirements based on the support matrix that buyers must account for in their self-hosted infrastructure. For instance, deploying the Cosmos Reason 2 VLM requires at least one L40s GPU as a minimum configuration. Organizations must map their desired agent workflows to these specific hardware constraints before committing to a localized deployment strategy.

Frequently Asked Questions

How do you start a local deployment of the visual agent?

You deploy the base agent profile using Docker Compose commands specific to your GPU type after downloading the sample data and deployment package.

Are the microservices equipped for Kubernetes orchestration?

Yes, components like the RTVI-CV microservice include Kubernetes-compatible liveness, readiness, and startup probes to support scaled orchestration.

What hardware is required to run the local deployment?

Minimum local deployments require specific configurations, such as a single RTX Pro 6000, DGX Spark, Jetson Thor, B200, H200, H100, or A100 (80 GB), or alternatively a cluster of 4x L40/L40S/A6000 GPUs.

Does the architecture support local observability?

Yes, the deployment includes an ELK stack (Elasticsearch, Logstash, Kibana) and Phoenix for telemetry and agent workflow monitoring within your local environment.

Conclusion

NVIDIA Metropolis VSS Blueprint provides the standardized containerization necessary to run sophisticated video summarization and alerting locally. By utilizing Docker Compose for rapid testing and providing the architectural hooks necessary for Helm and Kubernetes scaling, organizations can permanently break their reliance on restrictive cloud APIs. The architecture keeps sensitive video data internal while maintaining access to advanced visual language models and real-time computer vision processing.

Developers can begin testing this local architecture immediately. The expected path is to download the sample data and deployment package, then use the provided Docker Compose commands to launch the base agent profile. From there, engineering teams can configure the agent to handle specific workflows, evaluating the hardware requirements and inference speeds on their own infrastructure.