Home
Core Technology Insights
Open Source & AI Applications
Ollama Service Kubernetes Deployment Guide: Full Workflow from Configuration to Validation (YAML Included)

Ollama Service Kubernetes Deployment Guide: Full Workflow from Configuration to Validation (YAML Included)

Core Technology Insights Open Source & AI Applications

2025-10-31 12 mins read

A complete enterprise-grade Ollama deployment solution for Kubernetes—covering namespace isolation, persistent storage, Deployment setup, and service exposure. This guide includes pre-deployment checks, step-by-step operations, validation methods, and production optimization tips (GPU acceleration, multi-replica scaling) to help developers and DevOps teams quickly launch stable, scalable Ollama services on K8s.

ollama-k8s-deployment-guide

Overview

As lightweight LLMs gain traction, Ollama has become the top choice for SMBs to deploy AI capabilities with its "one-click setup and low barrier to use." However, single-machine Ollama deployments face production challenges: model loss, uncontrollable resources, and scaling difficulties. Kubernetes solves these pain points with its container orchestration capabilities. This guide provides a full-stack Ollama K8s deployment方案, supporting use cases from testing to production.

I. Pre-Deployment Prerequisites

Ensure your environment meets these requirements to avoid configuration issues:

K8s Cluster: 1+ node cluster (recommended v1.24+, supports Containerd runtime). Verify with kubectl get nodes (nodes must be in Ready state).
Persistent Storage: Cluster must support persistent storage (e.g., default StorageClass, NFS, Local Path) to save Ollama models (models are lost if not persisted after Pod restarts).
Resource Reservation:
- 7B model: Minimum 4 CPU cores + 8GB memory
- 13B model: Minimum 8 CPU cores + 16GB memory
- GPU acceleration: Pre-install NVIDIA device plugin (nvidia-device-plugin).
Tools: Local kubectl installed with cluster access. Verify with kubectl cluster-info.

II. Core Configuration: End-to-End Design

Ollama’s K8s deployment requires 4 key components: namespace isolation, persistent storage (PVC), Deployment, and Service exposure. Below are production-ready configurations with English comments—modify as needed.

1. Namespace: Resource Isolation

Create a dedicated namespace to separate Ollama resources from other services:

apiVersion: v1
kind: Namespace
metadata:
  name: ai-services  # All Ollama resources reside here
  labels:
    app: ollama      # Unified label for resource filtering

2. Persistent Storage (PVC): Avoid Model Loss

Ollama stores models in /root/.ollama by default. Use PVC to persist models across Pod restarts:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ai-services
spec:
  accessModes:
    - ReadWriteOnce  # Use ReadWriteMany for multi-replica (requires storage class support like NFS)
  resources:
    requests:
      storage: 50Gi  # 7B model ≈4GB, 13B≈13GB; reserve 20% redundancy for multiple models
  storageClassName: "standard"  # Replace with your cluster's storage class (e.g., aws-ebs, gcp-pd, nfs-client)

3. Deployment: Core Service Orchestration

Manages Ollama container lifecycle (resource limits, health checks, environment variables):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai-services
  labels:
    app: ollama
spec:
  replicas: 1  # Enable multi-replica only after solving storage sharing (e.g., NFS)
  selector:
    matchLabels:
      app: ollama
  strategy:
    type: Recreate  # Prevent model file conflicts in multi-replica scenarios
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest  # Use specific version in production (e.g., ollama/ollama:0.1.48)
        # For GPU acceleration: Use ollama/ollama:nvidia (requires NVIDIA device plugin)
        ports:
        - containerPort: 11434  # Default Ollama API port
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"  # Allow external access (default: localhost only)
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"  # Adjust based on available memory (default: 3)
        resources:
          requests:  # Minimum resources for scheduling
            cpu: "4"
            memory: "12Gi"  # 7B model: 8GB+, 13B model: 16GB+
          limits:  # Prevent resource contention
            cpu: "8"
            memory: "16Gi"
        # GPU configuration (uncomment if using NVIDIA image)
        # resources:
        #   limits:
        #     nvidia.com/gpu: 1
        volumeMounts:
        - name: model-storage
          mountPath: /root/.ollama  # Default Ollama model directory (do not modify)
          subPath: ollama  # Isolate directory for shared storage
        - name: cache-volume
          mountPath: /root/.cache/ollama  # Temporary cache for faster model loading
        livenessProbe:  # Restart Pod if service fails
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 120  # Longer delay for model loading
          periodSeconds: 20
          timeoutSeconds: 5
        readinessProbe:  # Remove Pod from Service if unready
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 10
        securityContext:
          runAsUser: 1000
          runAsGroup: 1000
          fsGroup: 1000  # Allow write access to model directory
      affinity:
        # Prefer scheduling to GPU nodes (if available)
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: nvidia.com/gpu.present
                operator: In
                values: ["true"]
      tolerations:
        # Tolerate GPU node taints
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: ollama-models-pvc
      - name: cache-volume
        emptyDir:
          medium: Memory
          sizeLimit: 2Gi  # In-memory cache for faster inference

4. Service: Expose Ollama for External Access

Fixes temporary Pod IPs with a stable access address:

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ai-services
  labels:
    app: ollama
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
    protocol: TCP
    name: http
  type: ClusterIP  # Modify for external access:
                   # - NodePort: Add nodePort: 30080 (range: 30000-32767)
                   # - LoadBalancer: For cloud environments (AWS ELB, GCP LB)

III. Deployment & Validation

1. Deploy Resources

Save all configurations to ollama-k8s-deploy.yaml and run:

kubectl apply -f ollama-k8s-deploy.yaml

Verify resource status (all components must be healthy):

# Check namespace
kubectl get ns | grep ai-services

# Check PVC (STATUS: Bound)
kubectl get pvc -n ai-services

# Check Deployment (READY: 1/1)
kubectl get deployment -n ai-services

# Check Pod (STATUS: Running, RESTARTS: 0)
kubectl get pods -n ai-services

Troubleshooting: If Pod is Pending, run kubectl describe pod <pod-name> -n ai-services to check for storage binding or resource shortages.

2. Pull Models

Once the Pod is running, pull your target model (e.g., Llama3-8B):

# Enter the Ollama container (replace <pod-name> with actual Pod name)
kubectl exec -it -n ai-services <pod-name> -- /bin/sh

# Pull model (e.g., Llama3-8B)
ollama pull llama3:8b

Check pull progress with:

kubectl logs -f <pod-name> -n ai-services

3. Test Service Access

Option 1: Cluster Internal Test (For inter-service calls)

# Run a test Pod
kubectl run -it busybox --image=busybox:1.35 -- /bin/sh

# Test model list API
wget -qO- http://ollama-service.ai-services:11434/api/tags

Option 2: Local Test (For development debugging)

Forward the Service port to your local machine:

kubectl port-forward -n ai-services service/ollama-service 11434:11434

Test the chat API with curl:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3:8b",
  "messages": [{"role": "user", "content": "What is Kubernetes?"}]
}'

A successful response (JSON format) indicates the Ollama service is operational.

IV. Production Optimization Tips

1. Storage Optimization

Multi-replica scenarios: Use ReadWriteMany-supported storage (NFS, AWS EFS, GCP Filestore).
Performance: Use SSD for faster model loading (HDD may cause timeouts for 13B+ models).

2. Resource Tuning

CPU/Memory:
- 7B model: 4-8 cores + 8-12GB memory
- 13B model: 8-16 cores + 16-32GB memory
- 34B model: 16+ cores + 64GB+ memory
GPU Acceleration: Use ollama/ollama:nvidia image (3-5x faster model loading, 50% lower latency).

3. Monitoring & Operations

Monitoring: Use Prometheus + Grafana to track model status, API throughput, and latency (Ollama exposes /metrics endpoint).
Log Collection: Forward container logs to ELK or Loki for troubleshooting.
High Availability: Add PodDisruptionBudget (PDB) to avoid service downtime during maintenance.

4. Security Hardening

RBAC Permissions: Restrict access to the ai-services namespace to authorized users.
Network Isolation: Use NetworkPolicy to allow access only from trusted services.
Image Security: Store Ollama images in a private registry to prevent tampering.

Summary

This deployment balances stability, scalability, and resource efficiency—suitable for both testing and production. Key advantages include:

Persistent models: Avoid repeated downloads via PVC.
Controllable resources: Prevent resource contention with CPU/memory limits.
Flexible scaling: Support single-replica debugging and multi-replica high availability.

Adjust configurations (storage capacity, resource limits, Service type) based on your cluster resources and model requirements to quickly launch lightweight AI services.

#AI Application Development #ollama #K8s #DevOps & Operations Guide #AI Large Model

Ollama Service Kubernetes Deployment Guide: Full Workflow from Configuration to Validation (YAML Included)

Overview

I. Pre-Deployment Prerequisites

II. Core Configuration: End-to-End Design

1. Namespace: Resource Isolation

2. Persistent Storage (PVC): Avoid Model Loss

3. Deployment: Core Service Orchestration

4. Service: Expose Ollama for External Access

III. Deployment & Validation

1. Deploy Resources

2. Pull Models

3. Test Service Access

Option 1: Cluster Internal Test (For inter-service calls)

Option 2: Local Test (For development debugging)

IV. Production Optimization Tips

1. Storage Optimization

2. Resource Tuning

3. Monitoring & Operations

4. Security Hardening

Summary

Subscribe our newsletter