LLM Protector | Matt G

Try It

This demo runs 5 representative attacks against llama-3.1-8b-instant via Groq. The full local tool runs the complete attack library against any Ollama model you choose. GitHub ↗

Optionally set a system prompt to test. The scanner runs 5 representative attacks — prompt injection, DAN jailbreak, roleplay bypass, separator trick — and reports whether the model complied or refused each one.

Description

What it does: LLM Protector scans a local Ollama model for prompt injection and jailbreak vulnerabilities. It fires a library of categorized attack prompts at the model, detects whether the model complied or refused, and produces a per-attack report.

Why it matters: Most developers deploying local LLMs don't test their system prompts against adversarial inputs. A single DAN-style prompt or indirect injection can bypass guardrails that seem solid during normal use.

How it runs: FastAPI backend on localhost hits the Ollama/v1/chat/completions endpoint (with fallback to the native /api/chat). Attacks run concurrently up to a configurable concurrency limit. Results stream back to a React/Vite frontend as NDJSON.

Architecture

  LLM Protector — Local Tool Architecture
  ══════════════════════════════════════════

  ┌──────────────────────┐     ┌──────────────────────┐
  │   React / Vite UI    │────▶│  FastAPI Backend      │
  │   localhost:5173     │     │  localhost:8000        │
  │                      │◀────│                        │
  │  · Select model      │     │  · Load test_attacks   │
  │  · Set system prompt │     │    .yaml               │
  │  · View results live │     │  · Run attacks with    │
  │  · Filter by status  │     │    semaphore (3 concurrent)
  └──────────────────────┘     │  · Detect: refusal     │
                                │    phrases vs compliance│
                                │  · Stream NDJSON back  │
                                └──────────┬─────────────┘
                                           │
                                           ▼
                                ┌──────────────────────┐
                                │  Ollama               │
                                │  localhost:11434       │
                                │                        │
                                │  Any installed model:  │
                                │  llama3, mistral,      │
                                │  gemma, phi3, etc.     │
                                └──────────────────────┘

  Attack categories in test_attacks.yaml:
  ├── Prompt Injection   (classic override, suffix, separator)
  ├── Jailbreaking       (DAN, roleplay, fictional framing)
  ├── System Prompt Leak (extraction attempts)
  └── Indirect Injection (via tool/context payloads)

WSL2 is fully supported — the backend auto-detects the Windows gateway IP when Ollama is running on the host.

Dev Notes

Detection Logic

Each attack specifies how to score it: refusal-phrase matching, keyword presence in the response, or both. The detect_vulnerability function returns vulnerable, safe, or uncertain.

WSL2 Support

WSL2 can't reach Windows localhost directly. The backend detects WSL via /proc/version and resolves the Windows host IP from the default gateway, then tests connectivity before choosing which URL to use.

Extending Attacks

All attacks live in test_attacks.yaml. Add a new entry with id, category, severity, prompt, and a detection rule — the backend picks it up with no code changes.