AURA Demo - Local AI Stack

System Architecture Flow

flowchart TB %% Input Sources subgraph INPUT["
          🎮 INPUT LAYER
"] MIC["🎤 Microphone
Audio Input"] SCREEN["🖥️ Game Screen
Visual Input"] end %% Audio Processing subgraph AUDIO["
        🔊 AUDIO PROCESSING
"] WAKE["Wake Word Detection
Trigger: 'Aura'"] RIVA["NVIDIA RIVA STT
GPU-Accelerated
Real-time Recognition"] MIC ---|Audio Stream| WAKE WAKE ---|Activated| RIVA end %% Vision Processing subgraph VISION["
         👁️ VISION PROCESSING
"] CAPTURE["MSS Screen Capture
60 FPS Capable
Multi-Monitor"] OCR["Tesseract OCR
Text Extraction
98% Accuracy"] QWEN["Qwen2-VL Model
Scene Understanding
7B Parameters"] SCREEN ---|Screenshots| CAPTURE CAPTURE ---|Images| OCR CAPTURE ---|Images| QWEN end %% Core Processing subgraph CORE["
       🧠 INTELLIGENCE CORE
"] CMD["Command Registry
Fuzzy Matching
Intent Recognition"] CONTEXT["Context Builder
State Management
History Tracking"] PERSONALITY["Personality System
Dynamic Traits
Response Shaping"] MISTRAL["Mistral 7B LLM
Reasoning Engine
8K Context Window"] RIVA ---|Transcribed Text| CMD CMD ---|Commands| CONTEXT OCR ---|Extracted Text| CONTEXT QWEN ---|Scene Analysis| CONTEXT CONTEXT ---|Full Context| PERSONALITY PERSONALITY ---|Enhanced Prompt| MISTRAL end %% Output Generation subgraph OUTPUT["
        🔊 OUTPUT LAYER
"] RESPONSE["Response Generator
Natural Language"] TTS["Coqui TTS Engine
Jenny Voice Model
Neural Synthesis"] SPEAKER["🔊 Audio Output
Game/Streaming"] MISTRAL ---|AI Response| RESPONSE RESPONSE ---|Text| TTS TTS ---|Audio| SPEAKER end %% Data Storage subgraph DATA["
     💾 LOCAL STORAGE
"] DB[("SQLite Database
Sessions
Conversations
Personality States")] CONTEXT -.->|Store| DB MISTRAL -.->|Log| DB DB -.->|Retrieve| CONTEXT end %% Styling classDef inputClass fill:#2a3f5f,stroke:#00d4ff,stroke-width:2px,color:#e0e0ff classDef audioClass fill:#1a3a2a,stroke:#00ff88,stroke-width:2px,color:#e0e0ff classDef visionClass fill:#3a2a1a,stroke:#ffaa00,stroke-width:2px,color:#e0e0ff classDef coreClass fill:#2a1a3a,stroke:#9d4edd,stroke-width:2px,color:#e0e0ff classDef outputClass fill:#1a2a3a,stroke:#00ff88,stroke-width:2px,color:#e0e0ff classDef dataClass fill:#0a1a1a,stroke:#4a5f7f,stroke-width:2px,color:#e0e0ff class MIC,SCREEN inputClass class WAKE,RIVA audioClass class CAPTURE,OCR,QWEN visionClass class CMD,CONTEXT,PERSONALITY,MISTRAL coreClass class RESPONSE,TTS,SPEAKER outputClass class DB dataClass

🎤 Speech Recognition

NVIDIA RIVA STT

GPU-accelerated speech recognition engine providing industry-leading accuracy with minimal latency.

Latency: <200ms wake word Mode: Streaming real-time GPU Usage: ~2GB VRAM Accuracy: 95%+ in gaming

Integration: audio/record.py • Wake word: "Aura"

🔊 Voice Synthesis

Coqui TTS - Jenny Voice

High-quality neural TTS running entirely on local GPU, providing natural and expressive speech.

Generation: <500ms typical Quality: 22kHz sample rate GPU Usage: ~1.5GB VRAM Voice Type: Female, pleasant

Integration: voices/backend_voice.py • Model: VCTK/vits

👁️ Vision Analysis

Qwen2-VL via Ollama

State-of-the-art multimodal model for understanding game scenes and providing visual context.

Model Size: 7B parameters Analysis Rate: 2-5 FPS VRAM Usage: ~6GB Capabilities: Scene + Text

Integration: llm/query.py • Ollama model: qwen2-vl

🧠 Reasoning Engine

Mistral 7B via Ollama

Fast and efficient LLM providing intelligent game assistance with personality-aware responses.

Model Size: 7B parameters Response Time: <2 seconds VRAM Usage: ~5GB Context: 8K tokens

Integration: llm/query.py • Personality: personality_manager.py

📝 Text Extraction

Tesseract OCR

Battle-tested OCR engine for extracting game text, UI elements, and puzzle clues.

Version: 5.0+ Languages: 100+ supported Processing: <200ms/frame Accuracy: 98%+ clear text

Integration: computer_vision/ocr.py • Preprocessing included

🖥️ Screen Capture

MSS (Python-MSS)

Lightning-fast cross-platform screenshot library with minimal performance impact.

Max FPS: 60+ capable Latency: <10ms CPU Usage: Minimal Memory: Efficient

Integration: computer_vision/capture.py • Multi-monitor support

Real-time Performance Metrics

Wake Word

<200ms

STT Streaming

Real-time

Vision Analysis

2-5 FPS

LLM Response

<2 sec

TTS Generation

<500ms

End-to-End

<3 sec

Minimum Requirements

GPU RTX 3060 12GB
RAM 16GB DDR4
Storage 50GB SSD
CPU 6-core modern

Recommended Setup

GPU RTX 4070+
RAM 32GB DDR5
Storage 100GB NVMe
CPU 8-core high perf

VRAM Usage

RIVA STT ~2GB
Coqui TTS ~1.5GB
Qwen2-VL ~6GB
Mistral 7B ~5GB

Quick Start Guide

1. Install Dependencies

# Create virtual environment
python -m venv aura-env
aura-env\Scripts\activate.bat

# Install requirements
pip install -r requirements.txt

2. Download Models

# Install Ollama from https://ollama.ai
# Then pull required models
ollama pull mistral
ollama pull qwen2-vl

# Coqui TTS models download automatically on first run

3. Configure NVIDIA RIVA

# Follow NVIDIA RIVA setup guide
# Ensure RIVA server is running on localhost:50051

4. Launch AURA

# Start with Blue Prince demo
python main_new.py --game blue_prince --input-mode riva --whisper-mode riva

# Open admin console in separate terminal
python utils/admin_console_system_shock.py

# Voice activation: Say "Aura" followed by your command

5. Verify Setup

# Test wake word detection
"Aura, can you hear me?"

# Test vision capabilities
"Aura, what do you see on screen?"

# Check admin console for real-time logs

AURA Demo Stack