AURA Demo Stack

100% Local AI Gaming Assistant

System Architecture Flow

flowchart TB %% Input Sources subgraph INPUT["
          🎮 INPUT LAYER          
 "] MIC["🎤 Microphone
Audio Input"] SCREEN["🖥️ Game Screen
Visual Input"] end %% Audio Processing subgraph AUDIO["
        🔊 AUDIO PROCESSING        
 "] WAKE["Wake Word Detection
Trigger: 'Aura'"] RIVA["NVIDIA RIVA STT
GPU-Accelerated
Real-time Recognition"] MIC ---|Audio Stream| WAKE WAKE ---|Activated| RIVA end %% Vision Processing subgraph VISION["
         👁️ VISION PROCESSING         
 "] CAPTURE["MSS Screen Capture
60 FPS Capable
Multi-Monitor"] OCR["Tesseract OCR
Text Extraction
98% Accuracy"] QWEN["Qwen2-VL Model
Scene Understanding
7B Parameters"] SCREEN ---|Screenshots| CAPTURE CAPTURE ---|Images| OCR CAPTURE ---|Images| QWEN end %% Core Processing subgraph CORE["
       🧠 INTELLIGENCE CORE       
 "] CMD["Command Registry
Fuzzy Matching
Intent Recognition"] CONTEXT["Context Builder
State Management
History Tracking"] PERSONALITY["Personality System
Dynamic Traits
Response Shaping"] MISTRAL["Mistral 7B LLM
Reasoning Engine
8K Context Window"] RIVA ---|Transcribed Text| CMD CMD ---|Commands| CONTEXT OCR ---|Extracted Text| CONTEXT QWEN ---|Scene Analysis| CONTEXT CONTEXT ---|Full Context| PERSONALITY PERSONALITY ---|Enhanced Prompt| MISTRAL end %% Output Generation subgraph OUTPUT["
        🔊 OUTPUT LAYER        
 "] RESPONSE["Response Generator
Natural Language"] TTS["Coqui TTS Engine
Jenny Voice Model
Neural Synthesis"] SPEAKER["🔊 Audio Output
Game/Streaming"] MISTRAL ---|AI Response| RESPONSE RESPONSE ---|Text| TTS TTS ---|Audio| SPEAKER end %% Data Storage subgraph DATA["
     💾 LOCAL STORAGE     
 "] DB[("SQLite Database
Sessions
Conversations
Personality States")] CONTEXT -.->|Store| DB MISTRAL -.->|Log| DB DB -.->|Retrieve| CONTEXT end %% Styling classDef inputClass fill:#2a3f5f,stroke:#00d4ff,stroke-width:2px,color:#e0e0ff classDef audioClass fill:#1a3a2a,stroke:#00ff88,stroke-width:2px,color:#e0e0ff classDef visionClass fill:#3a2a1a,stroke:#ffaa00,stroke-width:2px,color:#e0e0ff classDef coreClass fill:#2a1a3a,stroke:#9d4edd,stroke-width:2px,color:#e0e0ff classDef outputClass fill:#1a2a3a,stroke:#00ff88,stroke-width:2px,color:#e0e0ff classDef dataClass fill:#0a1a1a,stroke:#4a5f7f,stroke-width:2px,color:#e0e0ff class MIC,SCREEN inputClass class WAKE,RIVA audioClass class CAPTURE,OCR,QWEN visionClass class CMD,CONTEXT,PERSONALITY,MISTRAL coreClass class RESPONSE,TTS,SPEAKER outputClass class DB dataClass

🎤 Speech Recognition

NVIDIA RIVA STT

GPU-accelerated speech recognition engine providing industry-leading accuracy with minimal latency.

Latency: <200ms wake word Mode: Streaming real-time GPU Usage: ~2GB VRAM Accuracy: 95%+ in gaming
Integration: audio/record.py • Wake word: "Aura"

🔊 Voice Synthesis

Coqui TTS - Jenny Voice

High-quality neural TTS running entirely on local GPU, providing natural and expressive speech.

Generation: <500ms typical Quality: 22kHz sample rate GPU Usage: ~1.5GB VRAM Voice Type: Female, pleasant
Integration: voices/backend_voice.py • Model: VCTK/vits

👁️ Vision Analysis

Qwen2-VL via Ollama

State-of-the-art multimodal model for understanding game scenes and providing visual context.

Model Size: 7B parameters Analysis Rate: 2-5 FPS VRAM Usage: ~6GB Capabilities: Scene + Text
Integration: llm/query.py • Ollama model: qwen2-vl

🧠 Reasoning Engine

Mistral 7B via Ollama

Fast and efficient LLM providing intelligent game assistance with personality-aware responses.

Model Size: 7B parameters Response Time: <2 seconds VRAM Usage: ~5GB Context: 8K tokens
Integration: llm/query.py • Personality: personality_manager.py

📝 Text Extraction

Tesseract OCR

Battle-tested OCR engine for extracting game text, UI elements, and puzzle clues.

Version: 5.0+ Languages: 100+ supported Processing: <200ms/frame Accuracy: 98%+ clear text
Integration: computer_vision/ocr.py • Preprocessing included

🖥️ Screen Capture

MSS (Python-MSS)

Lightning-fast cross-platform screenshot library with minimal performance impact.

Max FPS: 60+ capable Latency: <10ms CPU Usage: Minimal Memory: Efficient
Integration: computer_vision/capture.py • Multi-monitor support

Real-time Performance Metrics

Wake Word
<200ms
STT Streaming
Real-time
Vision Analysis
2-5 FPS
LLM Response
<2 sec
TTS Generation
<500ms
End-to-End
<3 sec

Minimum Requirements

  • GPU RTX 3060 12GB
  • RAM 16GB DDR4
  • Storage 50GB SSD
  • CPU 6-core modern

Recommended Setup

  • GPU RTX 4070+
  • RAM 32GB DDR5
  • Storage 100GB NVMe
  • CPU 8-core high perf

VRAM Usage

  • RIVA STT ~2GB
  • Coqui TTS ~1.5GB
  • Qwen2-VL ~6GB
  • Mistral 7B ~5GB

Quick Start Guide

1. Install Dependencies

# Create virtual environment
python -m venv aura-env
aura-env\Scripts\activate.bat

# Install requirements
pip install -r requirements.txt

2. Download Models

# Install Ollama from https://ollama.ai
# Then pull required models
ollama pull mistral
ollama pull qwen2-vl

# Coqui TTS models download automatically on first run

3. Configure NVIDIA RIVA

# Follow NVIDIA RIVA setup guide
# Ensure RIVA server is running on localhost:50051

4. Launch AURA

# Start with Blue Prince demo
python main_new.py --game blue_prince --input-mode riva --whisper-mode riva

# Open admin console in separate terminal
python utils/admin_console_system_shock.py

# Voice activation: Say "Aura" followed by your command

5. Verify Setup

# Test wake word detection
"Aura, can you hear me?"

# Test vision capabilities
"Aura, what do you see on screen?"

# Check admin console for real-time logs