AI Generated Content: Local LLM

Building My Own AI Powerhouse: A Journey Setting Up Local LLMs on the Framework Desktop

For months, I've been fascinated by the idea of running large language models locally—no API costs, complete privacy, and the freedom to experiment without limits. When AMD announced the Ryzen AI Max+ 395 with its revolutionary unified memory architecture, I knew it was time to build something special.

This is the story of how I transformed a Framework Desktop into a personal AI workstation capable of running 70-billion parameter models—the same class of AI that typically requires enterprise-grade hardware costing tens of thousands of dollars.

Spoiler: It wasn't always smooth sailing, but the destination was worth every troubleshooting session.

Why This Hardware Changes Everything

Before diving into the build, let me explain why the Framework Desktop with the Ryzen AI Max+ 395 is such a game-changer for local AI.

Traditional setups using NVIDIA GPUs hit a wall: even the mighty RTX 4090 maxes out at 24GB of VRAM. Running a 70B parameter model? You'd need two of them, plus deal with the PCIe bandwidth bottleneck between cards.

The Ryzen AI Max+ 395 takes a different approach. It uses unified memory—the CPU, GPU, and NPU all share the same pool of ultra-fast LPDDR5x-8000 RAM. My configuration:

CPU: AMD Ryzen AI Max+ 395 (16-core/32-thread, up to 5.1GHz boost)
GPU: AMD Radeon 8060S (40 Compute Units, RDNA 3.5 architecture)
NPU: Dedicated AI accelerator for inference tasks
Memory: 128GB LPDDR5x-8000 unified RAM
Storage: 256GB NVMe (boot) + 2TB NVMe (models and data)
Networking: 5 Gigabit Ethernet + Wi-Fi 7

The magic number here is 128GB of unified memory. By configuring 96GB for the iGPU in the BIOS, I effectively have a "GPU" with more VRAM than any consumer graphics card on the market—and it's all accessible without copying data across a PCIe bus.

[IMAGE: Photo of Framework Desktop hardware]

Phase 1: The Foundation (Late November)

Before the Framework arrived, I did my homework. I had an existing Zorin Linux box running AnythingLLM with smaller models, which helped me understand the software stack I'd need:

Ollama as the model inference engine
Open WebUI for a ChatGPT-like interface
AnythingLLM for document analysis and RAG (Retrieval Augmented Generation)

I also compared hardware options. An Intel i9-12900HK mini-PC was tempting, but the Framework's 128GB unified memory pool made it the obvious choice for serious embedding work and large models.

Phase 2: Hardware Arrives & BIOS Configuration (Early December)

The box arrived, and I immediately dove into BIOS configuration—this step is critical for AI workloads.

The "Secret" VRAM Setting

By default, integrated GPUs often reserve only 512MB of system RAM. For LLMs, that's useless. I navigated to:

Setup Utility → Advanced → iGPU Configuration

And set the iGPU Memory Configuration to Custom: 96GB.

This single change transforms the system from a regular desktop into an AI workstation. The remaining 32GB stays available for the OS and applications—plenty of headroom.

Secure Boot: Off

ROCm (AMD's compute platform) and Docker play much nicer without Secure Boot enabled. I disabled it to prevent "Permission Denied" errors when loading GPU drivers.

Phase 3: Ubuntu Installation (December 3rd)

I chose Ubuntu 25.10 (Questing Quetzal) with kernel 6.17. This was deliberate—the Strix Halo architecture requires very recent kernels:

Kernel 6.14+ for NPU recognition
Kernel 6.16+ for a critical GPU memory bug fix

Older "stable" LTS kernels would have left me fighting driver issues. Sometimes bleeding edge is the right choice.

The base installation was straightforward, followed by essential packages:

Google Chrome (for testing the web UIs)
Git, GCC, build tools
Docker and docker-compose
SSH server for remote access
FFmpeg and multimedia codecs

Phase 4: ROCm Installation—The Tricky Part (December 4th)

Here's where things got interesting. AMD's ROCm platform is their answer to NVIDIA's CUDA, but installing it on cutting-edge hardware requires finesse.

The standard ROCm installer wants to replace your kernel modules. On a system with kernel 6.17 (newer than AMD's official support matrix), that's a recipe for disaster. The solution? User-space only installation:

# Download the installer (using 24.04 "Noble" base for compatibility)
wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/noble/amdgpu-install_6.3.60300-1_all.deb

# Install the configuration tool
sudo apt install ./amdgpu-install_6.3.60300-1_all.deb
sudo apt update

# CRITICAL: Install user-space libraries only, skip kernel modules
sudo amdgpu-install --usecase=rocm,hip --no-dkms -y

# Grant GPU access permissions
sudo usermod -aG render,video $LOGNAME

# Reboot to apply
sudo reboot

The --no-dkms flag is the hero here. It tells the installer: "I trust my kernel's built-in AMD drivers—just give me the compute libraries."

Verification

After reboot, the moment of truth:

$ rocm-smi
# Shows GPU temperature, power, VRAM usage

$ rocminfo | grep gfx
# Returns: Name: gfx1151

That gfx1151 identifier confirmed ROCm was seeing the Strix Halo GPU correctly. Success!

Phase 5: The LLM Stack (December 4th-5th)

With ROCm working, I deployed the AI infrastructure via Docker:

Container	Purpose	Port
Ollama	Model inference engine (ROCm-accelerated)	11434
Open WebUI	ChatGPT-like web interface	3000
AnythingLLM	Document workspace & RAG	3001

Ollama Configuration

Ollama needed some tweaks to work optimally:

# Edit the service configuration
sudo systemctl edit ollama.service

# Add these environment variables:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_FLASH_ATTENTION=1"

# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart ollama

The OLLAMA_HOST=0.0.0.0 setting allows connections from Docker containers and other machines on the network. OLLAMA_FLASH_ATTENTION=1 enables an optimization that significantly speeds up context processing.

Phase 6: The First Real Test—Llama 3.3 70B

Time to stop playing with "toy" models. With 96GB of VRAM available, I pulled the big one:

ollama run llama3.3

40GB download. Several minutes of anticipation. Then...

I watched rocm-smi in another terminal as the model loaded. VRAM usage climbed from 1% to 46%—roughly 44GB of the 96GB allocation.

It worked.

A 70-billion parameter model, running entirely in local memory, with 50GB of headroom left for context windows and multi-model setups.

Performance Numbers

Inference speed: ~3.5-4 tokens per second
Theoretical maximum: ~6 t/s (limited by memory bandwidth: 256 GB/s ÷ 42GB model ≈ 6 t/s)
Context capacity: 32K tokens comfortably, with room to push higher

For comparison, cloud APIs like Claude or GPT-4 stream at similar speeds. The difference? My queries never leave my network.

Phase 7: Remote Access Setup

I wanted to access this AI server from anywhere in my house—not just the machine itself.

Static IP Configuration

Using nmtui, I configured a static IP:

Address: 192.168.1.217/24
Gateway: 192.168.1.1
DNS: 1.1.1.1

SSH Access from Windows

I set up passwordless SSH with a memorable alias. On my Windows machine:

# Generate key pair
ssh-keygen -t ed25519

# Copy to Linux machine
type $env:USERPROFILE\.ssh\id_ed25519.pub | ssh steve@192.168.1.217 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

Then created ~/.ssh/config:

Host AI
    HostName 192.168.1.217
    User steve

Now I just type ssh AI and I'm in. No password, no IP address to remember.

Phase 8: AnythingLLM for Research

Open WebUI is great for general chat, but my real goal was document analysis—querying research papers, historical texts, and philosophical works.

AnythingLLM deployment:

export STORAGE_LOCATION=$HOME/anythingllm
mkdir -p $STORAGE_LOCATION
touch "$STORAGE_LOCATION/.env"

docker run -d -p 3001:3001 \
  --cap-add SYS_ADMIN \
  --add-host=host.docker.internal:host-gateway \
  -v ${STORAGE_LOCATION}:/app/server/storage \
  -v ${STORAGE_LOCATION}/.env:/app/server/.env \
  -e STORAGE_DIR="/app/server/storage" \
  --restart unless-stopped \
  mintplexlabs/anythingllm

Key Configuration Decisions

Through extensive testing (and helpful guidance from AI assistants), I settled on these optimized settings:

Setting	Value	Why
Embedder Model	nomic-embed-text-v1	8192 token context vs 512 for the default; superior retrieval accuracy
Vector Database	LanceDB	100% local, zero latency, no separate server needed
Chunk Size	8000 characters	~2-3 pages per chunk; good balance of context and precision
Chunk Overlap	1500 characters	Prevents sentences from being cut off between chunks
Max Context Snippets	20	Enables deep synthesis across many document sections
Similarity Threshold	Low (0.3-0.4)	Casts wider net for philosophical/historical research

Lessons Learned (The Hard Way)

1. Context Window vs. Memory

I initially set the context window to 128K tokens—the theoretical maximum. First complex query? Out of Memory crash.

The math: 70B model (~42GB) + 128K context KV cache (~40-60GB) + OS overhead = more than 96GB.

Solution: Dropped to 32K tokens. Still massive (about 250 pages of text), but stable.

2. The Docker Networking Gotcha

On Linux, host.docker.internal doesn't work by default like it does on Windows/Mac. Open WebUI couldn't find Ollama until I changed the API URL to http://172.17.0.1:11434 (Docker's gateway IP on Linux).

3. Agent Models Need to Be Smaller

The 70B model is brilliant at reasoning but sometimes "overthinks" simple tool-use commands. For agent tasks (like web search), a smaller 8B model responds more reliably to structured instructions.

4. Web Search: Still a Work in Progress

Getting AnythingLLM's web search agent to actually trigger searches proved frustrating. Even with DuckDuckGo configured and the agent enabled, the model often just hallucinated answers instead of searching. The troubleshooting continues—likely a workspace prompt or agent model configuration issue.

The Final Setup

After a week of configuration and testing, here's what I'm running:

Hardware:

Framework Desktop (FRAMDACP06)
AMD Ryzen AI Max+ 395 with 96GB iGPU allocation
128GB LPDDR5x-8000 unified memory
2.25TB NVMe storage
5GbE wired networking

Software:

Ubuntu 25.10 (Kernel 6.17)
ROCm 7.1 (user-space installation)
Docker with Ollama, Open WebUI, and AnythingLLM
Primary model: Llama 3.3 70B (Q4 quantization)
Embedder: nomic-embed-text-v1
Vector DB: LanceDB

Capabilities:

Run state-of-the-art 70B models locally
Process documents up to 32K tokens of context
RAG across large document collections
Access from any device on the network
Zero API costs, complete privacy

What's Next?

This project isn't finished. On my roadmap:

Fix web search agent: The tooling exists; I just need to nail down the configuration
Explore thinking models: Qwen3 and DeepSeek-R1 for complex reasoning tasks
Fine-tuning experiments: Training custom models on my own data
Remote access beyond LAN: Secure access when away from home
Image generation: Adding Stable Diffusion/Flux to the stack

Is It Worth It?

Absolutely—with caveats.

This setup is ideal if you:

Value privacy and want AI processing to stay local
Have heavy, ongoing AI usage that would rack up API costs
Want to experiment with models, prompts, and configurations
Enjoy the technical challenge of building systems
Need to process sensitive documents that can't go to cloud APIs

It's probably not for you if:

You just need occasional AI help (cloud APIs are easier)
You want plug-and-play simplicity
Budget is the primary concern (the hardware isn't cheap)
You need the absolute cutting edge in model capabilities (cloud models update faster)

For me, as someone who does extensive research across health topics, entertainment history, and AI development itself, having a personal AI workstation has been transformative. The ability to query local documents, maintain complete privacy, and tinker endlessly with configurations makes this one of the most satisfying tech projects I've undertaken.

The future of AI isn't just in the cloud. Sometimes, the most powerful AI is the one sitting in your office, ready to work whenever you are.

Have questions about building your own local AI setup? Drop a comment below—I'm happy to share more details about any part of this journey.

[IMAGE: Screenshot of Open WebUI running Llama 3.3 70B]

Tags: AI, LLM, Framework, AMD, Ryzen AI Max, ROCm, Ollama, Open WebUI, AnythingLLM, local AI, self-hosted, machine learning

AI Generated Content

Saturday, December 6, 2025

Local LLM