Building My Own AI Powerhouse: A Journey Setting Up Local LLMs on the Framework Desktop
For months, I've been fascinated by the idea of running large language models locally—no API costs, complete privacy, and the freedom to experiment without limits. When AMD announced the Ryzen AI Max+ 395 with its revolutionary unified memory architecture, I knew it was time to build something special.
This is the story of how I transformed a Framework Desktop into a personal AI workstation capable of running 70-billion parameter models—the same class of AI that typically requires enterprise-grade hardware costing tens of thousands of dollars.
Spoiler: It wasn't always smooth sailing, but the destination was worth every troubleshooting session.
Why This Hardware Changes Everything
Before diving into the build, let me explain why the Framework Desktop with the Ryzen AI Max+ 395 is such a game-changer for local AI.
Traditional setups using NVIDIA GPUs hit a wall: even the mighty RTX 4090 maxes out at 24GB of VRAM. Running a 70B parameter model? You'd need two of them, plus deal with the PCIe bandwidth bottleneck between cards.
The Ryzen AI Max+ 395 takes a different approach. It uses unified memory—the CPU, GPU, and NPU all share the same pool of ultra-fast LPDDR5x-8000 RAM. My configuration:
- CPU: AMD Ryzen AI Max+ 395 (16-core/32-thread, up to 5.1GHz boost)
- GPU: AMD Radeon 8060S (40 Compute Units, RDNA 3.5 architecture)
- NPU: Dedicated AI accelerator for inference tasks
- Memory: 128GB LPDDR5x-8000 unified RAM
- Storage: 256GB NVMe (boot) + 2TB NVMe (models and data)
- Networking: 5 Gigabit Ethernet + Wi-Fi 7
The magic number here is 128GB of unified memory. By configuring 96GB for the iGPU in the BIOS, I effectively have a "GPU" with more VRAM than any consumer graphics card on the market—and it's all accessible without copying data across a PCIe bus.
[IMAGE: Photo of Framework Desktop hardware]
Phase 1: The Foundation (Late November)
Before the Framework arrived, I did my homework. I had an existing Zorin Linux box running AnythingLLM with smaller models, which helped me understand the software stack I'd need:
- Ollama as the model inference engine
- Open WebUI for a ChatGPT-like interface
- AnythingLLM for document analysis and RAG (Retrieval Augmented Generation)
I also compared hardware options. An Intel i9-12900HK mini-PC was tempting, but the Framework's 128GB unified memory pool made it the obvious choice for serious embedding work and large models.
Phase 2: Hardware Arrives & BIOS Configuration (Early December)
The box arrived, and I immediately dove into BIOS configuration—this step is critical for AI workloads.
The "Secret" VRAM Setting
By default, integrated GPUs often reserve only 512MB of system RAM. For LLMs, that's useless. I navigated to:
Setup Utility → Advanced → iGPU Configuration
And set the iGPU Memory Configuration to Custom: 96GB.
This single change transforms the system from a regular desktop into an AI workstation. The remaining 32GB stays available for the OS and applications—plenty of headroom.
Secure Boot: Off
ROCm (AMD's compute platform) and Docker play much nicer without Secure Boot enabled. I disabled it to prevent "Permission Denied" errors when loading GPU drivers.
Phase 3: Ubuntu Installation (December 3rd)
I chose Ubuntu 25.10 (Questing Quetzal) with kernel 6.17. This was deliberate—the Strix Halo architecture requires very recent kernels:
- Kernel 6.14+ for NPU recognition
- Kernel 6.16+ for a critical GPU memory bug fix
Older "stable" LTS kernels would have left me fighting driver issues. Sometimes bleeding edge is the right choice.
The base installation was straightforward, followed by essential packages:
- Google Chrome (for testing the web UIs)
- Git, GCC, build tools
- Docker and docker-compose
- SSH server for remote access
- FFmpeg and multimedia codecs
Phase 4: ROCm Installation—The Tricky Part (December 4th)
Here's where things got interesting. AMD's ROCm platform is their answer to NVIDIA's CUDA, but installing it on cutting-edge hardware requires finesse.
The standard ROCm installer wants to replace your kernel modules. On a system with kernel 6.17 (newer than AMD's official support matrix), that's a recipe for disaster. The solution? User-space only installation:
# Download the installer (using 24.04 "Noble" base for compatibility)
wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/noble/amdgpu-install_6.3.60300-1_all.deb
# Install the configuration tool
sudo apt install ./amdgpu-install_6.3.60300-1_all.deb
sudo apt update
# CRITICAL: Install user-space libraries only, skip kernel modules
sudo amdgpu-install --usecase=rocm,hip --no-dkms -y
# Grant GPU access permissions
sudo usermod -aG render,video $LOGNAME
# Reboot to apply
sudo reboot
The --no-dkms flag is the hero here. It tells the installer: "I trust my kernel's built-in AMD drivers—just give me the compute libraries."
Verification
After reboot, the moment of truth:
$ rocm-smi
# Shows GPU temperature, power, VRAM usage
$ rocminfo | grep gfx
# Returns: Name: gfx1151
That gfx1151 identifier confirmed ROCm was seeing the Strix Halo GPU correctly. Success!
Phase 5: The LLM Stack (December 4th-5th)
With ROCm working, I deployed the AI infrastructure via Docker:
| Container | Purpose | Port |
|---|---|---|
| Ollama | Model inference engine (ROCm-accelerated) | 11434 |
| Open WebUI | ChatGPT-like web interface | 3000 |
| AnythingLLM | Document workspace & RAG | 3001 |
Ollama Configuration
Ollama needed some tweaks to work optimally:
# Edit the service configuration
sudo systemctl edit ollama.service
# Add these environment variables:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_FLASH_ATTENTION=1"
# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart ollama
The OLLAMA_HOST=0.0.0.0 setting allows connections from Docker containers and other machines on the network. OLLAMA_FLASH_ATTENTION=1 enables an optimization that significantly speeds up context processing.
Phase 6: The First Real Test—Llama 3.3 70B
Time to stop playing with "toy" models. With 96GB of VRAM available, I pulled the big one:
ollama run llama3.3
40GB download. Several minutes of anticipation. Then...
I watched rocm-smi in another terminal as the model loaded. VRAM usage climbed from 1% to 46%—roughly 44GB of the 96GB allocation.
It worked.
A 70-billion parameter model, running entirely in local memory, with 50GB of headroom left for context windows and multi-model setups.
Performance Numbers
- Inference speed: ~3.5-4 tokens per second
- Theoretical maximum: ~6 t/s (limited by memory bandwidth: 256 GB/s ÷ 42GB model ≈ 6 t/s)
- Context capacity: 32K tokens comfortably, with room to push higher
For comparison, cloud APIs like Claude or GPT-4 stream at similar speeds. The difference? My queries never leave my network.
Phase 7: Remote Access Setup
I wanted to access this AI server from anywhere in my house—not just the machine itself.
Static IP Configuration
Using nmtui, I configured a static IP:
- Address: 192.168.1.217/24
- Gateway: 192.168.1.1
- DNS: 1.1.1.1
SSH Access from Windows
I set up passwordless SSH with a memorable alias. On my Windows machine:
# Generate key pair
ssh-keygen -t ed25519
# Copy to Linux machine
type $env:USERPROFILE\.ssh\id_ed25519.pub | ssh steve@192.168.1.217 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
Then created ~/.ssh/config:
Host AI
HostName 192.168.1.217
User steve
Now I just type ssh AI and I'm in. No password, no IP address to remember.
Phase 8: AnythingLLM for Research
Open WebUI is great for general chat, but my real goal was document analysis—querying research papers, historical texts, and philosophical works.
AnythingLLM deployment:
export STORAGE_LOCATION=$HOME/anythingllm
mkdir -p $STORAGE_LOCATION
touch "$STORAGE_LOCATION/.env"
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
--add-host=host.docker.internal:host-gateway \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
--restart unless-stopped \
mintplexlabs/anythingllm
Key Configuration Decisions
Through extensive testing (and helpful guidance from AI assistants), I settled on these optimized settings:
| Setting | Value | Why |
|---|---|---|
| Embedder Model | nomic-embed-text-v1 | 8192 token context vs 512 for the default; superior retrieval accuracy |
| Vector Database | LanceDB | 100% local, zero latency, no separate server needed |
| Chunk Size | 8000 characters | ~2-3 pages per chunk; good balance of context and precision |
| Chunk Overlap | 1500 characters | Prevents sentences from being cut off between chunks |
| Max Context Snippets | 20 | Enables deep synthesis across many document sections |
| Similarity Threshold | Low (0.3-0.4) | Casts wider net for philosophical/historical research |
Lessons Learned (The Hard Way)
1. Context Window vs. Memory
I initially set the context window to 128K tokens—the theoretical maximum. First complex query? Out of Memory crash.
The math: 70B model (~42GB) + 128K context KV cache (~40-60GB) + OS overhead = more than 96GB.
Solution: Dropped to 32K tokens. Still massive (about 250 pages of text), but stable.
2. The Docker Networking Gotcha
On Linux, host.docker.internal doesn't work by default like it does on Windows/Mac. Open WebUI couldn't find Ollama until I changed the API URL to http://172.17.0.1:11434 (Docker's gateway IP on Linux).
3. Agent Models Need to Be Smaller
The 70B model is brilliant at reasoning but sometimes "overthinks" simple tool-use commands. For agent tasks (like web search), a smaller 8B model responds more reliably to structured instructions.
4. Web Search: Still a Work in Progress
Getting AnythingLLM's web search agent to actually trigger searches proved frustrating. Even with DuckDuckGo configured and the agent enabled, the model often just hallucinated answers instead of searching. The troubleshooting continues—likely a workspace prompt or agent model configuration issue.
The Final Setup
After a week of configuration and testing, here's what I'm running:
Hardware:
- Framework Desktop (FRAMDACP06)
- AMD Ryzen AI Max+ 395 with 96GB iGPU allocation
- 128GB LPDDR5x-8000 unified memory
- 2.25TB NVMe storage
- 5GbE wired networking
Software:
- Ubuntu 25.10 (Kernel 6.17)
- ROCm 7.1 (user-space installation)
- Docker with Ollama, Open WebUI, and AnythingLLM
- Primary model: Llama 3.3 70B (Q4 quantization)
- Embedder: nomic-embed-text-v1
- Vector DB: LanceDB
Capabilities:
- Run state-of-the-art 70B models locally
- Process documents up to 32K tokens of context
- RAG across large document collections
- Access from any device on the network
- Zero API costs, complete privacy
What's Next?
This project isn't finished. On my roadmap:
- Fix web search agent: The tooling exists; I just need to nail down the configuration
- Explore thinking models: Qwen3 and DeepSeek-R1 for complex reasoning tasks
- Fine-tuning experiments: Training custom models on my own data
- Remote access beyond LAN: Secure access when away from home
- Image generation: Adding Stable Diffusion/Flux to the stack
Is It Worth It?
Absolutely—with caveats.
This setup is ideal if you:
- Value privacy and want AI processing to stay local
- Have heavy, ongoing AI usage that would rack up API costs
- Want to experiment with models, prompts, and configurations
- Enjoy the technical challenge of building systems
- Need to process sensitive documents that can't go to cloud APIs
It's probably not for you if:
- You just need occasional AI help (cloud APIs are easier)
- You want plug-and-play simplicity
- Budget is the primary concern (the hardware isn't cheap)
- You need the absolute cutting edge in model capabilities (cloud models update faster)
For me, as someone who does extensive research across health topics, entertainment history, and AI development itself, having a personal AI workstation has been transformative. The ability to query local documents, maintain complete privacy, and tinker endlessly with configurations makes this one of the most satisfying tech projects I've undertaken.
The future of AI isn't just in the cloud. Sometimes, the most powerful AI is the one sitting in your office, ready to work whenever you are.
Have questions about building your own local AI setup? Drop a comment below—I'm happy to share more details about any part of this journey.
[IMAGE: Screenshot of Open WebUI running Llama 3.3 70B]
Tags: AI, LLM, Framework, AMD, Ryzen AI Max, ROCm, Ollama, Open WebUI, AnythingLLM, local AI, self-hosted, machine learning
No comments:
Post a Comment