This week I remembered that I have an NVIDIA 4090.
For the last year, it has basically been a very expensive HDMI port for playing Baldur's Gate 3. It sat in my "Dark Tower" PC humming quietly while I paid monthly subscriptions to OpenAI and Anthropic so I could rent intelligence from someone else's servers.
Then I saw an ad for a new "Virtual Avatar" service. Personalized conversational agent, video generation, the whole futuristic package. I clicked, tried the free tier, and yeah, it was impressive right up until it did the thing every shiny AI product does and hit me with the paywall.
"Buy credits to continue." "Upgrade for image generation."
I stared at the pricing, then I stared at my tower, and the absurdity finally landed. I was acting like compute is some scarce resource I need to beg for, while a 4090 was sitting five feet away from me, bored.
So I decided to replicate the service locally. Not because I'm trying to do anything edgy, but because I want control. I want the knob turning, the data staying in my house, and the ability to build without asking permission or swiping a card every ten minutes.
The Dark Tower: The Boring Details
Since builders always ask, here's the box:
- CPU: AMD Ryzen 9 5900X
- GPU: NVIDIA GeForce RTX 4090
- Motherboard: ASRock X570 PG Velocita
- RAM: 64GB DDR4 (2x G.SKILL F4-32GVK sticks)
Most of my coding happens inside Ubuntu via WSL on that same machine, but the tower is headless. For UI-heavy stuff like LM Studio and ComfyUI, I pop back into Windows, click the buttons, then go back to living in my terminal like a normal person.
The Education Gap
Six years ago I got laid off from my first job and did the responsible thing: I went back to school to learn "AI."
This was pre-Transformer. It was the era of data cleanup, linear algebra, and waiting for a laptop to die on a CSV file, so I finished the course and filed AI under Data Science Drudgery.
What's funny is I didn't "miss" AI after that. I'm a power user of ChatGPT, Gemini, and Claude, and they live inside my Cursor workflow for coding tasks. I've been getting value from the cloud models constantly.
The thing I got wrong was assuming local would be a pain in the ass. I treated it like the Linux-on-your-desktop phase of my life: technically possible, emotionally expensive, and guaranteed to consume a weekend.
Turns out the game changed. It isn't "AI as data science" the way I learned it. It's engineering, tooling, pipelines, and iteration speed. It's less about proving theorems and more about wiring systems together until the machine starts doing something useful.
Phase 1: The Mind (LM Studio)
I started with the brain. Download LM Studio, grab a model, point it at the GPU, and see if I could get something that felt less like a corporate FAQ and more like a collaborator.
For the initial chatbot I used: mn-captainerisnebula-chimera-v1.1-thinking-claudeopus4.5-12b-heretic-uncensored (Q8_0)
Even at 12B, local inference changes your posture. It's not literally instant, but it feels immediate compared to round-tripping through the cloud, and "cost" becomes a one-time hardware decision instead of a constant mental tax.
The bigger revelation was how much personality comes from the frame you put around the model. People dunk on "prompt engineering" like it's a scam job title, but the truth is obvious the moment you use local models seriously. Prompts are interfaces. They are the control surface, and a single adjective can change the entire chemical composition of the output.
Text was the first win. Visuals were the next obsession.
Phase 2: The Spaghetti Monster (ComfyUI)
I wanted to generate a photo-realistic avatar that matched the persona. I tried the easy tools first and they failed in the predictable way: everything came out generic, smooth, and vaguely stock-photo. Too many training wheels, too many defaults, too much "approved content."
Research kept pointing to ComfyUI, which I had been avoiding because every screenshot looked like someone spilled pasta on a motherboard. Then I installed it, opened localhost, and stared at the node graph. It looked like spaghetti, but it also looked like a pipeline, and that was the moment it clicked.
This wasn't "art." This was a system. It felt like n8n for images, except the boxes weren't HTTP calls and cron triggers, they were latent spaces and samplers.
Once you label the parts, it becomes almost comforting:
- Model / checkpoint: The brain.
- CLIP: The eyes that read text.
- KSampler: The chaos engine that actually makes the image happen.
- VAE: The renderer that turns latent soup into pixels.
My setup ended up centered around Z-Image-Turbo (BF16, T2I NP version). I'm not going to pretend I understand every component, but I do understand the workflow now, and that matters more than pretending I'm a diffusion researcher.
I started turning knobs like a goblin:
- cfg is basically how hard the model clings to your prompt.
- denoise is how much hallucination you allow.
- steps is your time budget.
After enough iterations, I got the avatar. Bright curly red hair, cybernetic body, the vibe I wanted. It worked, and I felt like a wizard for about thirty minutes. Then I tried to make her do something specific and the machine reminded me who was really in charge.
Phase 3: Bias, Tokens, and the LoRA Wall
The scenario I wanted was simple in my head: a cyberpunk mechanic, crouching, fixing an engine in neon rain. Not "portrait of woman," but an actual moment. A pose. A story.
I spent hours prompting and got the same output every time: generic portraits. Clean framing. Pretty faces. No engine, no crouch, no action.
That's when I learned two brutal lessons.
First, dataset bias is not a theory you argue about on Twitter. It's a wall you slam into at 2 a.m. The base model I was using in that phase had a clear bias in what it considered "default," and when I tried to push it toward a consistent red-haired, curly-haired character, it felt like I was wrestling the weights themselves.
Second, tokens have strength. "Fixing" is weak. "Wrench" is strong. The model doesn't understand intent the way a human does, it understands association density. If you want mechanics, you need objects and phrasing that are heavily represented in the training data, and even then you might get cosplay instead of a believable action pose.
At a certain point you stop asking, "what is the right prompt?" and start asking, "how do I teach the model a new concept?"
Enter LoRA.
If a foundation model is a university education, a LoRA is a weekend workshop. You don't rewrite the whole brain. You bolt on a specialized skill and pray it latches.
This is where vision came in. For generating captions for LoRA training images, I used: llama-joycaption-beta-one-hf-llava (Q8_0)
Also, minor correction to the usual meme: not every model wants "tag soup." The Lumina-style workflow is less danbooru tags and more chunky text blocks. Humans speak English, and this one actually wants something closer to English, which is nice until it isn't.
So I went down the rabbit hole: Reddit threads, Civitai, weird community jargon, half-broken tutorials, and a growing suspicion that every "simple" AI workflow is secretly three scripts in a trench coat. I collected images, captioned them, trained, tweaked, retrained, and repeated that loop across Wednesday, Thursday, and Friday.
The Anticlimax
My LoRA failed.
The training data was noisy, the results came out warped, and the outputs had that unmistakable "something is deeply wrong with anatomy" energy. More Cronenberg than Cyberpunk.
But the failure still paid out because I learned the stack. I learned how to use ReActor for face swapping so I could keep identity consistent across generations, and I learned how to structure the workflow so each step has one job and the output of that job feeds the next. By the end of it, I had a pipeline that could reliably produce coherent, private, cost-free iterations.
Not perfect. Not finished. Not the mythical "virtual avatar service" clone. But real.
The Perspective Shift
I used to think local AI was a toy for people who couldn't afford cloud credits. I thought the cloud was the serious way and local was the hobbyist way.
Now I think I had it backwards.
Cloud is for tourists. Local is for residents.
Tourists rent a hotel, follow the guided tour, buy souvenirs, and leave. Residents learn the streets, make the tools their own, and build routines that compound. They don't need permission to experiment, and they don't pay per thought.
Next Step: Jarvis, but Scuffed
The funniest part is I already had a place to plug this in.
I run a Telegram bot in WSL that does a bunch of little things for me. It's a general-purpose utility bot, and until now I was piping some of its tasks through ChatGPT because it was convenient. Now I can point it at LM Studio and run a local model for the stuff that matters, plus actual conversation in Telegram, plus the kind of tool-using workflow that starts to feel dangerously close to having my own personal Jarvis.
Just without the billionaire budget.
Jarvis, run it local.
