Feb 4, 2026

Notes on Self-Hosting LLMs (Early 2026)

What to expect when running local models on consumer hardware

Models change constantly. This advice hasn’t, at least not yet.

Whether self-hosting LLMs makes sense really depends on what you’re trying to do.

If your goal is to learn and play around, it’s definitely possible. Worth doing, even. Especially if you’re thinking ahead to when models inevitably get better. If your goal is to replace frontier labs and paid products, that’s realistically not happening.

Model Tiers

At the top you have frontier, paid API providers: OpenAI, Google, Anthropic. These will always be the most capable. Below that are cheaper alternatives that are roughly 90% as good. These tend to be frontier-adjacent open source models, many from Chinese labs: DeepSeek, GLM, Qwen, Kimi.

These models are generally open source or open weight, and you can self-host them. The problem is that at this capability level, they’re huge. In practice, it’s usually cheaper to just use the hosted API versions.

Dense vs MoE: 32B vs 256B-32A

When looking at models, there are two architectural types. Dense models are what you see when something says “32B parameters.” Over the past year or so, most architectures have moved toward Mixture of Experts (MoE), which are sparse. Only a subset of total parameters are active during generation.

For self-hosting, this distinction doesn’t really help. You’ll see models labeled “32B 8B-A” where the A is active parameters, but you still need to load the entire model. MoE efficiency at inference doesn’t help your VRAM budget.

Because of that, there aren’t many dense models that are both strong and easy to self-host. With 32GB VRAM, you’re at the low end of what you’d need for something meaningfully capable.

Practical Advice

On top of model selection, there’s the whole quantization rabbit hole. Different quant strategies, specific finetunes, etc. Good rule of thumb: check the model README for recommended quantization, or just go with the most downloaded version.

A solid resource is r/LocalLLaMA. People frequently ask “here’s my hardware, what’s the best model I can run?” and you’ll usually find current, practical answers.

On Mac, LM Studio has better Metal support than most alternatives. By default most models support CUDA but not as many support Apple’s unified memory, which makes things trickier.

So yeah, self-hosting LLMs is very cool and worth experimenting with. It’s also very much not a straight path, lots of sharp edges. Set expectations accordingly.