One of the biggest challenges when getting started with local LLMs is choosing the right model for your use case. There are hundreds of options available, all with different strengths, hardware requirements, context sizes, and reasoning capabilities.
A model that works well for a coding assistant might perform poorly as an automation agent. A model optimized for reasoning may feel too slow for interactive use. In the end, a lot of it comes down to the hardware you have available.
This article is meant to help you make practical choices when selecting a local LLM, whether you are building:
- local “copilot” style coding assistants
- terminal agents
- workflow automation
- tool-calling assistants
- document analysis pipelines
- research agents
The goal is not to find the “best” model overall, but the best model for your hardware and workload.
Dense vs MoE Models
One of the first decisions you will encounter is whether to use a dense model or an MoE (Mixture of Experts) model.
You can usually recognize MoE models from names like:
35B-A3B57B-A8B
The A<number>B suffix indicates how many parameters are active during inference.
Dense Models
Dense models activate all parameters for every generated token.
Examples:
- 7B
- 14B
- 27B
- 32B
With dense models, larger generally means better quality:
- stronger reasoning
- better factual recall
- more stable output
- fewer hallucinations
- more reliable code generation
The downside is significantly higher hardware requirements and slower inference speeds.
For example, a 14B dense model continuously runs all 14 billion parameters during inference, which quickly becomes expensive in terms of VRAM and memory bandwidth. Once you move beyond 14B–32B models, consumer hardware starts becoming a serious limitation unless you have a high-end GPU setup.
MoE (Mixture of Experts) Models
MoE models only activate part of the model at a time. This allows very large models to run on relatively limited hardware.
For example:
35B-A3B- Total parameters: 35B
- Active parameters per token: only 3B
In practice, MoE models often feel like a smaller dense model with access to broader knowledge. They can be surprisingly capable for their size and hardware footprint, especially for:
- long-context tasks
- tool orchestration
- automation agents
- structured workflows
- reasoning-heavy pipelines
However, there are tradeoffs.
Compared to dense models targeting similar hardware requirements, MoE models can sometimes show:
- less consistent code generation
- weaker structured output reliability
- more hallucinations
- unstable formatting in longer generations
However, this varies heavily between model families and training quality. Modern MoE models like Gemma’s recent variants can perform extremely well in real-world usage, especially for agent-style systems and longer reasoning tasks. For example, I got significantly better results from Gemma 4 26B-A4B (6-bit) on my coding agent workflows than from the dense Qwen 3 14B (6-bit) model.
Which Type Should You Use?
There is no universal answer, but some patterns appear consistently.
Dense models are usually better for:
- coding assistants
- deterministic automation
- repetitive workflows
- factual precision
MoE models are often good for:
- research agents
- long reasoning chains
- multi-step tool orchestration
- assistants that call external tools
- memory-heavy workflows
- lower-end hardware setups
If your system depends heavily on predictable structured output, dense models are generally the safer choice, but they require better hardware.
If your goal is broader reasoning capability within limited hardware constraints, MoE models can offer surprisingly strong value.
Quantization
Quantization is another major factor when running local models.
Quantization reduces model size by lowering numerical precision. Lower precision means:
- lower VRAM usage
- faster inference
- smaller files
…but also lower output quality.
In my own testing, aggressively quantized models would sometimes:
- randomly omit parts of generated code
- lose formatting consistency
- produce incomplete structured output
- degrade noticeably during long responses
For agent workflows or automation systems, this can become a serious reliability issue. If your hardware allows it, 6-bit or 8-bit quantizations are usually the safest choice for reliable output.
Common Quantization Levels
| Quantization | Relative Size | Quality Impact | Recommendation |
|---|---|---|---|
| Q8 | ~50% | Minimal | Best quality if VRAM allows |
| Q6_K | ~38% | Very small | Excellent balance |
| Q4_K_M | ~25% | Moderate | Most common local setup |
| Q3 | ~19% | Significant | Usually poor for coding |
| Q2 | ~13% | Severe | Mostly unusable |
In practice:
- Q6_K is often the sweet spot for consumer hardware
- Q8 gives the best reliability
- Q4 is acceptable for experimentation
- Q3 and below tend to become unreliable for serious work
What Kind Of Models Should I Run on My Hardware?
The exact requirements depend on context size, quantization format, and inference engine, but this should give a realistic idea of what kind of local LLMs are practical on different hardware setups. Macs can benefit from unified memory architecture, which allows models to utilize system RAM as GPU memory. On traditional PC setups, you are typically limited by the amount of VRAM available on your GPU.
| Available VRAM | Realistic Model Options | Typical Use Cases |
|---|---|---|
| 6–8 GB | 7B dense models, small Q4 MoE models | Basic coding assistants, chat, lightweight automation |
| 10–12 GB | 14B dense at Q4/Q6, smaller MoE models | Coding assistants, terminal agents, structured workflows |
| 16 GB | 14B dense at Q8, 27B MoE at Q4/Q6 | Agent workflows, tool-calling systems, longer reasoning tasks |
| 24 GB | 27B dense at Q4/Q6, larger MoE models | Advanced coding agents, research assistants, multi-step automation |
| 32 GB+ | 32B+ dense models, large MoE models at higher precision | Heavy reasoning workloads, large-context systems, high-quality local assistants |
| 48 GB+ | Most open-weight models at comfortable quantization levels | Professional local AI setups, multi-agent systems, large-scale experimentation |
Conclusion
Choosing a local LLM is ultimately about balancing quality, speed, memory usage, and the type of workload you are trying to run. Dense and MoE models both have their strengths, and quantization can dramatically change how a model behaves in practice. A smaller high-quality model running at higher precision can often outperform a much larger aggressively quantized one, while well-trained MoE models can provide surprisingly strong reasoning and context handling on limited hardware. In the end, benchmarks and parameter counts only tell part of the story. The best approach is to test models against your actual workflows and find the setup that gives the most reliable results on your hardware.
