One of the biggest challenges when getting started with local LLMs is choosing the right model for your use case. There are hundreds of options available, all with different strengths, hardware requirements, context sizes, and reasoning capabilities.

A model that works well for a coding assistant might perform poorly as an automation agent. A model optimized for reasoning may feel too slow for interactive use. In the end, a lot of it comes down to the hardware you have available.

This article is meant to help you make practical choices when selecting a local LLM, whether you are building:

local “copilot” style coding assistants
terminal agents
workflow automation
tool-calling assistants
document analysis pipelines
research agents

The goal is not to find the “best” model overall, but the best model for your hardware and workload.

Dense vs MoE Models

One of the first decisions you will encounter is whether to use a dense model or an MoE (Mixture of Experts) model.

You can usually recognize MoE models from names like:

35B-A3B
57B-A8B

The A<number>B suffix indicates how many parameters are active during inference.

Dense Models

Dense models activate all parameters for every generated token.

Examples:

With dense models, larger generally means better quality:

stronger reasoning
better factual recall
more stable output
fewer hallucinations
more reliable code generation

The downside is significantly higher hardware requirements and slower inference speeds.

For example, a 14B dense model continuously runs all 14 billion parameters during inference, which quickly becomes expensive in terms of VRAM and memory bandwidth. Once you move beyond 14B–32B models, consumer hardware starts becoming a serious limitation unless you have a high-end GPU setup.

MoE (Mixture of Experts) Models

MoE models only activate part of the model at a time. This allows very large models to run on relatively limited hardware.

For example:

35B-A3B
Total parameters: 35B
Active parameters per token: only 3B

In practice, MoE models often feel like a smaller dense model with access to broader knowledge. They can be surprisingly capable for their size and hardware footprint, especially for:

long-context tasks
tool orchestration
automation agents
structured workflows
reasoning-heavy pipelines

However, there are tradeoffs.

Compared to dense models targeting similar hardware requirements, MoE models can sometimes show:

less consistent code generation
weaker structured output reliability
more hallucinations
unstable formatting in longer generations

However, this varies heavily between model families and training quality. Modern MoE models like Gemma’s recent variants can perform extremely well in real-world usage, especially for agent-style systems and longer reasoning tasks. For example, I got significantly better results from Gemma 4 26B-A4B (6-bit) on my coding agent workflows than from the dense Qwen 3 14B (6-bit) model.

Which Type Should You Use?

There is no universal answer, but some patterns appear consistently.

Dense models are usually better for:

coding assistants
deterministic automation
repetitive workflows
factual precision

MoE models are often good for:

research agents
long reasoning chains
multi-step tool orchestration
assistants that call external tools
memory-heavy workflows
lower-end hardware setups

If your system depends heavily on predictable structured output, dense models are generally the safer choice, but they require better hardware.

If your goal is broader reasoning capability within limited hardware constraints, MoE models can offer surprisingly strong value.

Quantization

Quantization is another major factor when running local models.

Quantization reduces model size by lowering numerical precision. Lower precision means:

lower VRAM usage
faster inference
smaller files

…but also lower output quality.

In my own testing, aggressively quantized models would sometimes:

randomly omit parts of generated code
lose formatting consistency
produce incomplete structured output
degrade noticeably during long responses

For agent workflows or automation systems, this can become a serious reliability issue. If your hardware allows it, 6-bit or 8-bit quantizations are usually the safest choice for reliable output.

Common Quantization Levels

Quantization	Relative Size	Quality Impact	Recommendation
Q8	~50%	Minimal	Best quality if VRAM allows
Q6_K	~38%	Very small	Excellent balance
Q4_K_M	~25%	Moderate	Most common local setup
Q3	~19%	Significant	Usually poor for coding
Q2	~13%	Severe	Mostly unusable

In practice:

Q6_K is often the sweet spot for consumer hardware
Q8 gives the best reliability
Q4 is acceptable for experimentation
Q3 and below tend to become unreliable for serious work

What Kind Of Models Should I Run on My Hardware?

The exact requirements depend on context size, quantization format, and inference engine, but this should give a realistic idea of what kind of local LLMs are practical on different hardware setups. Macs can benefit from unified memory architecture, which allows models to utilize system RAM as GPU memory. On traditional PC setups, you are typically limited by the amount of VRAM available on your GPU.

Available VRAM	Realistic Model Options	Typical Use Cases
6–8 GB	7B dense models, small Q4 MoE models	Basic coding assistants, chat, lightweight automation
10–12 GB	14B dense at Q4/Q6, smaller MoE models	Coding assistants, terminal agents, structured workflows
16 GB	14B dense at Q8, 27B MoE at Q4/Q6	Agent workflows, tool-calling systems, longer reasoning tasks
24 GB	27B dense at Q4/Q6, larger MoE models	Advanced coding agents, research assistants, multi-step automation
32 GB+	32B+ dense models, large MoE models at higher precision	Heavy reasoning workloads, large-context systems, high-quality local assistants
48 GB+	Most open-weight models at comfortable quantization levels	Professional local AI setups, multi-agent systems, large-scale experimentation

Conclusion

Choosing a local LLM is ultimately about balancing quality, speed, memory usage, and the type of workload you are trying to run. Dense and MoE models both have their strengths, and quantization can dramatically change how a model behaves in practice. A smaller high-quality model running at higher precision can often outperform a much larger aggressively quantized one, while well-trained MoE models can provide surprisingly strong reasoning and context handling on limited hardware. In the end, benchmarks and parameter counts only tell part of the story. The best approach is to test models against your actual workflows and find the setup that gives the most reliable results on your hardware.

Picking a Local LLM for Your Project