Granite 4.1 LLMs: A No-Nonsense Look Under the Hood

Granite 4.1 LLMs: A No-Nonsense Look Under the Hood

3 0 0

IBM’s Granite team just dropped a detailed technical walkthrough of how they built the Granite 4.1 LLMs, and honestly, it’s refreshing to see a company share this level of detail without the usual marketing fluff. The models themselves are dense decoder-only transformers at 3B, 8B, and 30B parameters, trained from scratch on roughly 15 trillion tokens. What caught my eye is that the 8B instruct model reportedly matches or beats the previous Granite 4.0-H-Small, which was a 32B MoE with 9B active parameters. That’s a solid efficiency gain, especially since all Granite 4.1 models ship under Apache 2.0.

The Architecture Choices

The core design is pretty standard for modern LLMs: Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, RMSNorm, and shared input/output embeddings. Nothing exotic here, but that’s not a bad thing. The 3B model uses 2560 embedding size, 40 layers, 40 attention heads with 8 KV heads, and an MLP hidden size of 8192. The 8B bumps embedding to 4096, same layers and heads but 128 attention head size, and MLP hidden size of 12800. The 30B keeps the 4096 embedding but goes to 64 layers, 32 heads, and a massive 32768 MLP hidden size. All three share the same training pipeline and data strategy, which is smart for consistency.

Pre-Training: Five Phases of Progressive Refinement

This is where things get interesting. Instead of a single training run, Granite 4.1 uses five distinct phases, each with a different data mix and learning rate schedule. The idea is to start broad and gradually shift toward higher-quality, domain-specific data. Phase 1 runs on 10 trillion tokens with a power LR schedule, dominated by CommonCrawl (59%), code (20%), technical content (10.5%), math (7%), multilingual (2%), and domain-specific stuff (1.5%). That’s a lot of web data, but it’s the foundation.

Phase 2 drops to 2 trillion tokens and pivots hard toward math and code. Math jumps from 7% to 35% (a 5x increase), code from 20% to 30%, and they introduce 9% synthetic data. CommonCrawl gets slashed to 12% and filtered for quality. This phase is all about building reasoning capabilities without losing general language understanding.

Phase 3 is where mid-training begins, with 2 trillion tokens and an exponential decay LR. The data mix becomes more balanced: CommonCrawl-HQ, math, and code each at ~16.67%, plus synthetic (8.5%), technical (12.5%), multilingual (4.5%), and a notable 12.5% long chain-of-thought data. They also start blending in instruction tuning data at 7.5% for language and 4.5% for code. This is where the model starts learning to follow instructions and reason step by step.

Phase 4 is a refinement stage at only 0.5 trillion tokens, with a linear LR decay to zero. The data mix shifts even more toward quality: CommonCrawl-HQ at 40%, code and math at 20% each, with smaller amounts of long CoT (6%), code instructions (5%), and language instructions (9%). This is essentially a high-quality annealing phase to polish the model.

Phase 5 is long-context extension, taking the context window from 4K to 512K tokens through three staged steps: 32K, 128K, and finally 512K. For the 512K extension, they use 80% books and 20% code repository data (only for the 8B and 30B models). They do a model merge after each stage to avoid degrading short-context performance, which is a clever trick. The RULER benchmark results show the 8B base model scoring 83.6 at 32K, 79.1 at 64K, and 73.0 at 128K. Not bad for a dense 8B model.

Supervised Fine-Tuning and Reinforcement Learning

The SFT stage uses about 4.1 million high-quality curated samples, filtered through an LLM-as-Judge framework. I’d love to see more details on that judge model and the exact filtering criteria, but the post doesn’t go into it. The RL stage uses on-policy GRPO with DAPO loss (from Yu et al., 2025), which is a relatively new approach. They apply it in multiple stages to systematically strengthen math, coding, instruction following, and general chat. This multi-stage RL pipeline is becoming more common, and it makes sense: you can’t optimize everything at once without breaking something.

What I Like and What I’d Question

I appreciate that IBM published this level of detail. The five-phase pre-training strategy is well thought out, and the progressive data annealing is something more teams should adopt. The fact that the 8B model beats a 32B MoE from the previous generation is genuinely impressive, though I’d want to see independent benchmarks before getting too excited. The Apache 2.0 license is also a big plus for the community.

That said, I’m a bit skeptical about the 512K context window. The RULER scores drop noticeably from 32K to 128K, and I wonder how well the model actually uses that full context in practice. Also, the SFT data curation process feels like a black box. An LLM-as-Judge is fine, but without knowing the judge’s biases or the filtering thresholds, it’s hard to evaluate the quality.

Overall, Granite 4.1 looks like a solid family of models with a thoughtful training pipeline. The focus on data quality over quantity, the staged pre-training, and the multi-stage RL are all good practices. I’ll be keeping an eye on the community’s experience with these models, especially the 8B instruct variant.

Comments (0)

Be the first to comment!