What It Is
Large language models are, at their core, vast arrays of numerical weights — the values that encode learned knowledge and determine how the model responds to input. Standard models store each weight as a 16-bit or 32-bit floating-point number. Quantisation is the practice of reducing that precision to save memory and computation. Taking that logic to its extreme, 1-bit quantisation represents every weight as a single binary value: either −1 or +1.
The practical implication is substantial. A standard 8-billion-parameter model at 16-bit precision requires roughly 16 gigabytes of memory. The same architecture with 1-bit weights requires approximately 1 gigabyte — a roughly 14–16x reduction, consistent with the arithmetic. More importantly, 1-bit inference replaces floating-point multiply-accumulate operations (energy-intensive) with simple integer additions and subtractions (cheap). This restructuring is what drives reported energy efficiency gains.
A closely related variant, sometimes called 1.58-bit quantisation, uses ternary weights: −1, 0, or +1. Three states require log₂(3) ≈ 1.585 bits to represent — hence the name. The additional zero value allows some weights to be effectively silenced, giving the model an extra representational degree of freedom. Both approaches require the model to be trained natively at this precision from the start (quantisation-aware training); standard post-training quantisation fails at bit-widths below 4, producing sharply degraded outputs.
Key Facts and Dates
The primary academic lineage runs through Microsoft Research. In October 2023, researchers there published BitNet: Scaling 1-bit Transformers for Large Language Models (arXiv 2310.11453), introducing a 1-bit linear layer that could be substituted directly into standard Transformer architectures. The follow-on paper, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (arXiv 2402.17764, February 2024), demonstrated that ternary-weight models could match full-precision performance at the 3-billion-parameter scale and above — with 3.55× less GPU memory and, critically, 71.4× less energy for the arithmetic operations that dominate inference. That last figure is for the multiply-accumulate step specifically, not total system power; real-world energy reductions for full inference on edge hardware fall in the 2–5× range.
By October 2024, Microsoft had published a CPU inference framework (bitnet.cpp, arXiv 2410.16144) enabling 100-billion-parameter 1.58-bit models to run at usable speeds on a single laptop CPU — a meaningful threshold for edge and on-device deployment. In April 2025, Microsoft released the first open-source native 1-bit model trained at scale: BitNet b1.58 2B4T, a 2-billion-parameter model trained on 4 trillion tokens, with performance comparable to full-precision models of similar size.
The editorial reference is to PrismML, a startup that announced its Bonsai model family on 31 March 2026, with coverage appearing 4 April 2026. PrismML uses true binary weights (not ternary) applied end-to-end across all model components. Their claimed specifications for the Bonsai 8B model: 1.15 GB memory footprint (14× smaller than an FP16 8B equivalent), 5× more energy efficient on edge hardware, and inference speeds of 131 tokens/second on an Apple M4 Pro. Models are released under Apache 2.0 on HuggingFace. Important caveat: these figures are PrismML’s own claims. As of publication, no independent third-party benchmarking has been published. The 14× compression figure is mathematically consistent; the 5× energy efficiency claim is directionally consistent with Microsoft Research’s data but has not been externally validated.
Why It Matters for AI Governance and Narratives
The standard framing of AI competition centres on capability and compute: who has the largest clusters, the most advanced chips, access to the most data. Sovereign AI strategies — in the EU, India, the Gulf states, and elsewhere — are largely structured around this framing, treating data centres and chip supply chains as the chokepoints worth controlling.
1-bit quantisation disrupts that frame at the infrastructure layer. If competitive inference can be achieved on consumer hardware — a laptop, a smartphone, a low-cost server without GPUs — then the strategic value of large centralised compute concentrations changes. The question shifts from who controls the data centre to who can build and deploy capable systems at the edge. These are not equivalent questions, and they imply different governance interventions.
For narrative analysis, this matters because it is efficiency claims — not capability claims — that are increasingly doing political work in the AI discourse. A model that matches frontier performance at a fraction of the cost challenges both the infrastructure-sovereignty narrative (large states) and the moat narrative (large labs). It creates rhetorical space for smaller actors — startups, research institutions, national programmes with modest compute budgets — to claim competitive relevance. Whether the technical claims hold up under independent scrutiny is a separate question; the narrative utility is operative regardless.
The PrismML announcement is a specific instance of this pattern: a well-funded startup explicitly invoking the language of democratisation and sovereignty-independence to position efficiency gains as a geopolitical argument, not merely a technical one.
Where to Learn More
- BitNet b1.58 paper (Microsoft Research, Feb 2024): https://arxiv.org/abs/2402.17764 — the primary academic source for ternary-weight LLMs at scale
- BitNet original paper (Microsoft Research, Oct 2023): https://arxiv.org/abs/2310.11453 — introduces the 1-bit linear layer architecture
- Microsoft BitNet GitHub repository: https://github.com/microsoft/BitNet — includes bitnet.cpp inference framework and open model weights
- BitNet b1.58 2B4T model (HuggingFace): https://huggingface.co/microsoft/bitnet-b1.58-2B-4T — first open-source native 1-bit LLM at scale, with technical report at arXiv 2504.12285