Reference

Open-Source AI Model Builders

Organizations that release more than just weights — training data, recipes, checkpoints, and code. A working reference.

Open means: weights + data + code + checkpoints, reproducible from scratch Last updated June 2026
See the interactive scorecard

These organizations release weights + training data + code + checkpoints — the full stack, reproducible from scratch.

Allen Institute for AI (Ai2) Gold Standard
Models: OLMo / Molmo / Tülu. Full training code, every intermediate checkpoint, training configs, data provenance, and evaluation pipelines are public, all under Apache 2.0. Sizes up to 32B. Latest is Olmo 3 (Nov 2025) and Olmo 3.1 32B, which released every training and fine-tuning dataset for download without license restrictions, up to 6 trillion tokens, plus OlmoTrace for tracing outputs back to training data.
Ai2 has confirmed a sparse Olmo-MoE is on the roadmap for 2026, with more model flows, toolkits, and reasoning-focused releases coming, especially around the 32B scale they call a "sweet spot."
Hugging Face Data Stewards
Models: SmolLM / SmolVLM. SmolLM3 (3B) is a fully open model — open weights plus full training details including the public data mixture and training configs — pretrained on 11.2T tokens. They also maintain the FineWeb / FineWeb-Edu corpora that many other open models train on.
Continued iteration on the SmolLM/SmolVLM family and FineWeb data releases; they release intermediate checkpoints and post-training data progressively.
EleutherAI Original Open-Data Lab
Models: Pythia. Datasets: The Pile, Common Pile. Their newest direction is the Common Pile v0.1 (mid-2025), built entirely from openly licensed and public-domain text, after consulting legal experts on what counts as a sufficiently open license.
Growing the pool of openly-licensed data and training models on it, explicitly to sidestep the copyright problem — they argue the common idea that unlicensed text drives performance is unjustified.
BigCode (Hugging Face + ServiceNow) Code Models
Models: StarCoder 2 — code model with training code and the underlying dataset (The Stack v2) released publicly. A rare fully-open entry in the code-generation space.
LLM360 Community-Owned AGI
Models: Amber / Crystal / K2. A research lab built around the phrase "community-owned AGI." K2 is fully transparent — they open-source all artifacts including code, data, model checkpoints, and intermediate results — developed with MBZUAI and Petuum.
Mission-driven rather than tightly scheduled: standards and tooling for fully reproducible large-model research.
Smaller / Newer Fully-Open Entrants
  • Stanford's Marin
  • Apertus (70B) — from ETH Zürich / EPFL, Switzerland
  • AMD's Instella
  • Zyphra — Zamba models + the Zyda dataset
  • BLOOM, T5 — older but fully open
EleutherAI groups AI2, Hugging Face, Zyphra, and LLM360 together as the organizations defying the industry's transparency decline.

Open weights with detailed technical reports, but no training data release. These sit between the fully-open builders and the fully-closed labs.

DeepSeek Open Weights + Reports
Models: DeepSeek-V3, DeepSeek-R1. Open weights with detailed technical reports on architecture and training methodology, but no dataset release. R1's reinforcement-learning pipeline is documented enough to be reproducible in principle.
Mistral AI Open Weights + Papers
Models: Mixtral, Mistral 7B. Open weights (some Apache 2.0) with research papers describing recipes, but training data is not released. A member of the NVIDIA Nemotron Coalition.
Cohere For AI Open Weights + Data Cards
Models: Aya family. Open weights with multilingual data cards documenting composition, but not the raw dataset itself.
NVIDIA — Nemotron Frontier-Scale Open
Builds frontier-scale open models and releases the data — distinct from "frontier-class" capability. Nemotron 3 (Dec 2025) shipped with training datasets, recipes, and ~10T tokens of open data. The Nemotron Coalition (GTC, March 2026) pools data and compute with Mistral, Perplexity, Cursor, LangChain, and others.
A coalition-built base model co-developed with Mistral that will underpin the upcoming Nemotron 4 family, intended to be open-sourced on completion.
The honest trend note: The direction of the big closed labs is the opposite of this group. Even OpenAI, Anthropic, and Google DeepMind used to disclose substantial detail about their pretraining data mixtures pre-2022, but stopped — with researchers specifically citing lawsuits as the reason. The realistic forward picture is a widening split: the closed frontier gets less transparent on data, while this open-data cohort carries the full-transparency torch — increasingly using openly licensed data specifically to stay legally durable.
Caveat on roadmaps: Only Ai2 (Olmo-MoE 2026) and NVIDIA (Nemotron 4) have given fairly concrete public timelines. The others have stated direction (more data, more reproducibility) without firm dates — so their forward plans should not be read as scheduled commitments.
The Open Source Initiative published its Open Source AI Definition (OSAID 1.0) in October 2024. It requires sufficiently detailed information about the data used for training, the code to run the system, and the weights — note it does not mandate releasing the training dataset itself, nor the full training code. By even that bar, almost none of the popular "open" models qualify.
Sources