Alpaca ↔ ChatML Dataset Converter
Convert Alpaca format ({instruction, input, output}) to OpenAI messages JSON or ChatML text, or reverse the process to recover Alpaca from a messages array. Handles single objects, arrays, and JSONL for fine-tuning pipeline integration.
How to use the Alpaca ↔ ChatML Dataset Converter
Paste Alpaca or OpenAI messages data. The tool accepts a single JSON object, a JSON array, or JSONL (one object per line). Select the conversion direction, and for Alpaca→output, pick whether you want OpenAI messages JSON or ChatML text.
Alpaca → OpenAI messages: The instruction becomes the user turn. If input is non-empty it is appended to the user content with two newlines. The output becomes the assistant turn.
Alpaca → ChatML text: Produces formatted <|im_start|>role\n...\n<|im_end|> blocks used by llama.cpp, Mistral, and many GGUF-format models.
OpenAI messages → Alpaca: The first user message becomes instruction, the first assistant message becomes output. Multi-turn conversations are summarized into one instruction/output pair (a limitation of the Alpaca format).
Alpaca and ChatML dataset formats
The Alpaca format was introduced by Stanford with the original Alpaca dataset (52k instruction-following examples). Each record is a flat JSON object with three fields: instruction (the task description), input (optional context or data to process), and output (the expected model response). It became the de facto standard for single-turn supervised fine-tuning datasets and is still widely used on Hugging Face for instruction datasets.
ChatML (Chat Markup Language) is the tokenization convention used by OpenAI\'s fine-tuned models and adopted broadly by open-source community — used by Mistral, WizardLM, and others. The format uses special tokens <|im_start|> and <|im_end|> to delimit speaker turns. llama.cpp, Ollama, and LM Studio use ChatML templates when loading GGUF models that declare the chatml chat template. It is a text serialization of the messages array, not a separate data structure.
Converting Alpaca to OpenAI messages format is required for fine-tuning via the OpenAI API, Axolotl, or any framework that expects multi-turn conversation format. Converting to ChatML text is needed when building training datasets for llama.cpp-compatible models or when preprocessing data for frameworks that accept raw text rather than structured JSON.
Common use cases
- OpenAI fine-tuning — convert an Alpaca JSONL dataset to OpenAI messages format required by the /v1/fine_tuning/jobs endpoint.
- llama.cpp training data — produce ChatML text for SFT training with llama.cpp or Axolotl with chatml format.
- Dataset standardization — merge Alpaca and ShareGPT datasets by first converting both to OpenAI messages format.
- Evaluation harness prep — convert messages format to Alpaca when an evaluation framework expects instruction/output pairs.
- Dataset inspection — quickly visualize how Alpaca examples would look as formatted chat exchanges.
Frequently asked questions
What happens to the Alpaca "input" field?
Can I convert multi-turn conversations back to Alpaca?
What is the ChatML format exactly?
<|im_start|>role\ncontent\n<|im_end|>. The im_start and im_end tokens are special vocabulary entries in models trained with ChatML; the template is what separates turns during training and inference.