Alpaca ↔ ChatML Dataset Converter

Convert Alpaca format ({instruction, input, output}) to OpenAI messages JSON or ChatML text, or reverse the process to recover Alpaca from a messages array. Handles single objects, arrays, and JSONL for fine-tuning pipeline integration.

How to use the Alpaca ↔ ChatML Dataset Converter

Paste Alpaca or OpenAI messages data. The tool accepts a single JSON object, a JSON array, or JSONL (one object per line). Select the conversion direction, and for Alpaca→output, pick whether you want OpenAI messages JSON or ChatML text.

Alpaca → OpenAI messages: The instruction becomes the user turn. If input is non-empty it is appended to the user content with two newlines. The output becomes the assistant turn.

Alpaca → ChatML text: Produces formatted <|im_start|>role\n...\n<|im_end|> blocks used by llama.cpp, Mistral, and many GGUF-format models.

OpenAI messages → Alpaca: The first user message becomes instruction, the first assistant message becomes output. Multi-turn conversations are summarized into one instruction/output pair (a limitation of the Alpaca format).

Alpaca and ChatML dataset formats

The Alpaca format was introduced by Stanford with the original Alpaca dataset (52k instruction-following examples). Each record is a flat JSON object with three fields: instruction (the task description), input (optional context or data to process), and output (the expected model response). It became the de facto standard for single-turn supervised fine-tuning datasets and is still widely used on Hugging Face for instruction datasets.

ChatML (Chat Markup Language) is the tokenization convention used by OpenAI\'s fine-tuned models and adopted broadly by open-source community — used by Mistral, WizardLM, and others. The format uses special tokens <|im_start|> and <|im_end|> to delimit speaker turns. llama.cpp, Ollama, and LM Studio use ChatML templates when loading GGUF models that declare the chatml chat template. It is a text serialization of the messages array, not a separate data structure.

Converting Alpaca to OpenAI messages format is required for fine-tuning via the OpenAI API, Axolotl, or any framework that expects multi-turn conversation format. Converting to ChatML text is needed when building training datasets for llama.cpp-compatible models or when preprocessing data for frameworks that accept raw text rather than structured JSON.

Common use cases

  • OpenAI fine-tuning — convert an Alpaca JSONL dataset to OpenAI messages format required by the /v1/fine_tuning/jobs endpoint.
  • llama.cpp training data — produce ChatML text for SFT training with llama.cpp or Axolotl with chatml format.
  • Dataset standardization — merge Alpaca and ShareGPT datasets by first converting both to OpenAI messages format.
  • Evaluation harness prep — convert messages format to Alpaca when an evaluation framework expects instruction/output pairs.
  • Dataset inspection — quickly visualize how Alpaca examples would look as formatted chat exchanges.

Frequently asked questions

What happens to the Alpaca "input" field?

When input is non-empty, it is appended to the instruction with two newlines (a blank line separator): "instruction\n\ninput". This matches the convention used in most Alpaca-to-chat conversion scripts, including the Stanford reference implementation.

Can I convert multi-turn conversations back to Alpaca?

Only partially. Alpaca is inherently single-turn. When converting OpenAI messages to Alpaca, only the first user message becomes instruction and the first assistant message becomes output. Multi-turn context is lost — this is a format limitation.

What is the ChatML format exactly?

Each turn is wrapped: <|im_start|>role\ncontent\n<|im_end|>. The im_start and im_end tokens are special vocabulary entries in models trained with ChatML; the template is what separates turns during training and inference.

Does this work with Alpaca datasets that have extra fields?

Extra fields (e.g., system_prompt, id, source) are preserved in the output JSON when converting to messages format. When converting to ChatML text, only instruction/input/output are rendered.