UTF-8 Byte Counter

How many bytes is this string? For anything beyond plain ASCII the answer is not the character count. This tool measures the exact UTF-8 byte length of your text and shows it next to the character count, code-point count, and visible grapheme count, so you can see precisely where multibyte characters and emoji add weight.

UTF-8 bytes
0
UTF-16 units
0
Code points
0
Graphemes
0
.length (UTF-16 code units)
UTF-8 vs. characters
Largest single byte run

How to use the UTF-8 Byte Counter

Type or paste text into the box. The four headline cards update live: UTF-8 bytes is the size the text occupies when encoded as UTF-8 (the default of the modern web, JSON, and most files); UTF-16 units is the count JavaScript's .length returns; code points is the number of Unicode scalar values; and graphemes is the number of characters a human actually sees.

The detail table underneath breaks down the relationship — how many bytes each character averages, and the size of the longest single character in bytes. For pure ASCII all four numbers are equal. As soon as you add an accented letter, a CJK ideograph, or an emoji, they diverge, and the table makes the gap explicit.

This is handy whenever a byte limit matters — a database column defined in bytes, a network packet, an SMS segment, or a metadata field — and you need to know whether your text fits.

Bytes, characters, and why they differ

UTF-8 is a variable-length encoding. ASCII characters (the basic Latin letters, digits, and common punctuation) take a single byte each, so for English text the byte count equals the character count. But every other character costs more: most accented Latin letters and Greek or Cyrillic take two bytes, the vast majority of CJK ideographs take three, and emoji and rarer symbols take four. A string that looks short can therefore be surprisingly large in bytes.

On top of that, the words "character" and "length" are ambiguous. JavaScript measures strings in UTF-16 code units, where characters outside the Basic Multilingual Plane — including most emoji — count as two. The number of code points counts each Unicode scalar value once. And the number of graphemes counts what a person perceives as one character, which can be built from several code points: a flag emoji is two, and an emoji with a skin-tone modifier or a family emoji can be many. This is why "👋🏽".length is 4 in JavaScript even though you see one symbol.

Knowing which measure a system uses prevents off-by-a-lot errors. A column declared as VARCHAR(20) may mean 20 bytes or 20 characters depending on the database. A "280 character" limit might count code points. An API that truncates by bytes can split a multibyte character in half and corrupt it. Counting all four numbers at once lets you reason about whichever limit you are actually up against.

Common use cases

  • Database column limits. Check whether a name or message fits a byte-defined column before it gets silently truncated.
  • API and protocol fields. Verify text fits a byte-bounded field so multibyte characters are not cut mid-sequence.
  • Estimating storage and bandwidth. See the true byte cost of internationalized or emoji-heavy content.
  • Debugging length mismatches. Understand why a string's byte size, JavaScript length, and visible character count all differ.

Frequently asked questions

Why is the byte count larger than the character count?

UTF-8 uses one byte for ASCII but two to four bytes for other characters. Accented letters, CJK text, and emoji all push the byte count above the character count.

What is the difference between code points and graphemes?

A grapheme is one character as a human sees it; it can be made of several code points. A flag or a skin-toned emoji is a single grapheme built from multiple code points.

Why does JavaScript .length disagree with the others?

JavaScript counts UTF-16 code units. Characters above U+FFFF, including most emoji, occupy two units, so .length over-counts them relative to code points.

Does this measure UTF-16 or UTF-32 byte size?

The byte figure is UTF-8, the dominant encoding for files and the web. UTF-16 would use two or four bytes per character and UTF-32 a flat four, but those are far less common on disk and the wire.