Text to Binary/Hex Converter

Text to Binary/Hex Converter

Convert text to binary and hexadecimal representation. Character-by-character breakdown with ASCII/UTF-8 codes. Free online text encoder

Converting "Hello" to binary takes one tap; the interesting part is the encoding question nobody asks until something breaks. Is "ł" the byte 0xB3 (Windows-1250) or the two bytes 0xC5 0x82 (UTF-8) or the four bytes 0x00 0x00 0x01 0x42 (UTF-32)? "Same character" produces different bytes depending on the encoding. This converter shows ASCII / Latin-1 / UTF-8 / UTF-16 / UTF-32 encodings in binary, decimal, and hex, side by side — so you can see why your "non-ASCII characters mangled" bug is a UTF-8 / Windows-1250 mismatch.

What "convert text to binary" actually means

Text is characters; computers store bytes. The mapping between characters and bytes is an encoding. ASCII (1963) maps 128 characters to 7 bits each. ISO-8859-1 / Latin-1 extends to 256 characters (8 bits). Code pages (Windows-1252, Windows-1250, KOI8-R) are mutually-incompatible 256-character maps. Unicode assigns codepoints to every character (~150,000 in 2026); UTF-8, UTF-16, UTF-32 are different encodings of those codepoints into bytes.

UTF-8 is the de facto standard for modern text. It is variable-length (1-4 bytes per character), ASCII-compatible (every ASCII byte is itself in UTF-8), and self-synchronizing (you can find the next character boundary by scanning forward at most 4 bytes). When in doubt, save as UTF-8.

Working example: a Polish character

Input

Character: ł (lower-case Polish "l" with stroke, Unicode U+0142)

Output

Codepoint:        U+0142 (322 decimal)

UTF-8 (2 bytes):  C5 82 = 11000101 10000010
UTF-16 (2 bytes): 01 42 = 00000001 01000010
UTF-32 (4 bytes): 00 00 01 42
Latin-2 (1 byte): B3 = 10110011
Windows-1250:     B3 = 10110011
Latin-1 (cannot represent — ł is not in this code page)
ASCII (cannot represent)

A text file with one "ł" character:
  As UTF-8: 2 bytes (C5 82)
  As Windows-1250: 1 byte (B3)
  When a UTF-8 file is opened as Windows-1250, "ł" appears as two characters: "Å‚"
  When a Windows-1250 file is opened as UTF-8, "ł" appears as one invalid byte / replacement char

The "Å‚" mojibake is the most common encoding bug in Polish, German, and Scandinavian text. The fix is always "save and reload in the same encoding"; the diagnosis is "look at the bytes and see which encoding interprets them sensibly".

Encodings you will encounter

  • ASCII — 7-bit, 128 characters. Latin alphabet, digits, common punctuation. Universal but inadequate for non-English.
  • Latin-1 (ISO-8859-1) — 8-bit, 256 chars. Western European Latin letters with diacritics (é, ñ, ü). Common in pre-2000 web pages.
  • Latin-2 (ISO-8859-2) — Central European (Polish, Czech, Slovak, Hungarian, Croatian).
  • Latin-9 (ISO-8859-15) — Latin-1 with € added and a few currency tweaks.
  • Windows-1252 — Microsoft's version of Latin-1 with extra punctuation in the C1 control range. Treated identically by browsers; differs from real Latin-1 only in 32 codepoints.
  • Windows-1250 — Microsoft Central European. Common on Polish/Czech Windows systems.
  • UTF-8 — variable-length Unicode. 1 byte for ASCII, 2 for Latin/Greek/Cyrillic/Hebrew/Arabic, 3 for most CJK, 4 for emoji and rare scripts. The default for web, JSON, and most modern formats.
  • UTF-16 — 2 or 4 bytes per character. Used internally by Windows API, Java strings, JavaScript strings. BOM matters (byte order).
  • UTF-32 — fixed 4 bytes per character. Simpler indexing but wasteful. Rarely used in storage; sometimes used in-memory for codepoint-level work.

BOM (Byte Order Mark) and why it matters

BOM is U+FEFF at the start of a file, used in UTF-16 / UTF-32 to indicate byte order. In UTF-8 the byte order is fixed, so the BOM (EF BB BF) is decorative — and almost always a bug. UTF-8 BOM at the start of a shell script breaks the script ("$\'\\xef\\xbb\\xbf\': command not found"). At the start of a CSV, the first column header gets a phantom three bytes prepended. Most modern editors offer "UTF-8 without BOM" — pick it.

When to reach for this tool

  • You are debugging a "the character renders as a question mark / box / mojibake" bug and need to see what bytes are actually there.
  • You are setting up a parser or serializer and want to confirm a specific character produces the expected byte sequence in your target encoding.
  • You are explaining to a junior engineer why "save as UTF-8" matters and want a visual comparison of encodings for the same text.
  • You are reverse-engineering a binary file format and need to identify which encoding interprets bytes 0x80-0xFF as legible text.

What this tool will not do

  • It will not detect the encoding of unknown bytes. Encoding detection (chardet, ICU) is heuristic and can be wrong; this tool shows what a given encoding produces, not what an unknown sequence is.
  • It will not handle "legacy" multi-byte encodings (Shift_JIS, GB2312, Big5) fully. Asian legacy encodings are intricate; modern code should use UTF-8.
  • It will not produce HTML/XML entity encoding (ł for ł). For that, use the HTML entity encoder tool.

All conversion happens in your browser. Useful for inspecting bytes in sensitive content without uploading.

Frequently asked questions

Why does saving as UTF-8 sometimes break my file?

Usually one of: (1) the file has bytes invalid in UTF-8 (likely it was actually Windows-1252 and contains 0x95 or similar single-byte non-ASCII characters); (2) the saving tool adds a BOM and downstream consumers don't handle it; (3) the line endings change CRLF to LF in the same save. Inspect bytes before/after — the bug location is usually visible.

How many bytes is the 😀 emoji?

Codepoint U+1F600. UTF-8: 4 bytes (F0 9F 98 80). UTF-16: 4 bytes (surrogate pair D83D DE00). UTF-32: 4 bytes. Note: many "emoji" are actually sequences of multiple codepoints joined with zero-width joiners (👨‍���‍👧 = 5 codepoints, ~17 bytes in UTF-8). Counting "characters" depends on what you mean.

Is UTF-8 the same as Unicode?

No. Unicode is the character set (mapping codepoints to characters). UTF-8 is one of several encodings of Unicode codepoints to bytes. Other Unicode encodings: UTF-16, UTF-32. Saying "encode as Unicode" is ambiguous; "encode as UTF-8" is precise.

Why is ASCII still relevant?

Because UTF-8 is a superset of ASCII — every ASCII byte is itself in UTF-8. A pure-ASCII file is a valid UTF-8 file. Programmers writing English code, config files, and command lines work in the ASCII subset of UTF-8 daily without thinking about it. ASCII compatibility is why UTF-8 won over UTF-16 for storage.

How do I tell which encoding a file is in?

Check for a BOM at the start (FE FF = UTF-16BE, FF FE = UTF-16LE, EF BB BF = UTF-8 with BOM). If no BOM and the file has bytes 0x80-0xFF, try UTF-8 first; if the high-bit bytes form valid UTF-8 sequences, it almost certainly is UTF-8 (the prefix patterns are constrained enough that random byte sequences rarely look valid). If UTF-8 decode fails, try Windows-1252 / Latin-1 / locale-specific code pages.

What is "code page" and is it the same as encoding?

Yes, in practice. Microsoft uses "code page" to refer to 8-bit single-byte encodings like Windows-1252 (code page 1252), CP-437 (DOS), etc. "Encoding" is the broader term that includes multi-byte encodings like UTF-8.

Related tools

Published · Updated · E-Utils editorial team