Invisible Character Detector

Invisible Character Detector

Find and remove hidden Unicode characters in text. Detect zero-width spaces, RTL marks, homoglyphs. Debug copy-paste issues

A zero-width space (U+200B) is exactly what it sounds like — a character that takes no width on the page but counts as content in your string. They sneak in from Word, from chat apps, from copy-paste in browsers, and they break exact-match comparisons in ways that are infuriating to debug because the text "looks identical". This detector finds zero-width characters, right-to-left overrides, byte-order marks, non-breaking spaces, soft hyphens, homoglyphs (Latin "a" vs Cyrillic "а"), and roughly fifty other invisible-or-confusing Unicode codepoints.

The codepoints that look identical to ASCII

  • NO-BREAK SPACE (U+00A0) — looks like a regular space. Breaks split-on-space tokenization. Most common source: copying from PDFs and web pages styled with  .
  • ZERO WIDTH SPACE (U+200B) — completely invisible. Inserted for line-break hints in CJK text and word-boundary marks. Sometimes used maliciously to defeat keyword filters.
  • ZERO WIDTH NON-JOINER (U+200C) / JOINER (U+200D) — invisible, used for ligature control. Required for some Persian/Arabic text. Breaks length comparisons.
  • WORD JOINER (U+2060) — like a zero-width space but indicates "do not break here". Same problem for ASCII comparisons.
  • BYTE ORDER MARK (U+FEFF) — invisible at start of file. Required by some Windows tools, fatal to shell scripts (the shell errors with "command not found" pointing at a phantom byte sequence at the file start).
  • SOFT HYPHEN (U+00AD) — invisible until end-of-line, where it becomes a hyphen. Sometimes used as a watermark in copy-protected text.
  • COMBINING DIACRITICAL MARKS — separate from their base letters. "é" can be one codepoint (U+00E9) or two (e + combining acute U+0301). Both render identical; byte equality fails.

Homoglyphs — characters that look like other characters

  • Cyrillic а (U+0430) vs Latin a (U+0061) — visually identical. The classic phishing technique: register раypal.com using Cyrillic а, watch users mistype.
  • Greek ο (U+03BF), Cyrillic о (U+043E), Latin o (U+006F) — three different letters that look identical in most fonts.
  • Greek capital Iota Ι, Latin capital I, vertical bar | — sometimes ambiguous depending on font.
  • IDN/Punycode — internationalized domain names use Punycode (xn--) for non-ASCII characters. Browsers display Unicode form for trusted scripts, Punycode form for mixed-script (anti-spoofing). Always check the URL bar carefully.

Working example: a sneaky comment review

Input

paypal.com vs раypal.com  (both 10 visible chars)

Output

String 1: p-a-y-p-a-l-.-c-o-m
  Codepoints: U+0070 U+0061 U+0079 U+0070 U+0061 U+006C U+002E U+0063 U+006F U+006D
  All Latin. Clean.

String 2: р-а-y-p-a-l-.-c-o-m
  Codepoints: U+0440 U+0430 U+0079 U+0070 U+0061 U+006C U+002E U+0063 U+006F U+006D
  First two characters are Cyrillic (Ukrainian/Russian) lowercase 'er' and 'a'.
  String 2 is a homoglyph attack — visually identical, byte-different.

Domain registrars and browsers warn about mixed-script IDNs. Inside a string field (a comment, a chat message, a username), no one warns you. Compare with Unicode normalization (NFC) plus optional script detection to catch this.

When invisibles are legitimate

  • Zero-width joiner in emoji sequences — 👨‍👩‍👧 is "man + ZWJ + woman + ZWJ + girl" forming a family emoji. Stripping ZWJ breaks the emoji.
  • Soft hyphens for typesetting — used legitimately in books and long-form text to mark good line-break points. Stripping them is fine for plain text but loses hyphenation hints.
  • Right-to-left marks in mixed Arabic/Hebrew/Latin text — required for correct display. Aggressive stripping breaks bidirectional text.
  • Combining marks for accented scripts — fundamental for Devanagari, Thai, and many other writing systems. Always normalize Unicode rather than stripping combining marks.
  • BOM in UTF-16/UTF-32 files — required for byte-order disambiguation. Remove only from UTF-8 (where BOM is decorative and often harmful).

When to reach for this tool

  • You are debugging "the database value matches the form value but my == comparison returns false" — almost always invisible characters in one of them.
  • You inherited a script that "works on my machine, fails on CI" with cryptic shell errors — very likely a BOM at the file start.
  • You are reviewing user-submitted content for malicious zero-width injections (used in some carding-fraud techniques to bypass keyword filters).
  • You are exporting copy-paste from Word, Pages, or Google Docs to code/config files and want to strip smart quotes, NBSPs, and other typography characters that break ASCII tooling.

What this tool will not do

  • It will not normalize Unicode for you (NFC, NFD, NFKC, NFKD). Use the text-cleaner tool for that — different operation than detection.
  • It will not detect all spoofing. Some confusables require font-aware detection (rendered glyphs vs codepoints); some require context (Unicode TR39 confusable detection). The tool finds the common cases without false-positiving on legitimate uses.
  • It will not strip-and-rewrite by default. Detection is non-destructive; the tool reports what is there. Use the strip option only after reviewing what would be removed.

Text is analyzed entirely in your browser. Internal data, user submissions, and source code stay local.

Frequently asked questions

Why does my CSV file fail to import even though it looks correct?

Most common cause: a BOM at the start of the file. The first column header is "ID" visually but is actually "\uFEFFID" in bytes. The importer looks for a column named "ID" and does not find it. Save without BOM (most editors have a "without BOM" option), or strip the first three bytes from a UTF-8 BOM'd file.

How are zero-width characters used maliciously?

Three main ways: (1) keyword filter evasion — inserting ZWSP between letters bypasses naive substring matching; (2) homograph spoofing — combined with non-Latin lookalikes to make fake login pages; (3) text watermarking — invisibly mark documents to trace leaks. None of these are common in average user content, but worth scanning for in security-sensitive contexts.

Should I always strip non-ASCII characters from user input?

No. International users have names with accents, emoji, and non-Latin scripts. Strip silently and you exclude them. Better: normalize Unicode (NFC), detect intentional vs accidental invisibles, and reject only the specific categories that do not belong (BOM, ZWSP) while preserving legitimate ones (accented letters, CJK, RTL marks for Arabic/Hebrew).

How do I tell ASCII space from NBSP visually?

You cannot, in most fonts. Use a tool to look at codepoints, or enable "show invisibles" in your editor (VS Code: Editor: Render Whitespace = all). The byte is different (0x20 vs 0xC2 0xA0 in UTF-8) but the visual width is identical.

Are emoji invisible characters?

No, emoji are visible by definition. But many emoji sequences contain invisible joiners (ZWJ). The flag of England (🏴󠁧󠁢󠁥󠁮���󠁿) is a base character plus six "tag" characters that encode "england"; the tag characters are invisible alone but render combined with the base. Treat emoji as a unit, not characters.

Will normalizing Unicode break my data?

For most data, no. NFC normalization is what most systems already use (forms in browsers, JavaScript strings on input). The risky case: data stored as NFD that someone normalizes to NFC — byte-different, semantically identical. Test before bulk-normalizing.

Related tools

Last updated · E-Utils editorial team