Text Cleaner
Clean up text by removing duplicate lines, trimming whitespace, removing empty lines, normalizing spaces. Multiple operations in one pass. Free online text sanitizer
Most "clean this text" tasks are not glamorous: strip trailing whitespace, collapse runs of blank lines, remove duplicate entries, normalize CRLF to LF, replace smart quotes with ASCII. The frustrating part is doing six of those in sequence — usually with a chain of sed and tr commands you re-derive every time. This cleaner runs a configurable pipeline in one pass, shows a before/after diff, and tells you the byte count saved.
Operations the cleaner can apply
- Trim whitespace — leading, trailing, or both. Per line or globally.
- Collapse internal whitespace — multiple spaces/tabs become a single space.
- Remove blank lines — every empty line, or runs of 2+ collapsed to one.
- Deduplicate lines — case-sensitive or case-insensitive. Preserves first occurrence; the order of remaining lines is preserved.
- Sort lines — ascending or descending, locale-aware or byte-order, case-sensitive optional.
- Normalize line endings — CRLF (Windows) ↔ LF (Unix) ↔ CR (old Mac, rare). Editors that "just work" hide which one is in use; CI bots that fail with "shell script line 1: $\r: command not found" are signaling CRLF.
- Normalize Unicode — NFC / NFD / NFKC / NFKD. Different forms of "café" can differ at the byte level (c-a-f-é as four codepoints vs c-a-f-e-combining-accent as five). String comparisons fail across forms; normalize to NFC for safe equality.
- Replace smart quotes — " " ' ' ‹ › « » → " and '. Diacritic-aware: é stays é, but " becomes ".
- Strip non-ASCII — drop or replace everything outside 0x20–0x7E. Useful when feeding to an old printer protocol or a system that mojibakes anything else.
Working example
Input
Apple Apple banana Apple cherry cherry
Output
Pipeline: trim, dedupe (case-insensitive), normalize CRLF → LF, remove blank lines, sort. Result: Apple banana cherry Diff: removed 5 lines (3 dupes, 2 blank), normalized 1 line ending, trimmed 4 lines of whitespace. Bytes saved: 27 → 21.
Order matters in the pipeline: trim before dedupe (otherwise "Apple" and "Apple " look different); normalize line endings before splitting on \n (otherwise blank-line detection misses CRLF blank lines).
Hidden characters worth checking for
- NO-BREAK SPACE (U+00A0) — looks like a space, fails most "if (c === ' ')" checks. Sneaks in from web pages, Word docs, and Slack copy-paste.
- ZERO WIDTH SPACE (U+200B) — completely invisible. Used legitimately in CJK text but often inserted accidentally by chat apps.
- BYTE ORDER MARK (U+FEFF) — invisible at start-of-file. Breaks shell scripts ("$\xef: command not found"), confuses JSON parsers.
- LINE SEPARATOR (U+2028) / PARAGRAPH SEPARATOR (U+2029) — old typographic newlines. Legal in JSON but break JavaScript when un-escaped inside a string literal (JSON spec changed in 2019, JS still differs).
- RIGHT-TO-LEFT MARK (U+200F) / LEFT-TO-RIGHT MARK (U+200E) — bidi controls. Used legitimately in Arabic/Hebrew mixed text; can be used maliciously to hide code by reversing display direction.
- Soft hyphen (U+00AD) — invisible until a line break, where it becomes a hyphen. Web pages with copy-paste-protection sometimes use it as a tracking mark.
When to reach for this tool
- You pasted a list of emails from a Word doc and need to deduplicate and strip the smart quotes and stray spaces before importing.
- You inherited a CSV with mixed CRLF and LF line endings that breaks your importer; normalize to LF and re-run.
- You are debugging "the user input does not match the database value" — almost certainly trailing whitespace, smart quotes, or a NBSP.
- You need to compare two configuration files and want to normalize whitespace and line endings before diffing so the diff shows real changes only.
What this tool will not do
- It will not infer your intent. "Clean this up" is ambiguous — pick operations explicitly. The default pipeline (trim + collapse blanks) is conservative; aggressive operations (sort, dedupe, strip non-ASCII) are opt-in.
- It will not fix CSV-specific issues (escaping commas inside quoted fields, balancing quotes). For CSV, use a CSV parser, not a line-based cleaner.
- It will not preserve formatting in structured documents (Markdown, JSON, YAML). Whitespace inside code blocks or string literals is significant; running this cleaner over a code file will reformat string content.
Text is processed entirely in your browser. Internal lists, customer data, and confidential drafts are not sent to any server.
Frequently asked questions
Why does my deduplicated list still have duplicates?
Almost always one of: (1) one line has a trailing space and the other does not — trim first, then dedupe; (2) one is "Café" with NFC normalization, the other with NFD — normalize Unicode first; (3) case-mismatch and dedupe is case-sensitive — switch to case-insensitive.
How do I keep the original order while deduplicating?
Dedupe-without-sort preserves first occurrence and keeps later unique lines in their original order. Dedupe-with-sort produces an alphabetic list, which is destructive of order. Pick the one matching your downstream consumer.
What is NFC vs NFD normalization?
NFC (Composed) represents "é" as one codepoint U+00E9. NFD (Decomposed) represents it as "e" + combining acute U+0301. Both display identically but differ at the byte level. NFC is more compact; NFD is easier for algorithmic processing (separate base letter and accent). Most web inputs are NFC.
Will the cleaner break my regex or code?
It can. If you are cleaning a file with regex patterns, whitespace inside [ ] character classes is significant — collapsing internal whitespace will change behavior. For code/regex, prefer a code formatter to a text cleaner.
How do I detect the line ending of my input?
The cleaner reports CRLF / LF / CR / mixed counts. "Mixed" means your file has both — pick one to normalize to. Always pick LF unless you specifically need Windows-format output for a downstream tool.
Can I save my cleanup pipeline as a preset?
Settings persist in localStorage for return visits. For shared / reproducible pipelines, copy the operation list out and re-create on each session, or script it with sed / awk / python for true reproducibility.
Related tools
Find and remove hidden Unicode characters in text. Detect zero-width spaces, RTL marks, homoglyphs. Debug copy-paste issues
Sort, deduplicate, reverse, shuffle, filter, and number lists. Multiple list operations in one place. Free online list manipulation tool
Compare two texts and highlight differences side by side. Find changes between files, code versions. Free online text comparison diff tool
Convert text to UPPERCASE, lowercase, Title Case, camelCase, snake_case, kebab-case. Free online text case changer and transformer
Learn how AI language models work. Visualize tokenization, simulate temperature sampling, count tokens, build prompts. Interactive LLM tutorial and prompt engineering helper
Search and copy emojis by category. Find emoticons, symbols, flags for social media. Free online emoji keyboard with search and recent emojis
Last updated · E-Utils editorial team