Same, Same — but Different

Strings that look identical to you can be completely different to Python. Here's how to see what the computer sees — one copyable snippet at a time.

a а

Those are two characters. They render the same in almost every font. Yet "a" == "а" is False — one is Latin, one is Cyrillic. The whole page is variations on that surprise. Run the snippets; don't take my word for it.

Your toolkit — three calls from `unicodedata`

unicodedata.name(ch) — the official name of a character (your X-ray).
unicodedata.normalize("NFKC", s) — fold look-alikes & variants into one canonical form.
[hex(ord(c)) for c in s] — the actual code points behind a string.

The last question turns these into a search engine over the entire Unicode database.

Are these two letters the same?

a?vsа?

They look identical. Compare them and ask each one its name.

import unicodedata
pair = ("a", "а")
print(pair[0] == pair[1])
for ch in pair:
    print(f"U+{ord(ch):04X}  {unicodedata.name(ch)}")

what you'll see

False
U+0061  LATIN SMALL LETTER A
U+0430  CYRILLIC SMALL LETTER A

Is that a quote mark — or a gershayim?

ד"רstraight quotevsד״רgershayim

Hebrew abbreviations (like ד״ר = "Dr.") use a dedicated mark, gershayim, that looks just like an ASCII double-quote. Mix them up and your lookups silently miss.

import unicodedata
ascii_q   = 'ד"ר'
gershayim = "ד״ר"
print(ascii_q == gershayim)
for s in (ascii_q, gershayim):
    print(f"U+{ord(s[1]):04X}  {unicodedata.name(s[1])}")

what you'll see

False
U+0022  QUOTATION MARK
U+05F4  HEBREW PUNCTUATION GERSHAYIM

Same word — why is its length different?

שלוםplainvsשָׁלוֹםwith niqqud

Vowel points (niqqud) are combining marks: separate code points that stack onto a letter. The word looks almost the same, but it's longer.

import unicodedata
plain   = "שלום"
pointed = "שָלום"
print(len(plain), len(pointed), plain == pointed)
print([unicodedata.name(c) for c in pointed])

what you'll see

4 5 False
['HEBREW LETTER SHIN', 'HEBREW POINT QAMATS', 'HEBREW LETTER LAMED', 'HEBREW LETTER VAV', 'HEBREW LETTER FINAL MEM']

Is é one character, or two?

cafécomposedvscafédecomposed

"é" can be a single code point (NFC) or "e" + a combining accent (NFD). Same glyph, different bytes — until you normalize.

import unicodedata
nfc = "café"
nfd = "café"
print(nfc == nfd, len(nfc), len(nfd))
print(unicodedata.normalize("NFC", nfd) == nfc)
print([unicodedata.name(c) for c in nfd])

what you'll see

False 4 5
True
['LATIN SMALL LETTER C', 'LATIN SMALL LETTER A', 'LATIN SMALL LETTER F', 'LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']

Why won't this string split on spaces?

hello worldlooks normal…

The "space" between the words isn't a space — it's a no-break space (U+00A0). It compares unequal and resists split(" ").

import unicodedata
a = "hello world"
b = "hello world"
print(a == b)
print(a.split(" "), b.split(" "))
print(f"U+{ord(b[5]):04X}  {unicodedata.name(b[5])}")

what you'll see

False
['hello', 'world'] ['hello\xa0world']
U+00A0  NO-BREAK SPACE

What's hiding inside this word?

shalom6 letters… or is it?

Some characters take zero width — a zero-width space, a right-to-left mark. Invisible to you, very real to len().

import unicodedata
visible = "shalom"
sneaky  = "shalom‏"
print(visible == sneaky, len(visible), len(sneaky))
print([f"U+{ord(c):04X} {unicodedata.name(c)}"
       for c in sneaky if ord(c) > 0x7f])

what you'll see

False 6 8
['U+200B ZERO WIDTH SPACE', 'U+200F RIGHT-TO-LEFT MARK']

This word shows 6 letters — why is `len()` 7?

abc⟦RLO⟧xyzwhat you type→abczyxwhat you SEE

A right-to-left override (U+202E) is an invisible control character that reverses everything after it. The screen shows 6 glyphs; len() says 7 — the 7th is the hidden override. This is the Trojan Source trick: text that reads one way and means another.

import unicodedata
s = "abc\u202exyz"
print(repr(s), len(s))
for c in s:
    if unicodedata.category(c) == "Cf":
        print(f"U+{ord(c):04X}  {unicodedata.name(c)}  bidi={unicodedata.bidirectional(c)}")

what you'll see

'abc\u202exyz' 7
U+202E  RIGHT-TO-LEFT OVERRIDE  bidi=RLO

The bidi family — all invisible, all reorder: marks (LRM/RLM), overrides (LRO/RLO), embeddings (LRE/RLE/PDF), isolates (LRI/RLI/FSI/PDI).

Two identical lines — `\n` vs `\r\n`

"a⏎b"Unix · \nvs"a⏎b"Windows · \r\n

Windows ends a line with carriage-return + line-feed (\r\n); Unix uses just \n. Identical on screen — but len differs, == fails, and split("\n") leaves a stray \r clinging to every line.

import unicodedata
unix = "a\nb"
win  = "a\r\nb"
print(unix == win, len(unix), len(win))
print(win.split("\n"))
print(win.splitlines())
print([f"U+{ord(c):04X} {unicodedata.category(c)}" for c in win if ord(c) < 0x20])

what you'll see

False 3 4
['a\r', 'b']
['a', 'b']
['U+000D Cc', 'U+000A Cc']

Are all of these the number 3?

3ASCII·٣Arabic-Indic·３fullwidth

To int() they're all 3 — Python understands Unicode digits. But they're three different code points, which trips up exact matching and tokenizers.

import unicodedata
for d in ("3", "٣", "３"):
    print(repr(d), d.isdigit(), int(d), unicodedata.name(d))

what you'll see

'3' True 3 DIGIT THREE
'٣' True 3 ARABIC-INDIC DIGIT THREE
'３' True 3 FULLWIDTH DIGIT THREE

How do I find every character that looks like an X?

This is the real superpower. The Unicode database is searchable by name. Want every quotation-mark-like character — every X, X₁, X₂ a forger could swap in? Grep the names. (Swap "quotation mark" for "alef", "space", "digit three"…)

import unicodedata, sys

def look_alikes(needle):
    needle = needle.upper()
    for cp in range(sys.maxunicode + 1):
        name = unicodedata.name(chr(cp), "")
        if needle in name:
            print(f"U+{cp:04X}  {chr(cp)}  {name}")

look_alikes("quotation mark")

what you'll see (first 6 of 30)

U+0022  "  QUOTATION MARK
U+00AB  «  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00BB  »  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+2018  ‘  LEFT SINGLE QUOTATION MARK
U+2019  ’  RIGHT SINGLE QUOTATION MARK
U+201A  ‚  SINGLE LOW-9 QUOTATION MARK
... 30 total

How to actually fix it

When you want "looks the same" to mean the same — dedup, search, matching — fold everything to one canonical form first: NFKC (collapses compatibility look-alikes) + casefold (aggressive lowercasing).

import unicodedata

def canon(s):
    return unicodedata.normalize("NFKC", s).casefold()

print(canon("ﬁle") == canon("file"))
print(canon("CAFÉ") == canon("café"))

what you'll see

True
True

NFKC is a sledgehammer — it rewrites ﬁ→fi, ３→3, and more. Great for matching, wrong if you must preserve the exact original text (e.g. character offsets into a document). Then you compare canonical forms but keep the raw string.

Why any of this matters

Search & dedup — "identical" records that never match; duplicates you can't find.
Security — homograph / look-alike-domain attacks (раypal.com with a Cyrillic а).
Data & ML — tokenizers split disguised digits and spaces oddly; "the same" label fails to join.
Offsets — combining marks and zero-width characters make len() and string indices disagree with what the eye counts.

Everything here is Python 3 standard library — just unicodedata. Best way to learn it: pick a character you don't trust and run unicodedata.name() on it. Then search the names for its look-alikes.

Your toolkit — three calls from unicodedata

Are these two letters the same?

Is that a quote mark — or a gershayim?

Same word — why is its length different?

Is é one character, or two?

Why won't this string split on spaces?

What's hiding inside this word?

This word shows 6 letters — why is len() 7?

Two identical lines — \n vs \r\n

Are all of these the number 3?

How do I find every character that looks like an X?

How to actually fix it

Why any of this matters

Your toolkit — three calls from `unicodedata`

This word shows 6 letters — why is `len()` 7?

Two identical lines — `\n` vs `\r\n`