Strings that look identical to you can be completely different to Python. Here's how to see what the computer sees — one copyable snippet at a time.
Those are two characters. They render the same in almost every font. Yet
"a" == "а" is False — one is Latin, one is Cyrillic.
The whole page is variations on that surprise. Run the snippets; don't take my word for it.
unicodedataunicodedata.name(ch) — the official name of a character (your X-ray).unicodedata.normalize("NFKC", s) — fold look-alikes & variants into one canonical form.[hex(ord(c)) for c in s] — the actual code points behind a string.The last question turns these into a search engine over the entire Unicode database.
They look identical. Compare them and ask each one its name.
import unicodedata
pair = ("a", "а")
print(pair[0] == pair[1])
for ch in pair:
print(f"U+{ord(ch):04X} {unicodedata.name(ch)}")False U+0061 LATIN SMALL LETTER A U+0430 CYRILLIC SMALL LETTER A
Hebrew abbreviations (like ד״ר = "Dr.") use a dedicated mark, gershayim, that looks just like an ASCII double-quote. Mix them up and your lookups silently miss.
import unicodedata
ascii_q = 'ד"ר'
gershayim = "ד״ר"
print(ascii_q == gershayim)
for s in (ascii_q, gershayim):
print(f"U+{ord(s[1]):04X} {unicodedata.name(s[1])}")False U+0022 QUOTATION MARK U+05F4 HEBREW PUNCTUATION GERSHAYIM
Vowel points (niqqud) are combining marks: separate code points that stack onto a letter. The word looks almost the same, but it's longer.
import unicodedata
plain = "שלום"
pointed = "שָלום"
print(len(plain), len(pointed), plain == pointed)
print([unicodedata.name(c) for c in pointed])4 5 False ['HEBREW LETTER SHIN', 'HEBREW POINT QAMATS', 'HEBREW LETTER LAMED', 'HEBREW LETTER VAV', 'HEBREW LETTER FINAL MEM']
"é" can be a single code point (NFC) or "e" + a combining accent (NFD). Same glyph, different bytes — until you normalize.
import unicodedata
nfc = "café"
nfd = "café"
print(nfc == nfd, len(nfc), len(nfd))
print(unicodedata.normalize("NFC", nfd) == nfc)
print([unicodedata.name(c) for c in nfd])False 4 5 True ['LATIN SMALL LETTER C', 'LATIN SMALL LETTER A', 'LATIN SMALL LETTER F', 'LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']
The "space" between the words isn't a space — it's a no-break space (U+00A0). It compares unequal and resists split(" ").
import unicodedata
a = "hello world"
b = "hello world"
print(a == b)
print(a.split(" "), b.split(" "))
print(f"U+{ord(b[5]):04X} {unicodedata.name(b[5])}")False ['hello', 'world'] ['hello\xa0world'] U+00A0 NO-BREAK SPACE
Some characters take zero width — a zero-width space, a right-to-left mark. Invisible to you, very real to len().
import unicodedata
visible = "shalom"
sneaky = "shalom"
print(visible == sneaky, len(visible), len(sneaky))
print([f"U+{ord(c):04X} {unicodedata.name(c)}"
for c in sneaky if ord(c) > 0x7f])False 6 8 ['U+200B ZERO WIDTH SPACE', 'U+200F RIGHT-TO-LEFT MARK']
len() 7?A right-to-left override (U+202E) is an invisible control character that reverses everything after it. The screen shows 6 glyphs; len() says 7 — the 7th is the hidden override. This is the Trojan Source trick: text that reads one way and means another.
import unicodedata
s = "abc\u202exyz"
print(repr(s), len(s))
for c in s:
if unicodedata.category(c) == "Cf":
print(f"U+{ord(c):04X} {unicodedata.name(c)} bidi={unicodedata.bidirectional(c)}")'abc\u202exyz' 7 U+202E RIGHT-TO-LEFT OVERRIDE bidi=RLO
The bidi family — all invisible, all reorder: marks (LRM/RLM), overrides (LRO/RLO), embeddings (LRE/RLE/PDF), isolates (LRI/RLI/FSI/PDI).
\n vs \r\nWindows ends a line with carriage-return + line-feed (\r\n); Unix uses just \n. Identical on screen — but len differs, == fails, and split("\n") leaves a stray \r clinging to every line.
import unicodedata
unix = "a\nb"
win = "a\r\nb"
print(unix == win, len(unix), len(win))
print(win.split("\n"))
print(win.splitlines())
print([f"U+{ord(c):04X} {unicodedata.category(c)}" for c in win if ord(c) < 0x20])False 3 4 ['a\r', 'b'] ['a', 'b'] ['U+000D Cc', 'U+000A Cc']
To int() they're all 3 — Python understands Unicode digits. But they're three different code points, which trips up exact matching and tokenizers.
import unicodedata
for d in ("3", "٣", "3"):
print(repr(d), d.isdigit(), int(d), unicodedata.name(d))'3' True 3 DIGIT THREE '٣' True 3 ARABIC-INDIC DIGIT THREE '3' True 3 FULLWIDTH DIGIT THREE
This is the real superpower. The Unicode database is searchable by name. Want every quotation-mark-like character — every X, X₁, X₂ a forger could swap in? Grep the names. (Swap "quotation mark" for "alef", "space", "digit three"…)
import unicodedata, sys
def look_alikes(needle):
needle = needle.upper()
for cp in range(sys.maxunicode + 1):
name = unicodedata.name(chr(cp), "")
if needle in name:
print(f"U+{cp:04X} {chr(cp)} {name}")
look_alikes("quotation mark")U+0022 " QUOTATION MARK U+00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK U+00BB » RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK U+2018 ‘ LEFT SINGLE QUOTATION MARK U+2019 ’ RIGHT SINGLE QUOTATION MARK U+201A ‚ SINGLE LOW-9 QUOTATION MARK ... 30 total
When you want "looks the same" to mean the same — dedup, search, matching — fold everything to one canonical form first: NFKC (collapses compatibility look-alikes) + casefold (aggressive lowercasing).
import unicodedata
def canon(s):
return unicodedata.normalize("NFKC", s).casefold()
print(canon("file") == canon("file"))
print(canon("CAFÉ") == canon("café"))True True
NFKC is a sledgehammer — it rewrites fi→fi, 3→3, and more. Great for matching, wrong if you must preserve the exact original text (e.g. character offsets into a document). Then you compare canonical forms but keep the raw string.
len() and string indices disagree with what the eye counts.