mediumBit ManipulationArrays & Hashing

UTF-8 Validation

mediumTime: O(n)Space: O(1)

Recognize the pattern

validate byte sequence encodingmulti-byte character rulesbit pattern checking

Brute force idea

If you approach UTF-8 Validation in the most literal way possible, you get this: No simpler alternative — you must validate the encoding rules byte by byte. It is a fair place to begin because it matches the surface of the question, yet it does not capture the deeper structure that makes the problem simpler.

Better approach

The real unlock in UTF-8 Validation comes when you notice this: Scan left to right. Check the leading bits of each byte to determine if it starts a 1, 2, 3, or 4-byte character. Then verify the next N-1 bytes start with '10'. Instead of recomputing the world every time, you preserve just enough context to let the next decision become obvious.

Key invariant

The compass for UTF-8 Validation is this: The first byte's leading bits determine how many continuation bytes follow. Each continuation byte must start with '10' (bits 7-6). Any violation means invalid encoding. As long as that statement keeps holding, you can trust the steps built on top of it.

Watch out for

The trap in UTF-8 Validation usually looks like this: Not checking that continuation bytes actually start with '10' — a valid first byte followed by wrong continuation bytes is still invalid. When the code becomes mechanical before the idea is clear, small edge cases start breaking the whole story.

Arrays & Hashing Pattern