Characters & Character Classes¶

Character classes let you specify a set of characters that may appear at a position. They live inside square brackets […].

Shorthand Classes¶

Shorthand	Equivalent	Matches
`\d`	`[0-9]`	Digit
`\D`	`[^0-9]`	Non-digit
`\w`	`[a-zA-Z0-9_]`	Word char (ASCII)
`\W`	`[^a-zA-Z0-9_]`	Non-word char
`\s`	`[ \t\n\r\f\v]`	Whitespace
`\S`	`[^ \t\n\r\f\v]`	Non-whitespace
`\h`	`[ \t]`	Horizontal space (PCRE/Ruby)
`\N`	any non-newline	Portable dot alternative (PCRE)
`\R`	any line break	`\n`, `\r`, `\r\n`, etc. (PCRE)

\w and Unicode

In most ASCII-mode engines, \w only matches [a-zA-Z0-9_]. Under Unicode mode (Python's re, JS /u, etc.) it can include accented letters and script characters. This matters for international text.

Custom Character Classes¶

# Match a hex digit
[0-9a-fA-F]

# Match everything except angle brackets
[^<>]

# Ranges: a-z, A-Z, 0-9 — ranges are ASCII-ordered
[A-Za-z_][A-Za-z0-9_]*    # valid identifier

# Literals inside classes: ] ^ - \ need special treatment
[]^-\\]    # matches ], ^, -, \
             # ^ negates ONLY at position 0
             # - is range ONLY between two chars

# Union of classes inside [] (all flavors)
[\d\s]     # digit OR whitespace

Key rules inside `[…]`¶

^ at position 0 → negation. Anywhere else → literal ^
- between two chars → range. At start or end → literal -
] must be escaped as \] (or placed first as []…])
\ always starts an escape

POSIX Character Classes (ERE / PCRE)¶

POSIX	Meaning
`[:alpha:]`	Letters (locale-aware)
`[:alnum:]`	Letters + digits
`[:digit:]`	Digits `[0-9]`
`[:lower:]`	Lowercase
`[:upper:]`	Uppercase
`[:space:]`	Whitespace (all)
`[:punct:]`	Punctuation
`[:print:]`	Printable characters
`[:xdigit:]`	Hex digits
`[:blank:]`	Space and tab only

Usage: [[:alpha:]] — must double-bracket inside a character class.

Unicode Property Escapes¶

Supported in PCRE, Python's regex module, Java, and JS with /u flag.

\p{L}               # any letter
\p{Lu}              # uppercase letter
\p{Ll}              # lowercase letter
\p{N}               # any number (including numerals)
\p{Z}               # separator (space-like)
\p{P}               # punctuation
\p{S}               # symbol (currency, math…)
\P{L}               # NOT a letter (capital P = negation)

# Script
\p{Script=Latin}
\p{Script=Cyrillic}
\p{Script=Han}

# Block (PCRE/Java)
\p{InGreek}
\p{InBasicLatin}

Set Operations (JS `/v` flag, ES2024)¶

// Intersection: letters that are also ASCII
/[\p{L}&&[\x00-\x7F]]/v

// Subtraction: letters except vowels
/[\p{L}--[aeiouAEIOU]]/v

// Nested character classes
/[[a-z][A-Z][0-9]]/v    // equivalent to [a-zA-Z0-9]