A practical regex cheatsheet for people who keep forgetting

Regex is one of those skills you learn, mostly forget, re-learn, mostly forget again, and then look up every time you need it. The main reason is that regex is dense — a lot of behavior is packed into a few characters, and the syntax is optimized for writing, not reading. The second reason is that the documentation is either overwhelming or condescending, with very little in between.

This page is the reference I wish I had pinned to the inside of my skull. It covers the tokens I actually use, in the order of frequency I use them, with notes about the ones that have bitten me.

The mental model: patterns are programs

The core insight that makes regex click is that a pattern isn't a description — it's a program. It runs character-by-character through your input, and each token is an instruction: match this, then match that, and if that fails back up and try something else. Once you see regex as a tiny stack machine instead of a magic string, the weird behaviors (catastrophic backtracking, greedy vs lazy matching, why .* sometimes matches too much) all make sense.

Everything else on this page is just a vocabulary for telling that machine what to do.

Character classes

Character classes say "match any one of these characters." They're the workhorse of regex — most patterns are mostly character classes glued together with quantifiers.

The essentials:

. — any character except newline (usually)
\d — a digit, equivalent to [0-9]
\w — a "word" character: letter, digit, or underscore. Equivalent to [A-Za-z0-9_]
\s — any whitespace (space, tab, newline, etc.)
\D, \W, \S — the inverses

Custom character classes go in square brackets. The things that trip people up:

Inside [ ], most special characters lose their special meaning. [.+*] matches a literal dot, plus, or asterisk.
- is a range indicator, so [a-z] is a–z. If you want a literal dash, put it first or last: [-abc] or [abc-].
^ at the start of a character class means "not". [^aeiou] is any non-vowel. Outside of brackets, ^ is an anchor (see below).

Unicode-aware character classes (when the u flag is set) let you match things like \p{Letter} for any letter in any script. Handy for international text, though not all engines support them.

Quantifiers

Quantifiers say "how many of the previous thing." They're where most of the power and most of the problems come from.

? — zero or one (optional)
* — zero or more
+ — one or more
{n} — exactly n
{n,} — n or more
{n,m} — between n and m (inclusive)

All of these are greedy by default — they match as much as they can before giving up. Append a ? to make them lazy, matching as little as possible. Compare:

.* on "abc" "def" matches the whole string, from the first quote to the last
.*? on the same input matches just "abc"

This is the single distinction that causes the most "why doesn't my regex work?" moments. If you're matching "everything up to the next X" and you got "everything up to the last X," you need lazy quantifiers.

Anchors and boundaries

Anchors don't match characters — they match positions. They're how you say "at the start," "at the end," or "at a word boundary."

^ — start of string (or line, if the m flag is set)
$ — end of string (or line)
\b — word boundary: the transition between \w and non-\w
\B — not a word boundary

Word boundaries are deceptively useful. \bfoo\b matches the word "foo" but not inside "foobar" or "barfoo." This is how you do whole-word search without having to enumerate all the delimiters.

One gotcha: \b defines "word" as \w, which doesn't include most Unicode letters. For international text, you may need explicit lookarounds instead.

Groups and captures

Parentheses do two things at once: they group, and they capture. Grouping lets quantifiers apply to more than one character — (ab)+ matches "ab", "abab", "ababab." Capturing means the matched text is available afterwards, either for reference inside the regex or for extraction in code.

(foo) — capturing group, numbered 1, 2, 3... in the order their opening parens appear
(?:foo) — non-capturing group. Groups without saving. Useful when you just want the grouping behavior.
(?<name>foo) — named group. Much more readable than numbered references.
\1, \2... — backreferences inside the regex. (a|b)\1 matches "aa" or "bb" but not "ab."

A pattern like (\w+)\s+\1 finds a repeated word with whitespace between — the classic example of where backreferences shine.

Lookaheads and lookbehinds

These are zero-width assertions: they check whether something is (or isn't) there, without consuming it. You use them when you need context for your match but don't want the context to be part of the match.

(?=foo) — positive lookahead: assert that foo follows
(?!foo) — negative lookahead: assert that foo does not follow
(?<=foo) — positive lookbehind: assert that foo precedes
(?<!foo) — negative lookbehind: assert that foo does not precede

A common use: matching a number that's followed by a specific unit, but capturing only the number. \d+(?=\s*kg) matches 42 in "42 kg" but not the kg part. In most tools you can't do this without lookaheads.

Lookbehinds used to be finicky — some engines only supported fixed-length lookbehinds, meaning (?<=\d+) would fail. Modern JavaScript, PCRE, and .NET all handle variable-width lookbehinds now, but if you're targeting older environments it's worth checking.

A small library of patterns worth keeping

These are the ones I keep reaching for. Copy, paste, adjust, don't try to memorize.

Email (pragmatic version):

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

This isn't RFC-5322 compliant — the compliant regex is several hundred characters long and nobody actually uses it. This version catches 99% of real email addresses and rejects obvious garbage. For actual validation, send a verification email.

URL:

https?:\/\/[\w.-]+(?:\.[\w.-]+)+[\w\-._~:/?#[\]@!$&'()*+,;=%]*

Matches http and https URLs. Permissive on the path and query string because URLs can legally contain a lot of characters.

IPv4 address:

\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b

This actually validates the 0–255 range on each octet, unlike the lazier \d+\.\d+\.\d+\.\d+.

ISO date:

\d{4}-\d{2}-\d{2}

Matches YYYY-MM-DD. If you care about valid months and days, regex is the wrong tool — parse it and check with a real date library.

Hex color:

#(?:[0-9a-fA-F]{3}){1,2}\b

Matches #abc and #aabbcc. Extend with {1,2}|[0-9a-fA-F]{8} if you want #aabbccdd (with alpha).

Where regex goes wrong

Three things cause most regex failures:

Catastrophic backtracking. Nested quantifiers on the same input — patterns like (a+)+ or (a|a)* — can cause an engine to try every possible combination of ways to split the input, which is exponential in length. A 30-character input can hang the browser for seconds. Rule of thumb: never put a quantifier inside a group that also has a quantifier. If you need the nesting, use an atomic group (?>...) where supported, which tells the engine not to backtrack.

Greedy where you wanted lazy. If your match is "too long," you probably want *? or +? instead of * or +.

Unicode surprises. \w does not match most non-ASCII letters in most engines. If you're working with international text, either use explicit ranges or Unicode property escapes (\p{L}) where supported.

And one meta-rule: if your regex is more than two lines of code, it's probably the wrong abstraction. Parsing HTML with regex is famously a bad idea (use a DOM parser). Parsing JSON with regex is also bad (use JSON.parse). Parsing anything with a formal grammar — URLs, email addresses, SQL — eventually leads to pain. Regex is a great tool for small matching jobs; the moment your pattern starts to look like a program, reach for a different tool.

If you want to see patterns running live, explain themselves, and match against your own input, try our Regex Rambler. It shows the plain-English breakdown of any regex you paste, which is genuinely the fastest way to understand someone else's pattern (including your own, from six months ago).

Made with love by a very serious person pretending not to be. Tooly McToolface is a workshop of free, client-side web tools. If you liked this, the image compression guide and HEIC problem essay are natural next reads.