Magic numbers: how a file knows what it is

Your operating system shows you a PDF icon, double-clicking opens a PDF reader, and you assume the file is a PDF. Most of the time it is. But that whole chain rests on the three letters after the dot — and those letters are the one part of a file anyone can change without touching its contents. The file's true identity lives somewhere more honest: its first few bytes.

The extension is just a sticker.

A filename extension is metadata stored outside the file's data — it's part of the name, like a label on a jar. Nothing enforces that the label matches the contents. You can rename report.pdf to report.zip and the bytes inside are byte-for-byte identical; only the name changed. Operating systems lean on the extension because it's cheap: it's right there in the name, no need to open the file. That convenience is also the weakness — a wrong, missing, or malicious extension sends the OS to the wrong app, and it has no idea anything is off.

Magic numbers: signatures at the start.

Most binary formats begin with a fixed, distinctive byte sequence called a magic number or file signature. It's a deliberate fingerprint the format's designers put at offset zero so that any program can confirm "yes, this really is one of mine" before trying to parse it. A few you'll recognize once you've seen the bytes:

Format	First bytes (hex)	As text
PNG	89 50 4E 47 0D 0A 1A 0A	‹89›PNG…
JPEG	FF D8 FF	—
PDF	25 50 44 46	%PDF
ZIP	50 4B 03 04	PK‹03›‹04›
GIF	47 49 46 38	GIF8
ELF (Linux binary)	7F 45 4C 46	‹7F›ELF

Some are clearly human-readable on purpose — %PDF, GIF8, the PK in ZIP (the initials of Phil Katz, who created the format). Others mix in deliberately "un-typeable" bytes. PNG's signature is a small masterpiece of defensive design: the leading 0x89 is a non-ASCII byte (so naive text-mode transfers that strip the high bit are detected), 0D 0A and a lone 0A catch newline mangling between Windows and Unix, and a 0x1A ("Ctrl-Z", end-of-file on DOS) stops the file dumping garbage if you type it on Windows. The signature doesn't just identify the file; it detects the most common ways a transfer can corrupt it.

Reading the type is then simple: open the file, read the first handful of bytes, and compare against a table of known signatures. This is exactly what the Unix file command (and the libmagic library behind it) does, and what browsers do when they "sniff" a download.

Containers, offsets, and shared formats.

It's not always the first four bytes. Several common formats are containers that wrap different content, so the distinguishing bytes sit at an offset:

RIFF files start with RIFF, then four size bytes, then a four-byte type at offset 8: WEBP for a WebP image, WAVE for audio, AVI for video. Same wrapper, three different files — you must read offset 8 to tell them apart.
ISO Base Media files (MP4, MOV, HEIC, AVIF) carry the marker ftyp at offset 4, followed by a "brand" code that says which flavor it is: heic, avif, isom, qt . A HEIC photo and an MP4 video share most of their structure.
TAR archives put their ustar marker all the way out at offset 257.

And one signature can mean several file types. .docx, .xlsx, .pptx, .jar, and .epub are all ZIP archives underneath — Office documents are just zipped folders of XML. So byte-sniffing honestly reports "ZIP," and only a deeper look inside (or the extension) distinguishes a spreadsheet from a Java library. Detection tells you the container; it can't always tell you the intent.

Why detection beats trusting the name.

Reading the real type fixes real problems:

Mislabeled files. A download saved with the wrong extension won't open; detecting the true type lets you rename it correctly so the OS picks the right app.
No extension at all. Files pulled from APIs, archives, or Content-Disposition headers sometimes arrive nameless. The bytes still know what they are.
Security. An upload form that trusts .jpg in the filename can be fooled into storing an executable or a script. Server-side, you check the magic bytes, not the name — though even that isn't a complete defense (see below).

This is why "fix the extension" and "transcode the file" are different operations. Renaming corrects the label so software stops being confused; it does not change the data. You can't turn a real PDF into a real PNG by renaming — that requires re-encoding the bytes through an actual codec. Honest tools keep those two actions clearly separate.

Where sniffing stops working.

Magic numbers are a strong signal, not a proof. Their limits are worth knowing:

Plain text has no magic. A .txt, .csv, .json, or source-code file is just characters — there's no signature to match, so detectors fall back to heuristics ("does this look like UTF-8 text? does it start with { or [?"). That's a guess, not a fingerprint.
Collisions exist. The bytes CA FE BA BE are both a Java .class file and a Mach-O "fat" binary. A short signature can be ambiguous, and a hostile file can deliberately start with valid magic bytes for one type while being something else (a "polyglot").
Matching the header isn't validating the file. The first eight bytes saying "PNG" doesn't mean the rest is a well-formed, safe image. For security, signature-checking is a first filter, not the whole job — you still need a real parser, size limits, and sandboxing.

The honest summary: a magic number tells you what a file claims to be, with high confidence for well-designed binary formats and much less for text or adversarial inputs. Treat it as the best available hint, not a guarantee.

Takeaways.

The thing to remember: the extension is an editable label; the real type lives in the opening bytes. Detection reads those bytes (sometimes at an offset, sometimes shared across formats), which is great for fixing mislabeled or nameless files — but text has no signature, signatures can collide, and matching a header is not the same as validating or sandboxing the file.

Once you know files announce themselves in their first bytes, a lot of everyday weirdness makes sense: why a renamed file still opens fine, why an upload validator that trusts the extension is a security hole, and why "this won't open" is so often just a wrong sticker on a perfectly good jar.

See what a file really is.

The File Inspector reads the magic bytes of anything you drop in — 50+ formats — and tells you the true type, the MIME, and a hex view of the header. It flags a lying extension, re-downloads with the right one, and converts common images. Nothing is uploaded; it all runs in your browser.

Open the File Inspector

Made with love by a very serious person pretending not to be. Tooly McToolface is a workshop of free, client-side web tools. If reading bytes is your thing, a JWT decoded field by field takes the same lens to auth tokens, and the File Inspector and Base64 tool are the matching tools.

The extension is just a sticker.

Magic numbers: signatures at the start.

Containers, offsets, and shared formats.

Why detection beats trusting the name.

Where sniffing stops working.

Takeaways.

See what a file really is.

More from the workshop.

The File Inspector.

The iPhone HEIC problem.