How-To & Life · Guide · Developer Utilities
How to encode text to binary
ASCII vs UTF-8, byte layout for multi-byte characters, binary vs hex representation, why binary-encoded text balloons in size, and decoding back.
Converting text to binary is the most direct way to show what a computer actually sees when you type a character. Every letter becomes a sequence of bits, either 7 or 8 per character for ASCII, or up to 32 bits for a UTF-8 Unicode character. The conversion sounds mechanical, but three decisions shape the output: ASCII vs UTF-8, 7-bit vs 8-bit framing, and big-endian vs little-endian byte order. Getting any of the three wrong turns “Hello” into unreadable garbage on the other side. This guide covers what a byte actually encodes, how ASCII and UTF-8 differ in the encoding of the same character, the role of endianness, the legitimate modern uses (retro hardware, teaching, debugging binary protocols), and the common conversion mistakes to avoid.
Advertisement
What a byte holds
A byte is 8 bits. Each bit is 0 or 1. 8 bits give 256 possible values (00000000 through 11111111), typically written as 0 to 255 decimal or 00 to FF hex. Text encodings map each character to one or more bytes; binary conversion shows those bytes as explicit sequences of 0s and 1s.
ASCII — 7 bits per character
ASCII (1963, standardized in ANSI X3.4-1968) defines 128 characters in 7 bits. Letters, digits, punctuation, and control codes.
A = 65 = 01000001 a = 97 = 01100001 0 = 48 = 00110000 space = 32 = 00100000 newline = 10 = 00001010
The 8th bit is zero in pure ASCII. Many systems historically used the extra bit for extended character sets (“high ASCII” for accented letters, box-drawing), with every vendor picking a different mapping — the mess that Unicode was invented to end.
UTF-8 — Unicode in 1 to 4 bytes
UTF-8 (RFC 3629) is the dominant encoding on the modern web. It encodes every Unicode code point using 1 to 4 bytes:
1 byte (code points 0–127): identical to ASCII. 0xxxxxxx. Leading bit 0.
2 bytes (128–2047): 110xxxxx 10xxxxxx. Covers Latin extended, Greek, Cyrillic, Hebrew, Arabic.
3 bytes (2048–65535): 1110xxxx 10xxxxxx 10xxxxxx. Covers most CJK ideographs, Devanagari, and the BMP.
4 bytes (65536–1114111): 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Covers emoji, rarer CJK, historic scripts.
Example: “é” (U+00E9) in UTF-8 is 11000011 10101001 (0xC3 0xA9 — two bytes). The emoji “😀” (U+1F600) is four UTF-8 bytes.
UTF-16 and UTF-32
UTF-16: 2 bytes for Basic Multilingual Plane, 4 bytes for supplementary planes via surrogate pairs. Default in-memory encoding for JavaScript strings, Java strings, and Windows APIs. Uses a byte-order mark (BOM) to signal endianness.
UTF-32: always 4 bytes per code point. Rare outside internal representations where random access to code points matters.
Endianness
For encodings wider than 1 byte (UTF-16, UTF-32, binary numbers), the order of bytes matters.
Big-endian (network byte order, Motorola 68k): most significant byte first. The 16-bit value 0x1234 is stored as 12 34.
Little-endian (Intel x86, ARM default, modern CPUs): least significant byte first. 0x1234 stored as 34 12.
UTF-16 files start with a byte-order mark (FE FF for BE, FF FE for LE) so readers know. UTF-8 has no endianness (single byte units) but some systems still prepend a BOM (EF BB BF) as an encoding marker; it breaks many parsers and is best avoided.
Converting step by step
To encode “Hi” to ASCII binary:
H -> 72 -> 01001000 i -> 105 -> 01101001 Full: 01001000 01101001
In JavaScript:
function textToBinary(str) {
return [...str].map(ch =>
ch.codePointAt(0).toString(2).padStart(8, '0')
).join(' ');
}
textToBinary('Hi'); // '01001000 01101001'Note: codePointAt(0) returns a single Unicode code point, but that is not the same as a UTF-8 byte. For proper UTF-8 byte-level output:
function utf8Binary(str) {
const bytes = new TextEncoder().encode(str);
return [...bytes].map(b =>
b.toString(2).padStart(8, '0')
).join(' ');
}
utf8Binary('café');
// '01100011 01100001 01100110 11000011 10101001'7-bit framing
Pure ASCII output is sometimes written with 7 bits per character, dropping the leading zero:
H = 1001000 (7 bits) i = 1101001 (7 bits)
Used historically in 7-bit serial protocols and email transports before 8-bit MIME became universal. In 2026 this is primarily a puzzle-design choice or a bandwidth trick on constrained radio links.
Base-N conversions
Binary is base-2. Related encodings use more compact bases:
Hexadecimal (base-16): 4 bits per digit. H becomes 48.
Octal (base-8): 3 bits per digit. H becomes 110.
Base64: groups of 6 bits expressed as one of 64 printable ASCII characters. Common for embedding binary in text where binary is unsafe (email, JSON, URLs).
Real uses today
Teaching how computers store text. Clearer than any metaphor.
Debugging binary protocols at the byte level. When a custom protocol says “3 bytes header, magic value 0x4A6F,” reading the binary or hex dump is how you confirm.
Retro computing: punch cards, paper tape, 6502 assembly, Z80. Everything is bits down there.
Encoding layers in CTF or puzzle hunts: ciphertext wrapped in Base64 wrapped in hex wrapped in binary. Recognize each layer and unwrap.
Signage and novelty: T-shirts, tattoos, and the occasional physical plaque. Convert and proofread — a flipped bit is permanent.
Character encoding detection
Given a binary blob, is it ASCII, Latin-1, UTF-8, or something else? Heuristics:
All bytes under 0x80 → safe to interpret as ASCII or UTF-8.
Bytes 0x80–0xBF never appear first in UTF-8 — if one starts a “character,” it is not valid UTF-8.
Presence of the UTF-8 BOM (0xEF 0xBB 0xBF) at the start is a strong signal of UTF-8.
Failure to decode as UTF-8 usually means Latin-1, Windows-1252, or a legacy Asian encoding (Shift JIS, GB18030). Modern libraries (chardet, uchardet) get it right most of the time but are not magic.
Common mistakes
Forgetting that one character is not always one byte. In UTF-8, “café” is 5 characters but 5–6 bytes (depending on composed vs decomposed form). Always distinguish character length from byte length.
Using charCodeAt for non-BMP characters.Returns a UTF-16 code unit, not a full code point. For emoji and rarer CJK, use codePointAt.
Omitting leading zeros. A is binary 01000001 (8 bits); writing 1000001 looks tidy but confuses fixed-width parsers. Always pad.
Mixing endianness halfway through a file.Output UTF-16-BE for headers and UTF-16-LE for the body, and the reader gets garbage. Pick one and stick with it; BOM announces the choice.
Adding a UTF-8 BOM by accident. Windows Notepad famously prepends one. Many parsers (older PHP, some JSON libraries) choke. If you must have a marker, use Content-Type: ...; charset=utf-8 at the protocol level.
Treating binary output as a string to concatenate.Adding the string “01000001” to another is fine; adding 0x41 as a number is different. Keep the type consistent.
Run the numbers
Convert text to and from binary with the binary text encoder. Pair with the Base64 encoder/decoder when you need compact binary-in-text representation, and the Caesar cipher tool for layered puzzles that mix base conversion with simple encryption.
Advertisement