Encode and Decode Strings Solution
Design an algorithm to encode a list of strings to a single string, and decode it back. The encoded string is transmitted over a network and decoded back to the original list.
Problem Statement
Design an algorithm to encode a list of strings into a single string, and decode it back to the original list. The encoded string is transmitted over a network, so the algorithm must handle strings containing any possible character โ including delimiters.
Brute Force โ Delimiter-Based
Join strings with a delimiter (e.g., comma or pipe). On decode, split by the delimiter. Simple โ but fundamentally broken if any string contains the delimiter character.
"," as delimiter and a string contains ",", the decoder splits incorrectly. Escaping the delimiter adds complexity and edge cases. No delimiter is safe when strings can contain any ASCII character. We need a fundamentally different approach.
| Metric | Value |
|---|---|
| Time | O(n) โ concatenation and split |
| Correctness | โ Fails when strings contain the delimiter |
Length-Prefix โ O(n)
Prefix each string with its length followed by a separator character (e.g., #). The decoder reads the length number, then reads exactly that many characters โ no ambiguity possible regardless of string contents.
- Encode: For each string, append
length + "#" + string. E.g.,"hello"โ"5#hello". - Decode: Read characters until you hit
#โ that's the length as a number. Read the nextlengthcharacters โ that's the string. Repeat.
- Any time you need to serialize/deserialize a collection of variable-length items into a single stream โ think length-prefix encoding.
- The signal is "items can contain any characters including separators." This is how HTTP chunked transfer encoding, Protocol Buffers, and many binary protocols work.
- The pattern appears in LC 271, serialization interviews, and system design discussions.
# as separator?: - The separator between the length and the string content could be any character.
- We choose
#because it makes the code readable. - The separator is NOT a delimiter between strings โ it's only between the length number and the string body.
- Since we read the length first, we know exactly how many characters to consume, making the
#inside string content harmless.
Visual Walkthrough
Encode and decode ["hello", "wo#rld", ""]:
- The string
"wo#rld"contains the separator character#. - The decoder doesn't care โ it reads the length (6) first, then blindly consumes the next 6 characters.
- The
#inside the content is just another character.
Code โ Java & Python
Complexity Analysis
| Approach | Time | Space | Trade-off |
|---|---|---|---|
| Delimiter (broken) | O(n) | O(n) | Fast but incorrect for arbitrary input |
| Escaping + Delimiter | O(n) | O(n) | Correct but complex escape logic |
| Length-Prefix โ optimal | O(n) | O(n) | Simple, correct for all inputs, no ambiguity |
- Both encode and decode process each character exactly once.
- The length prefix adds a constant overhead per string (a few digits + one
#), which is negligible compared to the string content.
Edge Cases & Pitfalls
| Case | Input | Why It Matters |
|---|---|---|
| Empty string in list | ["", "a"] | Encodes as "0#1#a". Length 0 means read 0 chars โ produces "". |
| All empty strings | ["", "", ""] | Encodes as "0#0#0#". Three reads of length 0. |
| String contains # | ["a#b"] | Encodes as "3#a#b". Decoder reads len=3, consumes "a#b" โ correct. |
| String contains digits | ["123"] | Encodes as "3#123". Decoder finds first # at pos 1, reads len=3. |
| Single string | ["hello"] | Encodes as "5#hello". Simplest case โ one length prefix, one string. |
| Long strings | ["a"ร200] | Length prefix is "200" โ 3 digits. Works fine; number of digits scales with log(len). |
indexOf('#') starting from position 0 instead of position i. If a previous string's content contains #, seeking from the wrong position finds the wrong separator. Always search for # starting from the current parsing position i.