Statistics Homework 3 — RSA Frequency Attack
Overview: RSA as a Per-Letter Cipher
This homework extends frequency analysis to RSA encryption applied letter-by-letter. When RSA encrypts individual letters (A–Z mapped to 1–26), it degenerates into a simple substitution cipher: each plaintext letter maps deterministically to a unique numeric token. This deterministic, monoalphabetic mapping is vulnerable to frequency attacks that align ciphertext token frequencies with reference language profiles.
We explore two statistical attack methods:
- Greedy rank-based matching: Sort tokens by frequency and map them to letters sorted by expected frequency (E, T, A, ...)
- Chi-squared minimization: Find the letter-token permutation that minimizes statistical distance from the reference distribution
Security note: Real RSA never encrypts single letters—it operates on large blocks with randomized padding (e.g., OAEP), which destroys letter-level structure. This per-letter variant is intentionally weak for educational purposes.
1. RSA Parameters & Input Text
RSA Key Configuration
First prime number for RSA modulus
Second prime number for RSA modulus
Public exponent, coprime to φ(n)
Plaintext Input
Enter text to encrypt, or load a sample text for frequency analysis
2. Encrypted Output & Frequency Attack
Ciphertext (RSA Token Stream)
Attack Methods
Use statistical techniques to recover plaintext without the private key:
Recovered Plaintext
3. Frequency Distribution Comparison
This chart compares the observed frequencies of mapped letters (derived from ciphertext tokens) against the English reference distribution. The mapping is determined by the selected attack method.
4. Theoretical Foundations
Problem Formulation
Let \(\mathcal{A} = \{A,\dots,Z\}\) and define an injective encoding \(\mathrm{enc}:\mathcal{A}\to\{1,\dots,26\}\) with \(\mathrm{enc}(A)=1\) and \(\mathrm{enc}(Z)=26\). Per-letter RSA encryption with public key \((n,e)\) maps each letter \(X\) to ciphertext token: \[ c \equiv \mathrm{enc}(X)^e \pmod{n} \] Punctuation and spaces remain unchanged.
RSA Background
Given primes \(p,q\) and public exponent \(e\) with \(\gcd(e,\varphi(n))=1\), where:
- \(n = pq\) (modulus)
- \(\varphi(n) = (p-1)(q-1)\) (Euler's totient function)
- \(d \equiv e^{-1} \pmod{\varphi(n)}\) (private exponent)
Correctness follows from Euler's theorem: \(m^{ed} \equiv m \pmod{n}\) for any \(m\) coprime to \(n\).
Why Per-Letter RSA is Vulnerable
The mapping \(X \mapsto [\mathrm{enc}(X)^e \bmod n]\) is deterministic and (for typical parameters) injective over \(\{1,\dots,26\}\). Each plaintext letter always produces the same token, creating a monoalphabetic substitution cipher. The empirical distribution of tokens mirrors the letter distribution of the underlying language, enabling frequency attacks.
Attack Strategy 1: Greedy Rank-Based Matching
Let \(T\) be the set of ciphertext tokens with empirical frequencies \(f_T\). Let \(f_L\) be the reference language distribution (English). The greedy approach:
- Sort tokens by \(f_T\) in descending order: \(t_1, t_2, \ldots, t_k\)
- Sort letters by \(f_L\) in descending order: \(E, T, A, \ldots\)
- Map \(t_i \to \) letter at position \(i\)
This heuristic maximizes positional overlap and typically recovers a significant portion of plaintext.
Attack Strategy 2: Chi-Squared Minimization
A more principled approach scores each candidate letter-token permutation \(\pi: T \to \mathcal{A}\) against the reference profile. Given observed token counts \(O_t\) and mapping \(\pi\), we form letter counts: \[ O_L = \sum_{\pi(t)=L} O_t \] Compare to expected counts \(E_L = N \cdot f_L(L)\) where \(N\) is total token count. The chi-squared distance: \[ \chi^2(\pi) = \sum_{L \in \mathcal{A}} \frac{(O_L - E_L)^2}{E_L} \] We seek \(\hat{\pi} = \arg\min_\pi \chi^2(\pi)\). Since exhaustive search over \(26!\) permutations is infeasible, we use an iterative greedy refinement: start with rank-based assignment, then swap letter-token pairs to reduce \(\chi^2\).
Alternative Attack: Direct Factorization
When RSA parameters are small (e.g., \(n < 10^6\)), an attacker can recover the private key directly by factoring \(n = pq\). If factorization succeeds, compute \(\varphi(n) = (p-1)(q-1)\) and recover: \[ d \equiv e^{-1} \pmod{\varphi(n)} \] This renders frequency analysis unnecessary.
RSA security depends on the computational hardness of integer factorization. For large moduli (\(n > 2^{2048}\)), factorization is computationally infeasible. However, for small moduli, algorithms like:
- Trial division: \(O(\sqrt{n})\) complexity, feasible for \(p, q < 1000\)
- Pollard's rho: Expected time \(O(\sqrt[4]{n})\)
- Quadratic sieve: Subexponential for moderate \(n\)
NIST recommends minimum 2048-bit keys for RSA (FIPS PUB 186-4, SP 800-57 Part 1). The toy parameters used here are intentionally weak for educational demonstration.
Limitations & Real-World Countermeasures
This per-letter RSA construction is intentionally insecure. Production RSA systems:
- Encrypt large message blocks (never individual letters)
- Use randomized padding schemes (e.g., OAEP) that destroy statistical structure
- Employ key sizes of 2048 bits or larger
With proper implementation, frequency attacks become infeasible. Per-letter RSA behaves like a classical monoalphabetic substitution cipher and serves purely as a pedagogical tool for understanding statistical cryptanalysis.
Further Refinements
Advanced frequency attacks incorporate:
- Bigram/trigram analysis: Use two- or three-letter frequency patterns
- Dictionary cross-checking: Validate decoded words against language dictionaries
- Iterative swap optimization: Refine mappings by local search heuristics
For comprehensive cryptanalysis theory, see Stinson & Paterson (2018) Cryptography: Theory and Practice or Katz & Lindell (2020) Introduction to Modern Cryptography.
Key Insight: Why Frequency Attacks Work
Per-letter RSA creates a deterministic one-to-one mapping from letters to numeric tokens. The ciphertext token histogram mirrors the plaintext letter histogram. By aligning the most frequent token with the most frequent letter (E in English), and continuing this rank-based matching, we recover large portions of the message.
Chi-squared minimization provides a principled optimization criterion that typically outperforms simple rank matching by finding the permutation with minimal statistical distance from the reference language profile.