Statistics Homework 3 — RSA Frequency Attack

Overview: RSA as a Per-Letter Cipher

This homework extends frequency analysis to RSA encryption applied letter-by-letter. When RSA encrypts individual letters (A–Z mapped to 1–26), it degenerates into a simple substitution cipher: each plaintext letter maps deterministically to a unique numeric token. This deterministic, monoalphabetic mapping is vulnerable to frequency attacks that align ciphertext token frequencies with reference language profiles.

We explore two statistical attack methods:

Greedy rank-based matching: Sort tokens by frequency and map them to letters sorted by expected frequency (E, T, A, ...)
Chi-squared minimization: Find the letter-token permutation that minimizes statistical distance from the reference distribution

Security note: Real RSA never encrypts single letters—it operates on large blocks with randomized padding (e.g., OAEP), which destroys letter-level structure. This per-letter variant is intentionally weak for educational purposes.

1. RSA Parameters & Input Text

RSA Key Configuration

p — First Prime

First prime number for RSA modulus

q — Second Prime

Second prime number for RSA modulus

e — Public Exponent

Public exponent, coprime to φ(n)

RSA Modulus

—

χ² Distance

—

Plaintext Input

Enter text to encrypt, or load a sample text for frequency analysis

Plaintext:

2. Encrypted Output & Frequency Attack

Ciphertext (RSA Token Stream)

Encrypted Tokens:

Attack Methods

Use statistical techniques to recover plaintext without the private key:

Recovered Plaintext

Decoded Text:

3. Frequency Distribution Comparison

This chart compares the observed frequencies of mapped letters (derived from ciphertext tokens) against the English reference distribution. The mapping is determined by the selected attack method.

4. Theoretical Foundations

Problem Formulation

Let \(\mathcal{A} = \{A,\dots,Z\}\) and define an injective encoding \(\mathrm{enc}:\mathcal{A}\to\{1,\dots,26\}\) with \(\mathrm{enc}(A)=1\) and \(\mathrm{enc}(Z)=26\). Per-letter RSA encryption with public key \((n,e)\) maps each letter \(X\) to ciphertext token: \[ c \equiv \mathrm{enc}(X)^e \pmod{n} \] Punctuation and spaces remain unchanged.

RSA Background

Given primes \(p,q\) and public exponent \(e\) with \(\gcd(e,\varphi(n))=1\), where:

\(n = pq\) (modulus)
\(\varphi(n) = (p-1)(q-1)\) (Euler's totient function)
\(d \equiv e^{-1} \pmod{\varphi(n)}\) (private exponent)

Correctness follows from Euler's theorem: \(m^{ed} \equiv m \pmod{n}\) for any \(m\) coprime to \(n\).

Why Per-Letter RSA is Vulnerable

The mapping \(X \mapsto [\mathrm{enc}(X)^e \bmod n]\) is deterministic and (for typical parameters) injective over \(\{1,\dots,26\}\). Each plaintext letter always produces the same token, creating a monoalphabetic substitution cipher. The empirical distribution of tokens mirrors the letter distribution of the underlying language, enabling frequency attacks.

Attack Strategy 1: Greedy Rank-Based Matching

Let \(T\) be the set of ciphertext tokens with empirical frequencies \(f_T\). Let \(f_L\) be the reference language distribution (English). The greedy approach:

Sort tokens by \(f_T\) in descending order: \(t_1, t_2, \ldots, t_k\)
Sort letters by \(f_L\) in descending order: \(E, T, A, \ldots\)
Map \(t_i \to \) letter at position \(i\)

This heuristic maximizes positional overlap and typically recovers a significant portion of plaintext.

Attack Strategy 2: Chi-Squared Minimization

A more principled approach scores each candidate letter-token permutation \(\pi: T \to \mathcal{A}\) against the reference profile. Given observed token counts \(O_t\) and mapping \(\pi\), we form letter counts: \[ O_L = \sum_{\pi(t)=L} O_t \] Compare to expected counts \(E_L = N \cdot f_L(L)\) where \(N\) is total token count. The chi-squared distance: \[ \chi^2(\pi) = \sum_{L \in \mathcal{A}} \frac{(O_L - E_L)^2}{E_L} \] We seek \(\hat{\pi} = \arg\min_\pi \chi^2(\pi)\). Since exhaustive search over \(26!\) permutations is infeasible, we use an iterative greedy refinement: start with rank-based assignment, then swap letter-token pairs to reduce \(\chi^2\).

Alternative Attack: Direct Factorization

When RSA parameters are small (e.g., \(n < 10^6\)), an attacker can recover the private key directly by factoring \(n = pq\). If factorization succeeds, compute \(\varphi(n) = (p-1)(q-1)\) and recover: \[ d \equiv e^{-1} \pmod{\varphi(n)} \] This renders frequency analysis unnecessary.

RSA security depends on the computational hardness of integer factorization. For large moduli (\(n > 2^{2048}\)), factorization is computationally infeasible. However, for small moduli, algorithms like:

Trial division: \(O(\sqrt{n})\) complexity, feasible for \(p, q < 1000\)
Pollard's rho: Expected time \(O(\sqrt[4]{n})\)
Quadratic sieve: Subexponential for moderate \(n\)

NIST recommends minimum 2048-bit keys for RSA (FIPS PUB 186-4, SP 800-57 Part 1). The toy parameters used here are intentionally weak for educational demonstration.

Limitations & Real-World Countermeasures

This per-letter RSA construction is intentionally insecure. Production RSA systems:

Encrypt large message blocks (never individual letters)
Use randomized padding schemes (e.g., OAEP) that destroy statistical structure
Employ key sizes of 2048 bits or larger

With proper implementation, frequency attacks become infeasible. Per-letter RSA behaves like a classical monoalphabetic substitution cipher and serves purely as a pedagogical tool for understanding statistical cryptanalysis.

Further Refinements

Advanced frequency attacks incorporate:

Bigram/trigram analysis: Use two- or three-letter frequency patterns
Dictionary cross-checking: Validate decoded words against language dictionaries
Iterative swap optimization: Refine mappings by local search heuristics

For comprehensive cryptanalysis theory, see Stinson & Paterson (2018) Cryptography: Theory and Practice or Katz & Lindell (2020) Introduction to Modern Cryptography.

Key Insight: Why Frequency Attacks Work

Per-letter RSA creates a deterministic one-to-one mapping from letters to numeric tokens. The ciphertext token histogram mirrors the plaintext letter histogram. By aligning the most frequent token with the most frequent letter (E in English), and continuing this rank-based matching, we recover large portions of the message.

Chi-squared minimization provides a principled optimization criterion that typically outperforms simple rank matching by finding the permutation with minimal statistical distance from the reference language profile.