Statistics Homework 3 — RSA Frequency Attack

Overview: RSA as a Per-Letter Cipher

This homework extends frequency analysis to RSA encryption applied letter-by-letter. When RSA encrypts individual letters (A–Z mapped to 1–26), it degenerates into a simple substitution cipher: each plaintext letter maps deterministically to a unique numeric token. This deterministic, monoalphabetic mapping is vulnerable to frequency attacks that align ciphertext token frequencies with reference language profiles.

We explore two statistical attack methods:

Security note: Real RSA never encrypts single letters—it operates on large blocks with randomized padding (e.g., OAEP), which destroys letter-level structure. This per-letter variant is intentionally weak for educational purposes.

1. RSA Parameters & Input Text

RSA Key Configuration

First prime number for RSA modulus

Second prime number for RSA modulus

Public exponent, coprime to φ(n)

RSA Modulus
χ² Distance

Plaintext Input

Enter text to encrypt, or load a sample text for frequency analysis

2. Encrypted Output & Frequency Attack

Ciphertext (RSA Token Stream)

Attack Methods

Use statistical techniques to recover plaintext without the private key:

Recovered Plaintext

3. Frequency Distribution Comparison

This chart compares the observed frequencies of mapped letters (derived from ciphertext tokens) against the English reference distribution. The mapping is determined by the selected attack method.

4. Theoretical Foundations

Problem Formulation

Let \(\mathcal{A} = \{A,\dots,Z\}\) and define an injective encoding \(\mathrm{enc}:\mathcal{A}\to\{1,\dots,26\}\) with \(\mathrm{enc}(A)=1\) and \(\mathrm{enc}(Z)=26\). Per-letter RSA encryption with public key \((n,e)\) maps each letter \(X\) to ciphertext token: \[ c \equiv \mathrm{enc}(X)^e \pmod{n} \] Punctuation and spaces remain unchanged.

RSA Background

Given primes \(p,q\) and public exponent \(e\) with \(\gcd(e,\varphi(n))=1\), where:

Correctness follows from Euler's theorem: \(m^{ed} \equiv m \pmod{n}\) for any \(m\) coprime to \(n\).

Why Per-Letter RSA is Vulnerable

The mapping \(X \mapsto [\mathrm{enc}(X)^e \bmod n]\) is deterministic and (for typical parameters) injective over \(\{1,\dots,26\}\). Each plaintext letter always produces the same token, creating a monoalphabetic substitution cipher. The empirical distribution of tokens mirrors the letter distribution of the underlying language, enabling frequency attacks.

Attack Strategy 1: Greedy Rank-Based Matching

Let \(T\) be the set of ciphertext tokens with empirical frequencies \(f_T\). Let \(f_L\) be the reference language distribution (English). The greedy approach:

  1. Sort tokens by \(f_T\) in descending order: \(t_1, t_2, \ldots, t_k\)
  2. Sort letters by \(f_L\) in descending order: \(E, T, A, \ldots\)
  3. Map \(t_i \to \) letter at position \(i\)

This heuristic maximizes positional overlap and typically recovers a significant portion of plaintext.

Attack Strategy 2: Chi-Squared Minimization

A more principled approach scores each candidate letter-token permutation \(\pi: T \to \mathcal{A}\) against the reference profile. Given observed token counts \(O_t\) and mapping \(\pi\), we form letter counts: \[ O_L = \sum_{\pi(t)=L} O_t \] Compare to expected counts \(E_L = N \cdot f_L(L)\) where \(N\) is total token count. The chi-squared distance: \[ \chi^2(\pi) = \sum_{L \in \mathcal{A}} \frac{(O_L - E_L)^2}{E_L} \] We seek \(\hat{\pi} = \arg\min_\pi \chi^2(\pi)\). Since exhaustive search over \(26!\) permutations is infeasible, we use an iterative greedy refinement: start with rank-based assignment, then swap letter-token pairs to reduce \(\chi^2\).

Alternative Attack: Direct Factorization

When RSA parameters are small (e.g., \(n < 10^6\)), an attacker can recover the private key directly by factoring \(n = pq\). If factorization succeeds, compute \(\varphi(n) = (p-1)(q-1)\) and recover: \[ d \equiv e^{-1} \pmod{\varphi(n)} \] This renders frequency analysis unnecessary.

RSA security depends on the computational hardness of integer factorization. For large moduli (\(n > 2^{2048}\)), factorization is computationally infeasible. However, for small moduli, algorithms like:

NIST recommends minimum 2048-bit keys for RSA (FIPS PUB 186-4, SP 800-57 Part 1). The toy parameters used here are intentionally weak for educational demonstration.

Limitations & Real-World Countermeasures

This per-letter RSA construction is intentionally insecure. Production RSA systems:

With proper implementation, frequency attacks become infeasible. Per-letter RSA behaves like a classical monoalphabetic substitution cipher and serves purely as a pedagogical tool for understanding statistical cryptanalysis.

Further Refinements

Advanced frequency attacks incorporate:

For comprehensive cryptanalysis theory, see Stinson & Paterson (2018) Cryptography: Theory and Practice or Katz & Lindell (2020) Introduction to Modern Cryptography.

Key Insight: Why Frequency Attacks Work

Per-letter RSA creates a deterministic one-to-one mapping from letters to numeric tokens. The ciphertext token histogram mirrors the plaintext letter histogram. By aligning the most frequent token with the most frequent letter (E in English), and continuing this rank-based matching, we recover large portions of the message.

Chi-squared minimization provides a principled optimization criterion that typically outperforms simple rank matching by finding the permutation with minimal statistical distance from the reference language profile.