Statistics Homework 6 — Online Algorithms

Overview: Online Algorithms for Statistics

Computing statistics like mean and variance is straightforward for static datasets. However, the naive approach of recalculating from scratch whenever new data arrives is computationally inefficient and impractical for:

Online algorithms (also called incremental or streaming algorithms) solve this problem by computing statistics using recurrence relations that update existing estimates with each new observation, avoiding the need to store all previous data or recalculate from scratch.

The Computational Problem

Naive batch computation requires:

  • Storage: All \(n\) data points must be stored → \(O(n)\) space
  • Time: Recalculate entire sum each update → \(O(n)\) per update, \(O(nk)\) for \(k\) updates
  • Overflow risk: Summing large numbers can cause numerical overflow
  • Memory pressure: Cannot handle datasets larger than available RAM

Solution: Online algorithms maintain running estimates requiring only \(O(1)\) space and \(O(1)\) time per update.

1. Online Mean Computation

The arithmetic mean can be updated incrementally using a simple recurrence relation. Let \(\bar{x}_n\) denote the mean of \(n\) observations \(x_1, \ldots, x_n\).

1.1 Mathematical Derivation

Starting with the definition: \[ \bar{x}_n = \frac{1}{n}\sum_{i=1}^{n} x_i \]

When a new observation \(x_{n+1}\) arrives, the new mean is: \[ \bar{x}_{n+1} = \frac{1}{n+1}\sum_{i=1}^{n+1} x_i = \frac{1}{n+1}\left(\sum_{i=1}^{n} x_i + x_{n+1}\right) \]

We can express the old sum in terms of the old mean: \(\sum_{i=1}^{n} x_i = n\bar{x}_n\). Substituting: \[ \bar{x}_{n+1} = \frac{1}{n+1}\left(n\bar{x}_n + x_{n+1}\right) \]

Expanding and rearranging: \[ \bar{x}_{n+1} = \frac{n\bar{x}_n}{n+1} + \frac{x_{n+1}}{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1} \]

This gives us the recurrence relation for the online mean: \[ \bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1} \]

1.2 Algorithm Properties

1.3 Implementation

def online_mean():
    n = 0
    mean = 0.0
    
    def update(x):
        nonlocal n, mean
        n += 1
        mean += (x - mean) / n
        return mean
    
    return update

# Usage
mean_updater = online_mean()
for x in [10, 20, 30, 40, 50]:
    print(f"Mean after {x}: {mean_updater(x)}")
# Output: Mean after 10: 10.0
#         Mean after 20: 15.0
#         Mean after 30: 20.0
#         Mean after 40: 25.0
#         Mean after 50: 30.0

In C/C++, the implementation is similarly concise:

void online_mean_update(double *mean, int *n, double x) {
    (*n)++;
    *mean += (x - *mean) / (*n);
}

// Usage
double mean = 0.0;
int n = 0;
double data[] = {10.0, 20.0, 30.0, 40.0, 50.0};
for (int i = 0; i < 5; i++) {
    online_mean_update(&mean, &n, data[i]);
    printf("Mean: %f\n", mean);
}

2. Online Variance: Welford's Algorithm

Computing variance online is more challenging because the variance formula \(s_n^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x}_n)^2\) requires the mean \(\bar{x}_n\), which changes as data arrives. The naive approach of storing sum of squares suffers from numerical instability and potential overflow.

The Two-Pass Problem

Computing variance naively requires two passes: first to calculate \(\bar{x}_n\), then to compute \(\sum (x_i - \bar{x}_n)^2\). Online algorithms solve this using Welford's algorithm, which computes both mean and variance in a single pass using recurrence relations.

2.1 Mathematical Derivation

Welford's algorithm (also known as the online algorithm for variance) uses a recurrence relation based on the sum of squared differences. Let us define: \[ S_n = \sum_{i=1}^{n} (x_i - \bar{x}_n)^2 \]

Then the sample variance is: \[ s_n^2 = \frac{S_n}{n-1} \]

We need to find a recurrence relation for \(S_n\). Starting from: \[ S_{n+1} = \sum_{i=1}^{n+1} (x_i - \bar{x}_{n+1})^2 \]

Expanding and using the online mean update: \[ S_{n+1} = \sum_{i=1}^{n} (x_i - \bar{x}_{n+1})^2 + (x_{n+1} - \bar{x}_{n+1})^2 \]

Substituting \(\bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1}\) and after algebraic manipulation: \[ S_{n+1} = S_n + (x_{n+1} - \bar{x}_n)(x_{n+1} - \bar{x}_{n+1}) \]

This yields Welford's algorithm:

  1. Initialize: \(\bar{x}_1 = x_1\), \(S_1 = 0\), \(n = 1\)
  2. For each new \(x_{n+1}\):
    • Update mean: \(\bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1}\)
    • Update sum of squares: \(S_{n+1} = S_n + (x_{n+1} - \bar{x}_n)(x_{n+1} - \bar{x}_{n+1})\)
    • Compute variance: \(s_{n+1}^2 = \frac{S_{n+1}}{n}\)

2.2 Algorithm Properties

2.3 Implementation

def online_variance():
    n = 0
    mean = 0.0
    S = 0.0  # Sum of squared differences
    
    def update(x):
        nonlocal n, mean, S
        n += 1
        delta = x - mean
        mean += delta / n
        S += delta * (x - mean)
        variance = S / (n - 1) if n > 1 else 0.0
        return mean, variance
    
    return update

# Usage
var_updater = online_variance()
for x in [10, 20, 30, 40, 50]:
    mean, var = var_updater(x)
    print(f"n={var_updater.__closure__[0].cell_contents}, "
          f"mean={mean:.2f}, variance={var:.2f}")

In C/C++:

void online_variance_update(double *mean, double *S, int *n, double x) {
    (*n)++;
    double delta = x - *mean;
    *mean += delta / (*n);
    *S += delta * (x - *mean);
}

double sample_variance(double S, int n) {
    return (n > 1) ? S / (n - 1) : 0.0;
}

// Usage
double mean = 0.0, S = 0.0;
int n = 0;
double data[] = {10.0, 20.0, 30.0, 40.0, 50.0};
for (int i = 0; i < 5; i++) {
    online_variance_update(&mean, &S, &n, data[i]);
    printf("Mean: %f, Variance: %f\n", mean, sample_variance(S, n));
}

3. Interactive Calculator

Enter values one at a time to see how online algorithms update mean and variance incrementally. Compare with batch computation to observe numerical differences and efficiency.

The calculator demonstrates Welford's algorithm in action, showing how statistics update with each new observation without storing the entire dataset.

4. Theoretical Foundations

4.1 Time Complexity Analysis

Method Per Update Total (\(n\) updates)
Naive (batch) \(O(n)\) \(O(n^2)\)
Online \(O(1)\) \(O(n)\)

For streaming data with \(n\) observations, the naive approach requires \(O(n^2)\) total operations (each update recomputes from scratch), while the online algorithm requires only \(O(n)\) operations. This represents a quadratic improvement in time complexity.

4.2 Space Complexity Analysis

Method Storage Required
Naive (batch) \(O(n)\) — all data points
Online (mean) \(O(1)\) — 2 values (\(\bar{x}_n\), \(n\))
Online (variance) \(O(1)\) — 3 values (\(\bar{x}_n\), \(S_n\), \(n\))

Online algorithms achieve constant space complexity, independent of dataset size. This is crucial for:

4.3 Numerical Stability

The naive approach of accumulating \(\sum x_i\) and \(\sum x_i^2\) can suffer from:

Online algorithms mitigate these issues:

4.4 Overflow Prevention Example

# Naive approach - can overflow
def naive_variance(values):
    n = len(values)
    sum_x = sum(values)  # Can overflow for large n
    sum_x2 = sum(x*x for x in values)  # Can overflow
    mean = sum_x / n
    variance = (sum_x2 / n - mean * mean) * (n / (n - 1))
    return mean, variance

# Online approach - avoids overflow
def online_variance_safe():
    n = 0
    mean = 0.0
    S = 0.0
    def update(x):
        nonlocal n, mean, S
        n += 1
        delta = x - mean
        mean += delta / n
        S += delta * (x - mean)
        return mean, S / (n - 1) if n > 1 else 0.0
    return update

# For large datasets, online method prevents overflow
large_data = [1e10] * 1000000
var_updater = online_variance_safe()
for x in large_data[:1000]:  # Process first 1000 observations for demonstration
    var_updater(x)
# Online method succeeds; naive sum would overflow

4.5 Accuracy Comparison

For well-conditioned data (values of similar magnitude), both methods produce identical results within floating-point precision. However, online methods are more stable for:

4.6 Mathematical Foundations

Online algorithms rely on recurrence relations and mathematical properties of statistical moments:

For rigorous treatments, see Knuth (1997) for computational aspects and Welford (1962) for the original variance algorithm derivation.

4.7 Applications in Cybersecurity

Network Traffic Monitoring:

Intrusion Detection Systems (IDS/IPS):

Performance Monitoring:

Key Insight: Online vs Batch Algorithms

Batch algorithms require \(O(n)\) space and \(O(n)\) time per update, making them impractical for streaming data. Online algorithms achieve \(O(1)\) space and \(O(1)\) time per update while maintaining numerical stability and accuracy. This makes them essential for real-time cybersecurity monitoring, IoT devices, and any system where data arrives continuously and memory is limited.

References