Statistics Homework 6 — Online Algorithms

Overview: Online Algorithms for Statistics

Computing statistics like mean and variance is straightforward for static datasets. However, the naive approach of recalculating from scratch whenever new data arrives is computationally inefficient and impractical for:

Streaming data: Continuous data ingestion in real-time systems
Large datasets: Data that cannot fit in memory
Resource-constrained devices: Embedded systems, IoT sensors, edge devices
Real-time monitoring: Anomaly detection, performance dashboards

Online algorithms (also called incremental or streaming algorithms) solve this problem by computing statistics using recurrence relations that update existing estimates with each new observation, avoiding the need to store all previous data or recalculate from scratch.

The Computational Problem

Naive batch computation requires:

Storage: All \(n\) data points must be stored → \(O(n)\) space
Time: Recalculate entire sum each update → \(O(n)\) per update, \(O(nk)\) for \(k\) updates
Overflow risk: Summing large numbers can cause numerical overflow
Memory pressure: Cannot handle datasets larger than available RAM

Solution: Online algorithms maintain running estimates requiring only \(O(1)\) space and \(O(1)\) time per update.

1. Online Mean Computation

The arithmetic mean can be updated incrementally using a simple recurrence relation. Let \(\bar{x}_n\) denote the mean of \(n\) observations \(x_1, \ldots, x_n\).

1.1 Mathematical Derivation

Starting with the definition: \[ \bar{x}_n = \frac{1}{n}\sum_{i=1}^{n} x_i \]

When a new observation \(x_{n+1}\) arrives, the new mean is: \[ \bar{x}_{n+1} = \frac{1}{n+1}\sum_{i=1}^{n+1} x_i = \frac{1}{n+1}\left(\sum_{i=1}^{n} x_i + x_{n+1}\right) \]

We can express the old sum in terms of the old mean: \(\sum_{i=1}^{n} x_i = n\bar{x}_n\). Substituting: \[ \bar{x}_{n+1} = \frac{1}{n+1}\left(n\bar{x}_n + x_{n+1}\right) \]

Expanding and rearranging: \[ \bar{x}_{n+1} = \frac{n\bar{x}_n}{n+1} + \frac{x_{n+1}}{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1} \]

This gives us the recurrence relation for the online mean: \[ \bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1} \]

1.2 Algorithm Properties

Initialization: \(\bar{x}_1 = x_1\) (mean of one observation is the observation itself)
Update formula: \(\bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1}\)
Storage: Only \(\bar{x}_n\) and \(n\) required → \(O(1)\) space
Time complexity: \(O(1)\) per update
Numerical stability: Incremental updates avoid summing large numbers

1.3 Implementation

def online_mean():
    n = 0
    mean = 0.0
    
    def update(x):
        nonlocal n, mean
        n += 1
        mean += (x - mean) / n
        return mean
    
    return update

# Usage
mean_updater = online_mean()
for x in [10, 20, 30, 40, 50]:
    print(f"Mean after {x}: {mean_updater(x)}")
# Output: Mean after 10: 10.0
#         Mean after 20: 15.0
#         Mean after 30: 20.0
#         Mean after 40: 25.0
#         Mean after 50: 30.0

In C/C++, the implementation is similarly concise:

void online_mean_update(double *mean, int *n, double x) {
    (*n)++;
    *mean += (x - *mean) / (*n);
}

// Usage
double mean = 0.0;
int n = 0;
double data[] = {10.0, 20.0, 30.0, 40.0, 50.0};
for (int i = 0; i < 5; i++) {
    online_mean_update(&mean, &n, data[i]);
    printf("Mean: %f\n", mean);
}

2. Online Variance: Welford's Algorithm

Computing variance online is more challenging because the variance formula \(s_n^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x}_n)^2\) requires the mean \(\bar{x}_n\), which changes as data arrives. The naive approach of storing sum of squares suffers from numerical instability and potential overflow.

The Two-Pass Problem

Computing variance naively requires two passes: first to calculate \(\bar{x}_n\), then to compute \(\sum (x_i - \bar{x}_n)^2\). Online algorithms solve this using Welford's algorithm, which computes both mean and variance in a single pass using recurrence relations.

2.1 Mathematical Derivation

Welford's algorithm (also known as the online algorithm for variance) uses a recurrence relation based on the sum of squared differences. Let us define: \[ S_n = \sum_{i=1}^{n} (x_i - \bar{x}_n)^2 \]

Then the sample variance is: \[ s_n^2 = \frac{S_n}{n-1} \]

We need to find a recurrence relation for \(S_n\). Starting from: \[ S_{n+1} = \sum_{i=1}^{n+1} (x_i - \bar{x}_{n+1})^2 \]

Expanding and using the online mean update: \[ S_{n+1} = \sum_{i=1}^{n} (x_i - \bar{x}_{n+1})^2 + (x_{n+1} - \bar{x}_{n+1})^2 \]

Substituting \(\bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1}\) and after algebraic manipulation: \[ S_{n+1} = S_n + (x_{n+1} - \bar{x}_n)(x_{n+1} - \bar{x}_{n+1}) \]

This yields Welford's algorithm:

Initialize: \(\bar{x}_1 = x_1\), \(S_1 = 0\), \(n = 1\)
For each new \(x_{n+1}\):
- Update mean: \(\bar{x}_{n+1} = \bar{x}_n + \frac{x_{n+1} - \bar{x}_n}{n+1}\)
- Update sum of squares: \(S_{n+1} = S_n + (x_{n+1} - \bar{x}_n)(x_{n+1} - \bar{x}_{n+1})\)
- Compute variance: \(s_{n+1}^2 = \frac{S_{n+1}}{n}\)

2.2 Algorithm Properties

Single pass: Computes mean and variance simultaneously in one traversal
Numerical stability: Avoids catastrophic cancellation in sum-of-squares computation
Storage: Only \(\bar{x}_n\), \(S_n\), and \(n\) required → \(O(1)\) space
Time complexity: \(O(1)\) per update
Accuracy: Provably more accurate than naive two-pass algorithm for finite precision

2.3 Implementation

def online_variance():
    n = 0
    mean = 0.0
    S = 0.0  # Sum of squared differences
    
    def update(x):
        nonlocal n, mean, S
        n += 1
        delta = x - mean
        mean += delta / n
        S += delta * (x - mean)
        variance = S / (n - 1) if n > 1 else 0.0
        return mean, variance
    
    return update

# Usage
var_updater = online_variance()
for x in [10, 20, 30, 40, 50]:
    mean, var = var_updater(x)
    print(f"n={var_updater.__closure__[0].cell_contents}, "
          f"mean={mean:.2f}, variance={var:.2f}")

In C/C++:

void online_variance_update(double *mean, double *S, int *n, double x) {
    (*n)++;
    double delta = x - *mean;
    *mean += delta / (*n);
    *S += delta * (x - *mean);
}

double sample_variance(double S, int n) {
    return (n > 1) ? S / (n - 1) : 0.0;
}

// Usage
double mean = 0.0, S = 0.0;
int n = 0;
double data[] = {10.0, 20.0, 30.0, 40.0, 50.0};
for (int i = 0; i < 5; i++) {
    online_variance_update(&mean, &S, &n, data[i]);
    printf("Mean: %f, Variance: %f\n", mean, sample_variance(S, n));
}

3. Interactive Calculator

Enter values one at a time to see how online algorithms update mean and variance incrementally. Compare with batch computation to observe numerical differences and efficiency.

The calculator demonstrates Welford's algorithm in action, showing how statistics update with each new observation without storing the entire dataset.

4. Theoretical Foundations

4.1 Time Complexity Analysis

Method	Per Update	Total (\(n\) updates)
Naive (batch)	\(O(n)\)	\(O(n^2)\)
Online	\(O(1)\)	\(O(n)\)

For streaming data with \(n\) observations, the naive approach requires \(O(n^2)\) total operations (each update recomputes from scratch), while the online algorithm requires only \(O(n)\) operations. This represents a quadratic improvement in time complexity.

4.2 Space Complexity Analysis

Method	Storage Required
Naive (batch)	\(O(n)\) — all data points
Online (mean)	\(O(1)\) — 2 values (\(\bar{x}_n\), \(n\))
Online (variance)	\(O(1)\) — 3 values (\(\bar{x}_n\), \(S_n\), \(n\))

Online algorithms achieve constant space complexity, independent of dataset size. This is crucial for:

Embedded systems: Limited memory resources
Streaming applications: Cannot store entire data streams
Large datasets: Data may not fit in memory

4.3 Numerical Stability

The naive approach of accumulating \(\sum x_i\) and \(\sum x_i^2\) can suffer from:

Integer overflow: For integer types, large sums exceed representable range
Floating-point overflow: For very large numbers, sums can exceed maximum representable value
Loss of precision: Adding numbers of very different magnitudes loses precision

Online algorithms mitigate these issues:

Incremental updates: Only process one value at a time, avoiding large accumulations
Centered calculations: Welford's algorithm uses differences from the running mean, which tend to be smaller
Early normalization: Mean updates normalize by \(n\), keeping values bounded

4.4 Overflow Prevention Example

# Naive approach - can overflow
def naive_variance(values):
    n = len(values)
    sum_x = sum(values)  # Can overflow for large n
    sum_x2 = sum(x*x for x in values)  # Can overflow
    mean = sum_x / n
    variance = (sum_x2 / n - mean * mean) * (n / (n - 1))
    return mean, variance

# Online approach - avoids overflow
def online_variance_safe():
    n = 0
    mean = 0.0
    S = 0.0
    def update(x):
        nonlocal n, mean, S
        n += 1
        delta = x - mean
        mean += delta / n
        S += delta * (x - mean)
        return mean, S / (n - 1) if n > 1 else 0.0
    return update

# For large datasets, online method prevents overflow
large_data = [1e10] * 1000000
var_updater = online_variance_safe()
for x in large_data[:1000]:  # Process first 1000 observations for demonstration
    var_updater(x)
# Online method succeeds; naive sum would overflow

4.5 Accuracy Comparison

For well-conditioned data (values of similar magnitude), both methods produce identical results within floating-point precision. However, online methods are more stable for:

Large datasets: Avoid accumulation errors
Streaming data: Maintain accuracy over long streams
Mixed scales: Better handling when values span multiple orders of magnitude

4.6 Mathematical Foundations

Online algorithms rely on recurrence relations and mathematical properties of statistical moments:

Linearity of expectation: \(E[X+Y] = E[X] + E[Y]\) enables incremental mean updates
Variance decomposition: \(\text{Var}(X) = E[X^2] - (E[X])^2\) relates to sum-of-squares
Algebraic identities: Sum-of-squares updates derived from difference equations

For rigorous treatments, see Knuth (1997) for computational aspects and Welford (1962) for the original variance algorithm derivation.

4.7 Applications in Cybersecurity

Network Traffic Monitoring:

Real-time baselines for anomaly detection (mean ± 3σ thresholds)
Monitor thousands of connections without storing all historical data
\(O(1)\) updates enable immediate anomaly detection with minimal latency

Intrusion Detection Systems (IDS/IPS):

Track statistical features of system calls, file access patterns, user behaviors
Continuous learning as system behavior evolves
Efficient operation on edge devices and resource-constrained sensors

Performance Monitoring:

Live security dashboards with real-time statistics (response times, error rates)
Rolling window statistics without buffering entire datasets
Dynamic alert thresholds computed in real-time

Key Insight: Online vs Batch Algorithms

Batch algorithms require \(O(n)\) space and \(O(n)\) time per update, making them impractical for streaming data. Online algorithms achieve \(O(1)\) space and \(O(1)\) time per update while maintaining numerical stability and accuracy. This makes them essential for real-time cybersecurity monitoring, IoT devices, and any system where data arrives continuously and memory is limited.

References

Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420.
Knuth, D. E. (1997). The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd ed.). Addison-Wesley.
Chan, T. F., Golub, G. H., & LeVeque, R. J. (1983). Algorithms for computing the sample variance: Analysis and recommendations. The American Statistician, 37(3), 242–247.