Statistics Homework 5 — Location & Dispersion

Overview: Descriptive Statistics Fundamentals

A fundamental task in statistics is to summarize a distribution using representative values. Two complementary aspects are essential:

Together, location and dispersion provide a comprehensive summary of any distribution's essential characteristics.

Why These Measures Matter

Location and dispersion measures are crucial across multiple domains:

This homework explores the variety of location and dispersion measures, their mathematical foundations, appropriate use cases, and practical applications through interactive computation.

1. Location Measures (Central Tendency)

Location measures describe where a distribution is "centered" along the measurement axis. Different measures capture different notions of "typical value," and the choice depends on data characteristics and analysis goals.

Formal Definition

For a sample \(x_1, x_2, \ldots, x_n\) of \(n\) observations, a location measure is a function \(L: \mathbb{R}^n \to \mathbb{R}\) that maps data to a single representative value, typically satisfying:

  • Translation invariance: \(L(x_1 + c, \ldots, x_n + c) = L(x_1, \ldots, x_n) + c\)
  • Scale equivariance: \(L(cx_1, \ldots, cx_n) = c \cdot L(x_1, \ldots, x_n)\) for \(c > 0\)

1.1 Arithmetic Mean

The arithmetic mean (often simply called the "mean") is the most familiar location measure, calculated as the sum of all values divided by their count: \[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + \cdots + x_n}{n} \]

The arithmetic mean is the center of mass of the data points, treating each observation equally. It is optimal for minimizing the sum of squared deviations (least squares property) and is the expected value when data are viewed as equally likely outcomes.

When to use: The arithmetic mean is appropriate when:

Limitations: The arithmetic mean is highly sensitive to outliers; a single extreme value can dramatically shift the mean.

1.2 Weighted Mean

The weighted mean assigns different importance (weights) to observations: \[ \bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i} \] where \(w_i \geq 0\) are the weights associated with each \(x_i\).

Weighted means are fundamental in survey sampling (where weights represent sampling probabilities), portfolio analysis (where weights are investment amounts), and any situation where observations have different reliabilities or represent different population sizes.

When to use:

Special case: When all weights are equal, the weighted mean reduces to the arithmetic mean.

1.3 Geometric Mean

The geometric mean is defined as: \[ \bar{x}_g = \sqrt[n]{x_1 \cdot x_2 \cdots x_n} = \left(\prod_{i=1}^{n} x_i\right)^{1/n} \] This is equivalent to exponentiating the arithmetic mean of logarithms: \[ \bar{x}_g = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \ln(x_i)\right) \]

The geometric mean is appropriate for multiplicative relationships and rates of change. It always produces a value less than or equal to the arithmetic mean (by the inequality of arithmetic and geometric means).

When to use:

Requirements: All values must be positive.

1.4 Harmonic Mean

The harmonic mean is the reciprocal of the arithmetic mean of reciprocals: \[ \bar{x}_h = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}} = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}} \]

The harmonic mean is useful for rates and ratios, particularly when dealing with averages of speeds or densities. It is always less than or equal to the geometric mean, which is less than or equal to the arithmetic mean (for positive data).

When to use:

Requirements: All values must be positive and non-zero.

1.5 Probabilistic Mean (Expected Value)

For a discrete random variable \(X\) with probability mass function \(p_X(x_i) = P(X = x_i)\), the expected value (or probabilistic mean) is: \[ E[X] = \sum_{i} x_i \cdot p_X(x_i) \]

For a continuous random variable with probability density function \(f(x)\), the expected value is: \[ E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx \]

The expected value is the theoretical "center of mass" of a probability distribution, representing the long-run average value if the experiment were repeated infinitely many times. It is fundamental to probability theory and statistical inference.

When to use:

Connection to sample mean: The sample arithmetic mean is an unbiased estimator of the population expected value under appropriate sampling assumptions.

1.6 Trimmed Mean

The trimmed mean removes a specified percentage of observations from both ends of the ordered data before calculating the arithmetic mean of the remaining values. For a \(k\%\) trimmed mean, we discard the smallest \(k\%\) and largest \(k\%\) of observations: \[ \bar{x}_{\text{trim}, k} = \frac{1}{n - 2\lfloor kn/100 \rfloor}\sum_{i=\lfloor kn/100 \rfloor + 1}^{n - \lfloor kn/100 \rfloor} x_{(i)} \] where \(x_{(i)}\) denotes the \(i\)-th order statistic.

Trimmed means provide robustness to outliers while retaining more information than the median. They are particularly useful when outliers are present but we want a location measure that uses more of the data than the median does.

When to use:

Common choices: 10%, 20%, or 25% trimming are common; the 25% trimmed mean is sometimes called the midmean.

1.7 Winsorized Mean

The Winsorized mean replaces extreme values (rather than removing them) with the values at the trimming thresholds, then computes the arithmetic mean: \[ \bar{x}_{\text{win}, k} = \frac{1}{n}\left(\lfloor kn/100 \rfloor \cdot x_{(\lfloor kn/100 \rfloor + 1)} + \sum_{i=\lfloor kn/100 \rfloor + 1}^{n - \lfloor kn/100 \rfloor} x_{(i)} + \lfloor kn/100 \rfloor \cdot x_{(n - \lfloor kn/100 \rfloor)}\right) \]

Winsorization reduces the influence of outliers while preserving the sample size, making it useful when we want a robust estimate but need to maintain the original count of observations.

When to use:

1.8 Median

The median is the value that splits the ordered data in half: \[ \text{Median} = \begin{cases} x_{((n+1)/2)} & \text{if } n \text{ is odd} \\ \frac{x_{(n/2)} + x_{(n/2+1)}}{2} & \text{if } n \text{ is even} \end{cases} \]

The median is the most robust location measure, with a breakdown point of 50% (meaning up to half the data can be outliers without affecting the median). It minimizes the sum of absolute deviations.

When to use:

1.9 Mode

The mode is the most frequently occurring value in a dataset. For continuous data, it is often defined as the value at which the probability density function (PDF) or probability mass function (PMF) reaches its maximum.

The mode is the only location measure applicable to nominal (categorical) data. It can be unimodal (one peak), bimodal (two peaks), or multimodal (multiple peaks).

When to use:

2. Dispersion Measures (Spread & Variability)

While location measures describe where a distribution is centered, dispersion measures describe how spread out the data are around that center. Dispersion answers the critical question: How representative is the location measure?

Cybersecurity Application

Dispersion is essential for anomaly detection: understanding baseline variability enables setting appropriate detection thresholds. A mean response time of 100ms has very different implications when:

  • σ = 5ms: Highly predictable, anomalies easily detected
  • σ = 50ms: High variability, requires wider thresholds to avoid false positives

2.1 Variance

The variance is the average squared deviation from the mean. For a sample \(x_1, x_2, \ldots, x_n\): \[ s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2 \] where \(\bar{x}\) is the sample mean. The \(n-1\) denominator (Bessel's correction) makes this an unbiased estimator of the population variance.

The population variance (for a complete population) is: \[ \sigma^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - \mu)^2 \] where \(\mu\) is the population mean.

Variance measures spread in squared units, making it difficult to interpret directly. However, it has desirable mathematical properties (additivity for independent random variables) and is fundamental to statistical theory.

When to use:

2.2 Standard Deviation

The standard deviation is the square root of the variance: \[ s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2} \]

Standard deviation has the same units as the original data, making it much more interpretable than variance. It represents the "typical" distance of observations from the mean.

Interpretation: For normally distributed data, approximately 68% of values fall within one standard deviation of the mean (\(\bar{x} \pm s\)), 95% within two standard deviations (\(\bar{x} \pm 2s\)), and 99.7% within three standard deviations (\(\bar{x} \pm 3s\)). This is the empirical rule (68-95-99.7 rule).

When to use:

2.3 Mean Absolute Deviation (MAD)

The mean absolute deviation is the average absolute deviation from the mean: \[ \text{MAD} = \frac{1}{n}\sum_{i=1}^{n} |x_i - \bar{x}| \]

MAD is more robust to outliers than standard deviation because it uses absolute values rather than squares. It represents the average distance observations are from the mean, making it intuitively interpretable.

When to use:

2.4 Range

The range is the difference between the maximum and minimum values: \[ R = x_{\max} - x_{\min} \]

Range is the simplest dispersion measure but is highly sensitive to outliers. A single extreme value can dramatically inflate the range, making it unrepresentative of typical variability.

When to use:

Limitations: Highly sensitive to outliers; ignores the distribution of values between extremes.

2.5 Interquartile Range (IQR)

The interquartile range is the difference between the third quartile (\(Q_3\)) and first quartile (\(Q_1\)): \[ \text{IQR} = Q_3 - Q_1 \]

The IQR contains the middle 50% of the data, making it robust to outliers. Quartiles are calculated by ordering the data and finding values that divide it into quarters.

IQR is often used to identify outliers: observations beyond \(Q_1 - 1.5 \times \text{IQR}\) or \(Q_3 + 1.5 \times \text{IQR}\) are considered potential outliers (Tukey's method).

When to use:

2.6 Coefficient of Variation (CV)

The coefficient of variation is the ratio of standard deviation to the mean: \[ CV = \frac{s}{\bar{x}} \]

CV is a dimensionless measure, expressed as a percentage, that allows comparison of variability across different scales or units. It answers: "What percentage of the mean is the standard deviation?"

When to use:

Requirements: Mean must be non-zero. Best for ratio-scale data.

2.7 Median Absolute Deviation (MAD about Median)

The median absolute deviation about the median is a robust dispersion measure: \[ \text{MAD}_{\text{median}} = \text{median}(|x_i - \text{median}(x)|) \]

This measures typical deviation from the median using the median itself, making it highly robust to outliers. It has a breakdown point of 50%, meaning up to half the data can be outliers without affecting the measure.

When to use:

2.8 Standard Error of the Mean (SEM)

The standard error of the mean quantifies the variability of the sample mean itself: \[ \text{SEM} = \frac{s}{\sqrt{n}} = \frac{\text{standard deviation}}{\sqrt{\text{sample size}}} \]

SEM describes how much the sample mean would vary if we repeated the sampling process many times. It is crucial for confidence intervals and hypothesis testing. Note that SEM decreases as sample size increases, reflecting that larger samples provide more precise estimates.

When to use:

2.9 Percentile Ranges

Various percentile ranges measure spread using different portions of the distribution:

These measures are robust and can be customized to exclude specific tail percentages, making them useful for trimmed or robust analysis.

3. Interactive Calculator

Compute all location and dispersion measures for your own data. Enter values as a comma-separated list, and optionally provide weights for weighted calculations.

Explore how different measures produce different results based on data characteristics, and observe how dispersion quantifies the representativeness of location measures.

4. Theoretical Foundations

Location and dispersion measures are deeply connected to probability theory, optimization, and statistical inference.

4.1 Mathematical Properties

Different location measures optimize different criteria:

4.2 Inequality of Means

For positive numbers \(x_1, \ldots, x_n\), the means satisfy (with equality only when all values are equal): \[ \bar{x}_h \leq \bar{x}_g \leq \bar{x} \] That is, harmonic mean ≤ geometric mean ≤ arithmetic mean. This ordering reflects that harmonic and geometric means give more weight to smaller values.

4.3 Sensitivity to Outliers

Location measures:

Dispersion measures:

4.4 Mathematical Relationships

For normal distributions: \[ \text{MAD} \approx 0.7979 \times \sigma, \quad \text{IQR} \approx 1.349 \times \sigma, \quad \text{MAD}_{\text{median}} \approx 0.6745 \times \sigma \]

4.5 Chebyshev's Inequality

For any distribution (not just normal), Chebyshev's inequality provides a bound: \[ P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} \]

This means at least 75% of data fall within 2σ, at least 89% within 3σ, at least 94% within 4σ, regardless of distribution shape.

4.6 Measurement Scales

The choice of location and dispersion measures depends on the measurement scale:

Key Insight: Representativeness of the Mean

The fundamental question in descriptive statistics is: How representative is the mean? This depends critically on dispersion:

  • Low dispersion: Mean is highly representative (most observations cluster near it)
  • High dispersion: Mean is less representative (many observations deviate substantially)

Rule of thumb: CV < 0.15 suggests low variability (mean is representative), CV > 0.30 suggests high variability (consider median + IQR instead).

4.7 Applications in Cybersecurity

Anomaly Detection:

Risk Assessment:

Performance Monitoring:

References