Statistics Homework 5 — Location & Dispersion

Overview: Descriptive Statistics Fundamentals

A fundamental task in statistics is to summarize a distribution using representative values. Two complementary aspects are essential:

Location (Central Tendency): Where is the distribution centered? What is the typical value?
Dispersion (Spread): How spread out are the data? How representative is the center?

Together, location and dispersion provide a comprehensive summary of any distribution's essential characteristics.

Why These Measures Matter

Location and dispersion measures are crucial across multiple domains:

Cybersecurity: Characterize baseline patterns and detect anomalies through variability analysis
Risk Analysis: Quantify expected outcomes (location) and uncertainty (dispersion)
Performance Monitoring: Establish normal operation ranges and identify unusual variability
Quality Control: Define acceptable variation and identify out-of-spec processes

This homework explores the variety of location and dispersion measures, their mathematical foundations, appropriate use cases, and practical applications through interactive computation.

1. Location Measures (Central Tendency)

Location measures describe where a distribution is "centered" along the measurement axis. Different measures capture different notions of "typical value," and the choice depends on data characteristics and analysis goals.

Formal Definition

For a sample \(x_1, x_2, \ldots, x_n\) of \(n\) observations, a location measure is a function \(L: \mathbb{R}^n \to \mathbb{R}\) that maps data to a single representative value, typically satisfying:

Translation invariance: \(L(x_1 + c, \ldots, x_n + c) = L(x_1, \ldots, x_n) + c\)
Scale equivariance: \(L(cx_1, \ldots, cx_n) = c \cdot L(x_1, \ldots, x_n)\) for \(c > 0\)

1.1 Arithmetic Mean

The arithmetic mean (often simply called the "mean") is the most familiar location measure, calculated as the sum of all values divided by their count: \[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + \cdots + x_n}{n} \]

The arithmetic mean is the center of mass of the data points, treating each observation equally. It is optimal for minimizing the sum of squared deviations (least squares property) and is the expected value when data are viewed as equally likely outcomes.

When to use: The arithmetic mean is appropriate when:

Data are symmetrically distributed (or close to symmetric)
All observations should have equal influence
The variable is measured on an interval or ratio scale
There are no extreme outliers significantly skewing the distribution

Limitations: The arithmetic mean is highly sensitive to outliers; a single extreme value can dramatically shift the mean.

1.2 Weighted Mean

The weighted mean assigns different importance (weights) to observations: \[ \bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i} \] where \(w_i \geq 0\) are the weights associated with each \(x_i\).

Weighted means are fundamental in survey sampling (where weights represent sampling probabilities), portfolio analysis (where weights are investment amounts), and any situation where observations have different reliabilities or represent different population sizes.

When to use:

Observations represent different sample sizes or populations
Some observations are more reliable or important than others
Aggregating data from heterogeneous groups with different sizes
Accounting for sampling design (e.g., stratified sampling)

Special case: When all weights are equal, the weighted mean reduces to the arithmetic mean.

1.3 Geometric Mean

The geometric mean is defined as: \[ \bar{x}_g = \sqrt[n]{x_1 \cdot x_2 \cdots x_n} = \left(\prod_{i=1}^{n} x_i\right)^{1/n} \] This is equivalent to exponentiating the arithmetic mean of logarithms: \[ \bar{x}_g = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \ln(x_i)\right) \]

The geometric mean is appropriate for multiplicative relationships and rates of change. It always produces a value less than or equal to the arithmetic mean (by the inequality of arithmetic and geometric means).

When to use:

Data represent rates of change, ratios, or percentages (e.g., growth rates, return rates)
Variables are inherently multiplicative
Working with data spanning multiple orders of magnitude
Calculating average ratios or proportions

Requirements: All values must be positive.

1.4 Harmonic Mean

The harmonic mean is the reciprocal of the arithmetic mean of reciprocals: \[ \bar{x}_h = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}} = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}} \]

The harmonic mean is useful for rates and ratios, particularly when dealing with averages of speeds or densities. It is always less than or equal to the geometric mean, which is less than or equal to the arithmetic mean (for positive data).

When to use:

Calculating average rates (e.g., average speed over equal distances)
Working with ratios where the denominator varies (e.g., price per unit when quantities differ)
Financial calculations involving P/E ratios or similar metrics
Harmonic progression contexts

Requirements: All values must be positive and non-zero.

1.5 Probabilistic Mean (Expected Value)

For a discrete random variable \(X\) with probability mass function \(p_X(x_i) = P(X = x_i)\), the expected value (or probabilistic mean) is: \[ E[X] = \sum_{i} x_i \cdot p_X(x_i) \]

For a continuous random variable with probability density function \(f(x)\), the expected value is: \[ E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx \]

The expected value is the theoretical "center of mass" of a probability distribution, representing the long-run average value if the experiment were repeated infinitely many times. It is fundamental to probability theory and statistical inference.

When to use:

Working with probability distributions (theoretical or empirical)
Calculating expected outcomes in risk analysis or decision theory
Defining parameters of statistical models
Computing theoretical moments of distributions

Connection to sample mean: The sample arithmetic mean is an unbiased estimator of the population expected value under appropriate sampling assumptions.

1.6 Trimmed Mean

The trimmed mean removes a specified percentage of observations from both ends of the ordered data before calculating the arithmetic mean of the remaining values. For a \(k\%\) trimmed mean, we discard the smallest \(k\%\) and largest \(k\%\) of observations: \[ \bar{x}_{\text{trim}, k} = \frac{1}{n - 2\lfloor kn/100 \rfloor}\sum_{i=\lfloor kn/100 \rfloor + 1}^{n - \lfloor kn/100 \rfloor} x_{(i)} \] where \(x_{(i)}\) denotes the \(i\)-th order statistic.

Trimmed means provide robustness to outliers while retaining more information than the median. They are particularly useful when outliers are present but we want a location measure that uses more of the data than the median does.

When to use:

Data contain outliers but we want to use more than just the median
Need a robust estimate that is less sensitive than the arithmetic mean
Working with skewed distributions where outliers are expected
Combining robustness with efficiency (using more data than median)

Common choices: 10%, 20%, or 25% trimming are common; the 25% trimmed mean is sometimes called the midmean.

1.7 Winsorized Mean

The Winsorized mean replaces extreme values (rather than removing them) with the values at the trimming thresholds, then computes the arithmetic mean: \[ \bar{x}_{\text{win}, k} = \frac{1}{n}\left(\lfloor kn/100 \rfloor \cdot x_{(\lfloor kn/100 \rfloor + 1)} + \sum_{i=\lfloor kn/100 \rfloor + 1}^{n - \lfloor kn/100 \rfloor} x_{(i)} + \lfloor kn/100 \rfloor \cdot x_{(n - \lfloor kn/100 \rfloor)}\right) \]

Winsorization reduces the influence of outliers while preserving the sample size, making it useful when we want a robust estimate but need to maintain the original count of observations.

When to use:

Similar situations as trimmed mean, but need to preserve sample size
Want robustness while maintaining all \(n\) observations
Computing robust variance estimates alongside location
Outlier treatment that doesn't discard information completely

1.8 Median

The median is the value that splits the ordered data in half: \[ \text{Median} = \begin{cases} x_{((n+1)/2)} & \text{if } n \text{ is odd} \\ \frac{x_{(n/2)} + x_{(n/2+1)}}{2} & \text{if } n \text{ is even} \end{cases} \]

The median is the most robust location measure, with a breakdown point of 50% (meaning up to half the data can be outliers without affecting the median). It minimizes the sum of absolute deviations.

When to use:

Data are highly skewed or contain many outliers
Robustness is more important than efficiency
Working with ordinal data (though arithmetic mean requires interval/ratio scale)
Need a measure that represents the "middle" of ordered data

1.9 Mode

The mode is the most frequently occurring value in a dataset. For continuous data, it is often defined as the value at which the probability density function (PDF) or probability mass function (PMF) reaches its maximum.

The mode is the only location measure applicable to nominal (categorical) data. It can be unimodal (one peak), bimodal (two peaks), or multimodal (multiple peaks).

When to use:

Categorical or nominal data
Finding the "typical" value in frequency-based contexts
Identifying peaks or clusters in distributions
When the most common value is of primary interest

2. Dispersion Measures (Spread & Variability)

While location measures describe where a distribution is centered, dispersion measures describe how spread out the data are around that center. Dispersion answers the critical question: How representative is the location measure?

Low dispersion: Values cluster tightly around the center; the mean is highly representative
High dispersion: Values are widely scattered; individual observations may differ substantially from the mean

Cybersecurity Application

Dispersion is essential for anomaly detection: understanding baseline variability enables setting appropriate detection thresholds. A mean response time of 100ms has very different implications when:

σ = 5ms: Highly predictable, anomalies easily detected
σ = 50ms: High variability, requires wider thresholds to avoid false positives

2.1 Variance

The variance is the average squared deviation from the mean. For a sample \(x_1, x_2, \ldots, x_n\): \[ s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2 \] where \(\bar{x}\) is the sample mean. The \(n-1\) denominator (Bessel's correction) makes this an unbiased estimator of the population variance.

The population variance (for a complete population) is: \[ \sigma^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - \mu)^2 \] where \(\mu\) is the population mean.

Variance measures spread in squared units, making it difficult to interpret directly. However, it has desirable mathematical properties (additivity for independent random variables) and is fundamental to statistical theory.

When to use:

Theoretical analysis and mathematical derivations
Computing other statistics (standard deviation, standard error)
Analysis of variance (ANOVA) and regression
When squared deviations are meaningful (e.g., squared errors)

2.2 Standard Deviation

The standard deviation is the square root of the variance: \[ s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2} \]

Standard deviation has the same units as the original data, making it much more interpretable than variance. It represents the "typical" distance of observations from the mean.

Interpretation: For normally distributed data, approximately 68% of values fall within one standard deviation of the mean (\(\bar{x} \pm s\)), 95% within two standard deviations (\(\bar{x} \pm 2s\)), and 99.7% within three standard deviations (\(\bar{x} \pm 3s\)). This is the empirical rule (68-95-99.7 rule).

When to use:

Describing spread in interpretable units
Comparing variability across different groups
Setting thresholds for anomaly detection
Understanding typical deviation from the mean

2.3 Mean Absolute Deviation (MAD)

The mean absolute deviation is the average absolute deviation from the mean: \[ \text{MAD} = \frac{1}{n}\sum_{i=1}^{n} |x_i - \bar{x}| \]

MAD is more robust to outliers than standard deviation because it uses absolute values rather than squares. It represents the average distance observations are from the mean, making it intuitively interpretable.

When to use:

When robustness to outliers is important
Simple, intuitive measure of average deviation
Alternative to standard deviation for skewed distributions
When absolute deviations are more meaningful than squared deviations

2.4 Range

The range is the difference between the maximum and minimum values: \[ R = x_{\max} - x_{\min} \]

Range is the simplest dispersion measure but is highly sensitive to outliers. A single extreme value can dramatically inflate the range, making it unrepresentative of typical variability.

When to use:

Quick, intuitive assessment of overall spread
When extreme values are of interest
Initial exploratory data analysis
Understanding the full extent of the data

Limitations: Highly sensitive to outliers; ignores the distribution of values between extremes.

2.5 Interquartile Range (IQR)

The interquartile range is the difference between the third quartile (\(Q_3\)) and first quartile (\(Q_1\)): \[ \text{IQR} = Q_3 - Q_1 \]

The IQR contains the middle 50% of the data, making it robust to outliers. Quartiles are calculated by ordering the data and finding values that divide it into quarters.

IQR is often used to identify outliers: observations beyond \(Q_1 - 1.5 \times \text{IQR}\) or \(Q_3 + 1.5 \times \text{IQR}\) are considered potential outliers (Tukey's method).

When to use:

Robust measure of spread for skewed or outlier-prone distributions
Outlier detection (boxplots use IQR)
Complementing median-based analysis
When extreme values should be ignored

2.6 Coefficient of Variation (CV)

The coefficient of variation is the ratio of standard deviation to the mean: \[ CV = \frac{s}{\bar{x}} \]

CV is a dimensionless measure, expressed as a percentage, that allows comparison of variability across different scales or units. It answers: "What percentage of the mean is the standard deviation?"

When to use:

Comparing variability across different units or scales
Assessing relative variability independent of magnitude
Quality control and process capability analysis
When mean values differ substantially across groups

Requirements: Mean must be non-zero. Best for ratio-scale data.

2.7 Median Absolute Deviation (MAD about Median)

The median absolute deviation about the median is a robust dispersion measure: \[ \text{MAD}_{\text{median}} = \text{median}(|x_i - \text{median}(x)|) \]

This measures typical deviation from the median using the median itself, making it highly robust to outliers. It has a breakdown point of 50%, meaning up to half the data can be outliers without affecting the measure.

When to use:

Maximum robustness to outliers is required
Working with heavily contaminated data
Complementing median-based location measures
Robust statistical analysis

2.8 Standard Error of the Mean (SEM)

The standard error of the mean quantifies the variability of the sample mean itself: \[ \text{SEM} = \frac{s}{\sqrt{n}} = \frac{\text{standard deviation}}{\sqrt{\text{sample size}}} \]

SEM describes how much the sample mean would vary if we repeated the sampling process many times. It is crucial for confidence intervals and hypothesis testing. Note that SEM decreases as sample size increases, reflecting that larger samples provide more precise estimates.

When to use:

Constructing confidence intervals for the mean
Assessing precision of location estimates
Hypothesis testing and statistical inference
Understanding sampling variability

2.9 Percentile Ranges

Various percentile ranges measure spread using different portions of the distribution:

90-10 percentile range: Difference between 90th and 10th percentiles (contains middle 80% of data)
95-5 percentile range: Difference between 95th and 5th percentiles (contains middle 90% of data)
Quartile deviation: Half the IQR, \(Q_2 \pm \text{IQR}/2\)

These measures are robust and can be customized to exclude specific tail percentages, making them useful for trimmed or robust analysis.

3. Interactive Calculator

Compute all location and dispersion measures for your own data. Enter values as a comma-separated list, and optionally provide weights for weighted calculations.

Explore how different measures produce different results based on data characteristics, and observe how dispersion quantifies the representativeness of location measures.

4. Theoretical Foundations

Location and dispersion measures are deeply connected to probability theory, optimization, and statistical inference.

4.1 Mathematical Properties

Different location measures optimize different criteria:

Arithmetic mean: Minimizes \(\sum_{i=1}^{n} (x_i - c)^2\) (sum of squared deviations)
Median: Minimizes \(\sum_{i=1}^{n} |x_i - c|\) (sum of absolute deviations)
Mode: Maximizes the probability/density at the chosen value
Geometric mean: Minimizes \(\sum_{i=1}^{n} (\ln(x_i) - \ln(c))^2\) in log space

4.2 Inequality of Means

For positive numbers \(x_1, \ldots, x_n\), the means satisfy (with equality only when all values are equal): \[ \bar{x}_h \leq \bar{x}_g \leq \bar{x} \] That is, harmonic mean ≤ geometric mean ≤ arithmetic mean. This ordering reflects that harmonic and geometric means give more weight to smaller values.

4.3 Sensitivity to Outliers

Location measures:

Most sensitive: Arithmetic mean
Moderately robust: Trimmed and Winsorized means
Highly robust: Median (breakdown point 50%)
Least affected: Mode

Dispersion measures:

Most sensitive: Range, Variance, Standard deviation
Moderately robust: Mean absolute deviation, Coefficient of variation
Highly robust: IQR, Median absolute deviation

4.4 Mathematical Relationships

For normal distributions: \[ \text{MAD} \approx 0.7979 \times \sigma, \quad \text{IQR} \approx 1.349 \times \sigma, \quad \text{MAD}_{\text{median}} \approx 0.6745 \times \sigma \]

4.5 Chebyshev's Inequality

For any distribution (not just normal), Chebyshev's inequality provides a bound: \[ P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} \]

This means at least 75% of data fall within 2σ, at least 89% within 3σ, at least 94% within 4σ, regardless of distribution shape.

4.6 Measurement Scales

The choice of location and dispersion measures depends on the measurement scale:

Nominal: Mode only
Ordinal: Median, mode, range/IQR
Interval: Mean, median, mode, variance, standard deviation, MAD, IQR
Ratio: All measures including geometric and harmonic means, coefficient of variation

Key Insight: Representativeness of the Mean

The fundamental question in descriptive statistics is: How representative is the mean? This depends critically on dispersion:

Low dispersion: Mean is highly representative (most observations cluster near it)
High dispersion: Mean is less representative (many observations deviate substantially)

Rule of thumb: CV < 0.15 suggests low variability (mean is representative), CV > 0.30 suggests high variability (consider median + IQR instead).

4.7 Applications in Cybersecurity

Anomaly Detection:

Network traffic baselines combine location and dispersion (mean + 3σ thresholds)
High dispersion requires wider thresholds to avoid false positives

Risk Assessment:

Expected value (location) quantifies expected losses
Variance (dispersion) quantifies uncertainty and risk

Performance Monitoring:

Median or trimmed means provide robust summaries for skewed response time distributions
Low variability indicates reliable systems, high variability suggests instability

References

Wackerly, D. D., Mendenhall, W., & Scheaffer, R. L. (2014). Mathematical Statistics with Applications (7th ed.). Cengage Learning.
Huber, P. J. (2004). Robust Statistics. Wiley-Interscience.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing (3rd ed.). Academic Press.