Polimake

The normal distribution: from De Moivre (1733) to Gauss and Laplace, the Central Limit Theorem, and why knowing this changes business decisions

The normal distribution explained with the depth it deserves: its origin in Abraham de Moivre's The Doctrine of Chances (1733), its formalization by Gauss in 1809, the Central Limit Theorem that explains its ubiquity, its real business applications (A/B testing, quality control, finance), and the cases where it does NOT apply (Pareto, power-law distributions, fat tails).

· Platform

The team behind Polimake. We explore the intersection of technology, creativity, and automation.

Published:
The normal distribution: from De Moivre (1733) to Gauss and Laplace, the Central Limit Theorem, and why knowing this changes business decisions

A normal distribution—also called a Gaussian distribution or a bell curve—is the mathematical representation of how many natural and social phenomena are distributed: most values cluster around an average (the mean), and extreme values become progressively less frequent the farther they get from that center. Its characteristic symmetrical bell shape is probably the most recognizable image in statistics after the histogram, and the mathematical formula that describes it is one of the most widely used in science.

For someone in marketing, product, or any discipline that makes data-based decisions, understanding what the normal distribution assumes and where it breaks down is probably one of the most valuable statistical skills you can acquire. Assuming normality when the data doesn't meet it leads to wrong conclusions with real consequences. This article covers the history, the math in accessible language, the honest applications, and the cases where it does NOT apply—something most introductions to the topic leave out.

The origin: Abraham de Moivre, 1733

The normal distribution bears the name of Carl Friedrich Gauss (1777–1855), but its mathematical description is almost a century older. Abraham de Moivre, a French mathematician exiled in London, published in 1733 in his book The Doctrine of Chances (the second edition of a work originally from 1718) the first known mathematical description of what we now call the normal distribution.

De Moivre was studying the distribution of outcomes when tossing many coins. When you flip a coin 100 times and count how many heads come up, you don't get 50 every time. The count varies. De Moivre observed that if you graph the results of many experiments, the frequency of each number of heads formed a curve with a specific shape, symmetrical around the expected value (50). He developed the mathematical formula that describes that curve.

It's one of the classic cases of a discovery attributed late. Until the 20th century, "the De Moivre–Laplace law" was a common name. Today calling it "Gaussian" is standard but not strictly accurate historically.

Pierre-Simon Laplace (1749–1827), a French mathematician, significantly refined De Moivre's work during the early decades of the 19th century, especially in his Théorie analytique des probabilités (1812). Laplace contributed decisively to the Central Limit Theorem, which explains why the normal distribution appears so ubiquitously in real data (we cover it below).

Carl Friedrich Gauss published in 1809 Theoria motus corporum coelestium ("Theory of the Motion of the Heavenly Bodies"), where he derived the normal distribution in a completely different context: analyzing errors in astronomical measurement. Gauss observed that when several astronomers measured the same star, the measurement errors were distributed following a bell-shaped curve. His mathematical treatment in astronomy gave him fame and lent his name to the phenomenon, even though De Moivre and Laplace had described it earlier.

The historical reason it's called "Gaussian": Gauss's work was more visible in the 19th-century scientific community, and its association with least-squares methods (also developed independently by Legendre) cemented his name.

The formula and the parameters

Although this article avoids excessive math, it's worth knowing the basic concepts:

The normal distribution has two parameters that completely define it:

Mean (μ, mu). The center of the distribution. The most frequent value, the average. If a population has a mean height of 175 cm, μ = 175.

Standard deviation (σ, sigma). The spread. It indicates how much the values typically deviate from the average. If σ = 7 cm, it means most people are within about 7 cm of 175 cm.

The complete mathematical formula of the normal distribution:

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

You don't need to understand it in detail to use it. What matters operationally are two properties:

The 68-95-99.7 rule (the empirical rule). In a normal distribution:

  • 68% of the values are within 1 standard deviation of the mean (μ ± σ).
  • 95% are within 2 standard deviations (μ ± 2σ).
  • 99.7% are within 3 standard deviations (μ ± 3σ).

This means values beyond 3 standard deviations from the mean are extremely rare (~0.3%). This rule is one of the most useful statistical tools for quickly assessing whether an observed value is typical or anomalous.

Symmetry. The normal distribution is symmetrical around the mean. There are roughly the same number of values below the mean as above it. If the data doesn't meet this symmetry (skewness), it's not perfectly normal.

The Central Limit Theorem: why the normal distribution is ubiquitous

The fundamental reason the normal distribution appears so often in real data is deep mathematics. The Central Limit Theorem (CLT), developed during the 19th century (with contributions from Laplace, Lyapunov, Lindeberg, Lévy), essentially states:

When you average or sum many independent random variables—each with its own distribution (which doesn't even have to be normal)—the result tends to follow a normal distribution as the number of variables grows.

The practical consequence is enormous. Any phenomenon that is the aggregate result of many small independent factors tends to be normally distributed:

  • Human height. Determined by many genes, nutrition, environment, etc.—all small factors that add up.
  • Measurement errors. Each experiment has many small sources of error that sum together.
  • Averages of many independent outcomes (what De Moivre observed with coins).
  • IQ scores (as a statistical construct—the IQ scale itself is designed to be normal).
  • The performance of many individuals on similar tasks.
  • The distribution of errors in controlled industrial processes.

The CLT is why, for decades, the assumption of normality was almost automatic in applied statistics. If something is the aggregate result of multiple independent factors, it should be approximately normal.

But it's also why, when the data is NOT normal, it's usually because some assumption of the CLT isn't met—the factors aren't independent, or one factor dominates, or there are nonlinear effects. Knowing this guides you to investigate the underlying cause.

Real applications in business and marketing

A/B testing and experimentation. The core of how statistical tests are designed. The assumption of normality lets you calculate statistical significance (p-values), sample sizes, and confidence intervals. Platforms like Optimizely, VWO, Convert, and Google Optimize (discontinued in 2023) operate on this math. When someone says "this variant is 95% significant," they're saying the observed difference falls outside the typical range we'd expect by chance under the assumption of a normal distribution.

Quality control / Six Sigma. The name "Six Sigma" comes literally from the normal distribution: producing only 3.4 defects per million opportunities is equivalent to having defects more than 6 sigmas from the mean. Six Sigma was developed at Motorola in the '80s (Bill Smith) and popularized by Jack Welch at GE in the '90s. It remains a methodology in quality manufacturing.

Finance and risk models. The Black-Scholes model (1973), the basis for much of financial options pricing, assumes that asset returns are normally distributed. Although this assumption has been shown to be insufficient (markets have "fat tails" that the normal underestimates), the model is still used as an approximation. More on this in the limits section.

User behavior analysis. Distributions of time on page, visitor age, and segment income are frequently approximately normal, allowing standard statistical analysis.

Forecasting / sales forecasting. Assuming a normal distribution of forecast errors, you can build confidence intervals ("expected sales: 1,000 ± 100 units with 95% confidence"). Covered in sales forecasting.

Surveys and sampling. The CLT is why relatively small samples can estimate population parameters with precision. Covered in the questionnaire.

Organizational performance metrics. Employee productivity, process times, sales cycles—frequently approximately normal in aggregate.

When the normal distribution does NOT apply

Here's the most important part, and the one most introductions leave out. Many phenomena relevant to business do NOT follow normal distributions, and assuming normality when it doesn't apply produces seriously wrong predictions.

Power-law and Pareto distributions. We cover this in detail in the 80/20 rule / Pareto Principle. Many economic phenomena follow power-law distributions, not normal ones:

  • Revenue per customer: most customers generate little, a few generate the majority.
  • Web traffic per page: a few pages concentrate most of the visits.
  • City sizes. Few megacities, many small towns.
  • Word frequency (Zipf's Law).
  • Earthquake magnitudes.
  • Sales per product. A few best-sellers concentrate sales.

In power-law distributions, "outliers" are far more frequent and far more extreme than the normal predicts. Assuming normality leads to dramatically underestimating extreme events.

Fat tails. Some distributions are bell-shaped but with tails much fatter than the normal—extreme events are far more likely. Nassim Nicholas Taleb has popularized this concept especially in finance (the books Fooled by Randomness, 2001, and The Black Swan, 2007). The stock market crash of October 1987 (when the Dow Jones fell 22.6% in one day) was an event that, under the assumption of normality, should occur roughly once every 10^50 years. It occurred.

Bimodal or multimodal distributions. When there are two (or more) "peaks" instead of one. Example: the weight of adults in a population can have two modes (one for men, another for women). Treating the whole set as normal mixes two different distributions and produces wrong conclusions.

Skewed distributions. When one tail is longer than the other. Personal income is a classic example: many below the average, a few extremely high above it. The log-normal distribution fits better than the normal.

Discrete distributions with limited options. Survey results on a 1–5 scale, number of children per family, binary decisions. The normal is continuous and unbounded at both ends; these are discrete and bounded.

Events with rapid accumulation or exhaustion. Sales of a newly launched product typically follow S-curves, not normal ones. Technology adoption (covered in early adopter) follows a diffusion curve, not a normal one.

How to check whether your data is approximately normal

Before assuming normality and applying statistical tools that depend on it, it's worth verifying:

Visual: histogram. Plot the data on a histogram and see whether it looks like a symmetrical bell. If it's clearly skewed, multimodal, or has odd tails, it's not normal.

Visual: QQ plot (quantile-quantile plot). A chart that compares the quantiles of your data with those of a normal distribution. If the data is normal, the points lie approximately on a straight line.

Formal statistical tests: Shapiro-Wilk (best for small samples), Kolmogorov-Smirnov, Anderson-Darling. They produce a p-value; if it's low, the data rejects the hypothesis of normality.

Descriptive statistics: skewness and kurtosis. In a normal distribution, skewness ≈ 0 and kurtosis ≈ 3 (or 0 if you use "excess kurtosis"). Significantly different values suggest non-normality.

A practical rule for deciding: many statistical methods are reasonably robust to modest deviations from normality. But if the deviation is large, you have to consider transformations (logarithmic, square root) or non-parametric methods (which don't assume normality).

Common mistakes in applying the normal

Assuming normality by default without checking. The most expensive mistake. It leads to wrong conclusions in analytics, A/B testing, finance, and forecasting.

Confusing the normal distribution with the uniform distribution. The normal concentrates values near the center; the uniform spreads them out equally. They're different things.

Calculating means on data with extreme outliers without considering the distribution. If the data follows a power-law distribution, the mean is misleading because it's skewed by the few very large values. The median is typically more representative.

Using parametric tests (which assume normality) on data that clearly isn't normal. The conclusions about statistical significance are invalid.

Basing extreme predictions on the normality assumption. Especially in finance and risk management. The 2007–2008 financial crisis was partly fueled by models that underestimated the risk of extreme events.

Confusing "lots of data" with "a normal distribution." Having 10,000 points doesn't guarantee normality. The assumption depends on the nature of the phenomenon being measured, not on the sample size.

Ignoring the context of the phenomenon. What generates this data? If it's the aggregate result of many independent factors, normality is plausible. If it depends on a few dominant factors, probably not.

Running an A/B test with a very small sample. Some tests assume a sample size large enough for the CLT to apply. With a very small n, the assumptions break down.

Not considering transformations when they apply. If the data is log-normal (common in income, times), taking the logarithm turns it normal and lets you apply standard tools.

The normal distribution in A/B testing: the most common operational case

For many marketing and product professionals, A/B testing is the context where the normal distribution shows up most routinely. It's worth grounding it:

When a company runs an A/B test (version A vs. version B), it's measuring whether the observed difference in some metric (conversion, time on page, revenue per user) is statistically significant or could be random noise.

The reasoning: if the metric for each version follows an approximately normal distribution around its true value, we can calculate the probability of observing a difference that large or larger by pure chance. That probability is the p-value.

A p-value < 0.05 (typically) means the difference has less than a 5% probability of being noise—by convention it's considered "statistically significant."

But there are nuances worth attention:

Statistical significance is not magnitude. A difference can be statistically significant and operationally irrelevant (a 0.01% conversion improvement).

Sample size matters. With large samples, almost any difference becomes significant. The question is whether the magnitude justifies action.

Multiple testing. Running many tests simultaneously increases the probability of finding "significance" by chance. Corrections like Bonferroni adjust for this.

Sequential testing. Looking at results and stopping when there's significance (peeking) inflates the false-positive rate. Bayesian tests or formal sequential tests are alternatives.

The methodology's assumptions. Traditional frequentist A/B testing assumes a normal distribution of the measured metric. If the metric is very skewed (revenue per user in e-commerce is typically very skewed), the assumptions break down and the tests give unreliable results.

Modern A/B testing platforms (Optimizely, VWO, Convert, Mixpanel, Eppo, GrowthBook) use sophisticated statistical methods, but the underlying math still depends—to varying degrees—on the normal distribution or approximations to it.

The normal distribution and creative operations

For a brand that makes data-based decisions—A/B testing of creatives, campaign performance analysis, hypothesis evaluation—understanding what the statistics you use daily do and don't assume affects the quality of those decisions.

That discipline connects to creative operations: creative KPIs should be defined with awareness of which distributions they follow, and the learnings from A/B testing feed the editorial calendar and content production better when you understand their statistical limits.

At Polimake the measurement logic lives in three surfaces: Studio coordinates experiments based on defined KPIs, Studio produces variants for testing, and Media stores the history of variants and results that enable longitudinal analysis.


If you lead product, marketing, analytics, or any role that makes data-based decisions and you arrived here looking for an answer about the normal distribution, the most useful thing you can take from this article is probably the combination of three ideas: the normal distribution is ubiquitous in data for deep mathematical reasons (the CLT), standard statistical tools depend on its assumption, and many phenomena relevant to business do NOT follow it (power-law, fat-tailed, multimodal, skewed). The most valuable statistical skill isn't knowing how to calculate means and deviations, but knowing when the assumptions hold and when they don't.

To complement this, the 80/20 rule / Pareto Principle covers the counter-model of power-law distributions, sales forecasting covers one of the contexts where the normality assumption frequently breaks down, and the questionnaire covers the sampling context where the CLT is relevant.

Quick references