Statistical benchmarking

In statistics, benchmarking is a method of using auxiliary information to adjust the sampling weights used in an estimation process, in order to yield more accurate estimates of totals.

Suppose we have a population where each unit $k$ has a "value" $Y(k)$ associated with it. For example, $Y(k)$ could be a wage of an employee $k$ , or the cost of an item $k$ . Suppose we want to estimate the sum $Y$ of all the $Y(k)$ . So we take a sample of the $k$ , get a sampling weight W(k) for all sampled $k$ , and then sum up $W(k)\cdot Y(k)$ for all sampled $k$ .

One property usually common to the weights $W(k)$ described here is that if we sum them over all sampled $k$ , then this sum is an estimate of the total number of units $k$ in the population (for example, the total employment, or the total number of items). Because we have a sample, this estimate of the total number of units in the population will differ from the true population total. Similarly, the estimate of total $Y$ (where we sum $W(k)\cdot Y(k)$ for all sampled $k$ ) will also differ from true population total.

We do not know what the true population total $Y$ value is (if we did, there would be no point in sampling!). Yet often we do know what the sum of the $W(k)$ are over all units in the population. For example, we may not know the total earnings of the population or the total cost of the population, but often we know the total employment or total volume of sales. And even if we don't know these exactly, there often are surveys done by other organizations or at earlier times, with very accurate estimates of these auxiliary quantities. One important function of a population census is to provide data that can be used for benchmarking smaller surveys.

The benchmarking procedure begins by first breaking the population into benchmarking cells. Cells are formed by grouping units together that share common characteristics, for example, similar $Y(k)$ , yet anything can be used that enhances the accuracy of the final estimates. For each cell $C$ , we let $W(C)$ be the sum of all $W(k)$ , where the sum is taken over all sampled $k$ in the cell $C$ . For each cell $C$ , we let $T(C)$ be the auxiliary value for cell $C$ , which is commonly called the "benchmark target" for cell $C$ . Next, we compute a benchmark factor $F(C)=T(C)/W(C)$ . Then, we adjust all weights $W(k)$ by multiplying it by its benchmark factor $F(C)$ , for its cell $C$ . The net result is that the estimated $W$ [formed by summing $F(C)\cdot W(k)$ ] will now equal the benchmark target total $T$ . But the more important benefit is that the estimate of the total of $Y$ [formed by summing $F(C)\cdot F(k)\cdot Y(k)$ ] will tend to be more accurate.

Relationship to stratified sampling

Benchmarking is sometimes referred to as 'post-stratification' because of its similarities to stratified sampling. The difference between the two is that in stratified sampling, we decide in advance how many units will be sampled from each stratum (equivalent to benchmarking cells); in benchmarking, we select units from the broader population, and the number chosen from each cell is a matter of chance.

The advantage of stratified sampling is that the sample numbers in each stratum can be controlled for desired accuracy outcomes. Without this control, we may end up with too much sample in one stratum and not enough in another – indeed, it's possible that a sample will contain no members from a certain cell, in which case benchmarking fails because $W(C)=0$ , leading to a divide-by-zero problem. In such cases, it is necessary to 'collapse' cells together so that each remaining cell has an adequate sample size.

For this reason, benchmarking is generally used in situations where stratified sampling is impractical. For instance, when selecting people from a telephone directory, we can't tell what age they are so we can't easily stratify the sample by age. However, we can collect this information from the people sampled, allowing us to benchmark against demographic information.

Jilovsky, Cathie (2011-01-01). "Singing in harmony: statistical benchmarking for academic libraries". Library Management. 32 (1/2): 48–61. doi:10.1108/01435121111102575. hdl:10397/1739. ISSN 0143-5124.
Drummond, Chris; Japkowicz, Nathalie (March 2010). "Warning: statistical benchmarking is addictive. Kicking the habit in machine learning". Journal of Experimental & Theoretical Artificial Intelligence. 22 (1): 67–80. doi:10.1080/09528130903010295. ISSN 0952-813X. S2CID 779617.
Tiedau, J.; Engelkemeier, M.; Brecht, B.; Sperling, J.; Silberhorn, C. (2021-01-12). "Statistical Benchmarking of Scalable Photonic Quantum Systems". Physical Review Letters. 126 (2): 023601. arXiv:2008.11542. Bibcode:2021PhRvL.126b3601T. doi:10.1103/PhysRevLett.126.023601. PMID 33512183. S2CID 231592951.
Reisenthel, Patrick; Lesieutre, Daniel (2010-04-12). Statistical Benchmarking of Surrogate-Based and Other Optimization Methods Constrained by Fixed Computational Budget. American Institute of Aeronautics and Astronautics. doi:10.2514/6.2010-3088. ISBN 978-1-60086-961-7.

Relationship to stratified sampling

Further reading