Imagine when you are shopping on Amazon, a list of 50 items is displayed after a search. You scroll down, click an item, continue scrolling, and click on a few more. How does an analyst know if an item has been displayed on the screen to calculate the click rate (clicked/displayed)? How do they know if you saw only 15, or 20, or all 50 items? Is there a scientific way to estimate the furthest point you scrolled base on your clicks, and therefore how many items were actually displayed? It turns out this is “The German Tank Problem”.
Intro
In the above example, it might not be a problem for most of the mature tech companies as they are very likely already capturing the scrolling position in the telemetry data. However, when this data is not available, we often don’t have a better way than assuming the maximum click rank as the minimum displayed rank. For instance, when there are 50 items returned in a page, if the last click occurs at rank 30, we can safely say the user at least has seen the 1st to the 30th items. But this is surely an underestimate, if the user scrolled to 40th and the last click was on the 30th, our estimate would still be 30.
The German Tank Problem
Interestingly, this problem has been tackled before. The German Tank Problem, summarized by LLM –
In the statistical theory of estimation, the German tank problem consists of estimating the maximum of a discrete uniform distribution from sampling without replacement. In simple terms, suppose there exists an unknown number of items which are sequentially numbered from 1 to N. A random sample of these items is taken and their sequence numbers observed; the problem is to estimate N from these observed numbers. The problem can be approached using either frequentist inference or Bayesian inference, leading to different results.
During World War II, mathematicians wanted to estimate the number of tanks German was producing each month. The only available data to them was the serial numbers from captured tanks, which they assume they are sequential and started from 1. For example, if 4 captured tanks had numbers 19, 40, 42, and 60, they knew German had produced at least 60 tanks but don’t know how many more than 60.
There are multiple different ways to estimate and is not the focus of this article. Refer to Wikipedia if you are interested. But one simplest method is to apply the formula: N = m + m/k – 1. Where N is the estimated total number of tanks, m is the maximum serial number observed and k is the number of observations. The implication here is, it estimates the gap between each observed number (m/k) and assumes it is the next gap after the max.
Applying the formula to the above example, N = 60 + 60/4 – 1 = 74. During World War II, they estimate German was producing 246 tanks per month, and the actual number found after the war was 245.
Estimation of Displayed Items
Returning to our user behavior problem, a search session is equivalent to a batch of tanks, each item is a tank, each click is a capture of tank, and the position (rank) of an item is the serial number. The exact formula from the German Tank Problem can be applied here. Using the same example, if an user clicked the 19th, 40th, 42th, and 60th items, we can then infer this user has scrolled to the 74th items. The click ratio is then 4/74 instead of 4/60.
Or, in the screenshot of me shopping for a tank that I can afford, the click ranks are 4, 7, 14 so N = 14 + 14/3 – 1 = 17.6 indicating I scrolled to the last second row in the screenshot. Ignoring the fact that a row of 5 items will be displayed together.
Ending
Knowing this technique, the analyst provides a more accurate estimate. Great job. But no one cares.