Trust.

Above is a video from webpagetest … Notice at the 2.0s mark … they are completely different. Nothing changed. There is no api to add in network effects, nothing but a every slight difference to the render order. The left becomes visually complete at 2.0s, the right, 2.3s.

If they were playing darts, they just got better at hitting the wall.

I have a 15% difference … but really, who cares. 0.3s isn’t perceptually different or worse. The question then becomes, what is the margin of error? Should it be based on human perception or statistically based on the test and time of day?

Webpage test helps you figure this out, sorta. They give you a range over a period of tests, but you still have to do the math AND know what math to do. Most people look at a histogram of data and still don’t know what kind of impact the 95ᴛʜ percentile has to their performance.

So, lets do some math…

So, if I’m looking at webpage test, I see that over 9 tests I ended up with a time to fully rendered between 1.7s and 2.1s … according to webpage test, I have a standard deviation of 0.2s and the mean is at 1.9s. This means that the 2.0s test above was within one standard deviation and the second test was less or equal to two standard deviations above the mean. In my humble opionion, two standard deviations is a pretty simple measure of “who gives a damn.”

This means anywhere from 1.5s to 2.3s I really can’t say that it’s anything special. If I were taking measurements of a change, I’d want to verify that the change’s mean is at least two standard deviations of the old mean as well.

Here’s an example of a popular website:

Normal load: 7.4s to 11.4s (range of 4.0s)
Improvement over time: 6.3s to 10.7s (range of 4.4s)

Did it get better? Not in my opinion. If they were playing darts, they just got better at hitting the wall. That’s an overlap of ~60%. Only ~40% of random page loads will randomly see any improvement. If improvement is based on random chance, then there’s no real improvement.