When they were younger, both my kids were fans of Dr Seuss's Sleep Book (as I was in my turn). Among the delights for a small bed-time reader, Dr Seuss provides real time statistics about the number of people currently asleep, and (like a good statistics provider) publishes his methodology:
"We find out how many, we learn the amount,
By an Audio-Telly-o-Tally-o Count.
On a mountain, halfway between Reno and Rome,
We have a machine in a plexiglass dome
Which listens and looks into everyone's home.
And whenever is sees a new sleeper go flop,
It jiggles and lets a new Biggel-Ball drop.
Our chap counts these ballls as they plup in a cup,
And that's how we know who is down and who's up."
There's also a wonderfully goofy illustration of the machine, and I think it was this, rather than any predestination to work on usage statistics that made this one of my favourite parts of the book when I was a child.
In the real world, web usage statistics sometimes seem to offer the power (and intrusion) of the Audio-Telly-o-Tally-o Count, only to snatch it away again, and offer subtly different statistics, with various caveats.
As an example, suppose you own a website, and understandably want to know "how many visits did people make to my site in the last week?"
If you had an Audio-Telly-o-Tally-o Count, your "chap" would magically listen and look into everyone's home or ofice, find people actually visiting the site ...and then all that remains is to count the Biggel-Balls.
Of course, that's not what a usage stats tool does.
When someone requests a page from your website, their browser sends one or more requests to your website's "webserver" to send the text, images and other content that the customer needs to view the page. Along with that request comes some information about the customer's computer - its IP address, operating system, screen size, and some information about the kind of browser the customer is using. It also often tells us the URL of the page the customer came from. The webserver can record all this (in a file called a "server log") along with the time of the request, for analysis later. This combination of facts is fairly unique to the computer - think of it perhaps as being like a footprint. When the customer requests the next page, all this happens again, creating a further "footprint". As the customer visits further pages, his or her computer creates a series of further "footprints" - the process of following their visit for analysis purposes is a bit like following a trail of footprints down a beach.
For competeness here I should say that not all analysis runs from server logs. In a popular alternative the webserver includes a small program with each webpage that the customer's browser runs when it assembles the page. The "small program" (used for example by Google Analytics) causes a message to be sent out to a log file for analysis later. And there are other methods. For the purposes of the discussion here, it comes to much the same thing.
As another aside, it is of course sometimes possible to requre every user to log in, and then to follow their individually-identified activity with a cookie. That provides more detailed information, but is not always desirable (in some circumstances requiring people to log in makes them go away instead; not everyone will accept cookies, and so on).
Pushing on with the "footprints on the beach" analogy, it's worth noting that we have a science ficton or fantasy beach here - trails can suddenly start as if someone was teleported in by futuristic technology or magic (e.g. the customer came in from a bookmark, or typed the URL of our page rather than following a link that we can detect). Similarly, trails of footprints almost always suddenly stop (e.g. the customer stopped using their browser, or went to another site). The way the Internet works means that customers don't have to do anything formally to leave your site; they just stop requesting pages.
Imagine now a detective following these following these trails of footprints around on the beach. How do the clues compare with the Audio-Telly-o-Tally-o Count?
The detective has the following problems:
- "Footprints" are fairly unique to a given computer, but not completely so. If Big Corporation Inc. has bought a batch of identical computers and successfully forbids its staff from customizing them in any way, then all the computers will have identical footrprints. The detective may struggle to sort out all those size 42 Converse Sneakers. The usual counting rule is to count all this as one user (technically one "unique browser") whereas the the Audio-Telly-o-Tally-o Count can magicallly see several people, and so drops several Biggel-Balls.
- The Audio-Telly-o-Tally-o Count magically sees exactly where people stop using your website and do something else - and where they are still on the website, but not requesting pages. The detective only has the observation that the footprints stopped (the standard is to declare that a user session has ended if there are no more page requests for 30 minutes). Clearly this is arbitrary - the Audio-Telly-o-Tally-o Count might know that the user is still avidly reading a long web page, has broken off to answer the phone etc. So we might get one Biggel-Ball, as opposed to counting a new session each time there is a 30-minute gap.
- The detective is counting "footprints" of a computer, not the people behind it. So imagine a public library which has one computer, on which people come and go all day, many of them looking at your website. The Audio-Telly-o-Tally-o Count magicallly follows this, counting the people coming and going. The detective, following computer-generated "footprints" does not know that a different human is now filling those shoes. If there's a 30-minute break, of course the detective assumes this is a new session, but if the queue at the computer is moving swiftly enough, then this won't happen often, and several different humans will be counted as one visit.
- The Audio-Telly-o-Tally-o Count magically watches as a user switches from their desktop PC to their laptop or mobile device, or their computer at home, and can tell that this is one human continuing his or her visit. But each of these devices has a different footprint for the detective - the trainers suddenly stop, and a pair of heels carry on down the beach. So (unless the customer identifies themselves, e.g. by logging in) the detective counts a new visit each time the user switches device.
So, since we do not have an Audio-Telly-o-Tally-o Count, we can't count "visits" exactly in the common-sense meaning of the term. We can count "unique browsers" and "sessions" and combine those into a "visit" - a statistic which has some sources of error, but at least the major sources of error are known and the statistic is captured by a known and reproducible method. Note that the methodology is such that errors will usually result in under-counting: probably better for the business and its advertisers than getting an inflated idea of the traffic. It's currently the best that can be done, not due to the limitations of your usage analytics tool or usage analytics people, but due to problems with what you can actually measure, and the decisions you have to make to interpret this.
That takes us into the realm of the many things where "I don't see why we can't just..." meets "It's a bit more complicated than that" . But that needs a whole new blog post.
So this time I think the last word belongs to Dr Seuss, and his observation ("One Fish, Two Fish, Red Fish, Blue Fish") that:
"Every day from here to there,
Funny things are everywhere."