Surprisingly little actual analysis seems to get done in the world of Web Analytics. A huge library of common statistical methods exists for analysing data: from simple methods like correlation to more complex methods like linear and non-linear regression or cluster analysis. So why then are these techniques not routinely employed in web analytics the way they are for most other data analysis, asks Gary Angel, president for digital analytics consultancy Semphonic. This article is copyright 2012 The Best Customer Guide.

The problem, according to Angel, is that the topographic nature of websites prevents these basic statistical techniques from working well in the digital realm. He identifies several methods for solving the unique problems that digital data and website structure present, and highlights a method that can effectively enable statistical analysis of digital data by controlling the effects of topography. These techniques provide a completely new set of opportunities for effectively measuring, analysing, and optimising your digital properties to drive better online performance.

When we do digital analytics, the essential behaviour we see is a trail of where a visitor went in the virtual space of a single site, and we assume that when visitors navigate to a place, that they did so with a fixed intention. In other words, we tend to assume that the content that visitors view is an accurate reflection of their interests.

This assumption that what a visitor chose to view is reflective of their interests and intentions seems like a fairly safe bet. However, the way visitors traverse a website is controlled, to some extent, by the options and paths you provide. A website has a structure so that the visitor is encouraged to travel on certain link paths to reach a destination. Indeed, some paths may not be available to a visitor at all. So a website can sometimes be like a magician's trick: the card you pick was forced on you and isn't what you meant to choose at all. There is a fundamental tension between these two basic principles; meaning that if we don't take account of the structure of a website when we examine behaviour, we are highly likely to misread intention.

In a statistical analysis of traffic in San Francisco, HWY 1 is strongly correlated with reaching the Golden Gate Bridge. In web terms, we might assume that if getting to the Golden Gate Bridge is our success metric, then HWY 1 is a major contributor. And, of course, it is - but in a totally meaningless way. There are only two ways to get to the bridge, and HWY 1 is one of them, so if it wasn't strongly correlated, we'd be very surprised. But no analyst would ever be foolish enough the think that a straightforward correlation model would work for analysing city traffic. Surprisingly, though, many have made exactly that mistake when it comes to web sites. This is surprising because websites are very much like city streets. Some pathways are big and broad, while others are small and narrow and, sometimes, there is simply no direct way to get from Point A to Point B.

Basic statistical analysis techniques aren't designed to handle data sets where the data is topographically arranged - and the structure of websites creates a deep topology to web data. Simple correlation analysis, for example, does nothing to separate out the impact of natural structure and visitor intention. So pages that are closely related navigationally are almost always highly correlated. This makes it impossible to interpret true intention and, therefore, almost completely useless.

So any real analysis of visitor behaviour will have to take account of topology before it will be possible to infer correlation or intentionality. From a heuristic standpoint, we think of websites as having a hierarchical structure. At the top is the Home Page followed by Section or Main Menu pages and underneath each of these pages lives additional content often with further hierarchical nestings. While clearly valuable, this type of abstract hierarchical ordering isn't perfect. It doesn't provide a clear representation of the distance between two points nor does it capture many of the intricacies of website structure. Key content, for example, is often directly available from the Home Page but may be 'structured' as several layers deep within the website.

Nevertheless, many UI designs begin with a hierarchical structure diagram of the website and this type of representation is a good place to start thinking about capturing a topography in digital form: With this type of representation, we could map the distance between any two pages as the number of boxes along the hierarchy that have to be traversed to reach them. By instantiating an abstract digital hierarchy like this, the analyst can create a 'design view' of the distance between any two points on the website. For a complex website, building this abstract website structure is a lot of work, especially if a pre-existing design topology hasn't been constructed as part of the UI design. Still, as a digital representation of the design view of the website, it can be an invaluable asset to analysis.

Fortunately, you don't have to invent a topology for a website. Using an algorithmic approach to analysis, it's possible to create a behavioural topology of the website. The behavioural map works by creating a topography of the website based on user 'previous page' steps. To begin with, the analysis identifies all top-level pages - one where a majority of its views are classified as 'entries'. Next, any webpage for which the most common 'previous page' is one of the top-level pages becomes a second level page, with its parent node being the relevant top-level page. This process continues until every page on the website has been classified somewhere in the tree.

This behavioural topography is incredibly rich for analytic purposes. It can provide a deep measure of the difference between your designers view of the site and actual usage, and to provide a topographic distance between points on the website to measure correlation more appropriately. You can also use it to measure relationships between branches of the tree and places where jumps across trees are common or rare.

A topographic approach provides unique algorithmic website analytics that are fundamentally different than those you could hope to create with more traditional statistical methods such as correlation or regression. However, one of the virtues of a topographic analysis is that, by establishing both a logical and behavioural distance between points, it provides a method of controlling for distance between two points. With distance available explicitly in a variable, it's much easier for the analyst to incorporate it as a part of a standard statistical analysis.

With a direct incorporation of distance from the topographic mapping, there is no subjective assessment involved and the distance measurement is rigorous. Correlation to success between pages that are equidistant in the topology is significantly more meaningful because you've controlled for the most important influence of Website structure.

So web analytics data isn't directly analysable with straightforward statistical techniques because websites aren't simply open collections of randomly accessible content. Pages on a website are not equidistant and because websites embody a structure, it's impossible to analyse behaviour interestingly unless you've first accounted for structure.

A topographical design model creates a logical model of the site (rather like a sitemap) and then counts distances between nodes in the hierarchy. Even better, a behavioural topology model can be built showing how users actually navigate the structure and from this model, distance between nodes can be calculated based either on the distance in the tree or the actual number of average clicks between points.

These topographic models create numerous new analytic opportunities that few web analysts have explored. They form a distinct set of analysis techniques that are quite different from traditional marketing analytics. Interestingly, however, they also open up the opportunity to use classic statistical analysis techniques more fully. By creating objective measures of distance and a true topography of the website, these models make it possible to look at the relationship between content and outcome on the website while controlling for the site's inherent structure.