Introduction to exploratory data analysis for geostatistics

Here is a small series of articles motivated by a somewhat broad question from a student using ArcGis Geostatistical Analyst: how to interpret the QQplot, the trend analyst and the variogram? Whether it is Geostatistical Analyst or any other geostatistical tool, we are supposed to start, before any interpolation, by the exploratory analysis of the data. Why? Simply because geostatistical tools assume a number of data characteristics and if these assumptions do not apply to our data set, our results will be false. We will discuss what principles are based on geostatistical tools and how to use exploratory analysis tools to support the necessary hypotheses.

Some principles of geostatistics

Let’s start with the basics of geostatistics. Unlike deterministic interpolation approaches, geostatistics assumes that all values ​​within your study area are the result of a random process. A random process does not mean that all events are independent.

Geostatistics is based on random processes with dependence. Such a type of process is, for example, the tossing of three coins. We observe if they are heads or tails. The fourth piece will not be tossed.

The rule for determining the toss result of the fourth piece is:

• if the second and third pieces are equal, the fourth coin result will be the same as the first;
• if not, the result of the fourth coin will be the opposite to that of the first coin.

In a spatial or temporal context, such dependence is called autocorrelation.

As this is the basic idea of ​​all geostatistics, it is better to linger a little. Throwing a coin is the symbol of chance. So we will call it a ”  random process   “. Until now, nothing new.

On the other hand, if you throw the coin 100 times, you expect to have as many times heads as tails. And that does not shock you either. If we have tossed the coin 99 times and we got 50 heads and 49 tails, what would you predict for the 100th toss?

If you say tails, you do geostatistics. You know that tossing is a random process, but you are also convinced that there is a certain dependence of the results in relation to a theoretical model (on a large number of throws there will be 50% of heads and 50% of tails).

However, there as many chances to predict the 100th toss as the 1st (1 chance out of two). That’s why understanding geostatistics is not so simple.

Let’s forget, for the time being, our coins and let’s use a more geographical example. If you randomly draw a pair of XY coordinates in the world, which would be the chance of finding its altitude?

If right now I tell you that we follow a GPS route, giving you the altitudes of the points every 50m. What are your chances of finding the next point’s altitude?

It’s the same as with the coins: in theory your chances are the same in both cases, but, in practice, if you analyze the following GPS points, you could,almost, predict the altitude of the next point.

Prediction of random processes with dependency

How does all this relate to geostatistics and the prediction of unmeasured values? In the example of the coins the rules of dependence were given, whereas in the GPS example, one had to find them out. In practice,dependency rules are still unknown. Therefore, in geostatistics, there are two main tasks: (1) to discover the rules of dependence and (2) to make predictions. As you can observe in the examples, predictions can only be made if we know the dependency rules.
Kriging is based on these two tasks: (1) semi-vectorogram analysis and covariance (spatial autocorrelation) and (2) prediction of unknown values. Because of these two different tasks, it has been said that geostatistics uses data twice: first to estimate spatial autocorrelation and second to make predictions

Understanding stationarity

Consider again the example of the coins. There is only one dependency rule for the coin tossing. With this single set of measured values, there is no hope of knowing the dependency rules if they are not explained by someone. However,thanks to continuous observation of many samples (our GPS points), the dependencies can become obvious. In general, the statistics are based on a notion of replication, repetition, based on which one can think that it is possible to estimate; and that the variability and uncertainty of the estimate can be understood from repeated series of observations.
In a spatial context, the idea of stationarity is used to obtain the necessary replication. Stationarity is an assumption, often realistic for spatial data. There are two types of stationarity.
One is called stationarity of the average. It is assumed here that the average is constant among the samples and that it is independent of the samples location.
The second type is called second-order stationarity for the covariance and the intrinsic stationarity for semi-variograms. Second-order stationarity is the assumption according to which the covariance is the same between any two points if they are at the same distance and in the same direction, regardless of the points you choose. The covariance will depend on the distance between any two values ​​and not on their location

In the previous diagram, the covariance between the pairs of connected points by the black line is assumed to be the same.

All the previous is clear in statistical terms. But since you do not have to be a statistician, let’s translate this into plain English.

Here we have points A and points B connected by the black lines. Covariance is a measure of how the two variables, actually, vary. Regardless of the calculation formula, in this example it is a measure of the height difference of A and B in relation to the average height of the terrain. If the values ​​of A were completely independent of B, the covariance should be zero (0). If it is not zero, we can think that there is a link between the two variables A and B.If we consider points separated by the same distance, so as to respect the second-order stationarity, the covariance of all these pairs of points must be roughly equal.

In the diagram above it seems to be the case. In the one that follows, do you think it is also the case?

Well, yeah. Even if the differences in value between the pairs in the flat and in the steep areas are very different, the covariance measures the difference between the height of A and the average for the points A and the height of B and the average for the points B. It will be substantially the same regardless of the slope. However, in the following diagram

The covariance of the red pairs will be different from that of the blue pairs. If we take the blue pairs, the second-order stationarity is respected,if we consider the set of pairs, it is not.

For semi-variograms, this same principle is applied to the variance. Intrinsic stationarity is the assumption that the variance of the difference in observed values ​​is the same between any two points if they are at the same distance and in the same direction, regardless of the two points you choose.

Second-order and intrinsic stationarity are necessary hypotheses to obtain the necessary replication, and, therefore, to estimate the dependence rules, which allows us to make predictions and evaluate the uncertainty of the predictions. Note that it is the spatial information (similar distance between any two points) that provides the replication.

The notion of distance will be present all the time in the geostatistical analysis. For the time being, let’s just say that we will use the common sense: the closer two points are, the more they will tend to have similar values, the more likely they are linked.

On the other hand, as they move away, this link will be less visible. Until the values ​​of the two points become completely independent.

So here we are with a series of assumptions for our data that will allow us to use geostatistical tools to predict values ​​where we do not have data.

But, what would happen if our assumptions are false? Quite simple, our predictions would be false too. Therefore, BEFORE using the geostatistical tools, it must be verified that the basic assumptions used by these tools are met.It is this step that is called Exploratory Analysis of Spatial Data (AEDS).

Exploratory Analysis of Spatial Data (AEDS)

Exploratory analysis of spatial data allows you to examine your data indifferent ways. Before creating a surface, AEDS grants you a deeper understanding of the phenomena to be studied so that you can make better decisions about your data issues.

Note for users of ArcGis Geostatistical Analyst:

In ArcGis, the AEDS environment consists of a series of tools, each of which allows a particular view of the data. Each view can be manipulated and explored, allowing different points of view on the data. Each view is interconnected with all other views as well as with ArcMap. That is, if a bar is selected in the histogram, the points on the bar are also selected on QQPlot (if open), on any other open AEDS view, and on the ArcMap map.

Each AEDS tool allows you to examine your data in different views. Each view is displayed in a separate window and fully interacts with the ArcMap view as well as with other AEDS windows. Available tools are Histogram, Voronoi Map, QQPlot Normal, Trend Analysis , Semivariogram / Covariance Cloud, QQPlot Generaland Crosscovariance Cloud

Note for others:

Whether in QGis or other GIS software, even if the tools are not packaged as in the Geostatistical Analyst application, you have the same exploratory analysis tools. Everything that will be said here is valid,whatever the computer tool you use.

Exploring data distribution, searching global and local outliers,looking for global trends, examining spatial autocorrelation, and understanding covariance among multiple data sets are all useful tasks, essential, to perform on your data. This set of analysis makes up the AEDS.

In the next article we will discuss the use of histograms according to the AEDS framework.