### Stats - See Updates Below

Lubos is back in the climate wars, and he appears to have a point. I haven't read the paper or done the calculations, but his Mathematica principal components calculation seem to show that there is something fishy about some recent Antarctica data. The data is represented as time series of temperatures from 5509 locations. However, essentially all the variance turns out to be captured by the equivalent of just four [Karhunen-Loeve transformed virtual] meaurement locations.

If you represent each time series as a vector, the 5509 measurement points produce a 5509 dimensional vector space - but all the vectors lie within a four-dimensional subspace.

This is quite odd, and suggests a degree of bogosity. Of course it's not odd at all that the temperatures at different locations in Antarctica should be correlated, but the enormous differences in scale of the eigenvalues do not appear compatible with that being the full explanation.

Anybody know what's really going on here?

UPDATE: OK, I looked through the Steig et. al. paper as well as Tapio Schneider's paper developing the statistical methodology they used: Journal of Climate, 14, 5 (March, 2001) "Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and the Importance of Missing Values."

The problem they were dealing with: they wanted to find the best estimate of means and covariances and their behavior over time for the Antarctic Continent from irregularly spaced and sporadic weather station records. The frequent gaps in the record mean that those matrices are underdetermined (the time series vectors have unknown components). Consequently, the statistical estimation problem can be ill posed. That problem can be dealt with by "regularizing" the data (in effect smoothing it). It is a non-linear estimation problem, and Schneider's method is an iterative smoothing method.

The method has some features and effects in common with the more ad hoc method of truncation of principal components, but has more formal justification. The point is that you wind up with only a few principal components because the others would contribute only noise - their effects are smaller than the other known noise sources.

In any case, the methodology explains Lubos's results - he only sees a few principal components because the others have been filtered out in the RegEM estimation. You should try your method on raw data, Lubos. I predict a lot of large (and spurious) principal components due to the data gaps.

Arun's comments are very nearly correct, with the only qualification that the values in question are not so much modeled as the results of an estimation process.

Another virtue is that the principal components have natural interpetations in terms of familiar large scale weather or climate features.

UPDATE II: The Schneider paper is very clear and available online: http://ams.allenpress.com/perlserv/?request=get-document&doi=10.1175%2F1520-0442%282001%29014%3C0853%3AAOICDE%3E2.0.CO%3B2