Saturday, June 6, 2015

A First Look at Ross McKitrick's "A First Look"

[edit: Ross has been kind enough to reply here, and in fact continued to be kind so I will adjust some of the language here.  He has written an updated version of his article, which can be read here.  I'll also add edits as necessary to reflect our exchange (or language fixes) in the comments; they shall be in red. - 06/08/15, 3:00pm EST]

Recent corrections to buoy and ship temperature measurements have resulted in a renewed global temperature dataset that shows global warming "hiatus" since 1998 probably doesn't exist.  Karl et al. (2015) comment on the effects of three corrections in particular:

  1. Ship data have been shown to be consistently warmer than buoy data when they measure from the same region; this is important because the prevalence of buoy data has greatly increased within the past couple decades, so this introduces a known cooling bias in those recent years to the raw data.  This correction was carried out by bringing the buoy measurements up by the global mean of these regional differences.  It would have also been possible to correct this by bringing the ship measurements down; more on why this doesn't matter, and is in fact less preferable, later.
  2. In addition to being more prevalent, buoy measurements are also more precise: they are subject to less noise than ship measurements are.  As such, since we prefer to have precise measurements, they were weighted more relative to ship data, by a factor equal to the ratio between each method's measurement variance.
  3. Finally, the ship data themselves are wrong as well.  Two main methods exist for ship measurements: engine room intake, which overestimates sea surface temperatures due to exposure from heat in the engine room; and bucket haul measurements, which underestimate sea surface temperatures due to a heightened rate of heat loss from the bucket as it is hoisted from the ocean.  Changes in the prevalence of certain methods relative to each other, as well as changes in the prevalence of insulated v. non-insulated buckets, affect temperature measurements over time.  While corrections for ship data had been applied prior to the start of American involvement in World War II, ship metadata (basically, data about measurement method) shows that large changes in the relative frequency of these methods still occurred after the war.  This insulated v. non-insulated bucket bias in particular was corrected by continuing a colocation comparison to nighttime marine air temperature measurements to the present day, instead of ending them in 1941 as before.
These corrections were part of the newly updated Extended Reconstructed Sea Surface Temperature (ERSST) record, now at Version 4.  The first was detailed by Kennedy et al. (2011B), and the second and third by Huang et al. (2015A) (who provided the final ERSST.v4).  As Karl et al. state in their paper, the contribution of each to the increased 2000-2014 trend of 0.064˚C/decade over ERSST.v3b was 0.014˚C/dec, 0.012˚C/dec, and 0.030˚C/dec.

People familiar with the wider array of indicators we have for global heat accumulation will know that the "hiatus" was an artifact of surface and atmospheric temperatures only.  The largest heat sink, the global oceans (shown below), show no sign of such a hiatus, which is why instead of the "skeptic" cry of "global warming stopped in 1998", serious scientific inquiries have tried to investigate the alleged hiatus by minor changes in surface radiative forcing or changes in atmospheric circulation; or, in fact, an increase in ocean uptake.
The newest paper from Karl et al. essentially calls into question the idea of a hiatus at all, and instead blames it on incomplete sampling and a change in sea surface temperature measurement methodology over the same time period.

Watts Up With That has been hard at work trying to find some fault with the paper, and as shown by Sou at HotWhopper, Anthony probably isn't trying his best to provide an objective approach to analyzing the paper.  (I greatly understate that.)  But a somewhat serious person, Dr. Ross McKitrick, has commented on the paper there all the same, and I'd like to respond to some of his "first looks" at the new paper.

minor points

To get a couple small things out of the way: Ross starts off his article by saying that the idea of a hiatus is made by examining several datasets, all of them some version of surface temperatures or atmospheric temperatures (lower tropospheric in particular).  His last dataset he says is from the 0-2000m Argo float network—actually, my dataset graphed above is from that.  The caption of the figure he cited specifically says that's from 5-m data, again a surface record, and I am particularly disappointed that the graph started at the peak of the 1997-1998 El Nino, but so be it.

That all of these show the same "hiatus" is more an indication that they're all basically measuring the same thing.  But even then, one has to wonder what McKitrick means by "examination", since especially for the first several datasets he links to, it's not at all clear that there has been a hiatus, especially with the rather (ahem) inconvenient truth of the data from the most recent ~12 months.  But even without the 2014 data, has McKitrick done any analysis as Tamino has performed on the GISTEMP data?  Change point analysis, ANOVA, fitting polynomials to residuals?  My guess is he has not, just given that Tamino found, for each test, results that were (and I paraphrase only slightly) "not even close" to significant.

The "hiatus" appears to be clear if you compare model predictions with observations—but it really does help to know why models predict what they do, and what happens if you give the models the correct inputs and correct ENSO pattern, something "skeptics" don't seem keen on discussing in detail.  Kevin Cowtan at Skeptical Science has done that in spades.

Major Points

Small potatoes aside, let me get to some of the more substantive points of McKitrick's post.  His first point to make with regard to correction (1) is to point out what he thinks is a large uncertainty in the bias between buoys and ships (emphasis McKitrick's original):
However, Kennedy et al. note that the estimate is very uncertain: it is 0.12±1.7˚C!
McKitrick refers to Table 5 of Kennedy et al., which shows that the mean global estimate is 0.12˚C, with a standard deviation of 0.85˚C (so he obtained 1.7 by multiplying that by 2, which is fine).  This is a rather curious mistake for someone who knows about statistics to make, because what was being calculated was a sample mean.  (edit: revamped this next sentence for clarity)  For sample means, you do not use the standard deviation of the samples, but the square root of the variance divided by the sample size (i.e. √[var(x)/n] ).  Table 5 gives the standard error values as well, and it is those values that should be used for the estimate of the uncertainty of the mean.  (If one wants to calculate the standard errors for each value, they can again refer to Table 5 where the overlap counts, i.e. the sample sizes, are clearly included in the right-most column.)  The estimate is thus not very uncertain, but in fact very certain: 0.12±0.02˚C.  Karl et al. used this global mean value.

[edit: Ross' correction is to clarify that the regional uncertainty is high for these values.  I agree with him, but I do not think that this solves the standard error v. standard deviation problem, as sample size helps to fix that problem when calculating a mean.  I can probably write a post to illustrate this.]

McKitrick's next point about (1) is actually to contest the use of the global mean itself, instead pointing out how the differences vary by region and that other analyses from the Hadley group (i.e. Kennedy et al.) and Hirahara et al. (2014) used the regional analyses.

He doesn't really provide any analysis to show why the use of the global mean would cause a very important deviation from the use of the regional means, which is understandable since, after all, these are new papers and McKitrick's probably doesn't have ready access to all of the data; but throwing in this faux uncertainty based off of nothing more than two other papers using the regional values should be discouraged.  Or rather, one paper: Hirahara et al. used the global value for their analysis (emphasis mine):
The mean ERI bias of +0.13˚C is obtained and is within the range for the global region listed in Table 5 of Kennedy et al. (2011).  The biases appear to vary regionally and seasonally with large biases in the Central North Pacific and the southern oceans.  However, sampling data are insufficient to attribute the features to the ERI bias.  Thus, only the global mean bias is used.
So Hirahara et al. seem to be of the opinion that regional analysis may not be appropriate.  Either way, we should remember a common platitude at this point: all models are wrong, but some are useful.  I think that, unless McKitrick would like to provide some alternative analysis that shows a very large difference between using the global mean value v. the regional values, the model used by Karl et al. (and Hirahara et al.) can be considered useful for what it intends to do: fix the known ship biases.

McKitrick does not contest correction (2), but says of correction (3) that:
However, this particular step has been considered before by Kennedy et al. and Hirahara et al., who opted for alternative methods in part because, as Kennedy et al. and others have pointed out, the NMAT data have their own "pervasive systematic errors", some of which were mentioned above.
Some more context would show this is incorrect though.  The very next sentence in the Kennedy (2013) paper he cites, for instance, is this:
The use of NMAT to adjust SST data is, to an extent, unavoidable as the heat loss from a bucket does depend on the air-sea temperature difference.
The Kennedy (2013) paper discusses two methods that were proposed by previous researchers, both of which use NMAT data to figure out the relative fraction of insulated v. non-insulate buckets.  The later one, used by Karl et al. (and also by Huang et al., as McKitrick fails to point out), was proposed by Smith and Reynolds (2002) and used collocated measurements, with the first being proposed by Folland and Parker (1995), who used:
[...] a simplified physical model of the buckets used to make SST measurements combined with fields of climatological air-temperature, SST, humidity, wind and solar radiation. [...] The fractional contributions of canvas and wooden buckets were estimated by assuming a linear change over time from a mix of wooden and canvas buckets to predominantly canvas buckets by 1920.  The rate of this change was estimated by minimizing the air-sea temperature difference in the tropics.  The same method was also used in Raynor et al. (2006) and Kennedy et al. (2011C).
In other words, the Kennedy et al. (2011) papers also used NMAT in some fashion to estimate the fractional contribution of insulated v. non-insulated buckets.  Furthermore, again from Hirahara et al. in their section 4(c), they also used the Folland and Parker method, which uses NMAT; however, to Ross' point as well, Hirahara et al. used metadata in addition to NMAT to estimate the fraction of insulated v. non-insulated measurements, in particular after 1971 and in fact exclusively between 1941 and 1971.

So what does all of this mean?  Essentially every bucket correction used NMAT in some fashion.  Smith and Reynolds, Folland and Parker, Raynor et al., Hirahara et al., Kennedy et al., Huang et al., and Karl et al.  All of the papers say this.  In particular, discussion of Karl et al. should include discussion of Huang et al., which to my understanding used exclusively NMAT corrections to calculate the bucket biases; Karl et al. replicate their analysis directly.  Ross suggests this method is not preferred, but only argues this based off of the use of another method by a different group.  The comparisons between these two methods are shown in the first figure in the next section; the results seem similar between Huang et al. and Hirahara et al. in the 1998-2000 period in question, so I'm disinclined to agree there's a problem here.

"Numerical Example"

Several of Ross' simulated corrections that he provided in his numerical example do not seem to be justified given the types of corrections applied above.  To start, he introduces a linearly increasing negative adjustment to the pseudo-ship data starting in 1940, and why?  Who knows why; it doesn't seem at all relevant to any of the three corrections above, unless he thinks that is what the continued NMAT correction is.  Ross does not really discuss the Huang et al. paper, where Karl et al. said they got the new ERSST.v4 corrections from.  However, it should be brought up because from their Figure 6, which shows the full effect of the continued NMAT correction, we can see it is not a linear increase with time.
It doesn't even matter that much in the final model since the simulated ship fraction goes down with time, but since McKitrick seems to be so critical of Karl et al. for not choosing the particular NMAT method that he likes (for no particular reason), we should probably expect some due diligence here too.

This particular correction would cause a downward trend, though it doesn't really matter in the long run for two reasons.  First, the buoy fraction takes over in Ross' example.  Second, the model starts to diverge toward the end of the series is because of the massive over-correction applied to the pseudo-buoy series.  Why does McKitrick assume that the correction should be a massive over-statement like this?  He gives zero reason anywhere in his post, though since the quasi-independent estimates given by Kennedy et al. and Hirahara et al. are very close to each other, we might be safe in saying they're much more accurate than the overcorrection in Ross' model would imply.  So, I'll choose the "correct" correction.  What more, when you remove that absolutely false growing negative correction on the ship series, you get a more accurate model, as such.
The dip toward the end is due to the similarly massive negative "corrections" applied to the ship data in 1990 and 2000, as McKitrick thinks is necessary, though according to Huang et al. above is not, at least for 1990.  But McKitrick's model does not seriously consider the actual reason that the NMAT correction is needed in the first place, which is to divine the fractional composition of insulated v. non-insulated bucket measurements.  So It's not even clear at all that this correction should be simulated in this fashion; in fact it's almost certain that's not how it should be done at all.

If the accuracy of the NMAT correction is in question, then let's hear why, not these misleading/ambiguous statements about which researchers prefer one method over another.  And let's certainly not vastly overestimate the correction in a numerical example to make a misleading point as well.  Until the time when McKitrick (or any one else at WUWT) wants to take this paper seriously, I think the best thing to do is to consider the authors as knowing what they're talking about.


  1. "Table 5 gives the standard error values"
    I was a little confused here because all info is in Table 5, just in different columns, maybe you could write:
    "Table 5 also gives the standard error values"

    A statistician confusing the sample standard deviation and the standard error of the mean, that is quite something. Scary what mitigation scepticism does to a person.

    1. Thanks! I'll clarify that a bit more.

    2. McKitrick isn't a statistician any more than McIntyre is, rather he's an economist.

  2. I have posted a revised version of the document at that addresses some of your criticisms. I will send it to Anthony to post at WUWT.
    Regarding the use of SD or SE, the site-specific uncertainty is as indicated by the SD. I have revised the text to clarify that the bias uncertainty applies to the specific location.

    As to whether it matters using region-specific information or the global mean, in several places you ask whether this or that adjustment matters. There is a prima facie case that the 3 main adjustments in K15 must matter since they themselves attribute their new results to them. This presumably includes, in part, using +0.12 everywhere rather than using variable adjustments based on metadata.

    You are correct that Hirahara et al use a global mean adjustment, but they also use a model of variations in bucket type over time to change the weights. However they do not use regional variations--I misread them on that point and I've corrected that statement.

    You are misreading my paragraph on adjustment #3 and your accusation of lying is uncalled for. For the post-1941 period Kennedy and Hirahara use MAT data to estimate aspects of their adjustment calculations but they also use metadata and they do not rely exclusively on NMAT. The K15 group rely on achieving a trendless difference between SST and NMAT in collocations to calibrate their bias coefficients, and they themselves state that this introduces a large change in the 1998-2000 interval that has a big effect on the end-of-sample trend. I have revised the text to clarify the point that other teams don't solely rely on fitting to NMAT but also use metadata.

    I have nothing against NMAT or any other data set (in fact I've added the NMAT graphs to my document). The point I am making is that the K15 results arose not due to new data but new adjustments based on their expert judgment, and it is not immediately obvious that these are the correct decisions. The numerical example is not meant to replicate the construction of SST data, but to show how estimated adjustments can introduce trends not observed in the underlying data. You could construct any number of examples that go any way you like.

    1. All right, I made some mistakes in my previous comments replying to you (mixing the 0.030˚C figure, commenting on whether there is metadata v. not metadata in a contradictory way, so on). I would edit if I knew how, so I deleted my comments and allow me to reply again with some brevity. Again, for each paragraph respectively:

      I'm not sure that fixes the SD v. SE problem, since the error of the sampling mean is given by the standard error. Even if the regions did not differ, you seem like you would be making the same argument. It seems John Kennedy feels the same way.

      I ask why it matters mainly because I am not convinced that the regional usage would make a very large difference from the global usage. I do agree that the global usage's contribution of +0.014˚C is important, but there does not seem to be a good reason to suspect the regional usage would give a substantially different number. Clearly Hirahara et al. did not see it as a big issue.

      (I don't have much to say for your next paragraph, but my next below may tie in a bit.)

      I probably am misreading it, and I will also remove accusations of lying from my paper. I have reread the relevant section in Hirahara et al., and yes you are correct that they compare temperatures to other metadata to divine the fractional contribution, exclusively from 1941 to 1971 and combined with NMAT after. I can correct my article and commentary in that regard.

      Given your clarification on NMAT, I can fix up other commentary that suggests you don't trust it. My main point with the numerical example criticism wasn't that it's a bad idea to show how the corrections can cause certain effects (I use a similar example in my next post), but rather to point out that the main positive uptick at the end of your graph is due to a choice of large correction to your simulated buoy data that does not seem justified. Since the folks who read WUWT tend to dislike corrections that give warmer trends, that your example happened to give a more positive trend compared to the "true" value didn't seem honest. If that was not your intent, I apologize. My graph with the blue line is perhaps another way of giving a different example, as you suggest in your last sentence.

    2. OK, my edits have been added, I hope they are fair.

    3. Thanks for making the edits.

      Regarding the SD vs SE question, there is no dispute that SE gives the variance of the estimated global mean, if the unweighted global mean is what you are interested in. What is really at issue is how good an approximation +0.12 is to each site-specific adjustment. Remember that SE is attached to the unweighted global mean. But the observations are heavily clustered in shipping lanes, overweighted in the NH Atlantic. To estimate the global mean you need to weight observations according to location and sample density. Suppose that you had enough meta data to do that. Then you wouldn't need to use a global average as an approximation everywhere, you could use location-specific bias adjustments. After all, why throw out all that information? But the data are inadequate and K15 rely instead on a single global number. It artificially understates the uncertainty of what they are doing to refer to the unweighted global mean SE. Put another way, the difference between an SD of 0.85 and an SE of 0.02 is an indicator of the amount of information lost by not using location-specific adjustments.

      As to whether it matters: again, it's one of 3 steps that together make all the difference. So yes, it matters. Is it correct? Not for me to say, but it is important to point out how much information is being lost in this step.

      I don't refer to Huang et al. but I assume K15 = Huang et al. for the purpose of the ERSSTv4 data set. That is, K15 use the data set from Huang et al. So the references are interchangeable.

      There is another key methodological difference that hasn't been discussed yet: The change in the smoothing algorithm for the SST-MAT bias adjustments between ERSSTv3b and v4. Assumption 5 in the Huang/Karl method requires the bias fitting coefficients only to vary slowly and smoothly over time. v3b used a linear coefficient model. v4 uses a Lowess smoothing model with a relatively low parameter (f=0.1) that allows the parameter to move up and down within a single decade. Look at Fig 5 in Huang et al and you will see that f=0.2, which is still low, flattens the end segment of the bias adjustments. I suspect this would be an influential change. So you have to be prepared to make a convincing case for f=0.1 rather than f=0.2; but the fact that the choice matters is an indication of non-robustness.

    4. "[...] But the observations are heavily clustered in shipping lanes, overweighted in the NH Atlantic. To estimate the global mean you need to weight observations according to location and sample density. [...]"

      Hm... actually I agree with you here. The average should be area-weighted.

      I don't have a map of their locations but we can probably do some very crude calculations. They also don't rigorously define (for us) their regions, although there are some more approximations we can use for napkin math.

      My rough approximations are that the oceans are equally distributed between the north, south, and tropical regions, and also (where required) divided equally between east and west; and that the measurements are uniformly distributed amongst those. And so the Pacific with a total area of ~165 million square kilometers would have its N, TE, TW, SE, SW regions cover 55, 27.5, 27.5, 27.5, and 27.5 mill. km^2 respectively. Not correct, but useful.

      Using this area weighting scheme, for a mean value I get 0.146 instead of 0.121. For a standard error, since the equation I used for the mean was essentially:

      m = w1*x1 + w2*x2 + ... + w11*x11

      I think I can just use a propagation of errors formula, and I get 0.012. So, still pretty low. So a slightly higher value, roughly equivalent error. I can do a quick post on this issue (and I would welcome feedback there when I get it up).


      "v4 uses a Lowess smoothing model with a relatively low parameter (f=0.1) that allows the parameter to move up and down within a single decade. Look at Fig 5 in Huang et al and you will see that f=0.2, which is still low, flattens the end segment of the bias adjustments."

      This really depends on what you mean by "high" or "low". The values themselves do not matter so much as their effects on the smoothing; since an f value of 0.2 causes considerable smoothing of the data, in particular across the WWII transition, it's a lot "higher" in what it does. Huang et al. do discuss that as a particular reason why 0.1 is preferred, because it better preserves that change; even more so to the point, they say it may not even be good enough at doing that.

      They do provide discussion on why 0.1 is preferred: it serves as a balance between filtering out spurious noise, while still preserving the multidecadal variability that other authors have pointed out. Clearly 0.05 will be worse at accomplishing the former, and 0.2 worse at accomplishing the better. And, 0.2 doesn't recreate the war transition well. So in short Huang et al. do make a case for 0.1 over 0.2.

      "the fact that the choice matters is an indication of non-robustness."

      The robustness of the end effect is only so strongly dependent on the choice of smooth factor insofar as the choice between 0.1 and 0.2 is uncertain. That we can see a difference between the two is only slightly less banal of an observation that we see a difference between 0.1 and a straight line approximation as was used in ERSST.v3b. The question is, is there any a priori (so, not looking at the end of the series) to prefer 0.1 over 0.2? And yes, there are a couple reasons to prefer it, namely (1) that there is variability the authors judged to be important that 0.1 covered better than 0.2 did, and (2) 0.1 does a much better job of displaying the WWII transition than does 0.2, a feature we know from metadata was very sudden (and which 0.2 would give the impression of not being very sudden).