The National Weather Service (NWS) has provided river forecasts for navigation and flood warning since the mid 1800s (National Oceanic and Atmospheric Administration (NOAA), 1994). The NWS mission includes the protection of life and property and enhancement of the national economy. In so doing, the NWS strives to deliver quality forecasts with increasing accuracy. One method established to improve forecasts is completing a post-analysis or verification of the forecast data compared to observed data. The typical method used for determining the accuracy of a river forecast has been to pair observed and forecast values in order to calculate statistics such as root mean-squared error (RMSE) and mean absolute error. Such calculations are certainly useful in comparing magnitudes of error from one event to another at one forecast point, but they cannot easily be compared from one location to another. For example, one river may have variations in flow from thousands to tens of thousands while another varies from hundreds to thousands. An RMSE of 1000 cfs may be a "good" forecast for the former river, where an RMSE of 1000 cfs for the latter would not be. There is a need to have a measure that is normalized from one river to the next, so that the agency can make fair assessments and prioritize identified needs.
Statistics involving error magnitude also do not take the rarity of the event into account. River conditions vary greatly with the weather, and many other hydrologic and geologic variables, resulting in a wide range of flows. Those conditions that are common, or in line with climatology, might be considered "easier" to forecast compared to more rare events. The Linear Error in Probability Space (LEPS)-based skill score (Wilks, 1995) could take this difficulty into account and add valuable information to verification. Additionally, the LEPS-based skill score is not dependent on the scale of the variable and could be used as a normalized score.
The purpose of this project is to develop the LEPS-based skill score (SSLEPS ) and test its usefulness with more traditional methods of verification. It will be determined using monthly cumulative distribution functions (CDF), or frequency curves, of mean daily flow for six separate river flow gaging points. The six points will have some similarities and differences in basin size, climate regime, and bed slope of the river. Results using the LEPS-based skill score will be used in conjunction with the RMSE to provide the forecaster or agency with additional information regarding the quality of the forecast.
Fulfilling the NWS mission to provide quality river forecasts has progressed with the implementation of vastly improved technology, higher resolution data sets, and the enhanced ability to apply scientific methods to hydrologic forecasting. Assessing forecasts after the fact can show trends in improvement and help to confirm resources are being used wisely. Post-analysis can also increase the experienced-based knowledge of the forecaster by indicating biases in forecast models or the input of inaccurate data into those models. Statistical calculations such as RMSE can show such trends. However, in serving the public, how can the NWS state whether or not the forecast was of good quality? During a flood, errors are typically higher than errors during normal flow. How can we relay the difference to the public?
A recent example of this occurred during the 1997 Red River of the North flood at Grand Forks, ND and East Grand Forks, MN. A near record forecast of 48.8 feet above the gage datum of 779 feet mean sea level was issued two months in advance. As the time to crest drew closer and additional precipitation added to the projected runoff, the NWS updated the forecast to 50 feet (LeFever et al., 1999). As meteorological conditions such as rapid warming exacerbated the snowmelt flooding, several upward revisions were made to the forecast to as high as 54.0 feet. The cities had built levees to provide protection to 52 feet. The crest occurred at 54.35 feet breaking the previous record of 50.2 feet set in 1897. The cities were inundated with flood waters. In this unprecedented event, the NWS provided the best forecast possible given the current technological applications. How good was it? The agency knows there is always room for improvement as an RMS error would show in this case. The addition of a skill score would allow a five foot error of a rare event to be compared to a smaller error in a more common event. This data could then be used together to gage the overall improvement of the quality of NWS river forecasts. The variation in error could also be used to define the limits of the hydrologic forecasting science for rare events.
a. Data Stratification
The NWS provides river forecasts twice daily for major rivers, and twice daily during high flow for hundreds of smaller rivers and tributaries. Flows can vary greatly over a period of a few weeks and seasonal averages or extremes can mask this variation. For this reason, mean daily flows over monthly periods were chosen to represent a climatology of stream flows. A series of 12 monthly CDFs were created for six river forecast points using mean daily flow values from the U.S. Geological Survey (USGS) Hydro-Climatic Data Network (HCDN). The HCDN consists of streamflow records for 1,703 sites throughout the United States and its Territories, spanning the period 1874 through 1988. The records have been quality assured for accuracy and natural conditions (e.g., free of reservoir controls) (Slack and Landwehr, 1992).
The sites for this project were chosen through stratified random sampling (Maidment, 1992). The sampling area was made up of the Missouri and Upper Mississippi River basins, Region 10 and Region 7 of the HCDN, respectively, since forecast and observed flow data in this area would be readily available at the NWS Central Region Headquarters. To create the best possible CDF representing streamflow climatology, the sample set was limited to gaging stations with at least 50 years of data. Rivers forecasts by the NWS rarely fall to zero flow, therefore, the sample was further limited to those streams with perennial streamflow (i.e., non-zero flow). In an effort to use a variety of data sets, points were then separated according different groups of basin characteristics and U.S. standard climatic regions for temperature and precipitation (NOAA, 2000) as described below. Using these constraints, 70 gaging stations remained.
The 70 stations were divided into groups of various basin and climatological regimes. Rainfall, basin area, and basin bed slope, being important parameters affecting streamflow (Bedient and Huber, 1992), were chosen to use for grouping the data. Climatological regions were chosen within the Missouri and Mississippi River basins from the nine regions defined by the National Climatic Data Center and depicted in Figure 1. The East North Central and Central Regions were chosen due to availability of data. All 1703 sites in the HCDN were used to determine different classes of basin area. 1556 sites were available with slope information. Both basin area and slope histograms were created using Microsoft Excel (1995-96) and Jmp In software (Sall and Lehman, 1996) was used to test for normality. Both were not found to be normal. A variety of unequal class lengths were tried until normal distributions were found. Otherwise, most all the data would have been lumped into one class interval. The Histograms were plotted as relative frequencies divided by class interval with total area under the histogram equal to 1 (Johnson, 1994). Figure 2 shows the histogram and Table 1 the classes for basin area. Figure 3 and Table 2 show the histogram and classes for slope, respectively.
Figure 2. Basin Area Histogram.
Two different basin areas and slopes were selected. Basin area classes ranging from 200.01 to 2000 mi2 (A1) and 2000.01 to 10,000 mi2 (A2) were chosen since forecasts are generally not made for smaller basins, and streamflow variability is less for larger basins. Variability is needed to show the use of the skill score at different streamflows. Slope classes were chosen from 3.01 to 8.00 ft-mi-1 (S1) and 8.01 to 40.00 ft-mi-1 (S2) as the two classes with the highest frequencies, thus more sites from which to choose.
b. Randomization of Data
The 70 sites were divided into the two climatic regimes, East North Central (ENC) and Central (CEN) Regions. Since the decision was made to create six CDFs, three combinations of the two slopes and two basin areas were needed within each climate regimes. Available combinations were ENC_S1A1, ENC_S1A2, ENC_S2A1, ENC_S2A2, CEN_S1A1, CEN_S1A2, CEN_S2A1, and CEN_S2A2, and are shown in Appendix A. To achieve an unbiased selection of six stations to study, the sites were selected randomly from the stratified groups using the random number generator in Excel. Selected groups were ENC_S1A1, ENC_S2A1, ENC_S1A2, CEN_S1A1, CEN_S2A1, and CEN_S1A2. The random number generator was then used to select one station from each group. These were Sugar River near Brodhead, WI (Station number (#) 5436500); Jump River at Sheldon, WI (#5362000); Nishnabotna River near Hamburg, IA (#6810000); Meramec River near Steeleville, MO (#7013000); Castor River at Zalma, MO (#7021000); and Gasconade River at Jerome, MO (#6933500), respectively.
c. Creation and Comparison of CDFs
The HCDN historical data files for the six selected sites were imported into Excel and ranked from largest to smallest. Since the distributions of mean daily flow by months were unknown, CDFs were created using the Cunnane plotting position equation (Bedient and Huber, 1992; Maidment, 1992),
As suggested by Ritter (1999), the CDF computation was carried out to at least four decimal places (five places were significant). Graphical presentations of the CDFs can be found in Appendix B. JMP IN was used to create the graphical CDFs and to test them for normality. None were found to be normal. For each month, each CDF was compared to every other CDF to determine if a significant difference existed in the distributions using a Komolgorov-Smirnov (K-S) test of significance (Johnson, 1994; Kanji, 1993; Wilks, 1995) at the alpha = 0.05 level. The K-S test was used since the data were not normal. The null hypotheses, H0, and alternative hypothesis, HA, were as follows:
looked for the largest (in absolute value) difference between the empirical CDFs. Since absolute values were used, the test was one-sided and the null hypothesis was rejected at the a=(alpha x 100)% level if
Kanji expressed the limitations of the K-S test in that best results occur with a sufficiently large data set. Montgomery (1997) stated the probability of a Type II error, beta (failure to reject H0 when H0 is false), is dependent on the sample size. As the sample size increases, beta decreases and the power of the test (1-beta) increases. With each CDF of 50 or more years of data containing 1723-2294 elements of mean daily flow values, it was concluded that the K-S test was sufficiently powerful.
The null hypothesis was rejected (i.e. the CDFs were significantly different) for most combinations except the following: Steeleville and Jerome for all months, and Brodhead and Sheldon for the month of June. One could conclude the same CDF could be used for Steeleville and Jerome from the CEN_S1A1 and CEN_S1A2 groups, respectively. With Brodhead and Sheldon having similar CDFs for only one month, it would be best to keep them separate. Since the ENC_S1A1 and ENC_S1A2 stations (Brodhead and Sheldon, WI) were significantly different, one cannot conclude the same CDF can be used for all S1A1 and S1A2 groups. One can conclude the divisions of selected climate, slope, and area are significantly different. The results of these comparisons generally support using individual CDFs for each site.
d. Verification of Forecast Flow Data
Of the six sites, four were noted as NWS forecast points: Brodhead, Hamburg, Steeleville, and Jerome. Once again, the random number generator was used to select Hamburg as the point to test the concept of verification using RMSE and SSLEPS. RMSE incorporates both systematic and random errors (Maidment, 1992). A perfect RMSE is zero and increases as errors in forecast and observation pairs increase. The SSLEPS calculates the difference in the probability of occurrence between forecasts and observations with respect to the scale of the climatological CDF, rather than comparing the magnitudes of the specific forecasts and observations. It is assumed that in the portion of the CDF where a forecast value is more likely, it should be easier to forecast than for those values that are climatologically less likely. The reference forecast in the skill score equation is the climatological median, 0.50000, or the point at which there is a 50% chance of a variable being less than the value at that point. According to Wilks (1995), "transforming forecasts and observations to the cumulative probability scale assesses larger penalties for forecast errors in regions where the probability is more strongly concentrated", such as around the 50th percentile of the CDF. This will need to be kept in mind when evaluating the usefulness of the skill score. A skill score of 1 is perfect, zero means no improvement over climatology, and negative values imply a worse forecast than climatology.
In a series of N forecasts where Fi represents the i-th forecast, Oithe i-th observation, and F(x) the climatological cumulative distribution function as defined by the frequency curves in Appendix B
The sample set was comprised of flood-only forecasts covering April 2, 1998 through July 26, 1998. The forecasts were in stage height and had to be converted to flow using a rating curve (Figure 4). To obtain the forecast and observed frequencies from the CDF curves, an interpolation algorithm was run using Corel Quattro Pro software (1997). The data can be found in Appendix C. In June and July, several of the forecast and observed flows were higher than the largest value in the CDF and the rating curve. In order to place a frequency value on those flows, the CDF and rating curve were extended using curve fitting routines in JMP IN.
Verification results are shown in Table 3. RMSEs for April and May were similar and lowest of the four months. July's RMSE was nearly twice that of April, but with a slightly higher skill score. The range in forecast flow frequency values in April was from 0.91785 to 0.99874 and from 0.94951 to 0.99995 in July, much above the median flow frequency of 0.50000. This is an indication of forecasting unlikely events, and therefore more difficult forecasts. The resulting April skill score of 0.98776 is quite high (1.0 is perfect). Comparing April to June one can see the verification indicates forecasts were not as good in June with a much higher RMSE (5399 compared to 1405 in April) and a skill score of 0.94805. Still, with 1.0 being perfect, this would appear to be a "good" skill score. Comparing the flow frequency ranges in Table 4, June has a wider range and could explain the lower skill score.
|Month||Min F(F)||Max F(F)|
It appeared that flood-only forecasts might not provide a wide range of skill scores to evaluate. It was then decided to investigate scores for a daily forecast point. As none of the four forecast points selected in the random sampling were daily forecast points, another selection option would need to be used. Of the four HCDN selected rivers, only the Meramec River had a daily forecast point, Eureka, MO. It was desirable to have a wide range of flows to hopefully show more variety in the skill score, and Eureka was found to have a variety of flows from June through August 1998. It was also stated Eureka was a good site for such an experiment as it "is a mid-basin point..., downstream of a couple modeled tributary rivers. This modeling situation gives enough forecast lead time for a dampening out of many of the spurious signals that may be coming from any given headwater point, allowing the "true" signal to be read." (Buan, 2000). Eureka was also part of the HCDN data, and already part of the stratified group, CEN_S1A2. It seemed a very appropriate choice.
Eureka CDFs for June, July, and August were created as above and are contained in Appendix D. These CDFs were also compared to the other CEN_S1A2 basin, the Gasconade River at Jerome, MO to see if significant differences existed. Again using the K-S test of significance as before, the hypothesis that the Jerome and Eureka distributions were the same was rejected for all three months, further supporting the need for individual CDFs. The forecasts and observations for Eureka were converted from stage height to flow using the Eureka rating table (Figure D4 in Appendix D), and frequency values were obtained through interpolation of the CDFs. The August CDF had to be extended due to a higher flow than in the HCDN data set.
Since the flow data for Eureka varied more from low to high flow, it was decided to separate the data into two groups, the more extreme and the more common, so as not to attenuate differences between the two groups. The 25TH and 75th quartiles (Johnson, 1994) of the CDFs were determined using JMP IN. CDF values in the upper 25% and lower 25% were grouped together as the more extreme flows, and values in the inner quartiles were grouped as the more common occurrences. The forecast flow data with accompanying observations were then placed in one of the two groups, depending on the frequency value of the forecast data. Eureka RMSE and SSLEPS were derived for these two groups as well as for the month as a whole (all quartiles). Results are shown in Table 5.
The forecast data for Hamburg were also divided up according to its CDF inner and outer quartiles, and found that all observed-forecast pairs fell in the outer quartiles, specifically the 75th Quartile (Q75). Using the Q75 data for Eureka provided the additional advantage of being able to compare the Hamburg flood-only scores with the scores for Eureka as both data sets were based on a specific range in probabilities. It no longer mattered that one was a daily forecast point, and the other flood-only. Skill scores for Hamburg and Eureka in June were 0.94805 and 0.89884, respectively. For July, Hamburg had a score of 0.98836 and Eureka had 0.73059. Are these differences significant? To answer that, individual skill scores were calculated for each pair of forecast and observed flow values in June and July. Since the skill scores were not normally distributed and the data set was small, a test of significance was computed using the Wilcoxon Rank Sum test in Jmp In. Skill scores for both months were shown to be significantly different. Looking only at these scores one might look further into the Eureka forecasts to identify any particular problems in July, 1998 such as model deficiencies, sparsity of data, etc.
With data split into the inner and outer quartiles, Table 5 shows a wide variety of skill scores. The values in the more common group, the inner quartiles, are much lower (worse) than scores in the outer quartiles. As stated earlier, skill scores in regions of the higher probability of occurrence assess a higher penalty. These results show usefulness in the skill scores, especially in the outer quartiles. For example, it is obvious that June showed "better" forecasting than July in the outer quartiles where the RMSE was 3602 cfs for June compared to 5424 cfs in July, and respective skill scores were 0.89884 and 0.76189. The calculations over the individual months
|Month||Inner Quartiles||Outer Quartiles||Q75|
|n||RMSE (cfs)||SSLEPS||n||RMSE (cfs)||SSLEPS||n||
|n||RMSE (cfs)||SSLEPS||Min F(F)||Max F(F)|
(all quartiles) showed a similar trend to the outer quartile calculations but with slightly lower values. However, these monthly values failed to reflect the lower RMSEs in the inner quartiles, important information regarding error in the actual forecast. This further supports the idea of using the inner and outer quartiles as a verification method rather than combining all the data for a month.
Since RMSEs involve squaring errors, they will be increasingly higher where differences in forecasts and observations increase. Large differences strongly influence RMSEs. If it is true that forecasting in the inner quartiles is easier, RMSEs should be noticeably lower. The results shown in Table 5 support this. Using only RMSE values, one would conclude forecasts in the inner quartile were better than the outer quartile. Comparing only the skill scores, one would conclude the forecasts for the more extreme events were much better than for the common events. How can the two scores be used together? Is there a relation between RMSE and SSLEPS that would provide a more complete test of quality? These questions may be answered by creating a climatology of RMSE values and skill scores such that the forecaster could evaluate how his/her forecast scores compare to typical scores. An RMSE for a single forecast is the same as an absolute error (AE), |F-O|. Providing a normal absolute error with each forecast might give the public a better understanding of the typical range in the forecast. (DeWeese, 2000) and could add significant value. Additionally, the NWS could track improvement in forecasts.
An advantage with the skill scores, because they are based on probabilities, is that a flood on one river can be compared to a flood on another river. As shown above, a group of flood-only forecasts should not be compared to a group of daily forecasts. But if forecast or observed data for any point are split into similar probability ranges such as with the inner and outer quartiles, or flood only with the 75th Quartile, the results can be compared.
e. Grand Forks case study
In April 1997 an unprecedented flood occurred in the Red River of the North basin in North Dakota and Minnesota. The only verification score to use at the time was one comparing the magnitudes of the forecast and observed stages. However, the flood was literally "way off the charts" and went beyond the NWS's scientific and technical abilities. It came down to forecaster subjectivity, knowledge of the river, and experience. Citing differences between crest forecasts during the week before the observed crest, and the actual observed crests, some forecasts were as much as four feet too low. Given the rarity of the event, how "good" were these forecasts?
An April CDF, Figure 5, was calculated for Grand Forks using USGS historical data from the National Water Information System as it was not available in the HCDN. Estimated data was deleted. The rating curve shown in Appendix D, Figure D5, was used to convert gage height to flow as before. This rating curve provided by the North Central River Forecast Center was for the observed flow during the flood, and is quite complicated in that some rising stages had
decreasing flows due to damming effects. Since forecasters during the week prior to the crest were working with the increasing flow portion of this rating, that was the portion used for the flow values in these calculations (Figure D6). Extending this rating for higher stages resulted in higher flows, contrary to what actually happened due to damming.
Table 6 shows the forecast and observed crest flows, estimated using the rating curve in Figure D6, as well as individual and overall verification scores. Forecasts were for crests only, therefore each forecast was paired with the observed crest flow estimated to be 115,000 cfs from Figure D5. (Note: Individual RMSEs are the same as absolute error, |Fi - Oi|.)
|Date forecast issued||Forecast Crest (ft)||ObservedCrest (ft) 4/22/97||Forecast Crest (cfs)||Observed Crest (cfs)||RMSE (cfs)||SSLEPS|
A trend of increasing skill scores and lowering RMSEs is shown in the individual forecasts from February 27 - April 17, 1997. April 18 marked the beginning of decreasing flow with continued rising stage at which point RMSEs increased rapidly and skill scores decreased. The near perfect skill scores reflect the rarity of the event. Very small changes must be considered as the rate of change in the CDF in this region is very small. The range in frequency values for the forecast flows was from 0.99520 to 0.99988. The range for the Hamburg data in April was from 0.91785 to 0.99874. The April skill score for Hamburg was 0.98776 and for Grand Forks for the combined forecasts, 0.99685. Again using the Wilcoxon Rank Sum test of significance for the individual skill scores of Hamburg April data and for all crest forecast data for Grand Forks, the scores were not found to be significantly different. One could conclude the forecasts were equally good for the two extreme events.
The LEPS-based skill score adds more information to more traditional verification methods and appears to be quite useful as it is not dependent on the scale of the variable being verified. By taking into account the rarity of the event, it bears information as to the quality of the forecast. The LEPS-based skill score by itself can be used to compare a forecast from one river to an entirely different river. This would be of great benefit to the NWS in making fair assessments and prioritizing identified needs. For individual rivers forecast points, SSLEPS should be used with RMSE or similar error scores, to indicate how much improvement is needed to lower the error in the actual magnitude of the forecast. Additional benefit would be attained by developing a climatology of RMSE or AE values and skill scores such that the forecaster could evaluate his/her forecast compared to typical errors and skill scores. The agency could also track improvement in forecasts over time.
This study did not show that one common CDF for different sites or sites with similar basin characteristics could be used in calculating the LEPS-based skill score. Individual CDFs using years of mean daily flow data would need to be created. Current NWS river forecasting software will create CDFs on an annual basis, not monthly. If a change could be made such that the software could create monthly CDFs of mean daily flow, use of the SSLEPS could be further evaluated.
Acknowledgments. The author would like to thank A. Juliann Meyer and Thomas Gurss of the NWS Missouri Basin River Forecast Center; Steven Buan and John Halquist of the NWS North Central River Forecast Center; and Shilpa Shenvi and Geoffrey Bonnin of the NWS Office of Hydrologic Development for providing the data for this study; and John Schaake also of the NWS Office of Hydrologic Development for his guidance in the development of this project.
Bedient, P.B. and W.C. Huber, 1948: Hydrology and Floodplain Analysis Second Edition. Addison-Wesley Publishing Company, 692 pp.
Buan, S., 2000: Hydrologist, NWS North Central River Forecast Center, Personal Communication.
Corel Quattro Pro 8. (1997). Corel Corporation Limited.
DeWeese, M., 2000: Senior Hydrologist, NWS North Central River Forecast Center, Personal Communication.
Johnson, R., 1994: Miller & Freund's Probability and Statistics for Engineers. Prentice-Hall, 630 pp.
Kanji, G., 1993: 100 Statistical Tests. Sage Publications, 216 pp.
LeFever, J.A., J.P. Bluemle and R.P. Waldkirch, 1999: Flooding in the Grand Forks-East Grand Forks North Dakota and Minnesota area. North Dakota Geological Survey Educational Series NO. 25, 1-63.
Maidment, David R., 1993: Handbook of Hydrology. McGraw-Hill, NY, estimated 1450 pp.
Microsoft Corporation. (1995-96). Getting Results with Microsoft Excel 97. Microsoft Corporation, 27 pp.
Montgomery, 1997: Design and Analysis of Experiments Fourth Edition. John Wiley and sons, 704 pp.
National Oceanic and Atmospheric Administration, 1994: Natural Disaster Survey Report, The Great Flood of 1993, 244 pp.
National Oceanic and Atmospheric Administration, 2000: U.S. Regional Analysis of 1998 Climate, http://www.ncdc.noaa.gov/ol/climate/research/1998/ann/usrgns_pg.gif (May 10, 2000).
Ritter, Terry, (1999). Ritter's Crypto Glossary and Dictionary of Technical Cryptography. http://www.io.com/~ritter/GLOSSARY.HTM (April 15, 2000).
Sall, J., and A. Lehman, SAS Institute Inc., 1996: JMP Start Statistics. Duxbury Press, 520 pp.
Slack, J.R., and J. M. Landwehr, 1992: HCDN: A U.S. Geological Survey streamflow data set for the United States for the study of climate variations, 1874 - 1988. USGS Open-File Report 92-129. Http://www.rvarvs.er.usgs.gov/hcdn_report/content.html (February 20, 2000).
Wilks, D., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.