Rodeo II Sprint: Sub-Seasonal Climate Forecasting - temp34 Task, period #17

Register
Submit a solution
The challenge is finished.

Challenge Overview

Note
This challenge is NOT a part of TCO20 MM Track.

Remark

This sprint is part of the second phase of the Rodeo II Challenge, which follows after the previous marathon match. Big part of the problem statement is the same, but there are also some major changes. It is not required that you took part in the previous marathon match to take part in this sprint.

Introduction

Water managers need more skillful information on weather and climate conditions to help efficiently utilize water resources to reduce the impact of hydrologic variations.  Examples of hydrologic variations include the onset of drought or a wet weather extreme. Lacking skillful sub-seasonal information limits water managers’ ability to prepare for shifts in hydrologic regimes and can pose major threats to their ability to manage the valuable resource.

The challenge of sub-seasonal forecasting encompasses the lead times of 15 to 45 days into the future, which lay between those of weather forecasting (i.e. up to 15 days, where initial ocean and atmospheric conditions matter most) and seasonal to longer-lead climate forecasting (i.e. beyond 45 days, where slowly varying earth system conditions matter most, such as sea surface temperatures, soil moisture, snow pack).  

The Rodeo II Challenge series is a continuation of Sub-Seasonal Climate Forecast Rodeo I contest. In December 2016, the US Bureau of Reclamation launched Rodeo I. The primary component of Rodeo I was a year of submitting forecasts every 2 weeks in real-time. Teams were ranked on their performance over the year and needed to outperform two benchmark forecasts from NOAA. Additional prize eligibility requirements included a method documentation summary, 11-year hind-casts, and code testing.

The current challenge is the second step in a series of contests that builds on the results of Rodeo I. This document is the problem specification of 4 × 26 (in total, 104) recurring data science challenges (sprints) that aim to create good quality predictive algorithms on weather data over the next full year.

There will be 4 independent tracks (the same as in the marathon match) - in each you will be solving a specific task. Each track will be split into 26 sprints, with each sprint lasting 2 weeks. There will be separate prizes for each sprint. There will also be quarterly and overall (annual) bonus prizes. To be eligible for the quarterly and annual prizes, you must outperform both the winning solution for that category from Rodeo I and a sub-seasonal forecast from NOAA.

Prize Structure

The total sum of all prizes is $720,000!

  • Each sprint, each track (26 × 4; no need to beat NOAA nor Rodeo I score):

    • 1st    $500

    • 2nd    $350

    • 3rd    $250

    • 4th    $175

    • 5th    $100

  • Each quarterly bonus (4 × 4; awarded to competitors who's score calculated over the quarter period is higher than NOAA and Rodeo I score for the same period):

    • 1st    $6,000

    • 2nd    $4,500

    • 3rd    $3,000

    • 4th    $2,000

    • 5th    $1,000

    • 6th    $250

    • 7th    $250

    • 8th    $250

    • 9th    $250

    • 10th    $250

  • Overall bonus (each track; awarded to competitors who's score calculated over the 52 weeks is higher than NOAA and Rodeo I score for the same period):

    • 1st    $25,000

    • 2nd    $17,500

    • 3rd    $10,000

    • 4th    $7,500

    • 5th    $5,000

    • 6th    $2,500

    • 7th    $2,000

    • 8th    $1,500

    • 9th    $1,250

    • 10th    $1,000

The sum of any bonus funds that are not awarded will be reallocated to later prizes purses.  50% of it will go to the next quarter, and the other 50% will go towards the overall prize. Any funds not awarded during the overall bonus evaluation (when there are less than 10 competitors beating NOAA and Rodeo I score) will be redistributed to the winning positions.
 

In Q1 quarter, $18,250 were not awarded. Based on the reallocation rules, these are the new values for Q2 quarter and overall bonus prizes (Q3 and Q4 bonus prizes remain unchanged): 

  • Q2 quarterly bonus (each track; awarded to competitors who's score calculated over the quarter period is higher than NOAA and Rodeo I score for the same period):

    • 1st    $6,684.38

    • 2nd    $4,956.25

    • 3rd    $3,342.19

    • 4th    $2,228.13

    • 5th    $1,114.06

    • 6th    $341.25

    • 7th    $341.25

    • 8th    $341.25

    • 9th    $341.25

    • 10th    $341.25

  • Overall bonus (each track; awarded to competitors who's score calculated over the 52 weeks is higher than NOAA and Rodeo I score for the same period):

    • 1st    $25,684.38

    • 2nd    $17,956.25

    • 3rd    $10,342.19

    • 4th    $7,728.13

    • 5th    $5,114.06

    • 6th    $2,591.25

    • 7th    $2,091.25

    • 8th    $1,591.25

    • 9th    $1,341.25

    • 10th    $1,091.25

 

Task Overview

In the 4 tracks your task is to predict the following variables:

  • "temp34": 14-day average of temperature using a forecast outlook of 15-28 days (weeks 3-4),

  • "prec34": 14-day total precipitation using a forecast outlook of 15-28 days (weeks 3-4),

  • "temp56": 14-day average of temperature using a forecast outlook of 29-42 days (weeks 5-6),

  • "prec56": 14-day total precipitation using a forecast outlook of 29-42 days (weeks 5-6).

Technically these 4 tasks are organized as 4 × 26 = 104 different contests, and you are free to participate in any number of them. The 4 tracks are identical in most aspects: data, challenge specification, contest forum, schedule, etc., but they have individual leader boards and set of prizes. Also the 26 contests in the same track have individual leader boards and prizes, but to be eligible for quarterly and overall bonuses, you should take part in all the corresponding contests. This contest is about the "temp34" task, period #17.

The quality of your algorithm will be judged by how closely the predicted weather data matches the actual, measured values. See Scoring for details.

Schedule of Sprints (current sprint and track highlighted; all prediction ranges start at 00:00 UTC)

Period Submission Deadline    “temp34”/“prec34”  “temp56”/“prec56” Quarter
#                             Prediction Range   Prediction Range           
1      2019-10-15 00:00 UTC   Oct 29 - Nov 11    Nov 12 - Nov 25      Q1
2      2019-10-29 00:00 UTC   Nov 12 - Nov 25    Nov 26 - Dec 9       Q1
3      2019-11-12 00:00 UTC   Nov 26 - Dec 9     Dec 10 - Dec 23      Q1
4      2019-11-26 00:00 UTC   Dec 10 - Dec 23    Dec 24 - Jan 6       Q1
5      2019-12-10 00:00 UTC   Dec 24 - Jan 6     Jan 7 - Jan 20       Q1
6      2019-12-24 00:00 UTC   Jan 7 - Jan 20     Jan 21 - Feb 3       Q1
7      2020-01-07 00:00 UTC   Jan 21 - Feb 3     Feb 4 - Feb 17       Q1
Q1 Bonus Evaluation
8      2020-01-21 00:00 UTC   Feb 4 - Feb 17     Feb 18 - Mar 2       Q2
9      2020-02-04 00:00 UTC   Feb 18 - Mar 2     Mar 3 - Mar 16       Q2
10     2020-02-18 00:00 UTC   Mar 3 - Mar 16     Mar 17 - Mar 30      Q2
11     2020-03-03 00:00 UTC   Mar 17 - Mar 30    Mar 31 - Apr 13      Q2
12     2020-03-17 00:00 UTC   Mar 31 - Apr 13    Apr 14 - Apr 27      Q2
13     2020-03-31 00:00 UTC   Apr 14 - Apr 27    Apr 28 - May 11      Q2
Q2 Bonus Evaluation
14     2020-04-14 00:00 UTC   Apr 28 - May 11    May 12 - May 25      Q3
15     2020-04-28 00:00 UTC   May 12 - May 25    May 26 - Jun 8       Q3
16     2020-05-12 00:00 UTC   May 26 - Jun 8     Jun 9 - Jun 22       Q3
17     2020-05-26 00:00 UTC   Jun 9 - Jun 22     Jun 23 - Jul 6       Q3
18     2020-06-09 00:00 UTC   Jun 23 - Jul 6     Jul 7 - Jul 20       Q3
19     2020-06-23 00:00 UTC   Jul 7 - Jul 20     Jul 21 - Aug 3       Q3
Q3 Bonus Evaluation
20     2020-07-07 00:00 UTC   Jul 21 - Aug 3     Aug 4 - Aug 17       Q4
21     2020-07-21 00:00 UTC   Aug 4 - Aug 17     Aug 18 - Aug 31      Q4
22     2020-08-04 00:00 UTC   Aug 18 - Aug 31    Sep 1 - Sep 14       Q4
23     2020-08-18 00:00 UTC   Sep 1 - Sep 14     Sep 15 - Sep 28      Q4
24     2020-09-01 00:00 UTC   Sep 15 - Sep 28    Sep 29 - Oct 12      Q4
25     2020-09-15 00:00 UTC   Sep 29 - Oct 12    Oct 13 - Oct 26      Q4
26     2020-09-29 00:00 UTC   Oct 13 - Oct 26    Oct 27 - Nov 9       Q4
Q4 and Final (Overall) Bonus Evaluation

 

Note that since we will evaluate the solutions on the live weather data, the results will be announced only after the corresponding prediction range elapses.

Input Data

There is no official training data set. You are free to use any data. E. g., you can use the Subseasonal Rodeo data set, which the data files used in the previous marathon match were created from. The data set is described in detail in this publication, it also gives pointers to the original sources of the data where further information on each data file can be found. Notice especially the section A.2 (page 10) with the list of several sources, some of which are updated daily with actual measurements.

Ground Truth Data

We will use the following sources to generate ground truth for scoring:

These sources contain daily temperature and precipitation values over the entire globe with resolution 0.5° × 0.5°. The global field starts from the 0.5° × 0.5° lat/lon grid box centering at (lat, lon) = (−89.75, 0.25), going from west to east first, and then south to north. Together, there are 720 × 360 = 259,200 grid boxes. 

The first source (for temperature) contains for each year one file with name CPC_GLOBAL_T_V0.x_0.5deg.lnx.<year>.gz. After extracting, you get the file CPC_GLOBAL_T_V0.x_0.5deg.lnx.<year>. This is a raw binary file of size 

<number of days> × 4 × 259200 × 4    bytes.

For a regular year, the size is 365 × 4 × 259,200 × 4 = 1,513,728,000 bytes, and for a leap year, the size is 366 × 4 × 259,200 × 4 = 1,517,875,200 bytes.

The file contains single-precision floating-point values (i. e., 4 bytes per value) with little-endian ordering. Each block of consecutive 4 × 259,200 × 4 = 4,147,200 bytes (when reading the blocks from the beginning of the file) corresponds to a single day of the year. For each day, the block can be split into 4 sub-blocks of size 259,200 × 4 = 1,036,800 bytes. Each of these 4 sub-blocks contains the following 720 × 360 = 259,200 values (ordered as described a few paragraphs above):

  • 1st sub-block:        tmax values, daily maximum temperature in °C

  • 2nd sub-block:        nmax values, not used for ground truth calculation

  • 3rd sub-block:        tmin values, daily minimum temperature in °C

  • 4th sub-block:        nmin values, not used for ground truth calculation

Missing values are represented as -999.

For the current year, the file is updated daily and it is not compressed (no .gz extension).  E. g., on Sep 10 in 2019 at 8:00 GMT, which was the 253rd day of the year, the file CPC_GLOBAL_T_V0.x_0.5deg.lnx.2019 contained regular values only in the first 251 blocks (there is some time lag between the measurements are recorded and the file is updated). The other blocks were filled with -999.

The second source (for precipitation) contains a subfolder for each year. E. g., the data for 2019 is in RT/2019/ folder. There is a separate file for each day with the name PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.<YYYYMMDD>.RT, where <YYYYMMDD> is the respective date. This is a raw binary file of size 2 × 259,200 × 4 = 2,073,600 bytes.

The file contains single-precision floating-point values (i. e., 4 bytes per value) with little-endian ordering. The file can be split into 2 sub-blocks of size 259,200 × 4 = 1,036,800 bytes. Each of these 2 sub-blocks contains the following 720 × 360 = 259,200 values (ordered in the same way as for temperature):

  • 1st sub-block:        prec, daily precipitation in 0.1mm units (e. g., the value 123.45 means 12.345 mm of daily precipitation

  • 2nd sub-block:        number of gauges available in the grid box; not used for ground truth calculation

Also here, missing values are represented as -999.

The lat/lon resolution of the source grid is 0.5° × 0.5°, while the target contest resolution is 1° × 1°. Therefore, the interpolation described in the mentioned publication (section A.1, page 10) is performed for all the 3 relevant variables (tmaxtminprec). E. g., when calculating values for the target point (lat, lon) = (40, 253), we take the average of the values at points (39.75, 252.75), (39.75, 253.25), (40.25, 252.75), (40.25, 253.25), weighted by the cosine of the latitude in radians.

 Finally, to obtain the ground truth values:

  • For “temp34” and “temp56”, we calculate tmaxAvg and tminAvg as the average of the 14 tmax and tmin values, respectively. Then we take (tmaxAvg + tminAvg)/2.

  • For “prec34” and “prec56”, we calculate precSum as the sum of the 14 prec values. Then we take precSum/10 (the final division by 10 is the conversion of units from 0.1mm to 1mm).

We provide the tool to generate ground truth values

 

 

Output File

In this contest you must submit both the result and the code which was used to generate the result. There is no restriction in terms of resources and languages, other than what is described in the “General Notes” section. (Unlike in the previous marathon match, you do not need to dockerize your solution during the submission phase.) We should be able to run your code when following your documentation (which you will only need to submit after the result announcement if you win a prize). Your code must create predictions for the period listed in the schedule table for 514 target grid points listed here

Your predictions must be given in a single CSV file named <submission deadline>.csv, where <submission deadline> is the date in the YYYY-MM-DD format from the second column of the schedule table, that is, 2020-05-26.csv. The file must be formatted as follows:

lat,lon,temp34,prec34,temp56,prec56

 where

  • lat and lon are the latitude and longitude of the target grid point,

  • temp34 is your predicted 14-day average temperature at the given location, in °C. The average is taken over the prediction range listed in the schedule table. The daily average is calculated as the average of the tmin and tmax variables.

  • prec34 is your predicted total 14-day precipitation at the given location, in mm, over the prediction range listed in the schedule table.

  • temp56 is as temp34, over the other prediction range.

  • prec56 is as prec34, over the other prediction range.

Your solution file should contain the above header line (optional), plus exactly 514 lines (mandatory): one line for each of the 514 target grid points.

Your solution file must be formatted as specified above even if you participate in only a subset of the 4 tracks. In each track only values from a specific column will be used for evaluation. If you don't want to participate in one or more of the 4 tracks, simply don't make submissions on those tracks' online submission interfaces. You may use the special value nan in columns corresponding to tasks you don't want to solve.

Sample lines:

lat,lon,temp34,prec34,temp56,prec56
27,261,15.5,18.7,0,nan
27,262,11.7,15.0,21,nan
28,261,9.2,12.3,28.8,nan
28,262,19.3,21.1,22,nan
. . . 

In the sample above the contestant decided not to take part in the prec56 track.

Submission Format

This match uses the "submit result" submission style. Only your last submission will be evaluated. You also must submit the code which generated your result. In case you place on a prize winning position (either in a single sprint or in a quarterly/overall ranking), you will have to submit the documentation which describes how to run your code to reproduce the result (see “Final Prizes” section below). If you use huge data sources which are not downloaded automatically from the internet by your code, do not include these in your submission. Instead, if you win a prize, you should describe in the documentation which data must be downloaded manually before running your code.

You must submit all the files packed in a ZIP file with the following structure in the root directory:

/

<submission deadline>.csv    ← file with your predictions

code/                    ← folder

    <content>          ← all your code structured as you wish 

During the submission phase, there will be no provisional leaderboard nor any information indicating whether your submission is valid. 

Scoring

During scoring, your solution CSV file will be matched against ground truth data using the following algorithm.

If your ZIP file does not contain the expected solution file in the root directory, you will receive a score of 0.

If your solution is invalid (e.g. if the tester tool can't successfully parse its content, or it does not contain the expected number of lines), you will receive a score of 0.

Otherwise your score is calculated as:

err = RMSE(actual, predicted),

where the average is taken over the 514 grid points, actual is the true observed value (temperature or precipitation) at each location, predicted is your algorithm's output at the corresponding location.

For bonus quarterly/overall prizes, errAvg is calculated as the average of err values over the respective periods. To be eligible for quarterly/overall prizes, you may miss at most 1 sprint in a quarter (to be eligible for that quarter prize) and at most 4 sprints overall (to be eligible for overall prizes – here, the missed sprints do not have to be from different quarters). For score calculation purposes, your missed submissions will be replaced with the long term average (1981-2010) for that period.

Finally, for display purposes your score is mapped to the [0...100] range (100 being the best), as:

score_single = 100 / (0.1*err + 1)        ← for a single sprint

score_bonus = 100 / (0.1*errAvg + 1)        ← for bonus prizes

The scoring script is available here. The leaderboard of all challenges (including current quarterly and overall rankings) is hosted here.

General Notes

  • This match is not rated

  • Teaming is allowed. Topcoder members are permitted to form teams for this competition. If you want to compete as a team, please complete teaming form. After forming a team, Topcoder members of the same team are permitted to collaborate with other members of their team. To form a team, a Topcoder member may recruit other Topcoder members, and register the team by completing this Topcoder Teaming Form. Each team must declare a Captain. All participants in a team must be registered Topcoder members in good standing. All participants in a team must individually register for this Competition and accept its Terms and Conditions prior to joining the team. Team Captains must apportion prize distribution percentages for each teammate on the Teaming Form. The sum of all prize portions must equal 100%. The minimum permitted size of a team is 1 member, with no upper limit. However, our teaming form only allows up to 10 members in a team. If you have more than 10 members in your team, please email us directly at support@topcoder.com and we will register your team. Only team Captains may submit a solution to the Competition. Notwithstanding Topcoder rules and conditions to the contrary, solutions submitted by any Topcoder member who is a member of a team on this challenge but is not the Captain of the team are not permitted, are ineligible for award, may be deleted, and may be grounds for dismissal of the entire team from the challenge. The deadline for forming teams is 11:59pm ET on the 14th day following the start date of each scoring period. Topcoder will prepare a Teaming Agreement for each team that has completed the Topcoder Teaming Form, and distribute it to each member of the team. Teaming Agreements must be electronically signed by each team member to be considered valid. All Teaming Agreements are void, unless electronically signed by all team members by 11:59pm ET of the 21st day following the start date of each scoring period. Any Teaming Agreement received after this period is void. Teaming Agreements may not be changed in any way after signature.

  • Relinquish - Topcoder is allowing registered competitors or teams to "relinquish". Relinquishing means the member or team will compete, and we will score their solution, but they will not be eligible for a prize. Once a competitor or team relinquishes, we post their name to a forum thread labeled "Relinquished Competitors".

  • Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.

  • In this match you may use any programming language and libraries, including commercial solutions, provided Topcoder is able to run it free of any charge. You may also use open source languages and libraries, with the restrictions listed in the next section below. If your solution requires licenses, you must have these licenses and be able to legally install them in a testing VM (see “Requirements to Win a Prize” section). Submissions will be deleted/destroyed after they are confirmed. Topcoder will not purchase licenses to run your code. Prior to submission, please make absolutely sure your submission can be run by Topcoder free of cost, and with all necessary licenses pre-installed in your solution. Topcoder is not required to contact submitters for additional instructions if the code does not run. If we are unable to run your solution due to license problems, including any requirement to download a license, your submission might be rejected. Be sure to contact us right away if you have concerns about this requirement.     

  • You may use open source languages and libraries provided they are equally free for your use, use by another competitor, or use by the client. If your solution includes licensed elements (software, data, programming language, etc) make sure that all such elements are covered by licenses that explicitly allow commercial use.

  • If your solution includes licensed software (e.g. commercial software, open source software, etc), you must include the full license agreements with your submission. Include your licenses in a folder labeled “Licenses”. Within the same folder, include a text file labeled “README” that explains the purpose of each licensed software package as it is used in your solution.     

Final Prizes

In order to receive a final prize in a single sprint, you must do all the following:

  • Achieve a score in the top five according to the final test results. See the "Scoring" section above.

  • Once the final scores are posted and winners are announced, the prize winner candidates have 7 days to submit a report outlining their final algorithm explaining the logic behind and steps to its approach, including the documentation on how to run the code. You will receive a template that helps you create your final report.

  • If you place in a prize winning rank but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.

In order to receive a bonus quarterly/overall prize, you must do all the following:

  • Achieve a score in the top ten according to the final test results. See the "Scoring" section above, including the information about the allowable missed sprints.

  • Your score must be higher than the score calculated over NOAA’s and Rodeo 1 predictions, whichever is higher.

  • Once the final scores are posted and winners are announced, the prize winner candidates have 7 days to submit:

    • Report outlining their final algorithm explaining the logic behind and steps to its approach, including the documentation on how to run the code. You will receive a template that helps you create your final report.

    • Dockerized version of your algorithm, along with any assets/materials necessary to deploy, use and train it. The technical details on how to dockerize the solution are described in a separate document. The code in your container should produce the same results as in your submissions.

  • If you place in a prize winning rank but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.