Topcoder - Create CronJob For Populating Changed Challenges To Elasticsearch

Register
Submit a solution
The challenge is finished.

Challenge Overview

Previously, we have created the intial approach to populate a challenge or a list of challenges to elasticsearch.

For this challenge, we'd like to create a cronjob which will

1. Please use https://github.com/spinscale/dropwizard-jobs, if you have better choice, please raise in forum and ask approval.
1. The job should be up intervally to find the recently changed challenge ids with query as following

                  SELECT DISTINCT
                   (project_id) 
                  FROM
                     project 
                  WHERE modify_date < sysdate AND modify_date > <<last run timestamp>>
                  UNION
                  SELECT DISTINCT
                    (project_id) 
                  FROM
                     project_info 
                  WHERE modify_date < sysdate AND modify_date > <<last run timestamp>>
                  UNION
                  SELECT DISTINCT
                  (project_id) 
                  FROM
                     project_phase 
                  WHERE modify_date < sysdate AND modify_date > <<last run timestamp>>
                  UNION
                  SELECT DISTINCT
                    (project_id) 
                  FROM
                     upload 
                  WHERE modify_date < sysdate AND modify_date > <<last run timestamp>>
                  UNION
                  SELECT DISTINCT
                   (project_id) 
                  FROM
                     resource 
                  WHERE modify_date < sysdate AND modify_date > <<last run timestamp>>
                  UNION
                  SELECT DISTINCT
                   (project_id) 
                  FROM
                     prize 
                  WHERE modify_date < sysdate AND modify_date > <<last run timestamp>>

the interval should be configurable in YAML file and overridable by environment variables.

2. The service will be possiblly deployed in several machines and load balanced, so there will be several jobs running simutanously, we should use a distributed lock to make sure only one cronjob is running in the same time. The job are same, so no need to run in the same time. 

you can use redisson to achieve this, see https://github.com/redisson/redisson/wiki/8.-Distributed-locks-and-synchronizers

3. For the cronjob, it will store and retrieve the last run timestamp from Redis, so it can be more smarter like used in the query in item 1. If there is no last run timestamp, we will consider this as a initial load, please use a enough old timestamp.

And the new time that the job started will be saved to Redis.

For this challenge, Let's use redis cache like item to store this information.  Notes, for different environments, the key should be different, so better having a prefix for each environment, or make that configurable too.

4. The redis cache should be configurable in YAML file and environment variables. 

5. It is possible for query in item 1, there will be a big list of challenge ids (like the intial load), the job should be able to smartly do batch update, like retrieved a configurable size (for example, 100) each time to update into elasticsearch.

and for listing the challenge ids, be sure to use desc order, so the newer challenges will be updated first, which is important for us.

 

Final Submission Guidelines

- Code Changes
- Verification Steps

ELIGIBLE EVENTS:

2018 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30061798