Challenge Overview

Challenge Objectives

  • Prepare the ideation document along with estimated cost sheet for migrating an existing structural and functional genome annotation pipeline to AWS Cloud based on the existing workflow diagram provided.
  • Create sample infra code in terraform or AWS cloud formation template for the proposed architecture.

Project Background

In this project we are migrating internally developed pipelines of our client for structural (GSAP) and functional (GFAP) genome annotation that are heavily utilized to meet all bioinformatics needs related to genome annotation requests across a wide variety of species of interest (crops, weeds, insects, fungi) to cloud. These pipelines are currently established to be run on HPC resources available on-prem.

Due to the ever growing queue of annotation requests arising from across different R&D functions, and time spent in discussion on prioritization, as well as increasing competition for a finite amount of on-prem compute resources, the annotation workflow was deemed as a key impact opportunity for cloud deployment as a potential solution to meet internal needs and deadlines.

The goal would be a bioinformatics pipeline written in Python that is fully dockerized, optimized to run on AWS cloud and performs genome annotation in an efficient, cost-effective manner.

Technology Stack

  • Python
  • Perl
  • MySQL
  • Oracle
  • Docker
  • AWS
  • Multiple third-party tools (detailed in forum)
  • Fireworks for workflow tracking (used currently)
  • MongoDB (used by Fireworks)

Individual Requirements

Requirements

  • The expected modular workflow is provided in the forum. Based on the workflow you have to perform the following tasks for Structural Annotation (SA), Quality Control (QC), and Functional Annotation (FA). The workflow might be of the following:
    • SA + QC
    • SA + QC + FA
    • FA only
  • Identify and document the right workflow orchestration tool for the pipeline in AWS - SWF or Step functions.
    • The process should notify users of execution issues or failures
    • The process should allow restart of a step if it fails in the middle of the pipeline. 
    • The process should enable users to abort and restart a run at will. The restart of the job can be with or without modifications to the configuration file.
    • Every module within the SA, QC and FA workflows should also function independently. Which means individual modules should be able to run independently and also as part of the pipeline.
    • The process should enable users to toggle modules on/off. 
  • Identify and document how AWS Batch can be used to run multiple jobs in this pipeline.
  • Document how to store the dockerized images for different python modules in ECR making use of ECS.
    • All input and output of the python scripts will be using S3
  • Document how Cloud Watch can be used for monitoring in the above architecture.
  • Create a sample provisioning code with some mock python modules.
  • Estimate the cost of the cloud infrastructure in an xls so that we can estimate the cost of the pipeline. Our client is hoping to be able to take advantage of Spot Pricing and other cost savings mechanisms that are available. If there are performance implications for cost savings, please indicate this in your analysis and discuss the tradeoffs.
  • If there are other components of the AWS Platform or AWS Big Data Stack that you think would be helpful, please include those in your proposal. Note, the client does NOT want to change their functional methodology currently. We’re not looking for suggestions about improvements in Machine Learning capability or the underlying Genetic Annotation process in this challenge. Rather, we’re trying to execute the existing workflow and tool set as expeditiously as possible and move it to a scalable cloud-based infrastructure.
Note: One job of the current pipeline, runs in approximately 7 – 10 days on-premise with about 1000 cpu’s of total computing capacity. The client wants to be able to scale the annotation process to handle multiple runs of the pipeline (up to 10) in parallel. But of course, they’d like the execution to be as time and cost efficient as possible. You can document your recommendation in the document.
 

Final Submission Guidelines

What to Submit

  • An ideation document in Word or PDF format
  • The document should include detailed architecture diagrams with the envisioned cloud infrastructure
  • Please provide a cost estimate for the execution of each vertical section (SA, QC, and FA) of the workflow as well as the cost of executing the complete pipeline. Please include data transfer cost.
  • Infrastructure provisioning code for some mock python scripts.

Evaluation Criteria

1. The submissions will be evaluated by the level of detail in the proposals and the rigor provided in the cost estimates. 
2. Appropriate use of AWS Infrastructure
3. Documentation of the use of AWS Workflow Tools.
4. Evaluation of the provisioning code

ELIGIBLE EVENTS:

2020 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30104621