Challenge Overview

Background

According to Wikipedia, "Mud logging is the creation of a detailed record (well log) of a borehole by examining the cuttings of rock brought to the surface by the circulating drilling medium (most commonly drilling mud)." Quartz Energy has provided Topcoder with a set of mud logs and we’re developing an application to extract structured meaning from these records. The documents are very interesting - they are even oil-well shaped! You can read more details about them here. If oil is revealed in a well hole sample, a "Show" may be recorded in logs. This is one of the most important pieces of information in the mud logs. Our first attempt to gather information from these files is going to be to find the relevant mud logging terms within the text of these mud logs.

Requirements

In previous challenges such as this one, the Topcoder community developed a command line Java application that extracts a set of phrases from a Mud Log image file using Optical Character recognition technology such as Tesseract and Google Vision.  These images are typically in TIFF format.  Now our client wishes to operationalize this application in the following ways:
  1. The application is going to be deployed to the Azure Cloud
  2. We’re going to split the “commands” of the application -- extract, remove outliers, and generate marked images -- into separate Azure (cloud) functions.  Conveniently, Azure functions support Java.  For this challenge we’re going to focus on the extract command.
  3. The application should write the extracted image and phrase data to a Cosmos NoSQL Database.
  4. Mud Log images and supporting metadata files (SIF or LIC files) will be loaded into Azure’s Blob Storage.
  5. Each Azure function execution should process one image.  Ideally, the application processes each image in a few seconds but the maximum runtime for processing of each image needs to be limited to 5 minutes to avoid timeouts with the Azure platform. More information about Azure function quotas here: https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale. We’ll need to adjust our algorithm slightly so that it doesn’t partition and process component images in an infinite loop.
  6. The extract Azure function should be triggered by only image files being loaded to the image storage bucket, and then will find the corresponding LIC files in another storage bucket upon being triggered.
  7. Image processing will need to be handled in memory if possible.  The current process generates a lot of temp files which are undesirable in the serverless context.
  8. All legacy Tesseract code and command-line code that isn’t relevant to the cloud function extraction process should be removed
It’s probably obvious from the requirements above, but in this challenge, you’ll be constructing a serverless workflow on the Azure Cloud platform:
  1. Images and SIF or LIC files are loaded to a designated "Unprocessed” Blob Storage location. 
  2. This triggers an Azure function execution to process the images.  The trigger should occur when image files are added to the storage location.  We’ll instruct users to add the metadata files (either SIF or LIC files) first.
  3. Data is extracted and written to Cosmos DB.
  4. Once processing is complete, remove image from the unprocessed Blob Storage and store it in a “Processed” Blob Storage location.
  5. Any images that are unable to be processed are stored in another storage bucket, an “error bucket”
Other events will be fired to execute the remove outliers and generate marked image functions (not required for this challenge).

Here's a basis architecture diagram for clarity:  

 

Final Submission Guidelines

Project Deliverables

  1. Please provide your Azure Function Java code and database schema creation scripts to Topcoder in zipped format. 
  2. Set up the complete workflow outlined above in a personal Azure Account.  This includes Blob Storage and the Cosmos NoSQL database with the working schema.
  3. Carefully document your configuration/solution either in the Readme.md or in a User Guide.doc.  Screenshots of the Azure console configuration are essential so we can duplicate your work.
  4. Please record a screen share video of both your Azure configuration and the execution of the app. 
  5. Use of the Google Vision API is mandatory.  This is already implemented in the current codebase.
  6. Java code and Maven scripts from the previous challenges can be found in the Code Document forums along with sample data.  The current code does a significant amount of image processing to aid in character retrieval.  This processing should be ported to the new solution.

ELIGIBLE EVENTS:

Topcoder Open 2019

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30095252