Challenge Overview
According to Wikipedia, “Mud logging is the creation of a detailed record (well log) of a borehole by examining the cuttings of rock brought to the surface by the circulating drilling medium (most commonly drilling mud).” Quartz Energy has provided Topcoder with a set of mud logs and we’re developing an application to extract structured meaning from these records. The documents are very interesting -- they are even oil-well shaped! You can read more details about them here. The mud log images files (500 of them) for this challenge can be downloaded here.
If oil is revealed in a well hole sample, a “Show” may be recorded in logs. This is one of the most important pieces of information in the mud logs. Our first attempt to gather information from these files is going to be to find text related to the word “Show” from the text extracted from the images files. You can view the relevant show terminology here.
For this challenge, your application should perform the following functions:
-
Extract the raw text from each mud log image file.
-
Store the raw text in a database along with the mud file image name.
-
Give each image file a score based on the number of occurrences of the show phrases identified in the raw text. The scoring rubric is found on the same excel worksheet with the Show Terminology. Each occurrence of the “Show” word or phrase is significant and the scores should be cumulative based on each occurrence. Store the score in the database. Store the show word or phrases found in the databases
-
Create a summary report of the image filenames, scores, raw text, and extract phrases sorted by score descending.
-
Create a graph/plot which displays the highest scoring images files. Over time we’ll develop an idea about what threshold is worth examining. Initially however, we should set a configurable threshold scoring parameter and display all files which have a score above that value.
Notes:
Of all the tasks outlined above, task #1 is by far the most difficult. The text in these files is often handwritten or obscured by grid lines and many of the images are poor quality.
Other Requirements
-
You should implement your OCR extraction routine in a file called “extracttext.py”. This file should have a function extract_text which accepts a directory path parameter and should return a dictionary. Each key-value pair in the dictionary will correspond to the filename (key) and the raw text associated with each file (value) of one image file found in the directory path provided. Using this interface will help us to automate evaluation of the OCR solutions in future contests. A template extracttext.py sample can be found here.
Technology Overview
Python 3.6.x
MySQL 5.7.+
Final Submission Guidelines
1. Please submit all code required by the application in your submission.zip
2. Document the build process for your code including all dependencies (pip installs etc..)
3. You may use any Python Open Source libraries or technologies provided they are available for commercial use.