Challenge Overview
According to Wikipedia, “Mud logging is the creation of a detailed record (well log) of a borehole by examining the cuttings of rock brought to the surface by the circulating drilling medium (most commonly drilling mud).” Quartz Energy has provided Topcoder with a set of mud logs and we’re developing an application to extract structured meaning from these records. The documents are very interesting -- they are even oil-well shaped! You can read more details about them here. For this challenge, 101 mud log image files are being offered as training data. You can download this data set here.
If oil is revealed in a well hole sample, a “Show” may be recorded in logs. This is one of the most important pieces of information in the mud logs. In a previous challenge, Topcoder member, chok68, produced the winning submission which we’re going to use as our baseline OCR solution. The code for the previous challenge can be found here.
Here is what the existing application already does:
-
Creates a MySQL database designated by DB_DATABASENAME parameter in the .env file.
-
Iterates through all the mud log images in a directory designated on the command line.
-
Extracts the raw text from each mud log image file.
-
Store the raw text in a database along with the mud file image name.
-
Gives each image file a score based on the number of occurrences of the show phrases identified in the raw text and store the relevant phrases and scores in the databases
-
Creates a summary report of the image filenames, scores, raw text, and extract phrases sorted by score descending.
-
Creates a graph/plot which displays the highest scoring images files.
-
Full instructions on how to set up and execute the solution can be found in the ReadMe file in the root directory of the submission. You can download the submission here.
Notes:
Of all the tasks outlined above, task #3 is by far the most difficult. Many of the images are poor quality and the text is provided in a variety of different fonts and layouts.
New Requirements
In this challenge, we’re going to building on the previously built solution and adding a few new requirements:
-
Please add the following columns to the database schema:
-
IMAGE_OCR_PHRASE.OCR_PHRASE_TYPE CHARACTER(10) NOT NULL
-
IMAGE_OCR_PHRASE.OCR_PHRASE_COUNT INT NOT NULL DEFAULT 0
-
IMAGE_OCR.PHRASE_COUNT INT NOT NULL DEFAULT 0
-
-
There are 4 valid values for the IMAGE_OCR_PHRASE.OCR_PHRASE_TYPE field: Show, Stain, Trace, Negative
-
The scoring rubric has been updated to include some new terms and one additional type of phrase to identify: a negative phrase. Here is the revised scoring rubric. The current solution does NOT implement the negative case scoring, you will be able to improve on the baseline solution scoring simply by implementing the negative case. A negative phrase type is just a Show, Stain, or Trace phrase with the word "no" in front of it. All phrases are case-insensitive.
-
Your solution should populate the three new database fields requested in #1 above. Each phrase of any type (Show, Stain, Trace, or Negative) counts as 1 in the IMAGE_OCR.PHRASE_COUNT. If you identify 3 phrases in an image:
Oil Stain
Oil Stain
Oil Stain
Here are the values for the various fields for this case:
IMAGE_OCR.PHRASE_COUNT = 3
IMAGE_OCR.SCORE = 6
IMAGE_OCR_PHRASE.OCR_PHRASE_TYPE = “Stain”
IMAGE_OCR_PHRASE.OCR_PHRASE_COUNT = 3
IMAGE_OCR_PHRASE.OCR_PHRASE = “Oil Stain”
IMAGE_OCR_PHRASE.SCORE = 6
-
Topcoder has manually inspected about 200 image files to determine the ground truth data that will be basis for scoring the submissions. We’re providing a subset of this data as training data. The evaluation of the solutions will occur against both the training data and an additional set of testing data that isn’t being provided in advance. Here is the ground truth data for the images provided in the training data set. We are scoring for the inclusion of the negative phrases in the accuracy count. Although they don’t affect the scores of the records in the IMAGE_OCR.SCORE column, the negative phrases should appear in your IMAGE_OCR.PHRASE_COUNT totals.
-
The scoring will be conducted based on accuracy against the phrase counts. Your application must find each of the phrases. We’ll score each submission based on sum of all the distances between the Phrase Counts between the submission and ground truth data for all the images in the testing data set. The submission with the lowest score will be the most accurate.
-
The submissions will be compared against each other for accuracy based on the score described above and ranked in accuracy order - the most accurate (the lowest cumulative distance score) receiving a 10 and the next receiving a 9 in the performance element of the score card. The theoretical perfect accuracy score is 0, which would receive a final score of 10. Please review the scorecard to see the weighting of the performance characteristics. It is possible that we’ll have a tie in the accuracy scoring and we’ll allow a tie in that element of the competition. Although the accuracy elements of the competition will be heavily weighted, meeting the functional requirements, and good coding style and practice are important and could be decisive in the competition.
-
Produce new images in an output folder, using the same file names, that highlights the phrases you found as discussed here: https://stackoverflow.com/questions/20831612/getting-the-bounding-box-of-the-recognized-words-using-python-tesseract. Ideally, these highlights are color coded: Green, Light Green, Yellow and Red to agree with the phrase type of the terms. Please see the scoring rubric for details on the Phrase Types.
Technology Overview
Python 3.6.x
MySQL 5.7.+
Final Submission Guidelines
1. Please submit all code required by the application in your submission.zip
2. Document the build process for your code including all dependencies (pip installs etc..). Please make updates to the existing README.md file as needed to allow for straightforward deployment of your solution.
3. You may use any Python Open Source libraries or technologies provided they are available for commercial use.