Challenge Overview
Our client has developed multiple applications that allow users to extract data and tag phrases in Mud Log images. Currently however, these applications are standalone. Our data extraction tool produces a csv file. Our tagging tool loads data into a MySQL database. The tools aren’t integrated. In challenge we’re going to develop a tool that uses a common schema.
We also need to incorporate functionality that let’s the tool process depth registration files. These files have a “.LIC” extension but really they are just text files.
Background:
-
We’ve developed a schema across our Mud Log program that our data extraction tool will need to support. That schema can be found here: https://drive.google.com/file/d/1nu1ue7tnRHYF2xg929uY2HFpALtHemOL/view?usp=sharing
-
The schema has three tables: IMAGE_OCR, IMAGE_OCR_PHRASE, IMAGE_OCR_DEPTH. Currently our data extraction tool only really supports the IMAGE_OCR_PHRASE table so we’ll need to make some updates to populate the other info.
-
MySQL 5.7 has been provisioned on each image: User: root / PW: topcoder
The schema above has NOT been loaded into the database.
The application will need to create a new version of this schema with each data extraction run. -
For this challenge, we’ll provision an AWS Instance for you with Mud Log Phrase extraction code already working and deployed. The following commands executes our mud log extraction application on a data set of 205 images used for a recent Marathon Match. In your instance, you can find the script to run the data extraction application by following these steps:
$ cd mudOCR
$ sudo bash test.sh ~/mud_images/testing ~/mud_model/ ~/phrases_out.csv
A smaller data set for this challenge is also provided on the server:
~/mud_log_integration/images
~/mud_log_integration/lic
To execute the OCR phrase extraction process against this smaller data set you would execute the following:
$ cd mudOCR
$ sudo bash test.sh ~/mud_log_integration/images ~/mud_model/ ~/integration_phrases_out.csv
-
This code iterates through a set of images files and produces a csv file that looks like the following:
us49023204260000_3770841.tif,no cut,1091,38762,1154,38786
us49023204260000_3770841.tif,no cut,1062,46544,1126,46568
us49023204260000_3770841.tif,cut,1081,58116,1118,58140
us49023204260000_3770841.tif,odor,1021,71885,1068,71910
us49023204260000_3770841.tif,odor,1010,64275,1052,64299
us49023204260000_3770841.tif,odor,1052,69046,1098,69070
us49023204260000_3770841.tif,odor,1097,63541,1144,63565
us49023204260000_3770841.tif,odor,1067,64108,1114,64134
us49023204260000_3770841.tif,odor,1014,66387,1060,66411
us49023204260000_3770841.tif,odor,1007,63156,1053,63179
us49023204260000_3770841.tif,cut,1027,73108,1063,73132
us49023204260000_3770841.tif,odor,1067,65870,1114,65896
us49023204260000_3770841.tif,cut,1012,66888,1050,66912
us49023204260000_3770841.tif,cut,997,45038,1034,45062
us49023204260000_3770841.tif,cut,1077,65723,1104,65745
us49023204260000_3770841.tif,cut,1079,66822,1116,66846
us49023204260000_3770841.tif,odor,1006,62830,1054,62855
us49023204260000_3770841.tif,odor,1006,62663,1054,62687
us49023204260000_3770841.tif,cut,1046,24514,1088,24537
us49023204260000_3770841.tif,odor,1115,40046,1160,40070
The columns are the mud log file name, the phrase found, X1, Y1, X2, Y2 where X1, Y1 are the pixel coordinates of the upper left corner, and X2, Y2 are the pixel coordinates of the lower right corner of a bounding box surrounding the phrase.
Requirements for this challenge:
-
Create a command line Java application -- based on the data extraction code provided above -- which does the following:
-
Create a new database with the schema outlined above and a name provided/configured by user running the application.
-
Gathers basic information about each image - Height in Pixels, Width in Pixels and file path and loads them into the database. The tool should allow processing of files in a local directory or on an AWS S3 folder. For this challenge the images you should process the images in the following folder: ~/mud_log_integration/images
-
The images files have been loaded to the following S3 folder: https://s3.amazonaws.com/mud-log-integration-images
-
Processes a set of “.LIC” files which contain depth values associated with Y pixel coordinates. The depth and Y values should be loaded into the IMAGE_OCR_DEPTH table. For this challenge the images you should process the LIC files in the following folder: ~/mud_log_integration/lic. Note there can be more than one LIC file per image. LIC files contain a “Well” field which relates to a Well Id and Well Image Name.
-
The “LIC” files have also been loaded to the following S3 folder: https://s3.amazonaws.com/mud-log-integration-lic
-
The application should be able to process more than one LIC file per image file. A sample of this file layout is provided.
-
Extracts a set of phrases from a directory (either S3 or local directory) which is loaded with image files. Generally, these are .tiff image files.
-
The phrase extraction process should populate all the fields IMAGE_OCR_PHRASE table include PHRASE_TYPE and SCORE as ESTIMATED_DEPTH field. The ESTIMATED_DEPTH values can be calculated by cross referencing the y values found by the extraction (average of y1 and y2) with the y values found in the IMAGE_OCR_DEPTH table (which were originally from the “LIC” files.
-
Each additional phrase discovered should update the PHRASE_COUNT and SCORE in the IMAGE_OCR table. The scoring is simple:
PHRASE_TYPE = “SHOW” IMAGE_OCR_PHRASE.SCORE = 3
PHRASE_TYPE = “STAIN” IMAGE_OCR_PHRASE.SCORE = 2
PHRASE_TYPE = “TRACE” IMAGE_OCR_PHRASE.SCORE = 1
PHRASE_TYPE = “NEGATIVE” IMAGE_OCR_PHRASE.SCORE = 0
-
The application should populate IMAGE_OCR.SCORE and IMAGE_OCR.PHRASE_COUNT values by rolling the update scores and counts for each image in the IMAGE_OCR_PHRASE table.
-
Use Maven as a build tool.
-
Add a report command to the tool which produces a spreadsheet with two tabs.
-
Summary: Three Columns: File Name, Phrase Count, Score. The table lists the phrase count and score totals from the IMAGE_OCR table. It should be sorted in Phrase Count descending order.
-
Phrases: Four Columns: File Name, Phrase, Phrase Type, X1, Y1, X2, Y2, Estimated Depth. It should be ordered by Name and then by Y1. So you see all the phrases associated with each file and then the phrase in vertical (Y) order.
-
-
Note that retraining the algorithm (which is a time consuming process) is not required or expected.
Technologies
Java 8
Maven
MySQL 5.7.x
Final Submission Guidelines
- Please make sure your code is properly configured and ready to run on the VM, we'll use your VM for quick verification.- Please submit a zip containing your code for the tool, and a detailed readme to deploy / configure / run your code. Also include details on how to quickly verify your submission on the VM.
- Make sure the code is built / deployed / ran using Maven build script.