Register
Submit a solution
The challenge is finished.

Challenge Overview

In previous challenges, we've developed an application that predicts a particular attribute (LNAM) based on a set of attributes present in a logging file.  The log files are in text format – LAS files – that is an industry standard in the oil and gas exploration industry.  Our process predicts well on smaller sets of log files, but we've discovered that as the number of log files and curve attributes increases the number of potential features also increases dramatically and our predictive ability begins to suffer. It's a classic case of the curse of dimensionality
 
We'd like to take a step back from our current prediction model and see if we can create a set of curve attribute grouping rules that we can apply to the LNAM assignments.  This rule generation process itself will still require some training but it takes its inspiration from the methodology that the human experts use: 
 
Here's the process that we need to introduce into our app: 
1.  In our training set we need to group our LAS files by LNAM.
2.  We should then accumulate a list of all possible curve attributes for each LNAM group.
3.  The application should identify the list of the common curve attributes across each LNAM class.  The goal is to identify not only the curve attributes that LNAM class share but also the curve attributes that make one LNAM class distinct from another.
4.  Curves attributes are also related to vendor.  Each logging operator has a set of tools that they operate.  Some of the logging tools have common names but some are distinct by vendor.  We have some documents that map curve attributes (tools) to vendors.  The logging operators are designated in the LAS files by the SRVC element.   Vendor documents are attached in a zip file in the Code Document section of the forums.  The “mnemonic” and “Company” columns are the ones that are relevant to this challenge.
5.  Assign the LNAM based on the curves (and if necessary) the logging operator (SRVC).     
 
Our current app has several features that are useful:
 
1.  It parses and loads a set of LAS file attributes into a pandas dataframe.  
2.  It cleans and normalizes the LAS files in preparation for transformation to pandas. The app generates a cleaned copy of the data into a new folder.
3.  It produces a testing.csv which extracts the LNAM attribute from a directory of LAS files.
4.  It can compare the output of the prediction process (e.g., prediction.csv) to the testing.csv (the ground truth) and generate an accuracy score.  The process can also generate more detailed classification reports by LNAM.
5.  Curve attributes are One Hot Encoded in the dataframe.  
  
Please revise the current application with some new command line arguments which align with the process outlined above: 
 
1.  aggregate-curves (Steps #1-2) above.  I think it would be helpful to write this information to a "curves" directory with each curve's parameters listed in its own file.  Ultimately, this app will be processing large numbers of files.
2.  process-curves (Steps #3 above).  This step should generate the curve parameters in memory but also produce an xlsx report of the n most common LNAM's.  The n -- the number of LNAM's -- should be configurable.  The report should also list the absolute number and relative frequency of the LNAM's in the data set.
3.  curve-predict (Steps #4 and 5) above.  We have a set of metadata that we produce as final output of our prediction process.  The app should use the curve and vendor info to make a prediction about the LNAM and generate the requested prediction output in csv form.  The current app produces output with the following columns:  File Name,UWI,LNAM,Service Company,Log Type,Cased Holed Flag,Generic Toolstring.  The logic to generate Log Type,Cased Holed Flag, and Generic Toolstring fields are already present in the app. 
 
Couple of additional issues to resolve/consider: 
1.  The current app seems to undercount at least one (and maybe more) of curve attributes.  For example, the curve attribute “CCL” is underrepresented in a recent review of the values_count() output from the dataframe compared with a grep count in the target directory containing the LAS files.  Please validate that all the curve parameters/attributes are being recorded in our initial dataframe and output reports and are resolving properly in our prediction logic.   I think there could be an issue with the app’s One Hot Encoding logic.
2.  We've tried some things like truncating the curve attribute names to reduce the number of inputs. So far it hasn't helped predictive ability much using the initial Random Forest but you're welcome to experiment with various simplification schemes.
 
Evaluation Criteria
In addition, to good coding style and accomplishing the tasks set out above, 30% of score will be ranked assessment of your LNAM predictions against our ground truth data.  The submission with the highest score in this scorecard element will receive a 10, next a 9 and so on.
 
You are welcome/encouraged to use your creativity to increase the predictive ability of your solution.   Please let us know in your solution documents/README files what optimization you made.

Final Submission Guidelines

Please use the code provided in the Code Document forums as a basis for the solution.  The codebase is Python 3.6.

Provide some documentation for the changes you made to make the algorithm more predictive.

Include any new dependencies in the requirements.txt file.

 

ELIGIBLE EVENTS:

Topcoder Open 2019

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30087734