Challenge Overview
-
Our client is an insurance company that currently provides its potential customers with forms, which the customers fill to describe the insurance options they want to include in their policy. Currently, there are a lot of redundant and unstructured options listed in the form, along with a lot of text input options. This makes it difficult for the customers to easily find and select the options. Furthermore, the insurance company also loses out on the opportunity to help the customer in easily finding standard or recommended list of options that most people might find useful.
-
In the near future, the client is looking to build a simplified insurance options (benefits) configurator, where the customer can go ahead and select options according to their needs in a hierarchical manner. Example of the intended hierarchy (example only for general idea, not exactly based on the provided dataset):
-
Q1 - Coverage for transplants?
-
Q1.1 - Kidney Transplant coverage?
-
Yes - 100% coverage
-
Copay - 20%
-
Copay - 10%
-
No
-
-
Q1.2 - Hip Transplant coverage?
-
Yes - 100% coverage
-
No
-
-
-
-
This hierarchy was created for the client in the last challenge of the series. The hierarchy was created by analyzing the anonymized past history configuration data , which contains the choices made by customers on the unstructured options discussed above. This raw dataset will also be used in this challenge.
In this challenge, we will add new columns to the raw dataset based on some new goals listed below and also to compile the results of the last challenges into the columns of the raw dataset.
Results of the previous challenges
-
The top two submissions of the last challenge can be found in the forum. Participation in the last challenge is not required for this challenge, and for reference the specification of the last challenge can be found here.
-
In the last challenge, a hierarchical tree of topics/labels was created where each node in the tree described a label. The ordering is of the following format for the winning submission:
-
T_1 copay
-
T_1_1 visit
-
T_1_1_1 office visit copay
-
-
In the last challenge, the submitters also created a mapping of each row in the provided raw excel file to the leaf of the hierarchy tree that it is the best associated with. So basically there were 2 output resources from the last submission.
-
In addition to the last challenge, a few more challenges were conducted in this series, where the goal was to reduce the variation in the ‘answer’ column values in the raw dataset. The results of the previous challenge submissions are also included in the forum.
In the last challenge, the participants had the option to use the previous submissions as their starting base, or start from scratch. The winner of the last submission chose to start from scratch, but possibly took some inspiration from the submissions of the previous challenges on how to achieve variability reduction.
The Three Goals of this challenge
-
Goal 1 - Generic Answer Column
-
The primary goal of this challenge is to use the result of the previous challenges, particularly last challenge, and find the generic version of the ‘answer’ value of each row in the raw dataset, particularly those row with input type = text. The raw dataset can be found in the forum.
-
The main problem that the client is facing that when people fill up their form, they sometimes enter text input manually or in a manner that is not consistent with how other users enter similar or the same information. And this creates the problem that there are many rows where the freely entered text value actual mean the same thing, but are written a little differently or maybe have different numeric values, some additional words, different stop words and punctuation etc.
-
To solve the above problem, they are looking to first clean each answer by removing all the unnecessary words, spaces, numbers, punctuation etc, such that the ‘answer’ value can be represented in its most generic form.
-
The generic version of the answer should be such that if two answers mean the same thing, they should ideally have the same generic answer. It should be possible to group the rows on the basis of unique generic answer assigned to that row by your algorithm. This generic answer value should be added to a new column called ‘generic_answer’.
-
The requirement of finding the generic answer is very similar (almost identical) to the requirements of the first 3 challenges of this series, but the participants are free to use a new approach or improve upon the existing results of challenge 3 as they see fit. The challenge specifications of those 3 challenges can be found here: Challenge 1 and Challenge 2 (these were hosted simultaneously) and Challenge 3. Note that the results of only Challenge 3 has been provided, which combines and improves the results of first 2 challenges.
To get an idea on how to get these generic answers for each row, the participants are strongly encouraged to refer to the results of challenge 3 (combination of the first 2 + some additional improvements), which can be found in the forum.
Furthermore, the winning submission of the previous submission also carried out considerable steps to reduce variability during its execution, so if possible that can also be directly extended to get the generic answer.
-
Important - In addition to taking inspiration from the previous submission, the hierarchical categorization/tree created in the last challenge can also be used to find this generic answer. One idea would be to first group the rows using the hierarchy created in the last challenge, and then look for common patterns in the text within the group to find the generic answer. The participants are free to use the results of the last challenge in any way they find suitable.
-
Goal 2 - Intent Column
-
Once the generic answer is found, that generic answer should further be simplified into its most basic concept, called the intent from here onward. For example, the answer value “ lifetime maximum for hospice respite is limited to 15 days inpatient/15 days outpatient” can have a simple intent like ‘hospice life maximum is limited’, or maybe even ‘hospice life maximum limited’.
-
The key difference between the intent and a generic_answer value of a row is that there can be multiple unique generic answers under one intent, but not the vice versa. Also, the generic_answer is basically the most simplified version of the answer text, but the intent value does not have to be a simplified version of the answer value. It can be worded in a different way from how the answer has been worded, but the overall goal of the intent column is that, if a marketing professional reads the value in the intent column, he should know what is the 'intention' of this option selected or written by a customer.
-
In some cases, it might be possible that the intent value and the generic value are identical or very similar, if suppose your algorithm found that the generic answer and the intent are worded in a similar or exactly the same manner.
-
Goal 3 - Compile the heirarchy tree into columns
-
This is a relatively simple goal. Just parse the already available hierarchical tree, which was created in the last challenge, and using the mapping generated in the last challenge, add the labels of each node of an answer in a separate column, such as - Root category or Level 0, Level 1, Level 2, Level 3, and so on. For example, in the example given above, the values can be:
Root Category column: Copay
Level 1 column: Visit
Level 2 column: Office Visit Copay
Level 3 column: N/A
Level 4 column: N/A
… Level N column: N/A
Here N is the maximum depth of the hierarchical tree generated in the last challenge. Please check the tree and assign N accordingly.
Data and additional code access
The Data is available in the forums. In addition, the winning submissions of previous challenges attempting to solve the problem have also been included in the forums.
Important - The participants are free to start from scratch or use the provided submissions and extend/modify them to achieve the objectives of this challenge. However, the participants are required to use the output hierarchical categorization of the last challenge. For participants who choose to implement this generic_answer and intent generation algorithm from scratch, then their code should internally invoke the the winning algorithm of the last challenge and generate the hierarchy and the mapping. It is optional to use this hierarchical information in finding the generic_answer and intent values, but it is compulsory to parse the hierarchical tree, in order to populate the hierarchy related columns.
Expected technologies - The participants are free to use any technique they like to achieve the results as long as everything is implemented in Python. It can be a string parsing based implementation, or it might include advanced techniques from the field of NLP, Deep Learning or Machine Learning (ML). In the ML route is chosen, the participants are free to train models themselves, or use ready-made machine learning/deep learning models available online, as long as they are available to be used in commercial software free of charge.
Data Description -
The available dataset is not very large (~70MB). You will have access to the entire data set. Check the sample and metadata file available in the forums for a complete definition of all data fields.
Here is a definition of some terms that will help with understanding the provided data:
-
Benefit is basically an insurance option, which describes the coverage of a particular health care service by the insurance provider.
-
Coverage code is a unique identifier for a set of benefits/options provided as a group of services with actual start/end dates for the coverage
-
Product consists of a set of coverage codes (and hence benefits) and is used internally to align coverage codes to internal rules and procedures.
Benefits data is the core data set for this challenge. Here is how this data is generated. Benefits configuration is arranged in a question/answer format on a website. The benefit has a hard coded question, and then several types of available answers to round out the question. For example, a question might be “Is this is High Deductible Health Plan?” and the user might have a choice of 2 check boxes, radio buttons, or a drop down with Yes/No toggles. The answer then becomes the statement combination of the Q&A, leaving “No, this is not a high deductible health plan.” Another examples the Question might be “The out of pocket maximum is:” and the user enters “$3000”, leaving the answer to be “The OPM is $3000.” Or finally the Answer might be “Enter additional comments here” in which the user might enter free form text and that free form text becomes the answer. These sets of answers are rolled up to a form all of the benefits for a specific coverage code.
The actual data set contains flat data records that:
-
List the insurance options/benefits (the 'answer' column) and question identifiers (sequence_id)
-
Connect benefits to coverage codes
-
Connect coverage codes to internal products
-
Start/End date for the coverage
And also these useful columns:
-
Type_of_tag - information about the type of field presented in the software - radio button, checkbox, text input, dropdown
-
Value flag - this is only populated for records where type of tag is text - It denotes whether the user entry field is a text field (meaning all open free form text allowed) or it is a numeric field, meaning the answer may contain some text that is automatically generated by the software, but the user can only enter a numeric value.
-
Top50_flag - This just denotes that the coverage code is for a very important client
Note - All the columns can be considered in your analysis to find the generic_answer and the intent. Furthermore, as mentioned above, the hierarchy tree generated in the previous challenge can also be used to assist in achieving the goals.
It should be noted that ‘Type_of_tag = N/A’ refer to simple input like checkboxes and option buttons, and they are not that difficult to deal with. The primary difficulty reside is the rows with ‘Type_of_tag = text input’, where the answer column is usually entered by a human and it is NOT generated by some software like in the case of checkboxes or option buttons. So the primary focus should be devoted to those rows.
Final Submission Guidelines
What to Submit
-
Command line based code written to achieve the 3 goals mentioned above. Python 3.7 should be used to implement the submission.
-
A PDF/Word/Markup format based report detailing the techniques and algorithms used to achieve the goals.
-
The updated CSV file, with these new columns: generic_answer, intent and hierarchical columns such as root_level, level_1, level_2, level_3, and so on.
-
A README.md file detailing the deployment instructions.
Technology Stack
-
Python