Challenge Overview
Problem Statement | |||||||||||||
Prize Distribution
We will ask the 5 winners to submit a Docker file for their solution, and, if so will provide guidance in that regard. We will pay the winners who submit a Docker file an additional $100 each for their efforts. If you choose to not submit a Docker file, you will need to document precisely the steps that need to be taken to recreate your submission. Requirements to Win a PrizeIn order to receive a prize, you must do all the following:
If you place in the top 5 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above. BackgroundAs healthcare systems evolve through mergers, acquisitions, and partnerships, there is a large problem identifying and recognizing duplicate and erroneous information on entities such as doctors, practices, and clinics when the data from various sources is combined. As the number and frequency of these mergers increases, there is a growing need to establish a single "source of truth," using data from key business domains. Examples of these domains include healthcare provider, patient, payer, location, item, medication, procedures, information management and reporting, facility, diagnoses, guidelines, and codes. The processes for creating, managing and governing a single "source of truth" are broadly classified under Master Data Management (MDM). In MDM, it is necessary to reconcile inconsistencies and eliminate redundancies. Duplicate records are one such redundancy, and they tend to occur because of one or more of the following:
ObjectiveCreate an algorithm that:
You may download a data set, which includes ground truth, to develop your algorithm here. You will need to test your code against this data set and submit results as a csv file on our site. The csv file will be used to score your submission. Data DescriptionThe data is in csv format with the following columns:
Nb. Multiple provider taxonomies may be given for a single provider. Ground truth for the training set is provided in a .csv file as a list of duplicate pairs. Each row corresponds to the two comma seperated id numbers being a duplciate pair. Data Format RequirementsA single .csv file must be created which contains the answers for the given test. The .csv file should not contain any header. Each line of the .csv file will correspond to a single duplicate pair. Each line contains the follow values, comma separated: the id of the first record (i), the id of the second record (j), and the probability of these records being duplicate (Mij). The index of the first record (i) must be less than the index of the second record (j), and the probability must be less than or equal to 1 and greater than 0. For example, the row "45,300,0.75" means that there is a 75% chance that records 45 and 300 are duplicates. All pairs which are not specified in your .csv file will be assumed to have a duplicate probability of zero for scoring. Identical pair indices, where i=j, should not be specified as they are guaranteed to be equal. FunctionsDuring the contest, only your results will be submitted. You will submit code which implements only one function, getAnswerURL(). Your function will return a String corresponding to the URL of your answer .csv file. You may upload your .csv file to a cloud hosting service such as Dropbox which can provide a direct link to your .csv file. To create a direct sharing link in Dropbox, right click on the uploaded file and select share. You should be able to copy a link to this specific file which ends with the tag "?dl=0". This URL will point directly to your file if you change this tag to "?dl=1". You can then use this link in your getAnswerURL() function. Your complete code that generates these results will be tested at the end of the contest. HintIn addition to supervised learning algorithms to predict duplicates, we suggest trying ones that compare pairs of records directly and intelligently such that only a subset of all pairs of records are examined fully for duplicates. ScoringWe will use the mean squared error of prediction to evaluate your submissions. Define: dij = 1 if the ith row is a duplicate of the jth row or 0 if the ith record is not a duplicate of the jth row Brier Score = sum of (Mij - dij)^2 over all pairs (i,j) where i < j. Score = 1,000,000,000 / (Brier Score + 1,000) NOTE: For provisional tests, only a portion of the (i,j) pairs (approximately half) will be considered for towards the scoring. The remainder will be considered for system testing results. ResourcesEnglish Names and Nickname Corpus Unified Medical Language System (UMLS) ReportYour report must be at least 2 pages long, contain at least the following sections, and use the section and bullet names below. Your Information This section must contain at least the following:
Approach Used Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code. This section must contain at least the following:
Notes on Time LimitsYour final code will be executed on an Amazon m4.xlarge machine running linux. The time limit for code execution is 1 hour. Submissions will be limited to once every 2 hours, so plan your submissions near the end of the contest carefully. | |||||||||||||
Definition | |||||||||||||
| |||||||||||||
Examples | |||||||||||||
0) | |||||||||||||
| |||||||||||||
1) | |||||||||||||
| |||||||||||||
2) | |||||||||||||
|
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.