TCO - Living Progress - Data to Drops - Multiclass Learning and Classification

Key Information

Register
Submit
The challenge is finished.

Challenge Overview

Background

Millions of people around the world rely on water points for their daily existence. Too often the water points fail and communities are left without the water they desperately need. A lack of basic information on these failures has made keeping water flowing a major challenge for governments and aid agencies. Across the globe, many communities have come to rely on public water access points. People will come to these water points and open the tap or pump the handpump to fill their containers. As long as the water is flowing, they will carry the water home, and use it for sustenance: drinking, cooking, cleaning, bathing, and more. The water from these points are a critical foundation for success, health, and prosperity.

Unfortunately, these water points systematically fail due to technical breakdowns, water scarcity, vandalism, and misuse. When these water points fail the very foundation for community wellbeing fails. People have to revert to distant water sources, dirty water, or exorbitant prices. Better understanding the causes of failure will allow NGOs and governments to better avoid these failures, ensuring that water services last over time.

The recently launched Water Point Data Exchange (WPDx) has made significant progress in analyzing these failures and establishing a path forward to lasting services. WPDx consists of a data exchange standard and a central repository of compliant data. The water point data is aggregated from governments, NGOs, academia, and other sources and then standardized for integration into the central repository. This unprecedented library of information is already providing a foundation for improved research and effective policies to help keep water flowing.  The major limitation of WPDx is the presence of several open text fields among the standardized attributes. These fields (such as water point status, and water point type) allow for much needed flexibility, but severely curtail analysis. This solution will provide secondary processing on the WPDx data to convert those open text values into meaningful categories that allow for analysis.

This challenge is part of the HPE Living Progress Challenge Blitz Program(Secure top placements in the leaderboard to grab additional cash prizes).

Requirements

In a previous challenge, the Topcoder community developed a fascinating array of algorithms to categorize a provided set of water source and water technology values.  Many of the algorithms performed extremely well, categorizing partially unseen data at greater than 90% accuracy.

In this challenge we'd like to do something similar with a different set of test data which is identifying the status and condition of various water sources. We want you to develop the classification algorithms for this new set of data with the capability of automated “learning”.  The suggestion here is that you submit two Python scripts.  The first to perform weighting, association analysis, or branch analysis, etc.  The second is to do the actual classification.  We’ll provide 2500 records from the whole dataset which will allow you to validate your solution.  

The training data can be accessed here: https://drive.google.com/file/d/0B9bY2DOzMq65QTJEbG4tQk91ckE/view?usp=sharing

We’ll test the solutions with the data above plus ~2500 records which are not provided to you in advance.   The solutions will be evaluated for accuracy.  Fifty percent of the score for the submissions will be based on the accuracy of your categorization efforts.  The accuracy metric is fairly simple:

Accuracy = # of correct responses/# of total responses

What we’re doing in this challenge is mapping the values in the #status field to a list of status categories. The list of status categories should be saved as a comma separated value in the "Status Category" column, with leading and trailling spaces trimmed from each category.  It won’t be possible to categorize every field.  There are null values even in the training data.

Here is the list of categories you’ll be mapping from the #status field:

abandon
const_des
decommissioned
distance
electricity
flooded
Functional
funds_maintain
lowdem_alt
mechanic
missingparts
mng_other
overcrowding
quality
quantity_other
rationing
repair_fail
siting
technical_other
underconst
vague
vandalism
water_resource

Additional Requirements

- You should use Python 2.7 to complete this application.
- Please name your training Python script training.py.
- Please name your classification Python script classification.py.
- The classification.py script should take two command line parameters.  The 1st parameter should be the file path of the input file.  The 2nd is the the file path of the output file.
- The output file format should have four columns Row ID, #status and “Status Category”.  Status Category contains the categories you are assigning.
- The training and test files are in csv format and will be English only.  Your app should be able to read and write this format.
- Your training method should not require manual intervention (e.g. coding if/else statements in the classification.py file) beyond the execution of the training.py script itself.


Final Submission Guidelines

Submission Deliverables

1. Please submit all code required by the application in your submission.zip.
2. Document the build process for your code including all dependencies (pip installs etc..), please provide this in markdown format.
3. Provide instructions on how to execute your application.

ELIGIBLE EVENTS:

2016 TopCoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30054206