Challenge Overview
Project Overview
Governments across the world are increasingly applying open government practices such as crowdsourcing to develop stronger policies and to engage citizens, providing access to civic influencing beyond election cycle. When hundreds of ideas from citizens flow in, the crowdsourcers are facing a problem: Lack of efficient analysis and synthesis tools.
Civic CrowdAnalytics, a group at Stanford University, is developing solutions to address this problem, and is taking steps towards more participatory, inclusive and transparent democratic societies, making sure that all citizens have an equal opportunity to get their voices heard.
This challenge is part of the HPE Living Progress Challenge Blitz Program (Secure top placements in the leaderboard to grab additional cash prizes).
Competition Task Overview
In this challenge, you will analyzing and categorizing a data set of crowd-sourced ideas for civic improvements. These comments are related to transportation issues in Palo Alto, CA.
The training data set (317 examples) can be found here: https://docs.google.com/spreadsheets/d/1tyZu4gNumrQT0xWg0iytf6CIPBw48LbdpwwrxuVUh48/edit?usp=sharing
All the complete category hierarchy can be found in this separate document here: https://docs.google.com/spreadsheets/d/1BNXL38LxjRH5QEzQUhS58kl9fDTWUPdeWQ4ncerPnpw/edit?usp=sharing
We’re asking you to use Haven OnDemand as our baseline classification technology. Haven OnDemand has a free developer tier and you can sign up for a developer account here. Haven OnDemand has a number of API’s that might be relevant. You are encouraged to create a categorization index on the system. There are several API’s which might prove useful for the categorization of the data: Classify Document, Find Similar, and Query Text Index among others. Your code to access Haven OnDemand should be written in Python. For this particular application, you may not use other external Python machine learning API’s or libraries.
Fifty percent of the scoring for your submission with be based upon the effectiveness of your classification responses. We’re going to compare the accuracy of the submissions against each other.
The scoring function will work as follows:
1 point will be given for each correct main category and subcategory 1 associations
.5 points will be given for each correct subcategory 2 association
.25 points will be given for each correct subcategory 3 or 4 association
We’re also going to allow you to make a second guess for each category tag. The second guess will be ignored if the first guess is correct. The scoring for the second guesses will be as follows:
.5 points will be given for each correct secondary main category and subcategory 1 associations
.25 points will be given for each correct secondary subcategory 2 association
.125 points will be given for each correct secondary subcategory 3 or 4 association
After we determine your score, based on your responses. We’ll create a decile range based on this set of values from all the competitors. Submissions that have the highest scores in the range will receive a “10” for this performance metric and those that are the least accurate will receive a “1”. You will be provided with a dataset of 317 records for training and testing your solution. We’ll validate your solution and generate scores with data that we don’t provide in advance.
In order to facilitate the testing of your solution, we’re going to ask you to set up a simple Python web service and publish this on the web so it is available to our test harness. Your service, in turn, will need to call Haven OnDemand. You will still need to submit your code to Topcoder through the normal submission process but we’ll be doing the initial categorization validation online. Platforms like Heroku allow you to host your REST service for free.
It should have the following endpoint:
http://your.ip.address/api/v1/categorize
{
"document" :
[
{
"id" : "1",
"content" : "A large block of text, which you should categorize."
}, {
"id" : "2",
"content" : "A large block of text, which you should also categorize."
}
]
}
Your response should contain a JSON response in the following format:
{
"document" :
[
{
"id" : "1",
"content" : "A large block of text, which you should categorize.",
"primary_main_category" : "Big picture infrastructure",
"primary_subcategory1" : "other subcategory”,
"primary_subcategory2" : "other subcategory”,
"primary_subcategory3" : "other subcategory”,
"primary_subcategory4" : "other subcategory”,
"secondary_main_category" : "Public ",
"secondary_subcategory1" : "other subcategory”,
"secondary_subcategory2" : "other subcategory”,
"secondary_subcategory3" : "other subcategory”,
"secondary_subcategory4" : "other subcategory”
}, {
"id" : "2",
"content" : "A large block of text, which you should also categorize.",
"primary_main_category" : "Private Transit",
"primary_subcategory1" : "other subcategory”,
"primary_subcategory2" : "other subcategory”,
"primary_subcategory3" : "other subcategory”,
"primary_subcategory4" : "other subcategory”,
"secondary_main_category" : "Non-motor powered transit ",
"secondary_subcategory1" : "other subcategory”,
"secondary_subcategory2" : "other subcategory”,
"secondary_subcategory3" : "other subcategory”,
"secondary_subcategory4" : "other subcategory”
}
]
}
Your code should support at least 10 document records at a time and return a response within 1 minute.
Technology Overview
Linux
HavenOnDemand
Python 2.7
REST
JSON
Final Submission Guidelines
Submission Deliverables
1. Please submit all code required by the application in your submission.zip
2. Document the build process for your code including all dependencies (pip installs etc..)
3. Your documentation should include the processes required to load data to Haven OnDemand. Please describe configuration of your indices if they are used and parameters of your Haven OnDemand calls.
4. If you write code to load data to Haven OnDemand please include that in your submission.
5. We will be reviewing code and building the winning submissions locally to verify performance results.
6. Use of Machine Learning libraries other than Haven OnDemand are not allowed for this challenge. However, we are interested in what other technologies might support this task. Please join the forum topic labelled “Machine Learning Classification Technologies” and add your input there. You can name other technologies and methods that you think can be helpful in addressing the challenge.
7. You may use Open Source Python libraries to set up your web services, parse JSON, etc.