Challenge Overview
Project Overview
Governments across the world are increasingly applying open government practices such as crowdsourcing to develop stronger policies and to engage citizens, providing access to civic influencing beyond election cycle. When hundreds of ideas from citizens flow in, the crowdsourcers are facing a problem: Lack of efficient analysis and synthesis tools.
Civic CrowdAnalytics, a group at Stanford University, is developing solutions to address this problem, and is taking steps towards more participatory, inclusive and transparent democratic societies, making sure that all citizens have an equal opportunity to get their voices heard.
This challenge is part of the HPE Living Progress Challenge Blitz Program (Secure top placements in the leaderboard to grab additional cash prizes).
Competition Task Overview
In this challenge, you will analyzing and categorizing two data sets of crowd-sourced ideas for civic improvements. One of the data sets is related to transportation issues in Palo Alto, CA.
The training data set (317 examples) can be found here:
https://docs.google.com/spreadsheets/d/1tyZu4gNumrQT0xWg0iytf6CIPBw48LbdpwwrxuVUh48/edit?usp=sharing
All the complete category hierarchy can be found in this separate document here:
https://docs.google.com/spreadsheets/d/1BNXL38LxjRH5QEzQUhS58kl9fDTWUPdeWQ4ncerPnpw/edit?usp=sharing
The second data set is from Chile and it is in Spanish. However, the inputs and categories are clearly defined. The "Cuerpo" column in the input text to be categorized and the "Etiquetas ciudadanas" column contains the category tags.
https://drive.google.com/open?id=0ByjxTGykXQjATTNTYWFDeUFvRG8
In a previous challenge, we asked the Topcoder Community to use Haven OnDemand as our classification technology. Haven OnDemand has a free developer tier and you can sign up for a developer account here. For this challenge, however, you may any Python machine learning API’s or libraries provided licensing allows for commercial use.
Fifty percent of the scoring for your submission with be based upon the effectiveness of your classification responses. We’re going to compare the accuracy of the submissions against each other.
The scoring function will work as follows for the Palo Alto data:
1 point will be given for each correct main category and subcategory 1 associations
.5 points will be given for each correct subcategory 2 association
.25 points will be given for each correct subcategory 3 or 4 association
We’re also going to allow you to make a second guess for each category tag. The second guess will be ignored if the first guess is correct. The scoring for the second guesses will be as follows:
.5 points will be given for each correct secondary main category and subcategory 1 associations
.25 points will be given for each correct secondary subcategory 2 association
.125 points will be given for each correct secondary subcategory 3 or 4 association
For the Chile data set the scoring will work as follows:
1 point will be given for each correct primary main category
.5 points will be given for each secondary main category response.
We’ll average the scores together between the two data sets for each submission to determine the winner. After we determine your score, based on your responses. We’ll create a decile range based on this set of values from all the competitors. Submissions that have the highest scores in the range will receive a “10” for this performance metric and those that are the least accurate will receive a “1”. You will be provided with a training dataset of 317 records for the Palo Alto dataset training and 1045 records from the Chile Data to test and validate your solution. We’ll validate your solution and generate scores with data that we don’t provide in advance.
In order to facilitate the testing of your solution, we’re going to ask you to set up a simple Python web service and publish this on the web so it is available to our test harness. Your service, in turn, will need to provide categorization responses based on our input. You will still need to submit your code to Topcoder through the normal submission process but we’ll be doing the initial categorization validation online. Platforms like Heroku allow you to host your REST service for free.
Your API should have the following endpoints:
http://your.ip.address/api/v1/categorize/paloalto
http://your.ip.address/api/v1/categorize/chile
Your service should accept the following JSON in a post request:
{
"document" :
[
{
"id" : "1",
"content" : "A large block of text, which you should categorize."
}, {
"id" : "2",
"content" : "A large block of text, which you should also categorize.
}
]
}
Your response should contain a JSON response in the following format:
{
"document" :
[
{
"id" : "1",
"content" : "A large block of text, which you should categorize.",
"primary_main_category" : "Big picture infrastructure",
"primary_subcategory1" : "other subcategory”,
"primary_subcategory2" : "other subcategory”,
"primary_subcategory3" : "other subcategory”,
"primary_subcategory4" : "other subcategory”,
"secondary_main_category" : "Public ",
"secondary_subcategory1" : "other subcategory”,
"secondary_subcategory2" : "other subcategory”,
"secondary_subcategory3" : "other subcategory”,
"secondary_subcategory4" : "other subcategory”
}, {
"id" : "2",
"content" : "A large block of text, which you should also categorize.",
"primary_main_category" : "Private Transit",
"primary_subcategory1" : "other subcategory”,
"primary_subcategory2" : "other subcategory”,
"primary_subcategory3" : "other subcategory”,
"primary_subcategory4" : "other subcategory”,
"secondary_main_category" : "Non-motor powered transit ",
"secondary_subcategory1" : "other subcategory”,
"secondary_subcategory2" : "other subcategory”,
"secondary_subcategory3" : "other subcategory”,
"secondary_subcategory4" : "other subcategory”
}
]
}
Your code should support at least 10 document records at a time and return a response within 1 minute. For the Chile endpoint, the requests will be the same, but the responses should look like the following:
{
"document" :
[
{
"id" : "1",
"content" : "A large block of text, which you should categorize.",
"primary_main_category" : "Big picture infrastructure",
"secondary_main_category" : "Public "
}, {
"id" : "2",
"content" : "A large block of text, which you should also categorize.",
"primary_main_category" : "Private Transit",
"secondary_main_category" : "Non-motor powered transit "
}
]
}
We have developed a Test Harness in Python that you can use to test the services you create. It works well and has been updated for both endpoints. You can download from the following Drive folder:
https://drive.google.com/file/d/0ByjxTGykXQjAcjhFRnVfenNqYkE/view?usp=sharing
Technology Overview
Linux
Data Science
Python 2.7
REST
JSON
Final Submission Guidelines
Submission Deliverables
1. Please submit all code required by the application in your submission.zip
2. Document the build process for your code including all dependencies (pip installs etc..)
3. We will be reviewing code and building the winning submissions locally to verify that they perform as they do on your web service.
4. You may use any Python Open Source libraries or technologies provided they are available for commercial use.
5. If you use some other API or platform please document all steps required to replicate your results. (e.g. with Haven OnDemand, this typically means creating text indices and loading training data to this external platform).