Challenge Overview
Challenge Description
This challenge will be a proof of concept as the base for a new project for analyzing social media posts using sentiment analysis. The sentiment analysis will be used to determine trends in sentiment over a period of time.
John Hancock, like any company, cares about its customers and how it’s being perceived in social media. Many companies have failed to monitor these communication avenues at their peril as now more than ever it is where people go to both praise and criticize their experiences with brands.
This challenge will focus on analyzing social media messages over a period of time and:
* Determining if they are positive or negative tweets
* Categorizing those posts into specific categories
Environment Requirements
You are welcome to offer suggestions for languages and libraries to use for this particular challenge, with these guidelines:
1. Python is allowed, preferably Python 3
2. Any 3rd party libraries, like NLTK, Scikit learn, etc... must be freely available under a liberal license, like BSD or monitor
3. Your solution should be easily installed in Linux or Mac OS X
Input data
A concurrent challenge will implement scrapers that target both Twitter and Facebook and import them into a consistent MongoDB data structure. This challenge will filter and pull the data from Mongo for analysis.
The specification here describes the MongoDB data store and the expected features: https://www.topcoder.com/challenge-details/30057575/?type=develop . For this challenge, you are expected to create data that follows that set of features. Note that we don't expect things to match up exactly with the data scraper because it is still in development. A future F2F will merge the two challenge outputs together so that the feature names match up between the two pieces of code.
Sentiment analysis
We want the ability to filter the data down and then do two things. Please provide clear documentation on how your solution can implement each of the examples provided. Note that your solution for filtering data from the scraped data store in Mongo should be flexible to target keywords, timespans, specific hashtags and users, etc...
* Sentiment analysis (good or bad)
* Common terms
Sentiment
Sentiment analysis should be able to be done on the MongoDB data as a whole. We might want to schedule this out at some point, so we should have a script that we can run that can:
1. Go through each individual comment or tweet and determine sentiment.
2. Assign the sentiment and the probability to the tweet or comment
The script should have two modes:
1. A mode to go through all comments, regardless of whether or not they have sentiment attached and re-do any sentiment analysis
2. An "update" mode that will only update the comments that don't already have sentiment attached
Required examples:
These examples should be provided and documented in your submission. Please provide clear Python code for pulling this data. Note that any pre-requisites should be defined. You should also provide clear validation to prove to the client and reviewers that the results from the script match what is in the MongoDB.
* Are people saying good or bad things about the Boston Marathon?
* Is the sentiment trend for filter "Boston Marathon" going from good to bad or bad to good over the last 30 days?
* What important terms are most mentioned in the last 60 days when the tweets are filtered by the term "marathon"
* What important terms are most mentioned in the last 60 days on a specific Twitter user's feed and the replies on that feed?
* Report output, like "Over the past 30 days, 65% of the 75 comments on the Twitter feed have been positive, 25% have been negative and 10% are unclear.
* Over the past 30 days, these are the most common terms on the Twitter feed for John Hancock:
* Boston Marathon (11 times, 5% of all posts)
* Retirement (8 times, 3% of all posts)
* John Hancock + ETF
Recommendations
If you choose to target Python, NLTK is a good toolkit with loads of documentation: http://www.nltk.org/book/ch01.html It
If you choose to target Python, feel free to provide iPython notebooks as part of your solution, especially when showing how to implement the examples outlined above.
You must submit:
* The source code for your solution
* A deployment guide that describes:
* How to configure your tool
* How to run your tool
* How to view results
Bonus functionality
For bonus functionality:
* Graphs using matplotlib would be useful for plotting things like the sentiment trend, or a bar chart / histogram of important terms mentioned over a period of time.
This challenge will be a proof of concept as the base for a new project for analyzing social media posts using sentiment analysis. The sentiment analysis will be used to determine trends in sentiment over a period of time.
John Hancock, like any company, cares about its customers and how it’s being perceived in social media. Many companies have failed to monitor these communication avenues at their peril as now more than ever it is where people go to both praise and criticize their experiences with brands.
This challenge will focus on analyzing social media messages over a period of time and:
* Determining if they are positive or negative tweets
* Categorizing those posts into specific categories
Environment Requirements
You are welcome to offer suggestions for languages and libraries to use for this particular challenge, with these guidelines:
1. Python is allowed, preferably Python 3
2. Any 3rd party libraries, like NLTK, Scikit learn, etc... must be freely available under a liberal license, like BSD or monitor
3. Your solution should be easily installed in Linux or Mac OS X
Input data
A concurrent challenge will implement scrapers that target both Twitter and Facebook and import them into a consistent MongoDB data structure. This challenge will filter and pull the data from Mongo for analysis.
The specification here describes the MongoDB data store and the expected features: https://www.topcoder.com/challenge-details/30057575/?type=develop . For this challenge, you are expected to create data that follows that set of features. Note that we don't expect things to match up exactly with the data scraper because it is still in development. A future F2F will merge the two challenge outputs together so that the feature names match up between the two pieces of code.
Sentiment analysis
We want the ability to filter the data down and then do two things. Please provide clear documentation on how your solution can implement each of the examples provided. Note that your solution for filtering data from the scraped data store in Mongo should be flexible to target keywords, timespans, specific hashtags and users, etc...
* Sentiment analysis (good or bad)
* Common terms
Sentiment
Sentiment analysis should be able to be done on the MongoDB data as a whole. We might want to schedule this out at some point, so we should have a script that we can run that can:
1. Go through each individual comment or tweet and determine sentiment.
2. Assign the sentiment and the probability to the tweet or comment
The script should have two modes:
1. A mode to go through all comments, regardless of whether or not they have sentiment attached and re-do any sentiment analysis
2. An "update" mode that will only update the comments that don't already have sentiment attached
Required examples:
These examples should be provided and documented in your submission. Please provide clear Python code for pulling this data. Note that any pre-requisites should be defined. You should also provide clear validation to prove to the client and reviewers that the results from the script match what is in the MongoDB.
* Are people saying good or bad things about the Boston Marathon?
* Is the sentiment trend for filter "Boston Marathon" going from good to bad or bad to good over the last 30 days?
* What important terms are most mentioned in the last 60 days when the tweets are filtered by the term "marathon"
* What important terms are most mentioned in the last 60 days on a specific Twitter user's feed and the replies on that feed?
* Report output, like "Over the past 30 days, 65% of the 75 comments on the Twitter feed have been positive, 25% have been negative and 10% are unclear.
* Over the past 30 days, these are the most common terms on the Twitter feed for John Hancock:
* Boston Marathon (11 times, 5% of all posts)
* Retirement (8 times, 3% of all posts)
* John Hancock + ETF
Recommendations
If you choose to target Python, NLTK is a good toolkit with loads of documentation: http://www.nltk.org/book/ch01.html It
If you choose to target Python, feel free to provide iPython notebooks as part of your solution, especially when showing how to implement the examples outlined above.
Final Submission Guidelines
Submission requirementsYou must submit:
* The source code for your solution
* A deployment guide that describes:
* How to configure your tool
* How to run your tool
* How to view results
Bonus functionality
For bonus functionality:
* Graphs using matplotlib would be useful for plotting things like the sentiment trend, or a bar chart / histogram of important terms mentioned over a period of time.