Challenge Overview
Challenge Objectives
In this challenge you have to perform basic analysis work on OLD_INV_HDR table to identify duplicate customers.Project Background
Hestia is a wholesale distributor of a variety of plumbing and building-related products. In an effort to stay competitive, ahead of the technology curve, and to offer customers the best and easiest way for them to do business, Hestia wants to understand the data of thousands of their loyal customers. Hestia has engaged Topcoder to perform data science work on their huge volume of data to gain a lot of insights.Challenge Details
Hestia has multiple business centers (“branches”) with a network of warehouses associated with each business center. Each business center has its own data system. This possesses challenges for data science work.- There can be same customer purchasing from multiple business centers and you cannot take customer id as a unique identifier. These customers are duplicate buyers.
- Invoice numbers can be overlapped across branches
- Invoices will be “rolled over” which means same number can be repeated after few years
Based on the dataset provided in the forum you have to perform the following.
- Data Analysis (preferably using Python 3)
- Documentation that covers the following to evaluate your solution.
- Hypothesis building - Identify all possible columns that can be used for clustering
- For example, demographics of the customer
- Shipping Address
- Data Cleaning - Whether there are any outliers exist? Did you apply capping and flooring for the variables?
- Feature Identification
- Approaches tried (eg. k-means) and the reason to pick the model in your submission
- Clustering Details
- Any other thoughts you have on the dataset
- Hypothesis building - Identify all possible columns that can be used for clustering
- Provide Customer Cluster Analysis Result
- Visualization - Create a cluster graph that shows (as clusters/circles of different radii) which other nearby centers’ customers are buying from.
Eg. Say Richmond is your focus. Show Richmond in the center. Then show the other centers (Front Royal, Newport News, etc) from which customers are also buying who also buy from Richmond. The bigger the circle/cloud of dots, the larger the number of transactions. - When many customers are buying from two branches, if there are important similarities in what they are buying
- Visualization - Create a cluster graph that shows (as clusters/circles of different radii) which other nearby centers’ customers are buying from.
Final Submission Guidelines
- Documentation
- Code