Hestia - Customer Clustering Analysis Ideation

Register
Submit a solution
The challenge is finished.

Challenge Overview

Challenge Objectives

In this challenge you have to perform basic analysis work on OLD_INV_HDR table to identify duplicate customers.

Project Background

Hestia is a wholesale distributor of a variety of plumbing and building-related products. In an effort to stay competitive, ahead of the technology curve, and to offer customers the best and easiest way for them to do business, Hestia wants to understand the data of thousands of their loyal customers. Hestia has engaged Topcoder to perform data science work on their huge volume of data to gain a lot of insights. 

Challenge Details

Hestia has multiple business centers (“branches”) with a network of warehouses associated with each business center. Each business center has its own data system. This possesses challenges for data science work.
  • There can be same customer purchasing from multiple business centers and you cannot take customer id as a unique identifier. These customers are duplicate buyers.
  • Invoice numbers can be overlapped across branches
  • Invoices will be “rolled over” which means same number can be repeated after few years

Based on the dataset provided in the forum you have to perform the following.
  • Data Analysis (preferably using Python 3)
  • Documentation that covers the following to evaluate your solution.
    • Hypothesis building - Identify all possible columns that can be used for clustering
      • For example, demographics of the customer
      • Shipping Address
    • Data Cleaning - Whether there are any outliers exist? Did you apply capping and flooring for the variables?
    • Feature Identification
    • Approaches tried (eg. k-means) and the reason to pick the model in your submission
    • Clustering Details
    • Any other thoughts you have on the dataset
  • Provide Customer Cluster Analysis Result
    • Visualization - Create a cluster graph that shows (as clusters/circles of different radii) which other nearby centers’ customers are buying from. 
      Eg. Say Richmond is your focus.  Show Richmond in the center.  Then show the other centers (Front Royal, Newport News, etc) from which customers are also buying who also buy from Richmond. The bigger the circle/cloud of dots, the larger the number of transactions.
    • When many customers are buying from two branches, if there are important similarities in what they are buying
The data dictionary is also provided in the forum.

Final Submission Guidelines

  • Documentation
  • Code

ELIGIBLE EVENTS:

Topcoder Open 2019

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30086633