Challenge Overview
Our customer wants to increase the retirement account sales to users who are already visiting their banking site. Your job is to analyze website visitor and account data and show how to identify likely buyers.
The data is comprised of click-stream data from visitors of different types - some visitors buy retirement programs with their first visit, others click around and learn more first. Others are customers for many years whose funds on deposit, demographics and browsing behavior now indicates that they might be interested in retirement accounts.The ideal outcome of this challenge is a list of user categories according to account demographics and click activity, and for each category, a probability that the users in that category are interested in retirement accounts.
Background
As a global wealth management bank, we are looking to analyze actions their clients perform in their web application. The main goal is to predict if the user is giving out signals that they are planning their retirement, ie opening a retirement account. The PoC is not to show predictions per se, its to show how to classify users, though of course for those visitors or visitor categories that may be potential buyers, it should output probabilities by class.
In this challenge, you will get access to the submission which has data exploration and some basic analysis on how clickstream data can be used to qualify potential buyers of retirement accounts.
Task Detail
In this challenge, the goal is to efficiently analyze the data, engineer the necessary features, and build an algorithm for predicting if the client is planning her retirement. The groundtruth is defined as whether the user will open a retirement account.
In practice, at a certain time (e.g., every morning in weekdays), we will run the predictor algorithm, taking the input of all the client clickstream data and client information data until that time, and predict the probabilities of opening a retirement account within the next 7 days for all potential clients (i.e., who doesn’t have a retirement account yet).
The main idea is to use the clickstream data (primary data source) together with user information, such as demographic info (e.g., age), account info, and balances, to predict if the client is planning her retirement. Here are some simple signals for retirement planning could be:- The user has used retirement planning features on the website
- The user ages are between 45 to 59
- Don't have a retirement account
The expected final deliverables are short white papers with data analysis (e.g., tables and figures to illustrate the engineered features and its effectiveness), PoC codes (e.g., a sample pipeline), and example evaluation results (e.g., accuracy, F1 scores, survival analysis, ...).
The accuracy and efficiency are two important aspects for the final predictor. Therefore,
-
Please explore the effect of the size of training data -- how much of data should be given as input to the predictor without loss of too much accuracy?
-
If some of the data preparation can be precomputed, please describe how to build the exact pipeline for aggregating the data and show the efficiency improvement. For example one of your features might be “total assets at age 25” that can be calculated once and the predictor may not need all the earlier data.
It is up to you to analyze the data and figure out the appropriate features for prediction -- be creative! All the features should be clearly derived from the input data set, without using any other external data sources. All analysis should be backed up by data analysis done in Python.
Please note that for the training you have the entire click history - even after opening retirement account - predictor must NOT have any click or account history info for dates after opening the retirement account. Using this data is not relevant for the real world use case and your submission will be disqualified if it’s using new data to predict opening retirement account in the past.
Data Description
The available dataset is huge -- tens of GB clicks that users performed in the application. You will have access to a subset of this dataset -- roughly 50%. Remaining data is held out for future challenges.
Besides the clickstream data there are a few other files:
-
Clients data - info about the clients
-
Accounts data - info about the accounts
-
Client Account Relationship - a one to many relationship between the clients and the accounts
-
Account classification - details about types of accounts
-
Derived client accounts - sub accounts linked to the main client account
-
Account Cach Balance - balances for cash accounts (daily)
-
Account Positions - info about investment positions for the accounts
The “Tables details.xlsx” document provides info about various tables and columns in the data. The “Account classification” table has info about the various types of retirement accounts. See the “ClientAccountClissification” sheet in the data description document for a list of retirement account codes.
If an account has multiple clients (in client_account_relationship) that means it's a joint account (e.g., one person is account holder while other has power of attorney).
Tips about the data:
-
There are NaNs, in some cases in important fields. Contestants should explain how to deal with those data.
-
Entire clickstream history and client account performance info are available in this challenge. However, it is probably not necessary to have an entire account history to accurately predict retirement planning -- for example, 10+ years old data might not be very important and helpful.
Final Submission Guidelines
Contents
A document with details for the proposed algorithm and/or a proof of concept solution, pseudo-code or any documentation/ previous research papers that helps illustrate the proposal.The final submission should be a report, more like a technical paper. It should include, but not limited to, the following contents. The client will judge the feasibility and the quality of your proposed likelihood function.
- Title : Title of your idea
- Abstract / Description : High level overview / statement of your idea
- Outline of your proposed approach
- Outline of the approaches that you have considered and their pros and cons
- Justify your final choice
- Details : Detailed description. You must provide details of each step and details of how it should be implemented
- Description of the entire mechanism
- The advantage of your idea - why it could be better than others
- If your idea includes some theory or known papers;
- Reason why you chose
- Details on how it will be used
- Reference to the papers of the theory
- Reasonings behind the feasibility of your idea
Format
- A document should be minimum of 2 pages in PDF / Word format to describe your ideas.
- It should be written in English.
- Leveraging charts, diagrams, and tables to explain your ideas is encouraged from a comprehensive perspective.
Judging Criteria
You will be judged on the quality of your ideas, the quality of your description of the ideas, and how much benefit it can provide to the client. The winner will be chosen by the most logical and convincing reasoning as to how and why the idea presented will meet the objective. Note that, this contest will be judged subjectively by the client and Topcoder. However, the judging criteria will largely be the basis for the judgement.Accuracy (50%)
- Please justify your final chosen model conceptually and discuss the pros and cons of all compared models.
- Please establish evaluation metrics and benchmark the models that you have tried.
- Please explore and explain the data characteristics and outlining the main findings -- graphs and other visuals are highly encouraged
- Provide recommendations for any additional data sets that might be useful to increase the accuracy.
- Please discuss the effect of the training data size to the accuracy.
- Please discuss the data preparation pipeline. What can be precomputed? And what must be calculated in real-time?
- Data analysis scripts with environment setup and instructions on how to run the analysis.
- Predictor training and testing scripts with deployment/verification instructions.
- Should be implemented using Python.
Submission Guideline
You can submit at most TWO solutions but we encourage you to include your great solution and details as much as possible in a single submission.
Supplementary materials
You will be able to download through the links provided in the forum.