Challenge Overview
Background
Over the last few months, a series of challenges have been run to generate a series of high quality financial forecasts for a consumer broadband brand. These challenges have been known as ‘CFO Forecasting’ in various formats. As a result, a high quality, high accuracy series of algorithmic forecasts have been produced for a number of financial target variables.
This challenge is being rerun to generate a similar high quality forecast for a ‘sister brand’, owned by the same client, offering a similar broadband product set to the Small and Medium Enterprise business (SME). A series of models for the original broadband product set have been provided as a basis for a forecast for the Small Business (SME) market, though these need not necessarily be used if a superior performance is achieved through a different algorithm.
Challenge Objective
The objective of this challenge is to generate the highest accuracy predictions possible for the 5 financial variables outlined below, for each of the two products.The accuracy of the forecast must at least improve on the Threshold target quoted for each variable / product.
The model should be tailored to a 12-mth forecast horizon but must be extendable beyond this time period.
The accuracy of a prediction will be evaluated using MAPE (Mean Average Percentage Error) on the privatised data set using some withheld data over a period of 12 months.
Business context
The two products are broadband products:
-
Tortoise (legacy product, declining product, available everywhere) - slow download speeds.
-
Falcon (main product, reaching maturity, available in most of the country) - faster download speeds.
These two products do have an inter dependency since Falcon product is an upgrade of the earlier version, with the product growth of the later version dependent to a large extent on upgrading the customers from the earlier version. There is therefore a gradual move from Tortoise to Falcon.
The five variables are financial metrics
-
Gross adds – the number of new subscribers by product joining the brand during a month
-
Leavers – the number of subscribers by product who terminated service with the brand during that month
-
Net migrations – the number of subscribers who remained with the brand but moved to another product. This usually is an upgrade to faster broadband speed.
These three ‘discontinuous’ variables are seasonal; vary significantly from month to month; and are almost entirely dependent on the market, and competitor pressures at that point in time.
-
Average Revenue per new customer - the average monthly revenue paid by new customers in the first month of the customer’s contract.
-
Revenue - the total revenue generated by the subscriber base per product per month.
These two ‘continuous’ variables have a significant monthly recurring component with only small monthly incremental (both positive or negative) change. They are therefore smooth and continuous with only gradual shifts in the value..
Challenge Thresholds and Targets
Your submission will be judged on two criteria.
-
Minimizing error (MAPE).
-
Achieving the Thresholds and Targets designated in the tables above.
The details will be outlined in the Quantitative Scoring section below.
Business Insights
The relationship between key financial variables
-
Closing Base for a Product = Volume Opening Base for that Product + Gross Adds – Leavers + Net Migrations to that Product
-
Net Adds = Volume Closing Base – Volume Opening Base
-
Revenue = Average Volume over the period * Base ARPU
Net Migrations is the difference in the number of existing customers with Sandesh Brand 3 that move onto and off a specific broadband product per month. A positive net migrations value (before data set Privatisation) means that more customers are moving onto the product than moving off.
Price Increases
Over the period of the data set, a price increase to the existing customer base has been initiated. Details of the average increase and timing are included in the data set. Price increases will impact the following variables to a greater or less extent. Building in this impact may improve accuracy of the predictions.
-
Leavers
-
Revenue
The effect of the price increase should be distinguished from outliers and handled appropriately.
Outlier treatment
Within the data sets provided there are a number of outliers due to reasons outside of the market. These outliers should be treated as deemed appropriate to improve forecasting performance.
Any such treatment must be clearly documented and explained in the submission. Here are some examples that should be considered, though these are not exhaustive.
Average Revenue per New Customer
Outlier adjustment required:
There is a significant ‘step change outliner’ between Mar 18 and Apr 18. This is due to an artificial recorrection and should be adjusted. Please build into the submission; clearly document how the ‘step change outlier’ has been handled in your model. The treatment needs to be generic so that it can be applied to Real data set.
The trough in April 17 on the lower of the two lines (Tortoise) is an outlier and should be treated. Please build into the submission; clearly document how this outlier has been handled; and ensure this treatment is generic and can be applied to the real data set.
Please note the increase between Dec 17 and Mar 18 is genuine and is driven by a price increase.
Revenue - Tortoise and Falcon.
There are a couple of apparent outliers in tortoise data sets - these need to be identified, and treated to improve the forecast of the resulting data set.
. Mar 2018
. Feb - Mar 2017
Financial Year modeling:
Sandesh reports its financial year from April - March. This may contribute to seasonality based on financial year, and quarters (Jun, Sep, Dec, and Mar), rather than calendar year.
Anonymised and Privatised data set:
‘Z-score’ is used to privatise the real data.
For all the variables, following is the formula used to privatise the data:
zi = (xi – μ) / σ
where zi = z-score of the ith value for the given variable
xi = actual value
μ = mean of the given variable
σ = standard deviation for the given variable
Modeling Insight derived from previous challenges.
Previous models proven on this type of data
Various algorithmic models have proven themselves to be useful in similar previous univariate prediction challenges.
RNN (LSTM or Seq2Seq) have proven successful on customer movement variables - Gross Adds, Leavers, and Net Migrations; while SARIMAX or VARMAX have been more successful on the ‘smooth curve’ variables that describe the customer base - Revenue.
The best examples of the previous algorithms have been included in the specification. These models can be used as a foundation for further refinement.
Optimise the algorithms by minimising RSME
It is recommended to optimise the models by minimising RSME, rather than MAPE because of the privatisation method used. It is strongly believed that minimising RSME will create the best model capable of being retrained on the real data set.
Final Submission Guidelines
Submission Format
You submission must include the following items
-
The filled test data. We will evaluate the results quantitatively (See below)
-
A report about your model, including data analysis, model details, local cross validation results, and variable importance.
-
A deployment instructions about how to install required libs and how to run.
Expected in Submission
- Working Python code which works on the different sets of data in the same format
- Report with clear explanation of all the steps taken to solve the challenge (refer section “Challenge Details”) and on how to run the code
- No hardcoding (e.g., column names, possible values of each column, ...) in the code is allowed. We will run the code on some different datasets
- All models in one code with clear inline comments
- Flexibility to extend the code to forecast for additional months
Quantitative Scoring
Given two values, one ground truth value (gt) and one predicted value (pred), we define the relative error as:
MAPE(gt, pred) = |gt - pred| / gt
We then compute the raw_score(gt, pred) as
raw_score(gt, pred) = max{ 0, 1 - MAPE(gt, pred) }
That is, if the relative error exceeds 100%, you will receive a zero score in this case.
The final MAPE score for each variable is computed based on the average of raw_score, and then multiplied by 100.
Final score = 100 * average( raw_score(gt, pred) )
MAPE scores will be 50% of the total scoring.
You will also receive a score between 0 and 1 for all the thresholds and targets that you achieve. Each threshold will be worth 0.05 points and each target will be worth 0.05 points. If you achieve the target for a particular variable you’ll get the threshold points as well so you’ll receive 0.1 points for that variable. Your points for all the variables will be added together.
Judging Criteria
Your solution will be evaluated in a hybrid of quantitative and qualitative way.
-
Effectiveness (80%)
-
We will evaluate your forecasts by comparing it to the ground truth data. Please check the “Quantitative Scoring” section for details.
-
The smaller MAPE the better.
-
Please review the targets and thresholds above as these will be included in the scoring.
-
-
Clarity (10%)
-
The model is clearly described, with reasonable justifications about the choice.
-
-
Reproducibility (10%)
-
The results must be reproducible. We understand that there might be some randomness for ML models, but please try your best to keep the results the same or at least similar across different runs.
-