Challenge Overview
Background
Over the last few months, a series of challenges have been run to generate a series of high quality financial forecasts for a consumer broadband brand. These challenges have been known as ‘CFO Forecasting’ in various formats. As a result, a high quality, high accuracy series of algorithmic forecasts have been produced for 6 financial target variables.
This new challenge is being initiated to generate a similar high quality forecast for a ‘sister brand’, owned by the same client, offering a similar but competitive consumer broadband product set. The models for the original broadband product set have been provided as a basis for a forecast for the ‘sister brand’, though these need not necessarily be used if a superior performance is achieved through a different algorithm.
Challenge Objective
The objective of this challenge is to generate the highest accuracy predictions possible for the 6 financial variables outlined below, for each of the two products.The accuracy of the forecast must at least improve on the Threshold target quoted for each variable / product.
The model should be tailored to a 12-mth forecast horizon but must be extendable beyond this time period.
The accuracy of a prediction will be evaluated using MAPE error on the privatised data set over a period of 7 months.
Business context
The two products are broadband products:
-
Tortoise (legacy product, declining product, available everywhere) - slow download speeds.
-
Falcon (main product, reaching maturity, available in most of the country) - faster download speeds.
These two products do have an inter dependency since Falcon product is an upgrade of the earlier version, with the product growth of the later version dependent to a large extent on upgrading the customers from the earlier version. There is therefore a gradual move from Tortoise to Falcon.
The six variables are financial metrics
-
Gross adds – the number of new subscribers by product joining the brand during a month
-
Leavers – the number of subscribers by product who terminated service with the brand during that month
-
Net migrations – the number of subscribers who remained with the brand but moved to another product. This usually is an upgrade to faster broadband speed.
These three ‘discontinuous’ variables are seasonal; vary significantly from month to month; and are almost entirely dependent on the market, and competitor pressures at that point in time.
-
Closing Base - the number of subscribers by product at the end of the month
-
Average Revenue per user - the average monthly revenue paid by a subscriber per month for the service
-
Revenue - the total revenue generated by the subscriber base per product per month.
These three ‘continuous’ variables have a significant monthly recurring component with only small monthly incremental (both positive or negative) change. They are therefore smooth and continuous with only gradual shifts in the value..
Challenge Thresholds and Targets
Your submission will be judged on two criteria.
-
Minimizing error (MAPE).
-
Achieving the Thresholds and Targets designated in the tables above.
The details will be outlined in the Quantitative Scoring section below.
Business Insights
The relationship between key financial variables
-
Closing Base for a Product = Volume Opening Base for that Product + Gross Adds – Leavers + Net Migrations to that Product
-
Net Adds = Volume Closing Base – Volume Opening Base
-
Revenue = Average Volume over the period * Base ARPU
Net Migrations is the difference in the number of existing customers with Sandesh Brand 3 that move onto and off a specific broadband product per month. A positive net migrations value (before data set Privatisation) means that more customers are moving onto the product than moving off. Therefore for Tortoise, a legacy product approaching ‘end of life’, Net Migrations is negative since customers are mostly upgrading from this product to the superior Falcon product. Falcon Net migrations are positive since a large number of customers are upgrading to this product from Tortoise.
A selection of graphs to visual key relationships:
Note:
-
For all charts, xy axis intercept is at 0.
-
All charts are based on actual data sets, and refer to their respective axis.
Sandesh Brand 3: Revenue and Closing Base
Note:
Spikes in Dec ‘18 are outliers due to anomalies in the data set. These outliers should be smoothed out before modeling.
Slight adjustment in Oct 18 and Mar 19 in Closing Base for both Tortoise and Falcon is due to anomalies in Net Migrations - see later graphs. These anomalies can too be smoothed out.
Sandesh Brand 3 - Tortoise: Gross Adds, Churn and Net migrations (Discontinuous variables)
Note: Spike and trough in Net Migrations in Oct 18 and Mar 19 are anomalies too. These two outlier points should be rebalanced to smooth the data. This applies to both this product, Tortoise, and the next graph, Falcon.
Sandesh Brand 3 - Falcon: Gross Adds, Churn and Net Migrations (Discontinuous variables)
Financial Year modeling:
Sandesh reports its financial year from April - March. This may contribute to seasonality based on financial year, and quarters (Jun, Sep, Dec, and Mar), rather than calendar year.
Anonymised and Privatised data set:
‘Z-score’ is used to privatise the real data.
For all the variables, following is the formula used to privatise the data:
zi = (xi – μ) / σ
where zi = z-score of the ith value for the given variable
xi = actual value
μ = mean of the given variable
σ = standard deviation for the given variable
Modeling Insight derived from previous challenges.
Previous models proven on this type of data
Various algorithmic models have proven themselves to be useful in similar previous univariate prediction challenges.
RNN (LSTM or Seq2Seq) have proven successful on customer movement variables - Gross Adds, Leavers, and Net Migrations; while SARIMAX or VARMAX have been more successful on the ‘smooth curve’ variables that describe the customer base - Closing Base, ARPU and Revenue.
The best examples of the previous algorithms have been included in the specification. These models can be used as a foundation for further refinement.
Optimise the algorithms by minimising RSME
It is recommended to optimise the models by minimising RSME, rather than MAPE because of the privatisation method used. It is strongly believed that minimising RSME will create the best model capable of being retrained on the real data set.
Correlations in the principle Target variables.
Initial correlation analysis demonstrates relationships between the key target variables, inline with their mathematical dependencies.
In addition, Leaver and Net Migrations (NM) are showing some negative correlation.
CB = Closing Base; GA = Gross Adds; NM = Net Migrations.
Final Submission Guidelines
Submission Format
You submission must include the following items
-
The filled test data. We will evaluate the results quantitatively (See below)
-
A report about your model, including data analysis, model details, local cross validation results, and variable importance.
-
A deployment instructions about how to install required libs and how to run.
Expected in Submission
1. Working Python code which works on the different sets of data in the same format
2. Report with clear explanation of all the steps taken to solve the challenge (refer section “Challenge Details”) and on how to run the code
3. No hardcoding (e.g., column names, possible values of each column, ...) in the code is allowed. We will run the code on some different datasets
4. All models in one code with clear inline comments
5. Flexibility to extend the code to forecast for additional months
Quantitative Scoring
Given two values, one ground truth value (gt) and one predicted value (pred), we define the relative error as:
MAPE(gt, pred) = |gt - pred| / gt
MAPE scores are generated for each predicted value and then averaged by variable. Overall MAPE scores are computed by averaging the MAPE scores for each variable.
You will also receive a score between 0 and 1 for all the thresholds and targets that you achieve. Each threshold will be worth 0.033 points and each target will be worth 0.05 points. Obviously if you achieve the target for a particular variable you’ll get the threshold points as well so you’ll receive 0.083 points for that variable. Your points for all the variables will be added together.
The solutions are then ranked by a combination of overall MAPE scores and achievement on the designated thresholds and targets. Each is weighted equally. Ties will be resolved by the submission with the lowest MAPE value.
Judging Criteria
Your solution will be evaluated in a hybrid of quantitative and qualitative way.
-
Effectiveness (80%)
-
We will evaluate your forecasts by comparing it to the ground truth data. Please check the “Quantitative Scoring” section for details.
-
The smaller MAPE the better.
-
Please review the targets and thresholds above as these will be included in the scoring.
-
-
Clarity (10%)
-
The model is clearly described, with reasonable justifications about the choice.
-
-
Reproducibility (10%)
-
The results must be reproducible. We understand that there might be some randomness for ML models, but please try your best to keep the results the same or at least similar across different runs
-