Register
Submit a solution
The challenge is finished.

Challenge Overview

Background

Over the last few months, a series of challenges have been run to generate a series of high quality financial forecasts for a consumer broadband brand.  These challenges have been known as ‘CFO Forecasting’ in various formats. As a result, a high quality, high accuracy series of algorithmic forecasts have been produced for a number of financial target variables.

This new challenge is being initiated to generate a similar high quality forecast for a ‘sister brand’, owned by the same client, offering a similar broadband product set to the Small and Medium Enterprise business (SME).  A series of models for the original broadband product set have been provided as a basis for a forecast for the Small Business (SME) market, though these need not necessarily be used if a superior performance is achieved through a different algorithm.

Challenge Objective

The objective of this challenge is to generate the highest accuracy predictions possible for the 7 financial variables outlined below, for each of the two products.The accuracy of the forecast must at least improve on the Threshold target quoted for each variable / product.

The model should be tailored to a 12-mth forecast horizon but must be extendable beyond this time period.  

The accuracy of a prediction will be evaluated using MAPE (Mean Average Percentage Error)  on the privatised data set over a period of 8 months.

Business context

The two products are broadband products: 

  • Tortoise (legacy product, declining product, available everywhere) - slow download speeds.

  • Rabbit (main product, reaching maturity, available in most of the country) - faster download speeds.

These two products do have an inter dependency since Rabbit product is an upgrade of the earlier version, with the product growth of the later version dependent to a large extent on upgrading the customers from the earlier version.  There is therefore a gradual move from Tortoise to Rabbit.

The seven variables are financial metrics

  • Gross adds – the number of new subscribers by product joining the brand during a month

  • Leavers – the number of subscribers by product who terminated service with the brand during that month

  • Net migrations – the number of subscribers who remained with the brand but moved to another product.  This usually is an upgrade to faster broadband speed.

These three ‘discontinuous’ variables are seasonal; vary significantly from month to month; and are almost entirely dependent on the market, and competitor pressures at that point in time.

  • Closing Base - the number of subscribers by product at the end of the month

  • Average Revenue per new customer - the average monthly revenue paid by new customers in the first month of the customer’s contract.

  • Average Revenue per existing customer - the average monthly revenue paid by all subscribers in the customer base per month for the service

  • Revenue - the total revenue generated by the subscriber base per product per month.

These four ‘continuous’ variables have a significant monthly recurring component with only small monthly incremental (both positive or negative) change.  They are therefore smooth and continuous with only gradual shifts in the value..

Challenge Thresholds and Targets

 

Your submission will be judged on two criteria.

  1. Minimizing error (MAPE).

  2. Achieving the Thresholds and Targets designated in the tables above.

The details will be outlined in the Quantitative Scoring section below.

Business Insights

The relationship between key financial variables

  • Closing Base for a Product = Volume Opening Base for that Product + Gross Adds – Leavers + Net Migrations to that Product

  • Net Adds = Volume Closing Base – Volume Opening Base

  • Revenue = Average Volume over the period * Base ARPU 

Net Migrations is the difference in the number of existing customers with Sandesh Brand 3 that move onto and off a specific broadband product per month. A positive net migrations value (before data set Privatisation) means that more customers are moving onto the product than moving off.  Therefore for Tortoise, a legacy product approaching ‘end of life’, Net Migrations is negative since customers are mostly upgrading from this product to the superior Falcon product.  Falcon Net migrations are positive since a large number of customers are upgrading to this product from Tortoise.

Price Increases

Over the period of the data set, a price increase to the existing customer base has been initiated.  Details of the average increase and timing are included in the data set. Price increases will impact the following variables to a greater or less extent.  Building in this impact may improve accuracy of the predictions.

  • Leavers

  • Average Revenue per existing customer

  • Revenue

The effect of the price increase should be distinguished from outliers and handled appropriately.

 

Outlier treatment

Within the data sets provided there are a number of outliers due to reasons outside of the market.  These outliers should be treated as deemed appropriate to improve forecasting performance.

Any such treatment must be clearly documented and explained in the submission.  Here are some examples that should be considered, though these are not exhaustive.

Average Revenue per New Customer

Outlier adjustment required:

There is a significant ‘step change outliner’ between Mar 18 and Apr 18. This is due to an artificial recorrection and should be adjusted.  Please build into the submission and document how the ‘step change outlier’ has been handled. 

Please note the increase between Dec 17 and Mar 18 is genuine and is driven by a price increase.  

Revenue - Tortoise and Rabbit. 

There are a number of apparent outliers in both data sets - these need to be identified,  and treated to improve the forecast of the resulting data set.

. Oct 2018 

. Mar - Apr 2018 - Tortoise particularly

. Jan - Apr 2017

Average revenue per existing customer - Tortoise and Rabbit

Since ARPU is the Revenue per existing customer, it is not surprising that the ARPU exhibits anomalies aligned to Revenue.  These should be identified, and corrected.

. Oct 2018 

. Mar - Apr 2018 - Tortoise particularly

. Jan - Apr 2017

Financial Year modeling:

Sandesh reports its financial year from April - March.  This may contribute to seasonality based on financial year, and quarters (Jun, Sep, Dec, and Mar), rather than calendar year.

Anonymised and Privatised data set:

‘Z-score’ is used to privatise the real data.

For all the variables, following is the formula used to privatise the data:

            zi = (xi – μ) / σ

where zi = z-score of the ith value for the given variable

            xi  = actual value

            μ = mean of the given variable

            σ = standard deviation for the given variable

 

Modeling Insight derived from previous challenges.

Previous models proven on this type of data

Various algorithmic models have proven themselves to be useful in similar previous univariate prediction challenges.  

RNN (LSTM or Seq2Seq) have proven successful on customer movement variables - Gross Adds, Leavers, and Net Migrations; while SARIMAX or VARMAX have been more successful on the ‘smooth curve’ variables that describe the customer base - Closing Base, ARPU and Revenue.

The best examples of the previous algorithms have been included in the specification.  These models can be used as a foundation for further refinement.

Optimise the algorithms by minimising RSME

It is recommended to optimise the models by minimising RSME, rather than MAPE because of the privatisation method used.  It is strongly believed that minimising RSME will create the best model capable of being retrained on the real data set.

Quantitative Scoring

Given two values, one ground truth value (gt) and one predicted value (pred), we define the relative error as:

    MAPE(gt, pred) = |gt - pred| / gt

We then compute the raw_score(gt, pred) as    

    raw_score(gt, pred) = max{ 0, 1 - MAPE(gt, pred) }

That is, if the relative error exceeds 100%, you will receive a zero score in this case.

The final MAPE score for each variable is computed based on the average of raw_score, and then multiplied by 100.

Final score = 100 * average( raw_score(gt, pred) )

MAPE scores will be 50% of the total scoring.

You will also receive a score between 0 and 1 for all the thresholds and targets that you achieve.  Each threshold will be worth 0.0314 points and each target will be worth 0.04 points. Obviously if you achieve the target for a particular variable you’ll get the threshold points as well so you’ll receive 0.0714 points for that variable.  Your points for all the variables will be added together.  Ties if they occur will be resolved with the lowest overall MAPE score.

Judging Criteria

Your solution will be evaluated in a hybrid of quantitative and qualitative way. 

  • Effectiveness (80%)

    • We will evaluate your forecasts by comparing it to the ground truth data. Please check the “Quantitative Scoring” section for details.

    • The smaller MAPE the better. 

    • Please review the targets and thresholds above as these will be included in the scoring.

  • Clarity (10%)

    • The model is clearly described, with reasonable justifications about the choice.

  • Reproducibility (10%)

    • The results must be reproducible. We understand that there might be some randomness for ML models, but please try your best to keep the results the same or at least similar across different runs.



Final Submission Guidelines

Submission Format

You submission must include the following items

  • The filled test data. We will evaluate the results quantitatively (See below)

  • A report about your model, including data analysis, model details, local cross validation results, and variable importance. 

  • A deployment instructions about how to install required libs and how to run.

Expected in Submission

1. Working Python code which works on the different sets of data in the same format

2. Report with clear explanation of all the steps taken to solve the challenge (refer section “Challenge Details”) and on how to run the code

3. No hardcoding (e.g., column names, possible values of each column, ...) in the code is allowed. We will run the code on some different datasets

4. All models in one code with clear inline comments 

5. Flexibility to extend the code to forecast for additional months

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30115865