Register
Submit a solution
The challenge is finished.

Challenge Overview

The objective of this challenge is to generate a time-series forecast with the highest accuracy predictions possible for the 8 financial variables outlined below, for each of the two products.

The accuracy of the forecast must at least improve on the Threshold targets quoted for each variable / product.  Extra kudos will be gained for exceeding target accuracy!

The model should be tailored to a 12-mth forecast horizon but must be extendable beyond this time period.  

The accuracy of a prediction will be evaluated using MAPE (Mean Average Percentage Error)  on the privatised data set over a period of 7 - 9 months.

Introduction

A Mobile network, to be known as Sandesh, is looking to create a high quality, high accuracy forecast for a number of key Financial metrics used to manage its business on a monthly, annual cycle.  This challenge is focused on the mobile network’s most important brand across both its handset (and SIM) and SIM only products groups and includes the most significant measures for the size of its customer base, and the key measures of value.  The business operates an annual cycle and is therefore interested in forecast predictions over a 12 - 15 mth time horizon.

It is anticipated that a number of iterations will be required to achieve the final solution with the necessary high accuracy.  As a result, this challenge will be required to generate good accuracy, as well as a strong foundation to refine in the next iteration.

Business Context

The two products are mobile products: 

  • Leopard.  A bundle of both a mobile / cellular contract of 24mth length including airtime and a data allowance;  and a new mobile handset (cell phone). This is the traditional package sold in the mobile industry.

  • Panther. A SIM card sold with a contract including airtime and a data allowance.  No handset is sold in this product. The contract length is typically either 24 months, or a simple 30 day rolling contract.  This is a more recent product type, introduced as consumers have looked to source their handsets separate from their airtime contract.  

There is a loose relationship between these two products as consumers have looked to reduce costs and increase flexibility by moving to Panther though the Leopard product remains the more popular solution.

The eight variables are financial metrics

  • Gross adds – the number of new subscribers by product joining the brand during a month

  • Leavers (aka Churn) – the number of subscribers by product who terminated service with the brand during that month

  • Net migrations – the number of subscribers who remained with the brand but moved to another product.  This usually is a move from Handset product to SIMO to reduce costs.

These three ‘discontinuous’ variables are seasonal; vary significantly from month to month; and are dependent on the market; the nature of the customer offering;  and competitor pressures at that point in time. The other financial metrics are: 

  • Closing Base - the number of subscribers by product at the end of the month

  • Average Revenue per user excluding equipment (ARPU excl. equip.) - the average monthly revenue paid by a subscriber per month for their ‘airtime and data allowance’ service.  For the Leopard product, this metric excludes the cost of the handset, and covers only the airtime cost. In the case of Panther, since there is no equipment included, this is the only measure of ARPU.

  • Mobile Data Revenue - the revenue generated by the subscriber base per product per month for the ‘airtime and data’ service

  • Average Revenue per user Equipement - the average monthly revenue paid by subscriber per month for the equipment.  This is only applicable to Leopard.

  • Equipment revenue - the total revenue generated from the sale of equipment as part of the bundled ‘Leopard’ product.  This variable is only applicable to Leopard.

These five ‘continuous’ variables have a significant monthly recurring component with only small monthly incremental (both positive or negative) change.  They are therefore smooth and continuous with only gradual shifts in the value..

Additional data sets are included to provide a range of independent variables that may prove to be valuable in forecasting the 6 target variables.  A complete list is included with the data set but a selected number of described here

  • Brand Net Promoter Score (Brand NPS). A measure of customer satisfaction with the product based on customers’ willingness to recommend the product to their friends and colleagues.

  • Leavers - Number of customers about to leave each month.  This is the number of customers in the final month of their contract.  These customers will therefore be legally allowed to leave the following month, or upgrade their contract without punitive charges.  This metric is a strong indicator of ‘Leavers’ volumes - please see Business insight section.  This metric is closely linked to Gross Adds volumes from 24 months earlier.

  • Average number of months remaining on customer contract per month.  This is the average length left to run on contracts of the customers on the subscriber base and is a measure of the relative age of the customers on the base.  In the case of this customer base, the contract lengths are very even spread across the 24 mth, or 30 day rolling periods.

  • Out of Contract %.  This is a measure of the proportion of the customer base that are no longer within their contract period.  These customers are legally allowed to leave without cost, and will therefore have a clear link to ‘Leavers.  For historic reasons, this measure should only be considered from April 2016 onwards

  • Out of Bundle Revenue.  This revenue is generated when customers exceed their monthly ‘airtime and data’ allowances.  This is not included in Mobile data revenue, nor does it contribute to ARPU.

  • Roaming Revenue.  This revenue is generated when customers use their mobile phones abroad and are charged over and above their ‘airtime and data’ allowance.  This revenue is not included in Mobile data revenue, nor does it contribute to ARPU.

Business Insight

Mobile Data Revenue is the Target variable and data set from April 2016 should be considered only.

 

Average Revenue per User excluding equipment sales - labeled as Handset (Mobile Data) is the target variable.  Only data from April 2016 can be considered.

Out of Contract % (for Leopard and Panther) may be driving Leaver volumes.  As the percentage of customers out of contract increases so the number of customers allowed or able to leave without punitive charges increases - this is likely to drive, and explain, increases in Leavers.

Correlation:  There is a 52% correlation between Churn and absolute Out of Contract volumes (Closing Base x Out of Contract%)

There is a strong relationship between the variables describing the Customer base.  

  • Out of Contract % describes the number of customers eligible to Leaver

  • Leavers - Number of customers about to churn per month describes the number of customers about to come out of contract and drive an increase in ‘Out of Contract%’

  • Gross Adds from two years previously for Leopard and one year previously for Panther closely defines the ‘Number of customers about to churn per month’ as these customers come to the end of their contract period.

  • Upgrades - volume of existing customers changing their contract (not shown on the above graph)

Correlations: 

  • Gross Adds from two years earlier is 61% correlated to Leavers
  • Upgrades is 72% correlated to Leavers (when considered from Apr 2016 onwards)

Financial Year modeling:

Sandesh reports its financial year from April - March.  This may contribute to seasonality based on financial year, and quarters (Jun, Sep, Dec, and Mar), rather than calendar year.

Anonymised and Privatised data set:

‘Z-score’ is used to privatise the real data.

For all the variables, following is the formula used to privatise the data:

            zi = (xi – μ) / σ

where zi = z-score of the ith value for the given variable

            xi  = actual value

            μ = mean of the given variable

            σ = standard deviation for the given variable

Targets and Thresholds

Your submission will be judged on two criteria.

  1. Minimizing error (MAPE).

  2. Achieving the Thresholds and Targets designated in the tables above.

It is recommended to optimise the models to minimise RMSE, as opposed to MAPE.  The privatisation method used (see later section) can distort the error analysis.

The details will be outlined in the Quantitative Scoring section below.

Quantitative Scoring

Given two values, one ground truth value (gt) and one predicted value (pred), we define the relative error as:

    MAPE(gt, pred) = |gt - pred| / gt

We then compute the raw_score(gt, pred) as

    raw_score(gt, pred) = max{ 0, 1 - MAPE(gt, pred) }

That is, if the relative error exceeds 100%, you will receive a zero score in this case.

The final MAPE score for each variable is computed based on the average of raw_score, and then multiplied by 100.

Final score = 100 * average( raw_score(gt, pred) )

MAPE scores will be 50% of the total scoring.

You will also receive a score between 0 and 1 for all the thresholds and targets that you achieve.  Each threshold will be worth 0.0314 points and each target will be worth 0.04 points. Obviously if you achieve the target for a particular variable you’ll get the threshold points as well so you’ll receive 0.0714 points for that variable.  Your points for all the variables will be added together.  Ties if they occur will be resolved with the lowest overall MAPE score.

Judging Criteria

Your solution will be evaluated in a hybrid of quantitative and qualitative way. 

  • Effectiveness (80%)

    • We will evaluate your forecasts by comparing it to the ground truth data. Please check the “Quantitative Scoring” section for details.

    • The smaller MAPE the better. 

    • Please review the targets and thresholds above as these will be included in the scoring.

  • Clarity (10%)

    • The model is clearly described, with reasonable justifications about the choice.

  • Reproducibility (10%)

    • The results must be reproducible. We understand that there might be some randomness for ML models, but please try your best to keep the results the same or at least similar across different runs.



Final Submission Guidelines

Submission Format

You submission must include the following items

  • The filled test data. We will evaluate the results quantitatively (See below)

    • Please use Time Period, Generic Keys as the column names. 

    • The values in Time Period column are something like 2019-08

    • The values in each Generic Key column is the predicted values, i.e., floating numbers.

    • The final spreadsheet has a Nx(M+1) shape, where N is the number of time periods and M is the number of variables that we want to predict in this challenge. “+1” is for the Time Period column.

  • A report about your model, including data analysis, model details, local cross validation results, and variable importance. 

  • A deployment instructions about how to install required libs and how to run.

Expected in Submission

1. Working Python code which works on the different sets of data in the same format

2. Report with clear explanation of all the steps taken to solve the challenge (refer section “Challenge Details”) and on how to run the code

3. No hardcoding (e.g., column names, possible values of each column, ...) in the code is allowed. We will run the code on some different datasets

4. All models in one code with clear inline comments 

5. Flexibility to extend the code to forecast for additional months

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30114766