Register
Submit a solution
The challenge is finished.

Challenge Overview

 

Background

Previously, we have run a series of code challenges related to this forecasting problem. In this challenge, we are focusing on the normalized net migrations across three products, i.e., Tortoise, Rabbit and Cheetah.

Challenge Objectives

For Net Migrations (Normalised) for the three products, the objective of this challenge is to generate a forecast model that minimises the error (without overfitting the model), measured as MAPE, when compared to the actual performance.  The model should be tailored to a 12-mth forecast horizon but must be extendable beyond this time period. Accuracy will be measured over a 6-mth period given the limited data set available, measured by a reduction in MAPE.

Net Migrations (Normalised) is the target variable for which a forecast model must be generated.   This variable set, for each product, has been normalised to reflect the performance in a standard trading month - see section on Trading Days below.  This variable set is included in the privatised data set. The privatised actual performance, prior to normalisation, has also been included for reference but challenge objective will be based on Net Migrations (Normalised) variable.

Challenge Details

Baseline Models

We will provide two baseline models: LSTM and Sarimax models.  This code can be found in the Code Document forums.

Training Data

The training data set covers from all data before 18/19_Q4_Mar. Each row described an item on a certain date as follows.  The password for the file can be found in the Code Document forums.

  • Generic Group

  • Generic Brand

  • Generic Product Category

  • Generic Product

  • Generic Variable

  • Generic Sub-Variable

  • Generic LookupKey

  • Units

  • Time Period (a month)

The items include metrics like revenue, volume base, gross ads, leavers, net migrations and Average revenue per customer (see Background section) for Broadband for the Consumer market and also broken down by the Product level.

The ground truth file has the same number of rows, but only has one column, i.e., the revenue. You can use this data set to train and test your algorithm locally.

Testing Data

The testing data set covers from a few months starting from 18/19_Q4_Mar till now. It has the same format as the training set, but there is no groundtruth provided.

You are asked to make predictions for the testing data. You will need to append the last column of “Value” into the testing data. The newly added column should be filled by your model’s predictions.

Measurement

We will evaluate your predictions based on hold-out test cases using MAPE, which will be introduced later.

Additional Information

Business Insight:

The three products are broadband products: 

•Tortoise (legacy product, declining product, available everywhere)

•Rabbit (biggest  product, reaching maturity, available in most of the country)

•Cheetah (best and most expensive product, new and growing rapidly but only available in limited geographies)

There is no obligation for customers to upgrade to newer/better products. The footprint of Cheetah is small but growing. Many customers do not upgrade immediately when new product is available. Uptake lags footprint.

Net Migrations is the difference in the number of existing customers with Sandesh Brand 1 that move onto and off a specific broadband product per month. A positive net migrations value (before Privatisation) means that more customers are moving onto the product than moving off.  Therefore for Tortoise, a legacy product approaching ‘end of life’, Net Migrations is negative since customers are mostly upgrading from this product to the superior Rabbit product. Rabbit Net migrations are positive since a large number of customers are upgrading to this product from Tortoise; however a much smaller number are starting to upgrade from Tortoise and Rabbit to the new Cheetah product.

The relationship between Net Migrations across the three products:

Net migrations, when considered across all three products, sum to zero.  Since Net migrations reflect the movement between products, then ‘Net migrations - Tortoise’ + ‘Net migrations - Rabbit’ + ‘Net migrations - Cheetah’ = 0. Up until the launch of Cheetah in late 2017, ‘Net migrations - Tortoise’ + ‘Net migrations - Rabbit’ = 0.

The relationship between key financial variables

  • Volume Closing Base for a Product = Volume Opening Base for that Product + Gross Adds – Leavers + Net Migrations to that Product

  • Volume Net Adds = Volume Closing Base – Volume Opening Base

  • Revenue = Average Volume over the period * Base ARPU 

Net Migrations series - Rabbit as an example ...

Note: Net migration - Tortoise is the mirror image of Rabbit trend.

Net migrations - Rabbit exhibits some distinctive Time series patterns.

Trends:

There would appear to be potential 4 periods of different trends from April ‘11 to August ‘19 including a short term peak in early 2019.  Business decision behind these trend shifts are being investigated.

Seasonality:

When considering the 6 point moving average of the normalised data set, seasonality is clearly evident in the last three years at least - peaking in Dec / Jan every year, and troughs every August since 2016.

Noise:

Trading days’ impact has been removed in the normalised data set.  Though this removes some of the monthly peaks in the actual data, normalised data set remains variable.

Trading Days’ impact has been removed 

Sandesh reports their financials in trading months, weeks and days. All trading months have a round number of trading weeks - either 4 or 5 - so as to maintain consistency as the units roll up. This means that any given month must have either 28 or 35 trading days. This has been found to have a very significant impact on the forecast especially for Gross Adds and Leavers, and Net Migrations. 

To allow for this irregular and somewhat artificial ‘noise’ in the key variables for Gross Adds, Leavers and Net Migrations, these variables have been normalised to a standard 30.3 day month prior to Privatisation.  Predictions are therefore required for these normalised values, and the ‘noise’ will be added back in after predictions.  

The Coefficient of variation for Net Migrations on the actual data and normalised data is: 

    Tortoise product:  71, 37.5 respectively (Actual and Normalised)

    Rabbit product: 196, 77 respectively.

Therefore while variation remains significant in the normalised data set, it is greatly improved through this process of normalisation.

Data regarding the number of trading days for each month is provided for information.

Financial year modeling

Financial year for Sandesh is April to March (instead of January to December), hence Q1 is April, May and June.

Challenge structure

Anonymised and Privatised data set:

‘Z-score’ is used to privatise the real data.

For all the variables, following is the formula used to privatise the data:

            zi = (xi – μ) / σ

where zi = z-score of the ith value for the given variable

            xi  = actual value

            μ = mean of the given variable

            σ = standard deviation for the given variable

Modeling Insight derived from previous challenges.

An LSTM model (see code included) and a SARIMAX (univariate model) have proven the best algorithms for predicting the target variables in this data set.  These codebases can be found in the Code Document forum for this challenge.

LSTM is proven successful on customer movement variables - Gross Adds and Leavers; while SARIMAX is the most successful on the ‘smooth curve’ variables that describe the customer base - Closing Base, ARPU and Revenue.

However neither model has demonstrated a capability to accurately predict Net migrations thus far.  Both models are included as a foundation or starting point, but it is anticipated that they will need modifying to become aware of the factors driving the ‘trend’ changes over the data set.

 


Final Submission Guidelines

Submission Format

You submission must include the following items

  • The filled test data. We will evaluate the results quantitatively (See below)

  • A report about your model, including data analysis, model details, local cross validation results, and variable importance. 

  • A deployment instructions about how to install required libs and how to run.

Expected in Submission

1. Working Python code which works on the different sets of data in the same format

2.  Report with clear explanation of all the steps taken to solve the challenge (refer section “Challenge Details”) and on how to run the code

3. No hardcoding (e.g., column names, possible values of each column, ...) in the code is allowed. We will run the code on some different datasets

4. All models in one code with clear inline comments 

5. Flexibility to extend the code to forecast for additional months

Quantitative Scoring

Given two values, one groundtruth value (gt) and one predicted value (pred), we define the relative error as:

    MAPE(gt, pred) = |gt - pred| / gt

We then compute the raw_score(gt, pred) as 

    raw_score(gt, pred) = max{ 0, 1 - MAPE(gt, pred) }

That is, if the relative error exceeds 100%, you will receive a zero score in this case.

The final score is computed based on the average of raw_score, and then multiplied by 100.

Final score = 100 * average( raw_score(gt, pred) )

We will use this as a part of evaluation.

Judging Criteria

Your solution will be evaluated in a hybrid of quantitative and qualitative way. 

  • Effectiveness (80%)

    • We will evaluate your forecasts by comparing it to the groundtruth data. Please check the “Quantitative Scoring” section for details.

    • The smaller MAPE, the better. 

    • The model must achieve better performance than the provided baseline models.

  • Clarity (10%)

    • The model is clearly described, with reasonable justifications about the choice.

  • Reproducibility (10%)

    • The results must be reproducible. We understand that there might be some randomness for ML models, but please try your best to keep the results the same or at least similar across different runs.

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30110295