Challenge Overview
Challenge Objectives
We had an ideation challenge which has some initial data analysis done along with multiple approaches documented. Based on the ideation submission, this challenge asks you to:- Perform detailed analysis work using Jupyter Notebook
- Build the model
- Train the model and predict the forecast values for the next five years based on the given training dataset.
- Include extra, publicly-available data that helps strengthen your forecasts.
Project Background
Telecom providers sell products such as broadband and mobile phone contracts. These contracts consist of products of different types and capabilities, which are outlined below:- Mobile
- Broadband
For each of the products the customer would like to forecast the following:
- Volume base (opening base and closing base) – the total number of subscribers / connections at a point in time
- Gross adds – the number of new subscribers / connections during a period
- Leavers – the number of subscribers / connections which terminated service with the provider during that period
- Net migrations – the number of subscribers / connections which remained with the provider but moved to another product
- Average revenue per customer/connection
- Revenue
- Re-grades/upsell
Similarly Average revenue per customer = Revenue/ ((Opening Base+ Closing Base)/2)
Our model would eventually need to account for:
- Different subscription lengths
- Shocks introduced in the market, e.g. competitor price disruption and seasonal releases of new product lines (e.g. Samsung in Spring, Apple in the Fall), changes in market or technology-related regulation etc.
- Forecasts that are within 20% of eventual actual results for the following year. 20% is not a requirement for this challenge, but it is a target, and solutions that perform best will receive higher ratings
- What-if scenario building by users (future scope)
- Ability to load new datasets in order to test them for relevance against the validated forecast (e.g. improves accuracy – future scope)
Technology Stack
- Python 3.7.x
Individual Requirements
Data AnalysisYou need to perform data analysis work on the given dataset and save the notebook which should be shared along with the submission. It should have the following items covered properly:
- Feature importance and selection procedures, preferably using histograms
- We’re specifically interested in measuring how strongly the data provided in this challenge, and any data you add on your own, impact the “forecastability” of gross-adds, net migrations, leavers, volume base and average revenue per customer. Which features account most for changes in values?
- If other methods tried before finalizing on an approach, you can keep this work also in the notebook for reference purposes
- Code should be documented appropriately (within the code): Explanations are needed on how the different areas of the model work.
The dataset is provided in the forum which can be used for training and testing your model. You can split the data into training and testing data sets.
Please note that we have removed the current year’s values for final validation of your (and future) model(s)
Please also note that the model is expected to produce monthly, quarterly and annual forecasts. By default, data provided is monthly. When the input data is provided only quarterly or annually, it has been logged in the last month of the quarter / year and the other months have been left blank.
You are permitted to source your own data sets and models to improve the accuracy of your forecasts. The data sets can be of the same frequency intervals as the data set in the forum or at other frequency intervals (e.g. daily).
You must request approval in the forum before using externally sourced data sets. Data sets sourced externally must come from a credible source. Your test results must quantify the relative improvement in the forecast provided by your external data set(s).
Prediction Format
You must submit a CSV file for each of the forecasts below, with the same set of product categories in the training dataset with monthly columns. A template is provided in the forum. Create a new tab for each evaluation of your model.
Forecast Evaluation
To test the accuracy of your forecast, provide the Mean absolute error (MAE) and Root mean squared error (RMSE) values for the following test scenarios:
- Remove every other month
- Remove random months for 15% of the data set
- Repeat step (2) three times
Regarding Final Review: We realize the dataset is limited for forecasting. We are seeking to learn how much the factors we already have, and any additional data you bring, predict future performance. Therefore the “explainability” of your model and strength of your written analysis and recommendations will be extremely important.
Deployment Guide
Make sure you provide a README.md that covers how to run the script in any environment.Final Submission Guidelines
- Data analysis code notebook
- Source code
- Documentation:
- Overview: describe your approach in “layman's terms”
- Methods: describe what you did to come up with this approach, eg literature search, experimental testing, etc. If you augmented any of the ideas provided as input, describe your innovations.
- Materials: did your approach use a specific technology beyond Jupyter? Any libraries? List all tools and libraries you used
- Discussion: Include your analysis in this section. Explain what you attempted, considered or reviewed that worked, and especially those that didn’t work or that you rejected. For any that didn’t work, or were rejected, briefly include your explanation for the reasons (e.g. such-and-such needs more data than we have). If you are pointing to somebody else’s work (e.g. you’re citing a well-known implementation or literature), describe in detail how that work relates to this work, and what would have to be modified
- Data: What other data should one consider? Is it in the public domain? Is it derived? Is it necessary in order to achieve the aims? Also, what about the data described/provided - is it enough?
- Assumptions and Risks: what are the main risks of this approach, and what are the assumptions you/the model is/are making? What are the pitfalls of the data set and approach?
- Results: Did you implement your approach? How’d it perform? If you’re not providing an implementation, use this section to explain the EXPECTED results.
- Other: Discuss any other issues or attributes that don’t fit neatly above that you’d also like to include