Challenge Overview

Challenge Overview

The client for this challenge is a large telecom / mobile network operator.
In this challenge, we are expanding upon an ideation we recently ran, which will start to build and develop an algorithm that can analyze historical data and predict 5 separate key performance indicators (KPIs) on the radio access networks (RAN), up to 2 hours in advance with 85% accuracy.
 
In this challenge we will focus on expanding the work done in phase 1 to focus on two areas:
  • Formally preparing and cleaning the data
  • Determining how long a model is viable before it starts to lose effectiveness
  • Improving and validating the accuracy
 

Prediction target:

The target is for the 5 KPIs below, two hours into the future.  So, if we want to predict the KPIs at 11AM for a specific lncel_name, we can train using the date from 9AM and before, make the prediction, and then compare to the actual values at 11AM.  Client objective is 85% accuracy with the given data.  The data set is quite large, so please ensure the performance requirements are met below.

Background

Below are the 5 RAN KPIs that you must build models to predict -
  • Download (DL) physical resource block (PRB) Utilisation - This KPI shows the average value of the Physical Resource Block(PRB) utilization per TTI(Transmission Time Interval) in downlink direction. The utilization is defined by the ratio of used to available PRBs per TTI.  
    • In the data, this is the e_utran_avg_prb_usage_per_tti_dl column
  • DL Traffic Volume - This KPI is the Packet Data Convergence Protocol (PDCP) and Service data Unit (SDU) volume on an eUu interface per cell in a downlink direction.
    • In the data, this is the pdcp_sdu_volume__dl column
  • DL Average End User Throughput - This KPI indicates the IP scheduled end-user throughput in DL for QCIx Services. Unit of Kbps.
    • In the data, this is the avg_pdcp_cell_thp_dl column
  • Average number of users -This KPI shows the average number of User Equipment (UEs) having one PRB during the measurement period.
    • In the data, this is the avg_act_ues_dl column
  • Spectrum Efficiency - Spectral efficiency usually is expressed as “bits per second per hertz,” or bits/s/Hz. In other words, it can be defined as the net data rate in bits per second (bps) divided by the bandwidth in hertz. Net data rate and symbol rate are related to the raw data rate which includes the usable payload and all overhead.
    • In the data, the columns are dl_spectral_efficiency and ul_spectral_efficiency
 

General Terminology:

Data description and Key data challenge :

History data is provided as a large CSV file with 256 columns and approximately 16,000,000 records.  This data is sorted by date in the period_start_time_date column.  You don’t have to use all the data in the CSV file, but there should be plenty there to test correlations and models.
 
We have very limited documentation on what the columns represent.  You can ask about certain ones, but the only ones we know for sure are the KPI columns (detailed above), the period_start_time, mrbts_sbts_name, which is the tower ID, lnbts_name, which is the site ID, and lncel_name, which is the cell name.

Ideation

An ideation was recently run to investigate the data and offer potential approaches to building the models for prediction.  This ideation will be available in the forum, and competitors are encouraged to use this to guide their development.
 

Task Detail

In this code challenge, we are going to build on and enhance the work done in phase 1.

Preparing the data


The winning submission from phase 1 did a good job of data analysis and formatting the data so that it is useful with machine learning algorithms.  The data is not clean and needs to be formally handled, converted, and cleaned up.  This has to apply to both the training data and the prediction data.  One problem with the winning submission from phase 1 is that, when we try to use the unseen data that’s been held back, the submission doesn’t handle a few edge cases where there are strings in what are assumed to be numeric columns.  

Your submission needs to formally deal with any potential data issues, like the string → numeric issue noted above, missing values, etc…

The data preparation MUST be done as a function that can be called and applied to either the training data or the prediction data.  This function needs to properly handle the data conversion, formatting, stripping, etc… The goal is that when we train the model we will:
  • Load the training data from CSV
  • Call the data preparation function
  • Take the output and use that for training
Alternatively, for prediction, the flow will be the same:
  • Load the prediction data from CSV
  • Call the data preparation function
  • Take the output and pass it to the model for prediction

The data preparation function must be robust and work on the unseen data.

Prediction model accuracy and life

One concern we have is about how long a prediction model will be effective before needing to be retrained.  Because we are dealing with timeseries data, we can always add new, past data onto the training data and retrain the model(s) with the new data.  We want to ensure we have a good idea of:
  • How effective a model is after its been retrained on new data
  • How long that model will continue to meet the 85% accuracy requirements
  • Can we build a model that has at least a 24 hour “lifespan”?  
    • If not, why not, and what would an alternative be for meeting the 85% KPI requirements?
The final deliverables is a README and jupyter notebook that builds the data set, trains the models, and makes predictions

You can build one model per-KPI, but please ensure that you do NOT build a model per-cell name.  We want the approach to work on new and unseen cells, without historical data for a specific cell in particular.

Code format

In the Jupyter notebook, please ensure that the 3 main tasks:
  • Preparing the dataset
  • Training the models
  • Making predictions
Are clearly separated and easy to follow.  These MUST be individual functions that we can easily pull out in the future when we deploy the code.  As part of the review, we will be running the code and making predictions on an unseen dataset, so it’s important that the code is very easy to modify to use new data files.

The unseen data is a continuation of the data that’s been provided to competitors.  It’s an additional 4025149 rows of data, starting from the latest timestamp in the given data set.

Performance

We need to ensure that the models can be trained quickly on relatively standard hardware.  You can assume 32GB of RAM for training purposes, but not any GPUs at the moment.  It’s fine to use things like Tensorflow that can be GPU accelerated, but we can’t require the acceleration.  The models should be trainable in an hour or less on normal hardware.  If this is a problem with your chosen approach, we can discuss further in the forum.  Note that the ideation has suggestions on how to do incremental training.

Additional scoring

As part of review, we will be ensuring that the submissions meet the 85% KPI requirement, using a simple algorithm below to generate models and predictions in an iterative fashion throughout the unseen data and then evaluating the results.

We will attempt to do this using the data preparation, training, and prediction functions you provide.  We will:
  • Create one large data set with the current training data and unseen data
  • Prepare the data using your data preparation function
  • For all the unseen data
    • Create models targeting a specific date range
    • Generate predictions on a date range
    • Save the predictions
    • Update the date range to a later range until all the unseen data has been predicted
  • Evaluate the predictions for all KPIs using a simple mean absolute percentage error and attempting to keep this under 15%

def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100



As an example, if the unseen data is 30 days, but our model “life” is only 24 hours, we will look at the training up to day 1, generate a model, generate 24 hours of predictions for day 1, and add those to the prediction output.  Then, we will take the data from day 1, add it to the training data, and then train up to day 2, generate a model, generate 24 hours of predictions for day 2, and add those to the prediction output.

Because of the iterative nature of this review, please ensure that you understand how the scoring will work.  You are welcome to code something similar to the above in your own submission for testing.  It’s important that your code can be easily modified to work in the above fashion.  If it can’t, or we can’t understand it, that will greatly impact your review.  You MUST ensure that you provide clear and concise documentation both in your README and the Jupyter notebook.

Judging Criteria

You will be judged on:
  • Feasibility and Completeness
    • Did you complete all required KPI predictions?
    • Did you meet the 85% accuracy threshold on unseen data?
    • Did your submission fit inside the performance requirements?
    • Did you properly implement a data preparation function that works?
    • Is training done in a straightforward manner, using the output from the data preparation function directly?  
    • Is the model saved to a file?
    • Is prediction done in a straightforward manner, using the output from the data preparation function directly?
    • Did you properly document and investigate how long a model will last and how often they need to be retrained?

Final Submission Guidelines

  • Documentation (in text, .doc, .pdf, or .md format)
  • Code (in a Jupyter Notebook with readme)
  • Please provide a  single configuration variable to allow us to change the location of the data for reviewer systems.  Don’t hard-code the path to all the files.
    • Don’t require hardware, like CUDA cards, but if you want to optionally provide a way to target them, that’s fine.


Final Submission Guidelines

Please see above

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30119449