Challenge Overview
Challenge Overview
The client for this challenge is a large telecom / mobile network operator.In this challenge, we are expanding upon an ideation we recently ran, which will start to build and develop an algorithm that can analyze historical data and predict 5 separate key performance indicators (KPIs) on the radio access networks (RAN), up to 2 hours in advance with 85% accuracy.
The eventual solution will be built in later phases and not by this challenge. In this challenge we will take the data set and build models to make predictions for the KPIs. Some data cleansing and preparation is necessary and within the scope of this challenge.
Prediction target:
The target is for the 5 KPIs below, two hours into the future. So, if we want to predict the KPIs at 11AM for a specific lncel_name, we can train using the date from 9AM and before, make the prediction, and then compare to the actual values at 11AM. Client objective is 85% accuracy with the given data. The data set is quite large, so please ensure the performance requirements are met below.Background
Below are the 5 RAN KPIs that you must build models to predict -- Download (DL) physical resource block (PRB) Utilisation - This KPI shows the average value of the Physical Resource Block(PRB) utilization per TTI(Transmission Time Interval) in downlink direction. The utilization is defined by the ratio of used to available PRBs per TTI.
- In the data, this is the e_utran_avg_prb_usage_per_tti_dl column
- DL Traffic Volume - This KPI is the Packet Data Convergence Protocol (PDCP) and Service data Unit (SDU) volume on an eUu interface per cell in a downlink direction.
- In the data, this is the pdcp_sdu_volume__dl column
- DL Average End User Throughput - This KPI indicates the IP scheduled end-user throughput in DL for QCIx Services. Unit of Kbps.
- In the data, this is the avg_pdcp_cell_thp_dl column
- Average number of users -This KPI shows the average number of User Equipment (UEs) having one PRB during the measurement period.
- In the data, this is the avg_act_ues_dl column
- Spectrum Efficiency - Spectral efficiency usually is expressed as “bits per second per hertz,” or bits/s/Hz. In other words, it can be defined as the net data rate in bits per second (bps) divided by the bandwidth in hertz. Net data rate and symbol rate are related to the raw data rate which includes the usable payload and all overhead.
- In the data, the columns are dl_spectral_efficiency and ul_spectral_efficiency
General Terminology:
- DL PRB Utilisation
- DL Traffic Volume
- DL Average End User Throughput
- Spectrum Efficiency
Data description and Key data challenge :
History data is provided as a large CSV file with 256 columns and approximately 16,000,000 records. This data is sorted by date in the period_start_time_date column. You don’t have to use all the data in the CSV file, but there should be plenty there to test correlations and models.We have very limited documentation on what the columns represent. You can ask about certain ones, but the only ones we know for sure are the KPI columns (detailed above), the period_start_time, mrbts_sbts_name, which is the tower ID, lnbts_name, which is the site ID, and lncel_name, which is the cell name.
Ideation
An ideation was recently run to investigate the data and offer potential approaches to building the models for prediction. This ideation will be available in the forum, and competitors are encouraged to use this to guide their development.Task Detail
In this code challenge, we are looking for solutions for the following problems:- Build an algorithm which can compute the 5 KPI’s within the general performance requirements
- What approaches/ modelling techniques can be used to meet the KPIs. What are the risks or consequences of using your approach
- Meet a target accuracy of 85%
The final deliverables is a README and jupyter notebook that builds the data set, trains the models, and makes predictions
Code format
In the Jupyter notebook, please ensure that the 3 main tasks:- Building / cleaning the dataset
- Training the models
- Making predictions
Are clearly separated and easy to follow. Preferably, these would be individual functions that we can easily pull out in the future when we deploy the code. As part of the review, we will be running the code and making predictions on an unseen dataset, so it’s important that the code is very easy to modify to use new data files.
Performance
We need to ensure that the models can be trained quickly on relatively standard hardware. You can assume 32GB of RAM for training purposes, but not any GPUs at the moment. It’s fine to use things like Tensorflow that can be GPU accelerated, but we can’t require the acceleration. The models should be trainable in an hour or less on normal hardware. If this is a problem with your chosen approach, we can discuss further in the forum. Note that the ideation has suggestions on how to do incremental training.
Additional scoring
As part of review, we will be scoring the submissions against unseen data that’s been held back. This will be a significant part of the final score. The scoring algorithm and details will be provided in the forum for all members to understand how this will happen. The scores will be calculated and posted as part of the review.
Judging Criteria
You will be judged on:
- Feasibility and Completeness
- Did you complete all required KPI predictions
- Did you meet the 85% accuracy threshold on unseen data
- Did your submission fit inside the performance requirements
Final Submission Guidelines
- Documentation (in text, .doc, .pdf, or .md format)
- Code (in a Jupyter Notebook with readme)
- Please provide a single configuration variable to allow us to change the location of the data for reviewer systems. Don’t hard-code the path to all the files.
- Don’t require hardware, like CUDA cards, but if you want to optionally provide a way to target them, that’s fine.