Customer Churn (Drop Off) Modeling

Register
Submit a solution
The challenge is finished.

Challenge Overview

Welcome to the Customer Churn (Drop Off) Modeling Marathon Match! Our client is interested in recurrent forecasting for who of their customers are likely to churn from their customer base in the nearest future.

The client maintains a regularly updated dataset of their current and past customers, the services they subscribed for in different periods, and other known historic and current customer information. The client also has an existing model, which is monthly retrained on the latest historic data, and makes a forecast for who of the customers in the next three months will cancel or won’t renew the services they are consuming. The aim of this match is to create a better model, which beats the existing client forecasting in accuracy, while keeping the similar timing requirements for (re-)training and modeling.

Prizes

(If smaller number of placements is shown above in the page header, don't worry: the actual number of prizes and their amounts will be updated later to match the following list):

1st: $12,100
2nd:  $7,500
3rd:  $4,000
4th:  $3,000
5th:  $1,800
6th:  $1,100
7th:    $700
8th:    $500
9th:    $200
10th:   $100

Dataset Details

In the challenge forum, available after the registration to the challenge, you will find competitor pack, which includes the training dataset. The dataset includes two similar input CSV files: history.csv and input.csv.
  • Each record in these files corresponds to a single customer (the customer ID in the second column, md5_cust_party_key) and month (record year and month, in YYYYMM format, are in the first column, report_period_m_cd).
     
  • The history.csv file contains records for the period Jan 2018 - July 2019, and the forecast target in its last column (target_ind), which equals 1 if the customer churns (leaves customer base) within the following three months, and 0 if he does not.

    Important:
    • When we say “churns within the following three months”, we actually mean “churns within three months following the next month”; i.e. a target flag 1 in January 2018 means this customer will churn sometime after the beginning of March 2018 and before the end of May 2018. A month in-between, February 2018 in this case, is skipped because that’s when the history update and future forecast for the next three months is done based on the data at the end of the previous month.
       
    • When a customer churns, in general his ID won’t be present in the database after his churn; but that’s no guarantee: if the customer returns later, he may re-appear under the same ID in the future. In some cases it is even possible that a customer leaves the customer base, and then returns within the same month. In this case he will be present in the database without gaps, but the target flag will still report he churned.
  • The input.csv is similar to the history.csv, but it contains records for November 2019 only, and it does not contain the target column, which you effectively need to re-create. With the training dataset you are provided with a separate ground_truth.json file which lists IDs of customers in November 2019, who churned and did not churned in the following three months (Jan 2020 - March 2020).
     
  • The training dataset is based on ~50% of customer base. Historic provisional and final testing data (history.csv) at Topcoder server are based on differently selected ~50% of customers. Provisional input.csv is based on the same ~50% of customers as the corresponding history data, and the final input.csv contains entire customer base. The timeframe of provisional data matches the training: historic data from January 2018 to July 2019, and input for November 2019; the timeframe of final dataset is larger: historic data from January 2018 to November 2019, and input for March 2020.
     
  • For data-protection reasons, most of numeric data in the dataset have been z-transformed (corresponding columns have z in their name); i.e. the actual values x in such columns where replaced by z_transform(x) = (x - average(x)) / standard_deviation(x). It has been verified by the client, that their existing model is able to do meaningful forecasts on such data.
Table: Comparison of the final, provisional, and training datasets.

Submission Format

You will submit a dockerized version of your code, i.e. the source code with Dockerfile, placed inside a folder named code, and ZIPed so that code folder is located in the root of the resulting archive (an example submission is included into competitor pack, it is based on this generic template of Topcoder MM Submission, and you are encouraged to ask further questions about dockerization and submission format in the challenge forum). Topcoder system will build it, and run the resulting Docker container with input and output folders mounted as local paths inside the container, and their local path names passed as command line arguments, i.e. the container will be effective run like:

docker run --shm-size 4G -v <local_data_path>:/data:ro -v <local_writable_area_path>:/workdir your_submission_image ./test.sh /data /workdir

Thus, your code will get two local folder paths /data and /workdir as two command line arguments <arg1> and <arg2>. From the first one you will be able to read inputs: <arg1>/history.csv and <arg1>/input.csv. Into the second one you should output your solution file: <arg2>/solution.txt. Into solution file you are expected to write, one per line, IDs of all customers present in input.csv, sorted by predicted probability of their churn in the next three months; i.e. the first one should be the customer with highest predicted probability to churn, and the last one should the customer with the lowest probability. Each customer ID should be present exactly once.

Scoring

  • If your output misses any expected customer ID, or contains any duplicated, or unknown customer IDs, it is invalid and gets zero score. It also gets a zero score if it does not complete within the runtime limit.
    Solution runtime threshold: 2 hours.
     
  • The score of a valid submission equals to
    SCORE = 100 x num_churned_10p / num_churned_10p_max
    Where num_churned_10p is the number of customers within the first 10% of your output who actually churned, and num_churned_10p_max is the maximum possible number of churned customers within 10% of customer base (i.e. it may be less or equal to 10% of total number of customers).

    The competitor pack includes the local scorer code, which is a simple JavaScript code, to run with three command line arguments: path to the scored solution.txt, path to the ground_truth.json file, and path to the folder where the result.txt and details.txt will be written (the first of them will contain the score, and the second will contain extra details, in particular in the last column it will contain the score value if it is calculated for different fraction of customers, instead of 10% used for the official score).

    To run the scorer with Node execute in the scorer folder:
    $ npm install
    $ node index.js path/to/solution.txt path/to/ground_truth.json path/to/output


    Alternatively you can build and run it with Docker, executing in its folder
    $ docker build -t churn-scorer .

    $ mkdir workdir
    $ # place solution.txt and ground_truth.json inside ./workdir
    $ docker run --rm -v “$(pwd)/workdir”:/workdir churn-scorer ./run-scorer.sh /workdir/solution.txt /workdir/ground_truth.json /workdir
    Sure, using -v Docker argument you can mount your machine’s directories to the container in a more convenient way.

     
  • The prize eligibility threshold score is set 30.0. It is somewhat below the score achieved by existing client model, which is 34.0 both on provisional and final datasets. It has been confirmed by Topcoder tester that the score of 30.0 is achievable with reasonable efforts by a model created from scratch. The random guess solution, which outputs customer IDs shuffled in random order, score ~10.0.

Other Rules

  • This match is rated (TCO-eligible); and individual (no teaming allowed).
     
  • Any 3rd party components of your solution should be available under permissive open source licenses similar to MIT License. In case of doubts, or a strong desire to use something covered by more restrictive licenses, please contact the copilot for approval.
     
  • There are no programming language restrictions, though the client team currently uses Python and R for their modeling. If you consider different alternatives, with Python and R as one of options, we encourage you to stick with them.
     
  • Online testing is performed on t2.xlarge EC2 AWS machines, with 50 Gb volume, and no GPU support.
     
  • To claim a prize within a week after the announcement of final results you must submit a detailed write-up explaining your solution, its implementation, and any other details relevant to its usage and further development.