Medium | 500 Points | Topcoder Skill Builder Competition | Databricks | Apache Spark

Key Information

Register
Submit
The challenge is finished.

Challenge Overview



This is the 500 points Medium level problem of Topcoder Skill Builder Competition for Databricks and Apache Spark. For more challenge context info Register for the Host Competition before submitting a solution to this problem.

 

Technology Stack

  • Python

  • Scala

  • SQL

  • Apache Spark

  • Databricks

You can use either Python, Scala, SQL or a combination of these in your notebook. Using Apache Spark is mandatory.

Problem Statement

Notebook Setup

  • Sign up for the Databricks community edition here

  • Note that during sign up, it will prompt you to select between the Business and Community editions. Be sure to select the community edition, which is the free version of Databricks

  • You will be working with the dataset available in Github Archive. You can upload the dataset to your Databricks workspace and import it when working with the tasks below or you can download it during run time itself. You need to be familiar with Github / Version Control to understand the terminologies used in the tasks below.

  • Once you have signed up, proceed with the steps below

 

Data Ingestion Task

 

Data Processing Task 1

  • Once imported, you need to prepare the frequency table for the imported dataset. Particularly, you need to answer the question: Which are the popular event types on 7th October, 12 PM.

  • Each row in the dataset is json with an attribute “type”. You need to collect the unique values for this attribute and count their occurrences.

  • Next, using Databricks Visualization feature, plot the top 10 popular event types as a pie chart

  • Your notebook must contain the commands in the cells that you used to arrive at this conclusion.

  • That completes this task.

 

Data Processing Task 2

  • You will be working with the same data as before (as received in the Data ingestion task above).

  • This task involves you having to analyze the data for events created by bots

  • For this, you need to analyze the payload attribute of the event under which more details about the user that generated the event exist. Here, the user type will tell you whether the user is a bot or not.

  • Only consider events that were generated by the users of type “Bot”.

  • Next, you need to determine the most active repositories based on the bot actions. That is, determine which repositories (the repo.name attribute) were the most active, based on the interactions carried out by the bots. The repository with the most events is the repository that is the most active in our scenario. You need to collect the top 20 repositories here.

  • Finally, display the data as a table with two columns, with the first column having the name of the repository and the second column having the count. Arrange them in descending order, with the most active repository appearing first.

  • Your notebook must contain the commands in the cells that you used to arrive at this conclusion.

  • That completes this task

 

Finally, Publish your notebook. Databricks will provide you with the public url where your notebook can be accessed

 

Important Notes

  • Don’t just write the commands necessary to complete this task. You need to run all the cells in the notebook, display the output and verify that it meets the expectations and then publish.

  • This contest is part of the Databricks Skill Builder Contest

  • Successfully completing the task will earn you 500 points in the DataBricks Skill Builder Leaderboard.

  • All tasks will be part of a single notebook. DO NOT provide multiple notebooks.

 

Problems

  1. Easy: 250 Points

  2. Medium: 500 Points - This contest

  3. Hard: 1000 Points



Final Submission Guidelines

Submit a text file that contains the link to your Databricks Notebook

 

ELIGIBLE EVENTS:

2021 Topcoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30149036