Challenge Overview
This is the 500 points Medium level problem of Topcoder Skill Builder Competition for Databricks and Apache Spark. For more challenge context info Register for the Host Competition before submitting a solution to this problem.
Technology Stack
-
Python
-
Scala
-
SQL
-
Apache Spark
-
Databricks
You can use either Python, Scala, SQL or a combination of these in your notebook. Using Apache Spark is mandatory.
Problem Statement
Notebook Setup
-
Sign up for the Databricks community edition here
-
Note that during sign up, it will prompt you to select between the Business and Community editions. Be sure to select the community edition, which is the free version of Databricks
-
You will be working with the dataset available in Github Archive. You can upload the dataset to your Databricks workspace and import it when working with the tasks below or you can download it during run time itself. You need to be familiar with Github / Version Control to understand the terminologies used in the tasks below.
-
Once you have signed up, proceed with the steps below
Data Ingestion Task
-
Create a notebook in Databricks
-
In this notebook, you need to import the data for the 7th of October, 2020 at 12 PM UTC - from Github Archive using Apache Spark and load it into a DataFrame
Data Processing Task 1
-
Once imported, you need to prepare the frequency table for the imported dataset. Particularly, you need to answer the question: Which are the popular event types on 7th October, 12 PM.
-
Each row in the dataset is json with an attribute “type”. You need to collect the unique values for this attribute and count their occurrences.
-
Next, using Databricks Visualization feature, plot the top 10 popular event types as a pie chart
-
Your notebook must contain the commands in the cells that you used to arrive at this conclusion.
-
That completes this task.
Data Processing Task 2
-
You will be working with the same data as before (as received in the Data ingestion task above).
-
This task involves you having to analyze the data for events created by bots
-
For this, you need to analyze the payload attribute of the event under which more details about the user that generated the event exist. Here, the user type will tell you whether the user is a bot or not.
-
Only consider events that were generated by the users of type “Bot”.
-
Next, you need to determine the most active repositories based on the bot actions. That is, determine which repositories (the repo.name attribute) were the most active, based on the interactions carried out by the bots. The repository with the most events is the repository that is the most active in our scenario. You need to collect the top 20 repositories here.
-
Finally, display the data as a table with two columns, with the first column having the name of the repository and the second column having the count. Arrange them in descending order, with the most active repository appearing first.
-
Your notebook must contain the commands in the cells that you used to arrive at this conclusion.
-
That completes this task
Finally, Publish your notebook. Databricks will provide you with the public url where your notebook can be accessed
Important Notes
-
Don’t just write the commands necessary to complete this task. You need to run all the cells in the notebook, display the output and verify that it meets the expectations and then publish.
-
This contest is part of the Databricks Skill Builder Contest
-
Successfully completing the task will earn you 500 points in the DataBricks Skill Builder Leaderboard.
-
All tasks will be part of a single notebook. DO NOT provide multiple notebooks.
Problems
-
Medium: 500 Points - This contest
Final Submission Guidelines
Submit a text file that contains the link to your Databricks Notebook