Topcoder Challenge | Topcoder Community

Challenge Overview

This is the 500 points Medium level problem of Topcoder Skill Builder Competition for Databricks and Apache Spark. For more challenge context info Register for the Host Competition before submitting a solution to this problem.

Technology Stack

Python
Scala
SQL
Apache Spark
Databricks

You can use either Python, Scala, SQL or a combination of these in your notebook. Using Apache Spark is mandatory.

Problem Statement

Notebook Setup

Sign up for the Databricks community edition here
Note that during sign up, it will prompt you to select between the Business and Community editions. Be sure to select the community edition, which is the free version of Databricks
You will be working with the dataset available in Github Archive. You can upload the dataset to your Databricks workspace and import it when working with the tasks below or you can download it during run time itself. You need to be familiar with Github / Version Control to understand the terminologies used in the tasks below.
Once you have signed up, proceed with the steps below

Data Ingestion Task

Create a notebook in Databricks
In this notebook, you need to import the data for the 7th of October, 2020 at 12 PM UTC - from Github Archive using Apache Spark and load it into a DataFrame

Data Processing Task 1

Once imported, you need to prepare the frequency table for the imported dataset. Particularly, you need to answer the question: Which are the popular event types on 7th October, 12 PM.
Each row in the dataset is json with an attribute “type”. You need to collect the unique values for this attribute and count their occurrences.
Next, using Databricks Visualization feature, plot the top 10 popular event types as a pie chart
Your notebook must contain the commands in the cells that you used to arrive at this conclusion.
That completes this task.

Data Processing Task 2

You will be working with the same data as before (as received in the Data ingestion task above).
This task involves you having to analyze the data for events created by bots
For this, you need to analyze the payload attribute of the event under which more details about the user that generated the event exist. Here, the user type will tell you whether the user is a bot or not.
Only consider events that were generated by the users of type “Bot”.
Next, you need to determine the most active repositories based on the bot actions. That is, determine which repositories (the repo.name attribute) were the most active, based on the interactions carried out by the bots. The repository with the most events is the repository that is the most active in our scenario. You need to collect the top 20 repositories here.
Finally, display the data as a table with two columns, with the first column having the name of the repository and the second column having the count. Arrange them in descending order, with the most active repository appearing first.
Your notebook must contain the commands in the cells that you used to arrive at this conclusion.
That completes this task

Finally, Publish your notebook. Databricks will provide you with the public url where your notebook can be accessed

Important Notes

Don’t just write the commands necessary to complete this task. You need to run all the cells in the notebook, display the output and verify that it meets the expectations and then publish.
This contest is part of the Databricks Skill Builder Contest
Successfully completing the task will earn you 500 points in the DataBricks Skill Builder Leaderboard.
All tasks will be part of a single notebook. DO NOT provide multiple notebooks.

Problems

Easy: 250 Points
Medium: 500 Points - This contest
Hard: 1000 Points

Final Submission Guidelines

Submit a text file that contains the link to your Databricks Notebook

Medium | 500 Points | Topcoder Skill Builder Competition | Databricks | Apache Spark

Challenge Overview

Technology Stack

Problem Statement

Important Notes

Problems

Final Submission Guidelines

Learn

ELIGIBLE EVENTS:

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30149036