Hard | 1000 Points | Topcoder Skill Builder Competition | Databricks | Apache Spark

Register
Submit a solution
The challenge is finished.

Challenge Overview



This is the 1000 points Hard level problem of Topcoder Skill Builder Competition for Databricks and Apache Spark. For more challenge context info Register for the Host Competition before submitting a solution to this problem.

 

Technology Stack

  • Python

  • Scala

  • SQL

  • Apache Spark

  • Databricks

You can use either Python, Scala, SQL or a combination of these in your notebook. Using Apache Spark is mandatory.

Problem Statement

Notebook Setup

  • Sign up for the Databricks community edition here

  • Note that during sign up, it will prompt you to select between the Business and Community editions. Be sure to select the community edition, which is the free version of Databricks

  • You will be working with the dataset available in Github Archive. You can upload the dataset to your Databricks workspace and import it when working with the tasks below or you can download it during run time itself. You need to be familiar with Github / Version Control to understand the terminologies used in the tasks below.

  • Once you have signed up, proceed with the steps below

 

Data Ingestion Task

 

Data Cleaning Task

  • Analyze the data and figure out the attribute(s) on the events of type “PullRequestEvent” that can tell you the language of the repository to which the pull request was submitted to (HINT - It is nested under the “payload” attribute). You need to observe the values of two attributes:

    • One that lets you know that the pull request was opened

    • And the other that lets you know the language of the repository in which the pull request was opened.

  • Not all “PullRequestEvent” based objects will have this information. Ignore these records.

  • Thus, you will first filter out all the records that are NOT of type “PullRequestEvent”. After which, you will then filter out all the records that do not have the information about the language associated with the repository to which the pull request was submitted to.

 

Data Processing Task 1

  • Once you have the dataset cleaned, write the commands to group the events based on the language associated with the repository

  • That is, based on the pull request events that you will filter out from the dataset, determine the programming language of the repositories to which the pull requests were submitted to and group repositories by their language

  • Using Databricks Visualization feature, plot the languages as a pie chart. This should tell us the share of the languages and thus the popular languages.

  • Your notebook must contain all the commands in the cells that you used to arrive at this.

 

Data Processing Task 2

  • Start with the data received from the Data Ingestion task earlier.

  • For events related to issues, the payload contains information about the labels associated with the issue. Ignore events that do not have the issue label information.

  • Collect the name attribute of the labels.

  • Determine the top 10 popular label names and plot them as a simple bar chart, with the label name on the x axis and the count of the labels. Arrange them in ascending order.

  • Your notebook must contain the commands in the cells that you used to arrive at this.

 

Data Processing Task 3

  • Start with the data received from the Data Ingestion task earlier.

  • Github allows users to watch the activity on a repository. On the 31st of October, determine the top 5 repositories that were watched (event type is WatchEvent).

  • Display them in a table with a single column having the header “Repository Name”. Display the repository names in this column. Arrange them in descending order (most watch repository will appear first in the table).

  • Your notebook must contain the commands in the cells that you used to arrive at this.

 

Finally, publish your notebook. Databricks will provide you with the public url where your notebook can be accessed.

 

Important Notes

  • Don’t just write the commands necessary to complete this task. You need to run all the cells in the notebook, display the output and verify that it meets the expectations and then publish.

  • This contest is part of the Databricks Skill Builder Contest

  • Successfully completing the task will earn you 1000 points in the DataBricks Skill Builder Leaderboard.

  • All tasks will be part of a single notebook. DO NOT provide multiple notebooks.

 

Problems

  1. Easy: 250 Points

  2. Medium: 500 Points

  3. Hard: 1000 Points - This contest



Final Submission Guidelines

Submit a text file that contains the link to your Databricks Notebook

 

ELIGIBLE EVENTS:

2021 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30149041