Challenge Overview
This is the 250 points Easy level problem of Topcoder Skill Builder Competition for Databricks and Apache Spark. For more challenge context info Register for the Host Competition before submitting a solution to this problem.
Technology Stack
-
Python
-
Scala
-
SQL
-
Apache Spark
-
Databricks
You can use either Python, Scala, SQL or a combination of these in your notebook. You are encouraged to use Apache Spark (it is an optional requirement).
Problem Statement
Notebook Setup
-
Sign up for the Databricks community edition here
-
Note that during sign up, it will prompt you to select between the Business and Community editions. Be sure to select the community edition, which is the free version of Databricks
-
You will be working with the dataset available in Github Archive. You can upload the dataset to your Databricks workspace and import it when working with the tasks below or you can download it during run time itself. You need to be familiar with Github / Version Control to understand the terminologies used in the tasks below.
-
Once you have signed up, proceed with the steps below
Data Ingestion Task
-
Create a notebook in Databricks
-
In this notebook, you need to import the data for the 1st of October, 2020 at 9 AM UTC - from Github Archive
-
Once imported, print, in tabular format, the first 10 entries in the dataset. You need to print ONLY the following attributes:
-
The event type (type attribute)
-
The handle of the actor associated with the event (the actor.login attribute).
-
The repository name associated with the event (the repo.name attribute)
-
The date and time of the event (the created_at) attribute
-
Yes, a header row is needed.
-
Your notebook must contain the commands in the cells that you used to arrive at the above result.
-
Next, Publish your notebook. Databricks will provide you with the public url where your notebook can be accessed
-
That completes the task.
Important Notes
-
Don’t just write the commands necessary to complete this task. You need to run all the cells in the notebook, display the output and verify that it meets the expectations and then publish.
-
This contest is part of the Databricks Skill Builder Contest
-
Successfully completing the task will earn you 250 points in the DataBricks Skill Builder Leaderboard.
Problems
-
Easy: 250 Points - This contest
Final Submission Guidelines
Submit a text file that contains the link to your Databricks Notebook