Challenge Overview
Challenge Objectives
-
Develop Cloud Function to process imported data - cleaning procedure
-
Build API endpoints
Project Background
-
As part of this series of challenges, we are building a tool which will help the client to identify duplicate data, make a decision about which data to use and finally save the data to the database. If the data satisfies more than one criteria then the application puts the data in a set of staging tables for review. The client will then log into the UI that we had built earlier and check/review the staging data, edit them or make a selection and let the jobs process the data with the new criteria.
-
In a previous challenge, we have built cloud functions that import the data from a Google Cloud Object Storage bucket to the Firestore database
-
In this challenge, we will implement data cleaning functions and develop the new API services
Technology Stack
-
Cloud Function
-
Python
-
Firestore
-
CloudSQL
-
PubSub
-
CSV
Code access
See challenge forums to get access to the project repository. Use the develop branch as a starting point.
Code from the previous stage of the project (Java) that imports data from local CSV files is available in Gitlab repository and can be used for reference - see forums for access to the repository.
Individual requirements
The application monitors an Object Storage bucket and processes the files through Cloud functions. There are 3 types of files uploaded to the bucket:
-
Channel Mapping File - defines how the columns in the data file map to the global data scheme. Data files are produced by different “vendors” and have different columns. We use this channel mapping file to map the vendor columns to the consolidated column names. Channel Mapping file name will follow this pattern “CHANNEL_MAPPING*.csv”
-
Ingest Reference File - This file connects the data files to the vendors (which mapping to use) and defines the well name and ID (well name and api_no_10 columns) and stage number.
-
Data files - these files contain the actual measurement records
We have existing cloud functions that import the raw data into Firestore - 5 different collections:
-
ChannelMapping - the mapping of the raw data columns to the master dataset
-
Well - info about wells
-
Stage - info about stages
-
IngestReference - info about all the files uploaded to the bucket
-
MasterEvents - the raw data
The first task in this challenge is to move IngestReference data from Firestore into a CloudSQL instance (Postgres). This should be a straightforward change as the data model remains the same. The goal for this change is that we will need to search/filter the data in ingest reference collection and Firestore does not support filtering string columns by prefixes, suffixes or wildcard matches.
The second task is to add support for one more file type - Channel ranges definition. This file will be named “CHANNEL_RANGE*.csv” and the import function should just import all the records into the new ChannelRanges collection (overwriting all the data in the table, if any exists). Sample file is attached in the forums. Update the data processing function to store error message (if any) to the IngestReference table.
The third task in this challenge is to build a Cloud function that will clean the raw data (MasterEvents collection). Data cleaning is done per stage (each record in MasterEvents is tied to a single Stage) so the input to the cloud function will be the unique stage id. The cleaning procedure is the following:
-
There are a lot of columns in MasterEvent records and some share a common name but have a different suffix (_dup_n). For example we can have variableA_dup_1, variableA_dup_2, etc. Data cleaning goal is to pick just one of these values as the correct and discard the others.
-
Each duplicate column is checked to determine the overall percentage of data that is in range (range for a column is defined in ChannelMapping collection - ex, for variableA, check channel ranges definition to get the allowed range)
-
Next, each record in the stage is processed separately
-
For each duplicate column group, if there is only one column in range, that value is picked. If there are no values in range, but there is only one column that is in range 99% of the time (looking all rows in the stage) then that column is picked.
-
If all the columns in all the records of a stage are clean, the clean data is persisted into Events collection and Stage record is updated with status “Processed”
-
Otherwise, the data records are not modified, but Stage record is updated with status “Needs Manual Review” and requiredReviewColumns attribute set to list of columns that need manual review (ex variableA, variableC, etc)
-
The function should return all stage information, the total number of records, number of clean rows and number of rows that require manual review (in JSON format)
The fourth task in this challenge is to design and implement an API to get information about the data in IngestReference table. This will be a single endpoint that accepts a single parameter - named basePath (string, base directory to search the files in) and returns a list of all the matching records.
Sample data files are posted in the forums. Logging is required for the cloud functions and unit tests are required (minimum coverage 80%). Bucket name, database connection parameters, etc should be configurable using firebase environment.
Create a README file with details on how to deploy and verify the cloud functions. The deployment should be done using gcloud CLI tool.
What To Submit
Submit the full code
Submit the build/deployment/test documentation
Submit a short demo video