Challenge Overview
Challenge Objectives
-
Develop Cloud Functions to process files in a Google Cloud Object Storage bucket
-
Store the data in Cloud Firestore database
Project Background
-
As part of this series of challenges, we are building a tool which will help the client to identify duplicate data, make a decision about which data to use and finally save the data to the database. If the data satisfies more than one criteria then the application puts the data in a set of staging tables for review. The client will then log into the UI that we had built earlier and check/review the staging data, edit them or make a selection and let the jobs process the data with the new criteria.
-
In this challenge, we’ll build cloud functions that import the data from a Google Cloud Object Storage bucket to the Firestore database
-
Future challenges will implement data cleaning functions and develop the new api services
Technology Stack
-
Cloud Function
-
Python (preferred) or NodeJS (secondary)
-
Cloud Firestore
-
PubSub
-
CSV
Code access
We’re starting a new codebase for the cloud functions, so you should create the project structure.
Code from the previous stage of the project (Java) that imports data from local CSV files is available in Gitlab repository and can be used for reference - see forums for access to the repository.
Individual requirements
The application will monitor an Object Storage bucket and process the files through Cloud functions. There are 3 types of files uploaded to the bucket:
-
Channel Mapping File - defines how the columns in data file map to the global data scheme. Data files are produced by different “vendors” and have different columns. We use this channel mapping file to map the vendor columns to the consolidated column names. Channel Mapping file name will follow this pattern “CHANNEL_MAPPING*.csv”
-
Ingest Reference File - This file connects the data files to the vendors (which mapping to use), and defines the well name and ID (well name and api_no_10 columns) and stage number.
-
Data files - these files contain the actual measurement records
The expected workflow is that a user will upload the files to the bucket, a cloud function will be triggered to read the file information and store it into Firestorm database. Once all the information for a file is uploaded (there is a Channel Mapping file, ingest reference and the data file), another cloud function will be triggered (using pubsub) to actually process the file, read channel mapping and store the normalized record into the MasterEvents collection.
-
File uploaded Cloud function
This function will be triggered for each file uploaded to the bucket. The actions will depend on the actual file being uploaded:-
Channel Mapping file - just insert the parsed data into ChannelMapping collection
-
Ingest Reference file
-
Upsert a record in Well collection (a collection that contains all info about wells - well name, api_no_10)
-
insert a record in Stage collection (stage_no, api_no_10, county, vendor)
-
For each of the records in the file, do the following, in a transaction over IngestReference collection:
- if there is no existing record with the same file path - insert the record and set the status to “WAITING_FOR_FILE”
- if there is a record, update the record, set the status to "PENDING_PROCESSING" and emit an event to trigger the file processing cloud function (next section), passing the file path as parameter
-
-
Data file
In a transaction over IngestReference collection, for each record:-
If there is no existing record in IngestReference table with same path - insert it (filename, path) and set status to WAITING_FOR_REFERENCE
-
If there is an existing record, update the record, set status to "PENDING_PROCESSING" and emit an event to trigger the file processing cloud function (next section), passing the file path as parameter
-
-
-
File processing Cloud function
This function will be called for each input data file, probably with many instances in parallel. Here is what it needs to do-
Read the file from the bucket
-
Read the ChannelMapping configuration
-
Read the IngestReference record for the file
-
Normalize the records:
-
If a column has only zeros for all records, it should be removed
-
Map the column to the consolidated column name using the channel mapping file and vendor name. Notice that in the consolidated file there are same column names repeating multiple times - each repetition should be named with a _dup_N suffix - for example, if there are 3 columns that map to FR_Conc, the final column names will be FR_Conc_dup_0, FR_Conc_dup_1 and FR_Conc_dup_2
-
-
Save each record into MasterEvents collection
-
Update the record in IngestReference collection to PROCESSED or FAILED in case of errors
-
Sample data files are posted in the forums. Logging is required for the cloud functions and unit tests are required. Bucket name, database connection parameters, etc should be configurable using firebase environment.
Create a README file with details on how to deploy and verify the cloud functions. Deployment should be done using gcloud CLI tool.
What To Submit
Submit the full code
Submit the build/deployment/test documentation
Submit a short demo video