Challenge Overview
Challenge Objectives
-
Develop Cloud Function to support frontend features - process, manual review, and lookups
Project Background
-
As part of this series of challenges, we are building a tool which will help the client to identify duplicate data, make a decision about which data to use and finally save the data to the database. If the data satisfies more than one criteria then the application puts the data in a set of staging tables for review. The client will then log into the UI that we had built earlier and check/review the staging data, edit them or make a selection and let the jobs process the data with the new criteria.
-
In a previous challenge, we have built cloud functions that import the data from a Google Cloud Object Storage bucket to the Firestore and CloudSQL databases, and cleans the data automatically
-
In this challenge, we will implement new manual review and lookup API endpoints
Technology Stack
-
Cloud Function
-
Python
-
Firestore
-
CloudSQL
-
PubSub
-
CSV
Code access
See challenge forums to get access to the project repository. Use the develop branch as a starting point.
Code from the previous stage of the project (Java) that imports data from local CSV files is available in Gitlab repository and can be used for reference - see forums for access to the repository.
Individual requirements
The application monitors an Object Storage bucket and processes the files through Cloud functions. There are 3 types of files uploaded to the bucket:
-
Channel Mapping File - defines how the columns in the data file map to the global data scheme. Data files are produced by different “vendors” and have different columns. We use this channel mapping file to map the vendor columns to the consolidated column names. Channel Mapping file name will follow this pattern “CHANNEL_MAPPING*.csv”
-
Ingest Reference File - This file connects the data files to the vendors (which mapping to use) and defines the well name and ID (well name and api_no_10 columns) and stage number.
-
Data files - these files contain the actual measurement records
-
Channel ranges definition - these files contain the configuration for allowed values of all columns in the master data set and are used for cleaning the data (from a set of duplicate columns we pick only the one that is in the allowed range)
We have existing cloud functions that import the raw data into Firestore and CloudSQL- 5 different collections:
-
ChannelMapping - the mapping of the raw data columns to the master dataset
-
Well - info about wells
-
Stage - info about stages
-
IngestReference - info about all the files uploaded to the bucket (this is stored in CloudSQL)
-
MasterEvents - the raw data
The bellow endpoints are identical to the endpoints in the old backend - they just need to be ported to Python as cloud functions.
-
Lookup endpoint - wells
This endpoint will just list records from the wells table and will support filtering by name and api_no_10 -
Lookup endpoint - stage
List records from the Stage table and support filtering by well name and api_no_10, start and end date -
Login endpoint
This endpoint will authenticate a user using username/password combination and return a JWT token. Create a Users collection in Firestore and prepopulate it with default user account. Make sure to hash user passwords - do not store plain text values. -
Get manual review details
Input parameters are well name and stage number. Response should follow the format of the old API and will contain a list of variable groups (variable group is a duplicate column group, ex for columns Test_rate_dup_1 and Test_rate_dup_2, group name is Test_rate, and individual variables are channels), and for each channel add its name and statistics (number of records in range, maximum, minimum, median, mean and variance) -
Get channel data values
This endpoint is used to get data values for a specific channel. Input parameters are wellName, stage number, channel name, start date (optional), and sparse time interval (seconds, optional). It will return the data from MasterEvents collection for that channel/stage/well and will skip the values according to the sparse time interval parameter (ex if sparse time interval is 10, select only records where time between the previous record is at least 10 seconds - this is used for performance reasons, to avoid sending all the data unnecessarily)
-
Submit manual review
Inputs to this endpoint are well name, stage number and a list of columns to pick as clean values. All other duplicate value columns should be discarded and all clean data for that stage should be copied to the Events collection and stage record status is updated to “PROCESSED”
Please refactor the existing and new functions into separate module - main.py has become too large to manage, so the idea is to refactor the functions into smaller modules.
Sample data files are posted in the forums. Logging is required for the cloud functions and unit tests are required (minimum coverage 80%). Bucket name, database connection parameters, etc should be configurable using firebase environment.
Create a README file with details on how to deploy and verify the cloud functions. The deployment should be done using gcloud CLI tool.
What To Submit
Submit the full code
Submit the build/deployment/test documentation
Submit a short demo video that demonstrates the complete flow - uploading data files (one that will be cleaned automatically and the other one that will require manual review), trigger processing of records, getting status of the stage processing, listing wells/stages/channels, getting manual review data, submitting manual review