Challenge Overview
Background Overview
We want to update the way we collect information from a website periodically. The current experience requires a person to update the information manually from the Texas RRC (Railroad Commission) website, record data in excel spreadsheets, and has no means for data visualization or analytics.
Basically, what is going on here is that there is an administrator for the spreadsheet at our client that manually updates this spreadsheet based on information that is publicly available.
We're going to build a process that scrapes a website to populate the majority of the information. But the admin will still need to go in a verify that information is being updated correctly and fill in missing values from some handwritten documents. The state of Texas still allows the submission of handwritten Well Permits and other documents, so automated extraction will be difficult for some elements.
Challenge Overview
For this challenge, we are looking to build a tool that will simulate the process and scrape data automatically:
-
We have provided a word doc showing how this is manually done today, we expect this challenge to simulate the same process using code.
-
The data we’d like to scrape is highlighted with red circles in the document, we need to extract the values and then save them into a SQL Server database, we’ll provide the table schema in the forum. The table schema should be updated to best match the data we are scraping in this challenge. Note you can skip populating columns S, T , U, V , BE, BF, BG, BH in this challenge, but you must NOT delete these columns.
-
We need to scrape as much data as possible based on what’s described in the word doc, but we can skip the ones that require OCR for now and simply insert a link to the document for now.
-
You may use Python or Java for this tool.
Technologies
Python / Java 8
The solution should work on different platforms (Mac / Linux / Windows)
Final Submission Guidelines
Submission Deliverables
A single zip file that contains the following:
-
Complete code that covers the above mentioned requirements.
-
A detailed readme in markdown format describing how to configure, run and verify your submission.
-
Verification details should be provided (video or screenshots are welcome).