VA Online Memorial - Data scraper improvements

Register
Submit a solution
The challenge is finished.

Challenge Overview

The Department of Veterans Affairs' (VA) National Cemetery Administration (NCA) seeks to create an interactive digital experience that enables virtual memorialization of the millions of people interred at VA national cemeteries. This online memorial space will allow visitors to honor, cherish, share, and pay their respects and permit researchers, amateurs, students, and professionals to share information about Veterans.

In a previous challenge we built a 
data scraper that imports the veterans burial data into our Postgres database.  In this challenge we will improve performance and change veteran identification logic. 
First, current import for the entire data set requires a lot of resources (4GB+ memory) and is a lenghty process. The main cause for this is reading entire csv files into memory, instead of reading them line by line. The goal is to improve performance so that it can run on with smaller resources (Heroku instance for example).
Second, the scraper is using a db transaction for each row in the input file, which can lead to partially imported data in case of any errors. We need to change that so that we're using a transaction per input file, so in case of any error, entire input file is either imported fully or no changes are made to the database.
Third, the scraper will ignore importing any row that has no information in these columns: first/last name, birth/burial date and cemetery name/city/address. That does make our data complete, but also skips a lot of records. We want you to analyze those skipped rows and propose a different strategy  for importing records that would yield beter results (you can propose more than one). For example your suggestion might be to use "relationship", "v_first_name" columns as additional index in case birth/death date columns are empty (this is just a made up example, it's not based on data). We understand that any additional info will cause us to import more rows but also to potentially import duplicate rows. You don't need to implement that new import strategy, but you need to document the steps and the results (ex with the above example we will import XX new rows, but running the import again [sync] would create N duplicate rows).

Base code is available at https://github.com/topcoderinc/va-online-memorial



Final Submission Guidelines

Submit the updated scraper source code and updated DG
Submit a document with suggestions for improving import

ELIGIBLE EVENTS:

2018 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30062318