Project Gwalldata - Python Recursive PDF Obfuscator

Register
Submit a solution
The challenge is finished.

Challenge Overview

Project Gwalldata Overview

Thanks for your interest in the Project Gwalldata!

This is the very beginning of an interesting and building project based around python, text-processing, pdf and word documents. In the coming weeks and months there are many further contests planned as our complexity and needs ramp up and we're glad to have you on board from the start! Basically, our client has extremely large documents (both Word and pdf files) that need to be checked in various ways and with various methods for consistency amongst a possible large number of variables. Data preparation is going to involve creating realistic data to provide to the community so that we can simulate our real documents.

One of our first steps in order to complete data preparation for running our series of contests is going to be obfuscating data that occurs with-in a series of pdf documents. The data must be obfuscated in a way that we can later reverse so that we can release "real" data to the community for testing without actually providing real data. (e.g. if our file contains "Company A - Trade Secret Company A Has" then obfuscated it may become "Word Yes - Lava The Flexible Word Yes Temporary" - each word is replaced with something different but consistent)

Competition Task Overview

This challenge should develop a python command line utility to recursively obfuscate the contents of a folder containing PDFs and possibly folders of other folders and/or PDF documents.

There are many PDF read libraries for Python (e.g. pyPdf, PDFMiner, Reportlab, etc.) - you can choose your own preferred python pdf library to complete this task but you need to confirm the library license in challenge forum if it's not before using it in your submission. The general rule of thumb is if the library is free to use with no constraints, it's more likely to be accepted.

We need to create and store a mapping file and the suggested format is:

{originalValue1}={obfuscatedValue1}

{originalValue2}={obfuscatedValue2}

{originalValue3}={obfuscatedValue3}

....

When you process the words in the documents, for each word, you should check if it already has obfuscated value and create obfuscated value for it if no existing mappings. If you have a more appropriate suggestion for the mapping-format file in order to improve efficiency and/or storage space you're welcome to suggest and/or use it in your solution.

The original value - obfuscated value file should be written to a separate file.

You can assume we will have three types of the "words".

alphabetic, numeric and alphanumeric.

It is important that the obfuscated value should have same type as the original value. e.g. 123 -> 312, Word -> Otherword, Alpha231 -> Some123

In terms of images contained in the PDF files - you can ignore these. In terms of table structures - these should remain consistent (see attached PDF documents)

Testing

Unfortunately we do not have a large set of test data to supply with this contest, but have provided two PDF files for reference - please create/supply your own beyond these while you are developing and testing. Your submission should be able to recursively handle any number of pdf files contained within a root directory and it's sub-folders. Only one mapping file should be created and you should save it to the root level.



Final Submission Guidelines

Technology Overview

  • Python 2.7 should be used.
  • The Source Code should be clear and well commented.
  • A Deployment Guide should detail exactly how to test and run your submission.
  • Test Documents also showing your results should be included

Review style

Final Review

Community Review Board

Approval

User Sign-Off

Challenge links

ID: 30042894