Project Gwalldata - Python Word and PDF Table Data Consistency Checker

Register
Submit a solution
The challenge is finished.

Challenge Overview

Project Gwalldata Overview

Thanks for your interest in the Project Gwalldata!

This is the very beginning of an interesting and building project based around python, text-processing, pdf and word documents. In the coming weeks and months there are many further contests planned as our complexity and needs ramp up and we're glad to have you on board from the start! Basically, our client has extremely large documents (both Word and pdf files) that need to be checked in various ways and with various methods for consistency amongst a possible large number of variables. Data preparation is going to involve creating realistic data to provide to the community so that we can simulate our real documents.

One of our first steps in order to complete data preparation for running our series of contests is going to be obfuscating data that occurs with-in a series of pdf documents.

Competition Task Overview

This challenge should develop a python command line utility to look at tables in a pdf or word document and check them for data consistency. This means that all numbers in the table should have the same number of decimal places (or none) and any units should be consistent too.

The existing utility - word-counter use Win32 API to read .doc / .docx and pyPdf to read the PDF document.

We prefer to use same library for this challenge. If you have any other better choice, please confirm with us in challenge forum.

The utility should output a plain text.

The plain text file should describe all inconsistent issues.

Testing

Unfortunately we do not have a large set of test data to supply with this contest, but have provided two PDF files for reference - please create/supply your own beyond these while you are developing and testing. Your submission should be able to recursively handle any number of pdf files contained within a root directory and it's sub-folders. Only one mapping file should be created and you should save it to the root level.



Final Submission Guidelines

Technology Overview

  • Python 2.7 should be used.
  • The Source Code should be clear and well commented.
  • A Deployment Guide should detail exactly how to test and run your submission.
  • Test Documents also showing your results should be included

Review style

Final Review

Community Review Board

Approval

User Sign-Off

Challenge links

ID: 30043078