Project Gwalldata - Python Word and PDF Anchor Loader

Register
Submit a solution
The challenge is finished.

Challenge Overview

Project Gwalldata Overview

Thanks for your interest in the Project Gwalldata!

We're underway with this interesting and building project based around python, text-processing, pdf and word documents and we'd love for you to get involved! In the coming weeks and months there are many further contests planned as our complexity and needs ramp up and we're glad to have you on board at this point! To summarize, our client has extremely large documents (both Word and pdf files) that need to be checked in various ways and with various methods for consistency amongst a possible large number of variables. Data preparation is going to involve creating realistic data to provide to the community so that we can simulate our real documents.

One of the current stepsin our running series of contests is to identify all of the anchors contained within a document, and check the consistency of the content that is linked to.

Competition Task Overview

This challenge should develop a python command line utility to read all anchors from pdf or word (.doc and .docx) document and store that anchor information about the anchor id and content, and what is contained where it is linked to.

The existing utility - word-counter use Win32 API to read .doc / .docx and pyPdf to read the PDF document.

We strongly prefer to use same library for this challenge. If you have any other better choices, please confirm with us in challenge forum before moving ahead. That being said we want to ensure consistency as we're going forward so without an extremely good reason we ask that you default to our suggestions.

The utility should generate a csv file for each .doc/.docx/.pdf file fed in.

The format is:

{anchor id}, {anchor content}, {linked content}, {makes sense?}

....

....

The field {makes sense?} should be a simple boolean value as to whether the linked content seems to map to the named anchor content.

Testing

Unfortunately we do not have a large set of test data to supply with this contest, but have provided two PDF files for reference - please create/supply your own beyond these while you are developing and testing. Your submission should be able to recursively handle any number of pdf files contained within a root directory and it's sub-folders. Only one mapping file should be created and you should save it to the root level.



Final Submission Guidelines

Technology Overview

  • Python 2.7 should be used.
  • The Source Code should be clear and well commented.
  • A Deployment Guide should detail exactly how to test and run your submission.
  • Test Documents also showing your results should be included

Review style

Final Review

Community Review Board

Approval

User Sign-Off

Challenge links

ID: 30043080