Challenge Overview
Project Gwalldata Overview
Thanks for your interest in the Project Gwalldata!
This is the very beginning of an interesting and building project based around python, text-processing, pdf and word documents. In the coming weeks and months there are many further contests planned as our complexity and needs ramp up and we're glad to have you on board from the start! Basically, our client has extremely large documents (both Word and pdf files) that need to be checked in various ways and with various methods for consistency amongst a possible large number of variables. Data preparation is going to involve creating realistic data to provide to the community so that we can simulate our real documents.
One of our first steps in order to complete data preparation for running our series of contests is going to be identifying the frequency of certain words that occur with-in the document in order to identify which data is more likely to need to be obfuscated.
Competition Task Overview
This challenge should develop a python command line utility to count the frequency of the words present in the document, and allow for the exclusion of counting certain common words to be fed in via a configuration file. This should be accomplished through reading in a csv file that contains words to be ignored e.g. "the, and, a, ignore me, something, there". Please create populate the configuration file as you see fit with 10-15 example words.
This utility should support PDF and Word (.doc, .docx) documents.
There are many PDF read libraries for Python (e.g. pyPdf, PDFMiner, Reportlab, etc.) - you can choose your own preferred python pdf library to complete this task but you need to confirm the library license in challenge forum if it's not before using it in your submission. The general rule of thumb is if the library is free to use with no constraints, it's more likely to be accepted.
For .docx, it's a straight forward approach to read the files and there are many libraries available (docx format is XML based and is an open format.)
For .doc, one approach seems to be calling Win32 COM to read the files - if you know of a better way we're more than happy to hear it too! :)
Please structure your code logically and comment liberally so it's easy to see at a glance what is happening.
Testing
Please make sure to test using the documents attached to the contest as a baseline.
You may also provide your own .pdf, .doc and .docx of similar length for testing although this is not required - the content is unimportant.
Reviewers are also required to use some their own test documents to test the submissions.
Final Submission Guidelines
Technology Overview
- Python 2.7 should be used.
- The Source Code should be clear and well commented.
- A Deployment Guide should detail exactly how to test and run your submission.
- Test Documents also showing your results should be included