Challenge Overview
Project Gwalldata Overview
Thanks for your interest in the Project Gwalldata!
We're now mid-way through an interesting project based around python, text-processing, pdf and word documents. In the coming weeks and months there are many further contests planned as our complexity and needs ramp up and we're glad to have you on board! Basically, our client has extremely large documents (both Word and pdf files) that need to be checked in various ways and with various methods for consistency amongst a possible large number of variables. Data preparation is going to involve creating realistic data to provide to the community so that we can simulate our real documents.
Our current challenge is going to be to convert our existing code-base to work across multiple platforms.
Competition Task Overview
In past challenges, we implemented following features in Windows platform (and only tested in Windows platform).
- A GUI to call the following utilities.
- Anchor Loader
- Link Checker
- Pdf Obfuscator
- Table Consitency Checker
- Table Extractor
- Word Counter
The existing code use Win32 API to call Office Word Application to read/write .doc/.docx/.pdf files.
This challenge should remove Win32 API (and other windows only code - if have) and use cross-platform libs to implement the same functions.
We prefer pure Python implementation but we also noticed there is no good cross-platform python library to read/write .doc file and write .pdf file.
So it's acceptable to use some Java libraries by some Python-Java bridge libraries (e.g. JPype, Py4J).
.docx file can be processed by pure Python lib (e.g. python-docx).
.pdf file can be read by pure Python lib (pyPdf, pyPdf2, pdfMiner, etc. - you can find some existing utilties use them).
If you can use pure Python (without any java lib) to implement the same functions - you will get 20% bonus!
Testing
The submission must be tested in Windows, Mac OS X and Linux.
All functions must be tested and produce same result as existing code.
Final Submission Guidelines
Technology Overview
- Python 2.7 should be used.
- The Source Code should be clear and well commented.
- A Deployment Guide should detail exactly how to test and run your submission.
- Test Documents also showing your results should be included