Challenge Overview
Project Gwalldata Overview
Thanks for your interest in the Project Gwalldata!
We're underway with this interesting and building project based around python, text-processing, pdf and word documents and we'd love for you to get involved! In the coming weeks and months there are many further contests planned as our complexity and needs ramp up and we're glad to have you on board at this point! To summarize, our client has extremely large documents (both Word and pdf files) that need to be checked in various ways and with various methods for consistency amongst a possible large number of variables. Data preparation is going to involve creating realistic data to provide to the community so that we can simulate our real documents.
One of the current steps in order to complete data preparation for running our series of contests is going to be identifying and saving tabular information contained within PDFs and Word Documents. The data must be stored, along with its location in the document it was discovered in. All relevant table information should be stored (Header, Headings, Column Names, Row Names, Data, Footnotes etc.)
Competition Task Overview
This challenge should develop a python command line utility to identify all of the tables of data within the pdf or word (.doc and .docx) document and store that information about the rows and columns for us.
The existing utility - word-counter use Win32 API to read .doc / .docx and pyPdf to read the PDF document.
These libraries should be used for this challenge. If you have any other better choice, please confirm with us in challenge forum before moving ahead as we're looking to keep a level of consistency as we move forward.
The utility should generate a csv file for each .doc/.docx/.pdf file.
The format is:
{table name}, {# of data columns}, {# of data rows}, {# of footnotes}, {location in the document}
{footnotes}
{column name 1}, {column name 2}, ...., {column name n}
{[row1, column1]}, {[row1, column2]}, ... , {row1, column n}
.....
.....
{[row m, column1]}, {[row m, column2]}, ... , {row m, column n}
{empty line}
// next table data
Testing
Unfortunately we do not have a large set of test data to supply with this contest, but have provided two PDF files for reference - please create/supply your own beyond these while you are developing and testing. Your submission should be to handle the document specified when it is launched from the command line. Please clarify any questions around the storage of the data tables in the forums should questions arise.
Final Submission Guidelines
Technology Overview
- Python 2.7 should be used.
- The Source Code should be clear and well commented.
- A Deployment Guide should detail exactly how to test and run your submission.
- Test Documents also showing your results should be included