Challenge Overview
Project Overview
Electrical engineers send our client, a Global Fortune 500 company, technical documents (pdfs) that include one or more electrical panel configuration that describes: the types of breakers, their rating in Amps, a description of what the breaker connects, and details about the panel itself. These sections of the document are called schedules and they are depicted in table format. The problem is that different engineering firms use their own format to articulate this information. This project seeks to automate the task of identifying the schedules in the PDFs and extracting meaningful data from it.
Contest Details
This is the second challenge in the Panel Schedule Importer series where we are attempting to reduce the amount of human intervention required to extract electrical panel schedules from technical documents in order to provide the customer with a quote.
For this challenge, we will provide you with the top 2 submissions from our previous challenge “Panel Schedule Importer - Machine Learning Bakeoff” both of which provide a solution that extracts tables from multi-page PDFs and stores them as images. Additionally, we will also provide you with the customer's current OpenCV-based solution and 12 sample PDFs of varying degrees of complexity.
Your task is to provide a solution that uses one of the provided codebases, or your own, to parse through all the pdfs and extract the tables and store them as images. The intent is to store them as images so they can be processed later by an OCR job. Some of the tables (panel schedules) will be easy for the code to identify and others will be quite difficult perhaps impossible. It is the goal of this challenge to provide a benchmark of the maximum number of tables that can be successfully extracted so we can improve upon it via deterministic and/or machine learning approaches.
In addition, we would like your solution to output a CSV log file that contains the details of the panel schedules it identified. The minimum columns should be Filename, pdf page number, and output image file name. An ideal solution might also contain additional details like the confidence that what was identified as a table, size of the image file, cartesian coordinates, or any pre-parsing to determine if the pdf is Text, Image or a combination of the two. This will be essential for later processing jobs. It is critical that your solution assumes that there maybe more than one panel schedule per pdf.
Points To Note
The Major Requirements of this contest are as follow:
a. To correctly identify and extract schedule tables from the provided PDF files in the form of images. The images should be extracted into a single directory with a systematic naming convention i.e. pdfname_pdfPageNumber_sequenceNumber.bmp. You are free to store these images in any format you like.
b. To create a CSV based log file that keeps track of each of the files processed. At least the file name, page number and the output image filename is needed. Confidence, data type (text, image, both) are optional.
c. Identifying the schedule tables with maximum accuracy.
d. Your solution should not “cheat” by configuring your code base for known table locations. The provided pdfs are just a sample subset.
e. Solutions will be judged primarily based on the accuracy of identifying and extracting schedule tables from the provided pdfs. Solutions will also be evaluated for processing speed and judging will favor fast processing of pdfs.
Additional minimum requirements:
- Since the supplied solutions involve complex dependencies and environmental setups, it is desired that your solution includes a Dockerfile. This is not required however it should eliminate environmental discrepancies that may prevent judges from reviewing your submission. The customer supplied code base is part of a visual studio project and is not expected to be dockerized.
- You can use the approach described in one of the two submissions from the BakeOff contest, or the client supplied or you are welcome to use your own approach. You will notice that none of the supplied code bases use any machine learning approaches. The use of Machine Learning is encouraged at this point however the primary goal of this challenge is to get a benchmark on the three possible solutions' accuracy of identifying tables.
- You are only allowed to make use of MIT licensed, BSD licensed, Mozilla licensed or Apache licensed libraries in your solutions
Optional: You may wrap your solution in a script that will cycle through the entire set of 12 pdf or you may allow for it to be called one time for each pdf.
- Note that you can submit your solution in Python, Java or C#
Final Submission Guidelines
Include a detailed deployment guide (a README file is also fine as long as it contains deployment instructions) along with your source code and upload it to Topcoder.
Don't forget to include an unlisted link to your video that shows your solution in action.