Challenge Overview
Welcome to the Panel Schedule Importer - Machine Learning Bakeoff contest. This is the first contest in a series of contests aimed at using Machine Learning to help our client with their problem
Project Overview
Electrical engineers send our client, a Global Fortune 500 company, technical documents (pdfs) that include one or more electrical panel configuration that describes: the types of breakers, their rating in Amps, a description of what the breaker connects, and details about the panel itself. These sections of the document are called schedules and they are depicted in table format. The problem is that different engineering firms uses their own format to articulate this information. Fortunately, they are all very similar - typically a table with two major columns that signify the left and right side of the panel, and minor columns with supporting details about the breakers. The goal of the bake-off challenge is to help identify the right machine learning platform(s) that could be trained to seek out these "electric panel schedules" inside PDF docs in such a way it would reduce human intervention. The output of this system would be a canonical spec for a panel that would be fed into a quoting engine. The sample below has been extracted from a PDF and annotated for clarity.
Contest Overview
For this challenge, you will help our client identify a Machine Learning platform like Google Cloud Platform ML Engine, Amazon AWS Machine Learning, IBM Watson, Azure Machine Learning or open source packages like Tensorflow, Apache SINGA, Apache Spark MLlib or any other ML platform or library that is fair game. There are multiple opportunities to adopt ML strategies in this project. A few that we have identified are:
-
Identifying and extracting table schedules from image based or text based PDFs
-
Classifying extracted schedule tables against known or similar formats based on columns, headers, and footers.
-
Removing annotations that may obscure the schedule itself and or text inside
-
Identify columns inside schedules against known or unknown column title synonyms.
-
Extracting data via OCR or Text from PDF and normalizing indeterminate schedule formats for input to a quoting engine.
In addition, we would like you to think long term about an artifact like a JSON object that would capture the normalization of the panel schedule. The idea here is that it would provide a discrete set of data and metadata that on how the ML task was applied at the time it was run. If the system was retrained the object itself could be rerun through the ML pipeline and potentially result in an improved output. This object artifact might include the following:
-
literal data for the schedule unparsed (input)
-
parsed (normalized) data from schedules (output)
-
mapping - translation form input labels to output normalized (standard) labels
-
base64 image of the schedule.
This is just one approach to store the state of the ML task - we are open to any other approaches you might have.
Luckily for us, our client has been working on this problem for a while and they have provided a white paper of their current solution named PNL writeup which we have shared in the contest forum. This document includes a discussion of its limitations. We are hoping that this challenge and the subsequent series of challenges will take the lessons learned and proposed and build a reliable, robust, and efficient electrical panel schedule import tool.
Contest Details
After registering for the contest, in the contest forum you will find additional details for the contest. The main requirements of this contest are:
-
To determine the best Machine Learning based solution to identify the identify panel schedules from the provided PDF files.
-
There are several facets of Machine Learning to this tasks so we would like you to submit a scoring rubric that compares different platforms and/or libraries you are considering to support your recommendation.
-
Define an overall approach to a system that extracts, stores and normalizes panel schedules in varied formats contained inside pdfs.
Kindly note that we are expecting a detailed document that clearly describes the approach to be taken to solve our customer’s problem. We should be able to use your document as a reference to run the next set of contests. As such, we strongly discourage submissions that are actually just an essay on the topic.
Materials Provided
- A detailed document describing how a typical specification looks like
- A white paper provided by the client that describes their current solution
- A synonym guide - useful for interpreting fields in the specification
- Multiple sample files