Challenge Overview
INTRODUCTION
Welcome to the sparck Policy Number Identification challenge. As part of this challenge you will assemble a machine learning classifier and REST service that can identify insurance policy numbers given their location on a document page.
REQUIREMENTS
For this challenge you will have two main deliverables, the classifier and the REST service. The machine learning portion may be written in Java or Python and must be able to deployed to the AWS cloud. The REST service must be written in Java, be able to interact with the classifier and be deployed to the AWS cloud.
Classifier
You will be creating a classifier capable of accepting training data consisting of document contents and expected results. From this data your classifier should be able to recognize policy numbers in documents with greater than 95% accuracy.
Policy numbers could be located anywhere on a page, or not at all. Generally speaking, they will be in close proximity to other text on a page such as “Policy”, “Policy #”, “Reference” or “Reference #”. Close proximity meaning their physical location on a page.
Policy numbers could also follow a general pattern, such as “HO-01362771” or “23-205235-02”. They could be all numbers or alphanumeric combinations. Your classifier must be smart enough to detect and learn these patterns and locations in order to return the most accurate results possible.
REST Service
You will be creating a REST service capable of accepting a full document in JSON format. The service will then run the JSON data against the classifier, returning any policy numbers located within the document.
1. The exposed endpoint should be /extractpolicynumber
2. The JSON document should be POSTed in the body to this endpoint
3. Return JSON in the response body
If no policy numbers are located, simply return an empty JSON object.
RESOURCES
In the challenge forums you will find:
-- Training data for your classifier
-- Training_Documents.zip contains full document contents
-- Training_Numbers.zip contains corresponding policy numbers
-- Sample JSON file containing the document structure to parse via REST service
-- Sample JSON file containing the expected output from the REST service
Final Submission Guidelines
-- Java source for your solution with well commented blocks where appropriate
-- Any dependencies required to run your solution
-- Your solution must accept the sample input JSON data and return the sample output JSON data as posted in the forums
-- You must include both deliverables as outlined above
-- Provide instructions/details on how to test your accuracy
-- You are free to use 3rd party libraries so long as their license allows you to do so