Challenge Overview
Prize
1st place - $2000
2nd place - $1000
3rd place - $500
Overview
AWS Serverless: https://aws.amazon.com/serverless/
Tesseract is an open source text recognizer (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages. Github Wiki Resource
The goal of this challenge is to perform 4 image processing operations as fast as possible.
Challenge
In this challenge, you will receive a 2112-page long document. The baseline (i.e., previous winning solution) is to perform the following functions on the sample 2112 pages.
-
Convert documents (split into pages).
-
OCR pages.
-
Merge pages (Merge split pages into a single file).
-
Load documents into a User Interface that would allow users to view the pages.(from S3).
Here is the file:
https://drive.google.com/open?id=1DmkLpsKL79YDn4jXdNFO8j0jXyqGETim
Baseline
The overall quality score (quality_score_overall) should be 0.5. Use this script to calculate the score:
https://drive.google.com/file/d/1-U-bEJK-uphuUa8T7RpRAaOuPMS-4hmV/view?usp=sharing
Final Submission Guidelines
Submission
We expect you to tweak the provided baseline. You should carefully document the changes that you have done and also provide a new README file to provide instructions about how to run the new code.
Judging Criteria
-
Accuracy (30%)
-
The OCR results should maintain a high-enough accuracy. Tesseract conversion score (refer to the baseline package for details) should be greater than 0.5.
-
-
Efficiency (40%)
-
The running time should be as fast as possible.
-
The solution should be able to run 100 concurrent executions of the 2k-page file for one hour without any issues (e.g., out-of-memory, crashing, etc.). This is a hard requirement.
-
-
Easy-to-Deploy (30%)
-
All functions must read from and write to S3. This is a hard requirement.
-
You can use any open source (apache & MIT licenses are acceptable) libs for convert and merge documents.
-
The OCR step must be done using tesseract.
-
We will give bonus scores to submissions that are done with a microservice approach. Computation on AWS server-less services are preferred.
-