Challenge Overview

Problem Statement

ORC Error Reduction Marathon Challenge

Prize Distribution

1st place - $5,000

2nd place - $2,500

3rd place - $1,500

4th place - $1,000

Introduction

We are looking for solutions that can reduce the OCR errors in an unsupervised manner. The modified texts will be validated against a keyword/keyphrase vocabulary and the original texts. We hope to receive wonderful solutions. Good Luck!

Requirements to Win a Prize

In order to receive a prize, you must do the following:

Achieve a score in the top 4, according to system test results calculated using the test dataset. The score must be higher than 300,000.0. See the ��Data Description and Scoring�� section below.
Within 7 days from the announcement of the contest winners, submit a complete 2-page (minimum) report that: (1) outlines your final algorithm and (2) explains the logic behind and steps of your approach.

If you place in the top 4, but fail to do all of the above, then you will not receive a prize, which will be awarded to the contestant with the next best performance who completed the submission requirements above.

Background

Optical character recognition (OCR) software is capable of turning images of text into text, but it is not perfect. There are sources of errors that occur because of the nature of the document image itself (sizing, rotation, etc.) that should be handled in preprocessing, but there are sources of errors that need to be handled in post-processing, after the image is converted to text. Among the more common types of errors are segmentation errors (`too m uc h wh itespac e` or `toolittle whitespace`), `character rnisrecognitioii`, smudges or other artifacts that could cause misrecognition of `pun,ctuation`.

The cleaner we can make the images before the OCR step, and the cleaner we can make the output of the OCR, the more confidently we will be able to use this text as input to other models. Your challenge is to find these needles in the haystack, and train a model to find and correct the types of common issues described above -- as well as ones we might not anticipate.

Data Description

There are 10,076 OCR documents in total. There are 8,241,011 total words, while only 15,497 unique words. All words have been anonymized.

We have randomly split these documents into 3 sets and processed them into paragraphs:

Train Set: 260,007 paragraphs. The longest one has 844 words, while the average number of words is 10.63.
Provisional Set: 261,294 paragraphs. The longest one has 1508 words, while the average number of words is 10.58.
Final Test Set: 254,461 paragraphs. The longest one has 1806 words, while the average number of words is 10.67.

We provide the "Train Set" for your local development and testing. In addition, we have a lexicon list of 373,837 keyphrases. These phrases are used in the evaluation. You can downloaded them from the following link.

        https://www.dropbox.com/s/koj6yuqjim8lyd6/to_contestant_v2.zip?dl=1

Implementation

You are asked to implement a class OCRErrorReduction with two functions: initialize and clean. The initialize function will be called exactly once at the beginning of the test. It receives a list of single-space-separated strings where every string represent a keyphrase. The clean function will be called multiple times during the test. Every time, it receives an original OCR paragraph. It needs to return a cleaned paragraph.

Scoring

For each OCR document, we want to maximize the ratio of the words that can be tagged as keyphrases over the total number of words, while the cleaned string must be still not far away from the original string. Therefore, given the dictionary D and two strings, i.e., cleaned string (CS) and original string (OS), we define the raw score as

        raw(CS, OS | D) = Sqrt( (Match(CS, D) / |CS|)  * (1 - EditDistance(CS, OS) / max(|CS|, |OS|)) )

Where Match(CS, D) represents the maximum number of words that can be matched in the string CS to the dictionary D, |CS| and |OS| denote the number of words in these two strings, respectively. In order to calculate Match(CS, D), we will simply split the string by whitespace, and run a dynamic programming algorithm. |CS| is required to be between 1 and max(2 * |OS|, |OS| + 100). Otherwise, the raw score will be 0.0. More details can be found in the released test harness. If you find any question or error regarding the test harness, please post it in the forum.

The final score will be the average of the raw scores on all tested paragraphs multiplied by 1,000,000.

In the Example Test, we will use the first 10,000 paragraphs in the released "Train Set" for evaluation. The time limit is 20 mins. In the Provisional Test and System Test, we will use the whole "Provisional Test Set" and "System Test Set" respectively. The time limits are both 2 hours.

Definition

Class:	OCRErrorReduction
Method:	initialize
Parameters:	String[]
Returns:	int
Method signature:	int initialize(String[] keyphrases)

Method:	clean
Parameters:	String
Returns:	String
Method signature:	String clean(String paragrah)
(be sure your methods are public)

Notes

This match is rated

The allowed programming languages are C/C++, Java, and Python. You can use open source libraries.

You can include ONLY open source code in your submission, provided it is free for you to use and would be for the client as well. Code from open source libraries under the BSD or MIT licenses will be accepted. Other open source licenses may be accepted too, just be sure to ask us.

Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself, possible solution techniques or related to data analysis.

The memory limit is 2048 MB.

There is no explicit code size limit. The implicit source code size limit is around 10 MB (it is not advisable to submit codes of size close to that or larger).

The compilation time limit is 600 seconds. You can find information about compilers that we use and compilation options here.

Examples

"1"

Returns: "Seed: 1"

example test

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.

OCR Error Reducer - OCR Error Reducer