Challenge Overview
Problem Statement | ||||||||||||||||||||||||
ORC Error Reduction Marathon ChallengePrize Distribution
1st place - $5,000 IntroductionWe are looking for solutions that can reduce the OCR errors in an unsupervised manner. The modified texts will be validated against a keyword/keyphrase vocabulary and the original texts. We hope to receive wonderful solutions. Good Luck! Requirements to Win a PrizeIn order to receive a prize, you must do the following:
BackgroundOptical character recognition (OCR) software is capable of turning images of text into text, but it is not perfect. There are sources of errors that occur because of the nature of the document image itself (sizing, rotation, etc.) that should be handled in preprocessing, but there are sources of errors that need to be handled in post-processing, after the image is converted to text. Among the more common types of errors are segmentation errors (`too m uc h wh itespac e` or `toolittle whitespace`), `character rnisrecognitioii`, smudges or other artifacts that could cause misrecognition of `pun,ctuation`. The cleaner we can make the images before the OCR step, and the cleaner we can make the output of the OCR, the more confidently we will be able to use this text as input to other models. Your challenge is to find these needles in the haystack, and train a model to find and correct the types of common issues described above -- as well as ones we might not anticipate. Data DescriptionThere are 10,076 OCR documents in total. There are 8,241,011 total words, while only 15,497 unique words. All words have been anonymized. We have randomly split these documents into 3 sets and processed them into paragraphs:
We provide the "Train Set" for your local development and testing. In addition, we have a lexicon list of 373,837 keyphrases. These phrases are used in the evaluation. You can downloaded them from the following link. https://www.dropbox.com/s/koj6yuqjim8lyd6/to_contestant_v2.zip?dl=1 ImplementationYou are asked to implement a class OCRErrorReduction with two functions: initialize and clean. The initialize function will be called exactly once at the beginning of the test. It receives a list of single-space-separated strings where every string represent a keyphrase. The clean function will be called multiple times during the test. Every time, it receives an original OCR paragraph. It needs to return a cleaned paragraph. ScoringFor each OCR document, we want to maximize the ratio of the words that can be tagged as keyphrases over the total number of words, while the cleaned string must be still not far away from the original string. Therefore, given the dictionary D and two strings, i.e., cleaned string (CS) and original string (OS), we define the raw score as raw(CS, OS | D) = Sqrt( (Match(CS, D) / |CS|) * (1 - EditDistance(CS, OS) / max(|CS|, |OS|)) ) Where Match(CS, D) represents the maximum number of words that can be matched in the string CS to the dictionary D, |CS| and |OS| denote the number of words in these two strings, respectively. In order to calculate Match(CS, D), we will simply split the string by whitespace, and run a dynamic programming algorithm. |CS| is required to be between 1 and max(2 * |OS|, |OS| + 100). Otherwise, the raw score will be 0.0. More details can be found in the released test harness. If you find any question or error regarding the test harness, please post it in the forum. The final score will be the average of the raw scores on all tested paragraphs multiplied by 1,000,000. In the Example Test, we will use the first 10,000 paragraphs in the released "Train Set" for evaluation. The time limit is 20 mins. In the Provisional Test and System Test, we will use the whole "Provisional Test Set" and "System Test Set" respectively. The time limits are both 2 hours. | ||||||||||||||||||||||||
Definition | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
Notes | ||||||||||||||||||||||||
- | This match is rated | |||||||||||||||||||||||
- | The allowed programming languages are C/C++, Java, and Python. You can use open source libraries. | |||||||||||||||||||||||
- | You can include ONLY open source code in your submission, provided it is free for you to use and would be for the client as well. Code from open source libraries under the BSD or MIT licenses will be accepted. Other open source licenses may be accepted too, just be sure to ask us. | |||||||||||||||||||||||
- | Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself, possible solution techniques or related to data analysis. | |||||||||||||||||||||||
- | The memory limit is 2048 MB. | |||||||||||||||||||||||
- | There is no explicit code size limit. The implicit source code size limit is around 10 MB (it is not advisable to submit codes of size close to that or larger). | |||||||||||||||||||||||
- | The compilation time limit is 600 seconds. You can find information about compilers that we use and compilation options here. | |||||||||||||||||||||||
Examples | ||||||||||||||||||||||||
0) | ||||||||||||||||||||||||
|
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.