Topcoder Challenge | Topcoder Community

Challenge Overview

Our goal is to classify files that contain multiple (types of) documents. We have build a machine learning model based on Document Template Matching, Random Forest, and Hidden Markov Models, to classify the type of each page. The page-level accuracy is reasonably good, but we want to further improve its document-level F1. That is, the F1 will be calculated based on the detected document type and boundaries.

Background

For every single page’s classification, we are now using ElasticSearch (ES), with an index that contains 151 templates, and a Random Forest (RF) model. The outputs of the ES service (top 10 scores for 10 doctypes) provide the inputs to the RF model, which returns the top 3 predictions of the doctype and the corresponding probability. We then employ an HMM model to predict the page-level classification based on the current and previous page prediction. The goal of the HMM model is to predict the transition from one doctype to another doctype based on page sequence patterns of multi-doc files.

The percentage of pages with the correct doctype in the top 3 RF scores is 98%. However, when these pages are assembled to documents (based on the consecutive pages of each doctype), the document-level F1 is just around 14%. Looking at the data, most of these errors arise from misclassification of the first 1-3 pages. For example, given a 15 page Purchase Agreement (doctype), the document level prediction from HMM was:

"doctype": "1003",
- "score": 0.1,
- "pagerange": "1"
"doctype": "CREDIT REPORT",
- "score": 0.1,
- "pagerange": "2"
"doctype": "PURCHASE AGREEMENT",
- "score": 0.52,
- "pagerange": "3"
"doctype": "COMPLIANCE TEST RESULTS",
- "score": 0.13,
- "pagerange": "4"
"doctype": "PURCHASE AGREEMENT",
- "score": 0.39,
- "pagerange": "5-14"
"doctype": "CLOSING INSTRUCTIONS",
- "score": 0.2079200614621186,
- "pagerange": "15"

The goal of the HMM model is to predict the document classification and determine the document boundary based on the pattern of page sequences. Each doctype has an average number of pages and standard deviation, and we expect this pattern to manifest itself in the transition matrix. When the first page prediction is incorrect, this error propagates to the subsequent page(s), such that the aggregate document prediction is incorrect, even if page level classification is partially accurate.

Task Detail

You will be given a dataset of the predictions from the Random Forest model for a sequence of pages, the correct doctype classification, and the page range. You will also be given the prediction of our existing HMM model along with the sequence of documents (with the corresponding number of pages and doctype).

You are asked to build a ML model to further improve the document-level F1 score. Specifically, we will transfer the page-level predictions into tuples like <fileID, startPage, endPage, docType>. From the groundtruth and your predictions, we can induce two sets of tuples, denoting as GT and Pred. We will then compute the F1 score between these two sets as follows.

Precision = |GT & Pred| / |Pred|

Recall = |GT & Pred| / |GT|

F1 = Precision * Recall * 2 / (Precision + Recall)

where “&” means “set intersection”.

The dataset (link) has been anonymized. The header in the CSV can almost explain the meanings of those columns.

Hint: One hypothesis is that a backward-forward algorithm for HMM will resolve some of errors seen in classifying the first 1-3 pages.

Final Submission Guidelines

Submission

We mainly require two things in the submission:

Python3 codes. There should be two functions, one for fitting curves and forecasting, and another for evaluation. Please try to make use of existing libraries appropriately and document the code.
A document including the key ideas of your method.
A filled test.csv file that contains the predictions for all the pages. They should follow their original order.

Format

A document should be a minimum of 2 pages in PDF / Word format to describe your ideas.
It should be written in English. This documentation isn’t being evaluated for English grammar and spelling, only the quality of your ideas and your technical approach.
Leveraging charts, diagrams, and tables to explain your ideas is encouraged from a comprehensive perspective.

Judging Criteria

You will be judged on document-level accuracy, how well the code is organized and documented. Note that, this contest will be judged subjectively by the client and Topcoder. However, the judging criteria will largely be the basis for the judgement.

Accuracy (75%)

Please fill in the test.csv file with your predictions for every page.
We will compute the document-level F1 scores as defined above.

Feasibility (15%)

The solution must be simple and compatible to our current framework.
The response time (online serving) should be fast enough, i.e., < 1 second per 10 pages.
The size of the model must be less than 60 MB.

Clearness (10%)

Please make sure your report is well-written.
Please make sure your code is well-documented.
The code should be implemented using Python3 only.

Multi-Document File Classification Code Challenge

Challenge Overview

Challenge Overview

Background

Task Detail

Final Submission Guidelines

Submission

Contents

Format

Judging Criteria

Learn

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30090438