Register
Submit a solution
The challenge is finished.

Challenge Overview

Introduction

Implement API for the following functions:
  • Get PDF information
  • PDF image conversion
  • Create PDF with text information retrieved from image using OCR
 

Background

The ultimate goal is to run the API as AWS Lambda, so even though this challenge doesn’t involve anything related to AWS at all, please keep this goal in mind and try your best to make sure there’s no compatibility issue when it’s time for us to move to AWS.

 

Architecture

Following is the complete architecture for this system, it also includes the AWS part which you can ignore in this challenge, but it would still be helpful for you to check it to get the idea of the big picture.
 

Test Files

The sample PDFs for verification are listed below:  
The JSON file obtained by OCR of the sample PDF is as follows:  

Description

In this development challenge, the following functions need to be developed. Each function must be implemented according to the API specification described in “API Specs” section.

1.Get PDF information API

Get the following information of the PDF file.
  • Whether text information is given
  • Whether text information is given and appropriate encoding is set
  • Whether the PDF is password protected and text information is not forbidden to be read
  • General information in the following PDF
    • Title
    • Author
    • Subtitle
    • Keyword
    • Created date
    • Update date
    • Created application
    • PDF converted application
    • PDF version
    • Page size
    • Page number
    • File size
 

2.PDF image conversion API

Convert the specified PDF file to the PNG image format.
For multi-page PDF, image files for the number of pages are output.
Conversion sequence diagram


3.PDF creation API with text information

OCR the specified image file and create a PDF with text information based on the OCR result. This means that the original image is a PDF with text overlaid with transparent text, with the font size of the image representing the text and the corresponding text appropriately adjusted.


It is assumed that you can select and copy correctly when you select the range of the area that represents the character.
The created PDF must be able to be displayed as a web page on the browser (Chrome / Firefox) with PDF.js. Also, all text information recognized by OCR can be acquired by getTextContent of PDF.js.
Multiple image files are output as one PDF with the number of image files as the number of PDF pages.
There is no need to resize images when converting them to PDF.
Creation sequence diagram

4.OCR engine

We will provide a sample PDF of OCR using Google Vision API, so please use this JSON file for this development contest. The languages that are subject to OCR are as follows.
  • Japanese
  • English
 
It is not necessary to check Japanese in this development contest, but please implement it so that you can set it in a language other than English.
 
This development contest provides JSON files that are OCR from images, not directly from PDFs, to avoid PDF reading security issues. OCR from images is also performed in direct tasks.
 

5.Other

There is no problem using a commercially available open-source library for each API implementation. If you want to use an expensive library, please get our approval in advance.
 
You may want to explore these libraries for implementation:  

6.Use case

The final use case sequence will be posted as reference information.
It is assumed that the PDF that cannot acquire text information is re-OCRed.
Use case sequence diagram

API Specs

  • Swagger format
  • API Spec details
  • ankouPdfconverterV1Convert2imageJsonGet and ankouPdfconverterV1Convert2textpdfJsonGet are out of the scope of this challenge.

Limitations

The file size of the source resource cannot exceed 30MB.

Final Submission Guidelines

Submission

Please zip and submit the following set of files:
  • Source code (Node.js 8.10 or Python 3.6)
  • Readme
  • Unit test
  • Describe necessary information such as the execution method in the Readme

ELIGIBLE EVENTS:

2020 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30107067