Challenge Overview

Introduction

Implement API for the following functions:

Get PDF information
PDF image conversion
Create PDF with text information retrieved from image using OCR

Background

The ultimate goal is to run the API as AWS Lambda, so even though this challenge doesn’t involve anything related to AWS at all, please keep this goal in mind and try your best to make sure there’s no compatibility issue when it’s time for us to move to AWS.

Architecture

Following is the complete architecture for this system, it also includes the AWS part which you can ignore in this challenge, but it would still be helpful for you to check it to get the idea of the big picture.

Test Files

The sample PDFs for verification are listed below:

Sample PDF Japanese: If you don't understand Japanese, please use the following two files. This is a file with proper encoding.
Sample PDF English1: File with proper encoding.
Sample PDF English2: Files that do not have the proper encoding set.

The JSON file obtained by OCR of the sample PDF is as follows:

Description

In this development challenge, the following functions need to be developed. Each function must be implemented according to the API specification described in “API Specs” section.

1.Get PDF information API

Get the following information of the PDF file.

Whether text information is given
Whether text information is given and appropriate encoding is set
- The purpose is to determine whether g_font_error occurs due to PDF.js getTextContent. The result of getTextConent has some problems. The str is “ and fontName is ‘g_font_error’. see. “Sample PDF English 2” corresponds to this.
Whether the PDF is password protected and text information is not forbidden to be read
General information in the following PDF
- Title
- Author
- Subtitle
- Keyword
- Created date
- Update date
- Created application
- PDF converted application
- PDF version
- ~~Page size~~
- Page number
- File size

2.PDF image conversion API

Convert the specified PDF file to the PNG image format.
For multi-page PDF, image files for the number of pages are output.
Conversion sequence diagram

3.PDF creation API with text information

OCR the specified image file and create a PDF with text information based on the OCR result. This means that the original image is a PDF with text overlaid with transparent text, with the font size of the image representing the text and the corresponding text appropriately adjusted.

It is assumed that you can select and copy correctly when you select the range of the area that represents the character.
The created PDF must be able to be displayed as a web page on the browser (Chrome / Firefox) with PDF.js. Also, all text information recognized by OCR can be acquired by getTextContent of PDF.js.
Multiple image files are output as one PDF with the number of image files as the number of PDF pages.
There is no need to resize images when converting them to PDF.
Creation sequence diagram

4.OCR engine

We will provide a sample PDF of OCR using Google Vision API, so please use this JSON file for this development contest.

The languages that are subject to OCR are as follows.

Japanese
English

It is not necessary to check Japanese in this development contest, but please implement it so that you can set it in a language other than English.

This development contest provides JSON files that are OCR from images, not directly from PDFs, to avoid PDF reading security issues. OCR from images is also performed in direct tasks.

5.Other

There is no problem using a commercially available open-source library for each API implementation. If you want to use an expensive library, please get our approval in advance.

You may want to explore these libraries for implementation:

6.Use case

The final use case sequence will be posted as reference information.
It is assumed that the PDF that cannot acquire text information is re-OCRed.
Use case sequence diagram

API Specs

Swagger format
API Spec details
ankouPdfconverterV1Convert2imageJsonGet and ankouPdfconverterV1Convert2textpdfJsonGet are out of the scope of this challenge.

Limitations

The file size of the source resource cannot exceed 30MB.

Final Submission Guidelines

Submission

Please zip and submit the following set of files:

Source code (Node.js 8.10 or Python 3.6)
Readme
Unit test
Describe necessary information such as the execution method in the Readme

PDF Converter API Challenge

Challenge Overview

Introduction

Background

Architecture

Test Files

Description

1.Get PDF information API

2.PDF image conversion API

3.PDF creation API with text information

4.OCR engine

5.Other

6.Use case

API Specs

Limitations

Final Submission Guidelines

Submission

Learn

Review style

Final Review

Approval

Challenge links

Toolbox

ID: 30107067