Register
Submit a solution
The challenge is finished.

Challenge Overview

Unstructured text mining is the key the future of data science.  Many data sources provide valuable data in unstructured and semi-structured formats.  Frequently, tables in HTML are messy, and complex to find within large bodies of text.  Headers change, can be misaligned, numbers can be in different formats.   The tables titles are frequently labeled outside the table structure. Mastering the extraction of this information in a generic way can provide a very powerful tool for understanding many different forms of data. 

Our goal in this challenge is to get some ideas and a design for a tool that takes a html file as input and as the output produces structured text in a JSON format with data from the tables in the document. Specifically, the tool should be able to extract every number in the table including understanding missing values labeled by:
    The row name including sub-row names aligned to the tree
    The column header including any information for multiple headers
    The units for the number e.g. dollars, shares
    The date associated with the number from the header eg 2001
    Any ancillary information such as if it is a pro forma number or not
    The value in absolute units eg 1 million should be 1000000
    No commas or other separators in values
    The name of the table from any headers above the table or outside table
    The date of the document from the index
    The company name / ID associated with the table 
    The originating document type if there is a type
    All HTML tags must be stripped from the output

As a sample document we can use the SEC Apple 10-Q (link is https://tinyurl.com/y95t497g), table on page 3. The expected output for the table data is attached in the forums.
you can use other 10-Q filings as sample documents - Facebook, Alphabet, Microsoft, but bear in mind that the tool should be a generic one and not rely on specific formatting data of those sample documents.

This tool is intended to be a component of an architecture. Implementation must be in python, java, or scala.  Spark is acceptable as a platform but not required. Any libraries used must be Apache, BSD, or MIT licenced.   No GPL is permitted.  

Final Submission Guidelines

Submit a document describing your solution. Any additional design documents such as uml are highly desirable. Your proposal will be used as a basis for implementation so make sure it contains all the necessary implementation details.

ELIGIBLE EVENTS:

2018 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30060070