Challenge Overview
Unstructured text mining is the key the future of data science. Many data sources provide valuable data in unstructured and semi-structured formats. Frequently, tables in HTML are messy, and complex to find within large bodies of text. Headers change, can be misaligned, numbers can be in different formats. The tables titles are frequently labeled outside the table structure. Mastering the extraction of this information in a generic way can provide a very powerful tool for understanding many different forms of data.
Our goal in this challenge is to get some ideas and a design for a tool that takes a html file as input and as the output produces structured text in a JSON format with data from the tables in the document. Specifically, the tool should be able to extract every number in the table including understanding missing values labeled by:
As a sample document we can use the SEC Apple 10-Q (link is https://tinyurl.com/y95t497g), table on page 3. The expected output for the table data is attached in the forums.
you can use other 10-Q filings as sample documents - Facebook, Alphabet, Microsoft, but bear in mind that the tool should be a generic one and not rely on specific formatting data of those sample documents.
This tool is intended to be a component of an architecture. Implementation must be in python, java, or scala. Spark is acceptable as a platform but not required. Any libraries used must be Apache, BSD, or MIT licenced. No GPL is permitted.
Our goal in this challenge is to get some ideas and a design for a tool that takes a html file as input and as the output produces structured text in a JSON format with data from the tables in the document. Specifically, the tool should be able to extract every number in the table including understanding missing values labeled by:
The row name including sub-row names aligned to the tree The column header including any information for multiple headers The units for the number e.g. dollars, shares The date associated with the number from the header eg 2001 Any ancillary information such as if it is a pro forma number or not The value in absolute units eg 1 million should be 1000000 No commas or other separators in values The name of the table from any headers above the table or outside table The date of the document from the index The company name / ID associated with the table The originating document type if there is a type All HTML tags must be stripped from the output
As a sample document we can use the SEC Apple 10-Q (link is https://tinyurl.com/y95t497g), table on page 3. The expected output for the table data is attached in the forums.
you can use other 10-Q filings as sample documents - Facebook, Alphabet, Microsoft, but bear in mind that the tool should be a generic one and not rely on specific formatting data of those sample documents.
This tool is intended to be a component of an architecture. Implementation must be in python, java, or scala. Spark is acceptable as a platform but not required. Any libraries used must be Apache, BSD, or MIT licenced. No GPL is permitted.