Register
Submit a solution
The challenge is finished.

Challenge Overview

Welcome to this ideation challenge, please read on for details!

Please note that this is a data science based ideation challenge - there will be a checkpoint you can submit for in order to receive feedback and potentially one of five $100 prizes - we've detailed exactly what we're expecting as outputs below.

If you’re already familiar with Natural language processing and love the idea of competing to provide the best ideas for processing human-friendly information - then we just might have the challenge for you. Savvy investors know that a company's financial documents (such as Form 10-K, Form 10-Q, Form 20-F, Form S-1, or private financial statements) can be a great way to learn about a company’s finances, plus get information about the risks and challenges the company faces. Financial document information is presented in an organized, clean way that is easy-to-read … for humans. These documents, while published in HTML and readable to a human, have very little set structure, and no standards about formatting and presentation. This makes it a minor nightmare to parse them with software for further analysis. With this ideation challenge we are seeking your knowledge and expertise to propose ideas, algorithms, and/or solutions on how we can take a financial document in HTML as an input, and spit it back out with annotations for the key major divisions, headers and subheaders. We are very excited to see what you can come up with!


Overview

Parsing financial documents to break them apart is hard, and it makes general analysis across companies difficult. The SEC has addressed this by requiring companies to file XBRL, which are XML based versions of financial statements, but they are missing information like risk factors or management discussion and analysis. What we want, in essence, is a document outliner that can break up these documents into their nested structure: “Items” (which are major divisions in 10-Ks or 10-Qs for example), “notes” (parts of the document that explain the major financial statements or tables), headers (such as “General”, “Operating Segments”, or “Operations”  or under Item 1: Business), subheaders (such as “Our Business” or “What we Offer” under the “General” header) that are key components of these documents.


Objective

The objective of the ideation is to find key items and sub-items within financial documents.  For example, Form 10-K documents consist of 20 major sections called “Items” (some of which are optional, you can find more information around the items here), some large tables of financial data, and a section of “notes” explaining the financial data. Likewise, Form 10-Q documents are company financials that are issued quarterly, and are reduced versions of 10-Ks.  Form 20-F are foreign company filings, and are generally complex due to different accounting standards throughout the world. Form S-1 is the initial registration for new securities. And private financial statements are statements that private companies often file for their boards and investors. Financial documents are generally created using word processing applications like Microsoft Word or Google Docs. You can find more information on the SEC website as well as by reviewing them (search for your favorite public company here). Within the “Items”, there are usually headers, sub-headers, and sub-sub-headers, etc., but the overall structure of the item content is not mandated.

We want to be able to take a financial document in HTML as an input, and spit it back out annotations with the key items, notes, headers and sub-headers. While financial document formatting and content is very inconsistent between companies, it is generally consistent between different documents for the same company, and also may be more consistent for companies in the same industry. So a company's 2014 10-K will have a very similar structure to its 2014 10-Q statements, and its 2015 10-K.


Can’t you just…

The sheer variety of these documents means that a single approach might not always work.  We’ve included a document with samples of the “Note” headers from a large sample of the documents from various companies; they are numbered in a huge variety of ways, and sometimes they aren’t numbered at all. And although formatting is consistent within a single document, formatting across documents varies wildly. Headings are almost never identified with H tags. So the approach must be flexible enough to deal with many different documents and formats.


Attachments

Be sure to check the forum posting to get access and descriptions of the relevant attachments we're providing for this challenge.
 

Requirements To Win a Prize

Checkpoint - Five Prizes of $100
In order to be eligible for a checkpoint prize you must submit to this challenge by June 22nd at 19:00 EDT.
Any submissions timestamped before then will be eligible and considered in the running for a prize for both the checkpoint and final submission. After this time you will only be eligible for final submission prizes.
The advantages of submitting to the checkpoint phase will be to get client feedback around your ideas and the opportunity to receive one of the five $100 checkpoint prizes.
You are not required to submit for the checkpoint in order to submit for the final phase.

Final Submission:
Achieve a passing score in the top 5. See the “Evaluation Criteria” section below. 
Submit a complete report, at least 2 pages long, outlining your proposal. The required content appears in the “Submission Requirement” section below. 

** Bonus Opportunity: $500 **
If we see something truly innovative we reserve the right to give out an additional $500 bonus prize to the submission who truly comes up with an innovative way to solve the problem. The recipient of this prize does not need to be the first place winner.


Final Submission Guidelines

Submission Requirement

All submissions should include a document up to 5 pages (2-4 pages are recommended) in word or PDF format to describe your resources, your thought process, all useful links, and justify how your solution satisfies our judging criteria and goals.

More specifically, your document should include:
1. Report – Your report should be 2-5 pages long, and describe in detail how you would go about both finding and classifying headers.
2. Platform – The technologies and techniques used should be suitable for use in a production system. This must, above all, be implementable, so while new and/or experimental technologies can be used, they should be suitable for use in production systems, or mitigation plans should be included. We prefer open source libraries. Besides the cost advantages, open source often has good community support, as well as generally allowing greater modification and customization. If third-party software is to be used, the following should be given;
a.Name
b.Version
c.Vendor (if proprietary) / License (if open-source)
d.Estimated Annual Cost
e.Link to product and license
3. Completeness – If your submission depends on other resources, such as a corpus of machine learning data, you should include information about how that resource can be acquired. While we don’t expect every detail to be spelled out, major dependencies should be well described.
4. Algorithms – Key algorithms should be described. If proprietary software is used, best-available information is acceptable. The following information should be included:
a.Name
b.Written description with performance bounds
c.Past Implementations (where applicable)
d.Published references (where applicable)
e.Pseudocode
f. Additional inputs required, if any
g.Strategies for incorporating feedback about misclassifications
h.Estimates for accuracy, including the basis for such estimates
i.Known advantages and disadvantages of the approach
j.Known scalability issues (or reasons why it is particularly scalable)

Evaluation criteria

You will be judged on the quality of your ideas, the quality of your description of the ideas, and how much benefit it can provide for the client.  The winner will be chosen by the most logical and convincing reasoning as to how and why the idea presented will meet the objective. 

As an ideation challenge, this contest will be at least somewhat subjective. However, the following criteria will largely be the basis for the judgement:
1. Accuracy
2. Implementability 
3. Completeness (i.e. coverage of all critical sections of the financial statements)
4. Robustness of report (i.e. how much of the challenge objectives are covered and documented as part of the submission requirement)

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30054564