Challenge Overview
1. Context
- This challenge is a part of a brand new project to introduce new AI based capabilities to the Topcoder platform. As of now, the project's immediate goal is to extract useful information from the challenge specification and explore its use cases.
Challenge Context
Within the project context mentioned above, the current challenge aims to deliver an implementation that can generate relevant tags for a challenge, using its challenge specification as input.
Project Context
2. Expected Outcome
- A command-line or API based tool that takes challenge specification string as input and returns a list of tag objects that are relevant to the challenge specification.The exact nature of this expected list and the tag objects are described in the Expected output sub-section under the Challenge Details section below.
3. Challenge Details
- The dataset (in JSON format) can be found in the forum thread. The data is in the format:
challenge_id: {- "sub_track": sub_track of the challenge,
"challenge_url": url of the challenge,
"technologies": the list of technology tags of challenges,
"platforms": list of platform tags of the challenge,
"challenge_spec": the challenge specification string including HTML
Input
It should be noted that although five attributes of the challenges have been provided, the final implementation should take a string, i.e. only challenge_spec value as input to generate the output. The other details can be used as needed to improve the program. Here, the input string can either be with or without the HTML. The tool should be able to handle both cases.Expected output format
The output of the tool should be a list of string tags, or a list of tag objects, where each object can include additional metadata about the tag. For example, if the generated tags are: 'Python', 'Machine Learning', 'NLP', 'Shopping', 'e-commerce', 'Data Scientist', then the output should be a list of objects:
[- {
- tag: 'Python',
type: 'required_skill',
source: 'external_EMSI'
{- tag: 'Machine Learning',
type: 'required_skill' ,
source: 'external_EMSI'
{- tag: 'NLP',
type: 'required_skill',
source: 'external_EMSI'
{- tag: 'Shopping',
type: 'problem_domain',
source: 'custom'
{- tag: 'e-commerce',
type: 'problem_domain',
source: 'external_ABCD'
{- tag: 'Improve Sales',
type: 'summary_phrases',
source: 'custom'
]
Hence, the expected attributes of each tag object are:- tag - the name of the tag
- type - the category of the tag. Competitors are free to create a list of their categories. Some ideas of tag categories can be: 'required_skill', 'problem_domain', 'target_audience' 'summary_phrases' etc. Among these, 'required_skill', 'problem_domain' are essential, and any additional categories are optional but are welcome and can potentially lead to a higher review score.
- source - the source of this tag. If the value has been suggested by a custom model or implementation built by the competitor, it's source attribute should be set to 'custom'. If the tag has been extracted from an external API/service, like EMSI, then the tag object's source should be set to 'external_XYZ', where XYZ is the name of the service. For example, any tag that has been received from the EMSI service should have the source attribute as 'external_EMSI'. In fact, the use of EMSI is required for skill tags. (For more details about EMSI, kindly check the 'External Data' section below.
- match_score - (Optional) the estimated confidence score of how well the tag matches with the spec. Here the accuracy of this score is optional and hence not extremely important, and will not affect the review score unless it is particularly misleading.
- Any additional optional attribute which might be useful can be included.
Important note about Technology and Platform Tags
The platform and technologies tags are available for some of the challenges (mostly Development challenges) in the provided dataset. It should be noted that the tags might not be reliable as they are assigned manually by copilots and are at times not exhaustive.
Furthermore, although it is expected that output for Code challenge spec inputs contain similar kinds of tags, the expected output tags should NOT be limited to just tags that are similar to current type of technology and platform tags. The output tags should ideally also contain additional kinds of tags, such as the 'problem_domain', 'summary_phrases' tags etc described above.External Data
In addition to any custom NLP implmentations, any use of external data and resources is not only allowed but is encouraged.
About EMSI:
The use of the EMSI skill extraction service is mandatory for extracting skill tags from the spec text: https://skills.emsidata.com/extraction. Although it should be noted that competitors are free to add additional skill tags using their own models (which should have the 'custom' source attribute value - as described in the Expected output format section above).
Similar services can be explored for inspiration for other types of tags, such as for the 'problem_domain', 'summary_phrases' types. It should be noted that any paid resource or any tool which might lead to any violation of its license should not be used.A few ideas
In addition to exploring resources like Emsidata Skills (for skill tags) and other similar kinds of APIs for other kinds of tags, state-of-the-art NLP techniques can be explored. Particularly, deep learning based text summarization models might be useful to generate some initial summaries, from which individual words and phrases can be extracted - these phrases can be set as 'summary_phrases', but these should be one or two word phrases, and not long sentences.
Furthermore, if Deep learning is used, freely available existing models can be used as a starting point for transfer learning based implementations. Finally a hybrid implementation can be created which uses various kinds of services, techniques and models to generate the final list. In general, apart from the requirement of using the EMSI API for skill tags, there is no restriction on kind of technology used, as long as the input and output format is consistent with the requirement. - "sub_track": sub_track of the challenge,
Dataset
4. Scorecard Aid
- The review of this challenge will be subjective in nature. That is, the scoring will be done based on the subjective analysis of the submission by the reviewer. Broadly, the subjective review will be on the following lines:
- Basic correctness of the submission with respect to the requirements
- Correctness of the input/output format
- Code quality and how well the files are structured (to ensure ease of future extensibility)
- Quality of the Documentation and any verification steps.
- The quality of the output labels - Once the above steps are realised, the quality of the outputs of various submissions will be compared and a score will be assigned accordingly. The ranking will be broadly based on the quality of the tagging.
Judging Criteria
Final Submission Guidelines
- A command-line based or API based tool that fulfills the requirement mentioned above.
- A demo script that demonstrates the proper working of the tool by passing the challenge_spec string from each of the challenges in the provided dataset as input and generating the output. The output should be saved to csv in the format (for each row): challenge_url, output tag object list.
- Clear instructions about how to run the tool. Here, no assumptions should be made about any software to be pre-installed. The instructions should be such that someone with basic knowledge of how to use the command line can deploy the code. Note - In case it is an API tool, the instructions should be clear on how to deploy it locally.
- A separate document listing the strategy used.
- Optional - Any additional documentation, verification step or video/audio that can help better understand the submissions.