Challenge Overview
Problem Statement | |||||||||||||
Competitors with the best 5 submissions in this contest according to system test results will receive the following prizes: Prize USD 1st $10,000 2nd $5,000 3rd $2,500 4th $1,500 5th $1,000 Bonus Prize - $100 per winning submissionWe will ask the 5 winners to submit a Docker file for their solution and will provide guidance in that regard. We will pay the winners who submit a Docker file an additional $100 each for their efforts. IntroductionThis contest builds upon one we ran recently. It uses the same format that allows submissions to employ any language or open source libraries that meet the licensing requirements specified below in the "Final Code Submission" section. Requirements to Win a PrizeIn order to receive a prize, you must do all the following:
If you place in the top 5 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all the above. BackgroundFaith Comes by Hearing ("FCBH") is dedicated to spreading the message of the Bible across the globe. Over the years, FCBH has observed that many people can"t read or live in oral communities. The organization wishes to allow as many people as possible to hear the Bible in their native languages. ObjectiveTo enable this, FCBH seeks algorithms that can correctly identify the languages spoken in audio recordings. Given speech data from multiple languages, your algorithm must identify which languages are spoken. Data DescriptionYou will receive both a training data set ("Contestant Training Data") and a test data set ("Contestant Test Data"). Both data sets contain recorded speech in 176 languages ("Possible Languages Spoken"). A separate .mp3 file stores each speech recording, and only 1 language is spoken in each file. The Contestant Training Data contains:
The Contestant Test Data contains:
There is no indication of which Possible Language Spoken in each Contestant Test Data file is the Actual Language Spoken in that file (i.e., the data are unlabeled). Algorithm Output Format RequirementYour algorithm must output 3 records for each Speech File. Each record must contain 3 comma-separated fields in the following order:
An example output appears below for 2 Speech Files: fileName1.mp3,LanguageA,1 fileName1.mp3,LanguageB,3 fileName1.mp3,LanguageC,2 fileName2.mp3,LanguageG,3 fileName2.mp3,LanguageT,1 fileName2.mp3,LanguageB,2 FunctionsDuring the contest, only your results will be submitted. You will submit code which implements only one function, getURL(). Your function will return a String corresponding to the URL of your answer .csv file. You may upload your .csv file to a cloud hosting service such as Dropbox which can provide a direct link to your .csv file. To create a direct sharing link in Dropbox, right click on the uploaded file and select share. You should be able to copy a link to this specific file which ends with the tag "?dl=0". This URL will point directly to your file if you change this tag to "?dl=1". You can then use this link in your getURL() function. Another common example is to use Google drive for sharing the link. If you choose that, please use the following format to create a direct sharing link: "https://drive.google.com/uc?export=download&id=" + id; You can use any other way to share your result file but make sure the link you provide should open the filestream directly. Your complete code that generates these results will be tested at the end of the contest. HintsThe following links may provide helpful background information on previous, similar work. Sections "IV. Acoustic-Phonetic Approaches" and "V. Topics in System Developments" of Spoken Language Recognition: From Fundamentals to Practice by H. Li et al. (2013) could be particularly useful. National Institute of Standards and Technology (NIST) Language Recognition Evaluation 2009 NIST Language Recognition Evaluation 2011 You may find the techniques, tools, and research papers in the following section useful.
Technique: Language Identification (Weka)
Technique: Spoken Language Classification
Technique: Voice Pattern Designs
Technique: Audio Processing
Technique: Machine Learning through Audio Analysis
Technique: Various ScoringWe will run both provisional and system tests using the same submission, but you will not know which Speech Files are used for which test. Your algorithm's performance will be quantified as follows. For each Speech File, If possiblelLang = Actual Language Spoken and langRank = 1, then 1.0 points Else if possiblelLang = Actual Language Spoken and langRank = 2, then 0.40 points Else if possiblelLang = Actual Language Spoken and langRank = 3, then 0.16 points Else 0.00 points Score = 1,000 * Total points for all Speech Files in the Contestant Test Data The maximum possible total scores for example, provisional, and system testing are 0, 3,520,000, and 8,800,000. If there are ties in the final system test results, then we will break them using algorithm run time on the Amazon Web Services m4.xlarge virtual machine described above. However, we will not measure algorithm run time or break any ties during the contest. ReportYour report must be at least 2 pages long, contain at least the following sections, and use the section names below. Contact Information
Approach UsedPlease describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.
Final Code SubmissionYou must submit all code used in your algorithm. You may use any programming language you like, provided it is a free, open-source solution for you to use and would be for the client as well. If using something other than a standard Python/gcc/Java install, as usually used by our testers, please include any relevant installation instructions for said language, and likewise for any additional libraries and packages. You must submit evidence that your code runs successfully to completion. The code will be run on CentOS 6.5 x86_64 HVM. Data File Downloads
| |||||||||||||
Definition | |||||||||||||
| |||||||||||||
Notes | |||||||||||||
- | In the evaluation after the contest, the submitted code must actually return results by processing the data files, without having manually hard-coded results specifed. Specifically, we should be able to execute the submitted code using the original contest data, and obtain results similar to what were submitted during the contest. | ||||||||||||
- | Usage of additional data is acceptable, provided it is freely available in the same manner as code/libraries that are used. Also, any additional data should likewise be submitted with the code after the contest (in the case of a top 5 finish). The submitted code with additional data will be subject to the same aforementioned requirement of being able to produce the results submitted during the contest. | ||||||||||||
- | 2/7 of the submitted languages are used for provisional testing, and 5/7 are used for system testing. This will be *approximately* 20 and 50 of each language (but not exactly since the split was chosen randomly). | ||||||||||||
Examples | |||||||||||||
0) | |||||||||||||
| |||||||||||||
1) | |||||||||||||
| |||||||||||||
2) | |||||||||||||
|
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.