Challenge Overview
Problem Statement | |||||||||||||
Prize DistributionCompetitors with the best 3 submissions in this contest according to system test results will receive the following prizes: Prize USD 1st $2,500 2nd $1,500 3rd $1,000 IntroductionThis contest uses a new format that allows submissions to employ any language or open source libraries that meet the licensing requirements specified below in the ���Final Code Submission��� section. Furthermore, we have included some helpful hints on potentially viable approaches. We look forward to seeing your results in this new format. BackgroundFaith Comes by Hearing (���FCBH���) is dedicated to spreading the message of the Bible across the globe. Over the years, FCBH has observed that many people can���t read or live in oral communities. The organization wishes to allow as many people as possible to hear the Bible in their native languages. To enable this, FCBH seeks algorithms that can correctly and quickly identify the languages spoken in audio recordings. Given speech data from multiple languages, your algorithm must learn how to identify which languages are spoken in new speech recordings. This learning requirement is most important to FCBH, as reflected in the ���Algorithm Learning Evaluation��� section below. Requirements to Win a PrizeIn order to receive a prize, you must do all the following:
If you place in the top 3 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all the above. Data DescriptionYou will receive both a training data set (���Contestant Training Data���) and a test data set (���Contestant Test Data���). Both data sets contain recorded speech in 5 languages: English, French, Italian, German, and Spanish (���Possible Languages Spoken���). A separate .mp3 file stores each speech recording, and only 1 language is spoken in each file. The Contestant Training Data contains:
The Contestant Test Data contains:
There is no indication of which Possible Language Spoken in each Contestant Test Data file is the Actual Language Spoken in that file (i.e., the data are unlabeled). Algorithm Output Format RequirementYour algorithm must output 5 records for each Speech File. Each record must contain 3 comma-separated fields in the following order:
Your algorithm must output the results of testing each Speech File against every Possible Language Spoken in the Contestant Test Data. Thus, the number of rows in a complete set of output will be Total Speech Files * Possible Languages Spoken = 600 * 5 = 3,000. The "code" that is submitted during the contest should merely be a hardcoded set of return values. An example Java submission appears below for 2 Speech Files: public class SpokenLanguages { public String[] identify() { return new String[]{ "fileName1.mp3,English,1", "fileName1.mp3,French,3", "fileName1.mp3,Italian,4", "fileName1.mp3,German,2", "fileName1.mp3,Spanish,5", "fileName2.mp3,English,5", "fileName2.mp3,French,1", "fileName2.mp3,Italian,2", "fileName2.mp3,German,4", "fileName2.mp3,Spanish,3" }; } } HintsDifferent voices are used in both among the Contestant Training Data and Contestant Test Data, and sometimes multiple speakers may be talking simultaneously. Focus on correctly identifying the language spoken, not on the number of speakers or distinguishing between them. The following links may provide helpful background information on previous, similar work. Sections ���IV. Acoustic-Phonetic Approaches��� and ���V. Topics in System Developments��� of Spoken Language Recognition: From Fundamentals to Practice by H. Li et al. (2013) could be particularly useful. National Institute of Standards and Technology (NIST) Language Recognition Evaluation 2009 NIST Language Recognition Evaluation 2011 You may find the techniques, tools, and research papers in the following section useful.
Technique: Language Identification (Weka)
Technique: Spoken Language Classification
Technique: Voice Pattern Designs
Technique: Audio Processing
Technique: Machine Learning through Audio Analysis
Technique: Various ScoringWe will run both provisional and system tests using the same submission, but you will not know which Speech Files are used for which test. Your algorithm���s performance will be quantified as follows. For each Speech File, If possiblelLang = Actual Language Spoken and langRank = 1, then 1.0 points Else if possiblelLang = Actual Language Spoken and langRank = 2, then 0.40 points Else if possiblelLang = Actual Language Spoken and langRank = 3, then 0.16 points Else 0.00 points Score = 10,000 * Total points for all Speech Files in the Contestant Test Data The maximum possible total scores for example, provisional, and system testing are 100,000, 1,900,000, and 4,000,000. If there are ties in the final system test results, then we will break them using algorithm run time on the Evaluation Test Data and as described below. However, we will not measure algorithm run time or break any ties during the contest. Algorithm Learning EvaluationAs mentioned in the ���Background��� section above, FCBH needs algorithms that can learn to identify new languages from new audio data. We will take the top 3 algorithms according to system test results on the Contestant Test Data and evaluate their ability to learn as follows, using a new training data set (���Evaluation Training Data���) and a new test data set (���Evaluation Test Data���) described below. The Evaluation Training Data contains:
The Evaluation Test Data contains:
There is no indication of which Possible Language Spoken in each Evaluation Test Data file is the Actual Language Spoken in that file (i.e., the data are unlabeled). For this evaluation, you must write a make/build file and execution file (possibly a .sh script or similar) for your algorithm that do all the following:
On a common Amazon Web Services c3.large virtual machine, we will do all the following:
ReportYour report must be at least 2 pages long and must contain at least the following sections. Your InformationThis section must contain the following:
Approach UsedPlease describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code. This section must contain the following:
Final Code SubmissionYou must submit all code used in your algorithm. You may use any programming language you like, provided it is a free, open-source solution for you to use and would be for the client as well. If using something other than a standard Python/gcc/Java install, as usually used by our testers, please include any relevant installation instructions for said language, and likewise for any additional libraries and packages. You must submit evidence that your code runs successfully to completion. The code will be run on CentOS 6.5 x86_64 HVM. Data File Downloads
| |||||||||||||
Definition | |||||||||||||
| |||||||||||||
Notes | |||||||||||||
- | In the evaluation after the contest, the submitted code must actually return results by processing the data files, without having manually hard-coded results specifed. Specifically, we should be able to execute the submitted code using the original contest data, and obtain results similar to what were submitted during the contest. | ||||||||||||
- | Usage of additional data is acceptable, provided it is freely available in the same manner as code/libraries that are used. Also, any additional data should likewise be submitted with the code after the contest (in the case of a top 3 finish). The submitted code with additional data will be subject to the same aforementioned requirement of being able to produce the results submitted during the contest. | ||||||||||||
Examples | |||||||||||||
0) | |||||||||||||
|
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.