Challenge Overview
Problem Statement | |||||||||||||
Overall Prizes
Sub-Contest Prizes (Per each Sub-contest - 4 total)
BackgroundThe Healthy Birth, Growth, and Development (HBGD) program addresses the dual problem of growth faltering/stunting and poor neurocognitive development, including contributing factors such as fetal growth restriction and preterm birth. The HBGD program is creating a unified strategy for integrated interventions to solve complex questions about (1) life cycle, (2) pathophysiology, (3) interventions, and (4) scaling intervention delivery. The HBGD Open Innovation platform was developed to mobilize the global ���unusual suspects��� data science community to better understand how to improve neurocognitive and physical health for children worldwide. The data science contests are aimed at developing predictive models and tools that quantify geographic, regional, cultural, socioeconomic, and nutritional trends that contribute to poor neurocognitive and physical growth outcomes in children. The solutions developed by this challenge will support the efforts of the HBGD Open Innovation initiative. ObjectiveThe goal of this contest is to develop flexible methods that are able to adaptively fill��in, back��fill, and predict time-series using a large number of heterogeneous training datasets. The data is a set of thousands of aggressively obfuscated, multi-variate time��series measurements. There are multiple output variables and multiple input variables. For each time-series, there are parts missing. Either individual measurements, or entire sections. Each time-series has a different number of known measurements and missing measurements, the goal is to fill in the missing output variables with the best accuracy possible. How the missing input variables are treated is an open question, and is one of the key challenges to solve. This problem, unlike many data science contest problems, is not easy to fit into the standard machine learning framework. Some reasons that this is the case:
The goal of this contest is two-fold: we wish not only to obtain a very good, flexible solution, but we would also like to encourage competitors to try a diverse set of approaches, and have documentation of which ones worked and why. Ideally, we would like competitors to try methodologies outside of the standard scope of contest algorithms. While many contests use Random Forests and Gradient Boosted Regression trees, we would like competitors to branch out and try recurrent neural networks, Gaussian Process Regression, polynomial regression, VARMAX processes, and other approaches. In facilitating this goal, there will be several sub-contests running parallel to the main contest. Each of these sub-contests will focus on a particular type of solution approach, and will have additional prizes (in addition to the primary prize pool for the overall contest). Competitors should include a brief explanation of how their solution fits into the framework of one of the sub-contests, where applicable: Mixed Effects Models Linear and non-linear mixed effects models are appropriate for our data as we have discrete subjects for which we'd like to have discernable models. Further, it would be great to have interpretable estimates for the effects of each nominal variable. For solid implementations, see the lme4 and nlme packages in R. Key challenge: linear models are insufficiently expressive for human growth data, but it is tricky to extend non-linear mixed effects models to the multivariate case. Neural Networks Deep neural networks and recurrent neural networks (RNNs) are of great interest because of the amount of empirically successful research that has recently emerged, suggesting that these types of models have potential to revolutionize many other computational fields. RNNs are of particular interest as they have the capacity to model variable length inputs, and deal natively with multiple outputs. For interesting recent work see: https://arxiv.org/abs/1606.04130 There are many good RNN implementations for Python in Theano, TensorFlow, Keras, and other packages. Key challenge: RNNs implicitly assume that the inputs are regularly sampled, but our data is both sparsely and irregularly sampled. Tree-Based Models Random Forests and Gradient Boosted Decision Trees are some of the most popular machine learning models. Even though our data is not a precise fit to the independent and identically distributed vectors-of-features model underlying classic supervised learning, predictive models can still be fit and evaluated to good effect. The xgboost package and scikit-learn have great implementations of these types of models. Key challenge: tree-based models do not perform well when dealing with high-cardinality nominal variables, but these variables (such as subject id) provide key information that is necessary for good predictions. Matrix Completion Models Although not an obvious approach for time-series data, we can use matrix completion methods to address the sparsely sampled nature of our data. For example, instead of users and items for rows and columns, consider using subjects and time (days) for rows and columns. LightFM and libFM are two good packages to consider here. Key challenge: integrating the side-information for each row and each column. Note that competitors are free to submit different solutions in multiple sub-contests and/or for the main contest as well, and can win prizes in more than one. Data DescriptionThe training and test data contains several columns: ----------+--------------------+------------+------------------------------------------------------- Column#s | Column Name(s) | Data Type | Description ----------+--------------------+------------+------------------------------------------------------- 1-3 | y1, y2, y2 | Float | The three dependent variables to be predicted in test ----------+--------------------+------------+------------------------------------------------------- 4 | STUDYID | Integer | ----------+--------------------+------------+------------------------------------------------------- 5 | SITEID | Integer | ----------+--------------------+------------+------------------------------------------------------- 6 | COUNTRY | Integer | ----------+--------------------+------------+------------------------------------------------------- 7 | SUBJID | Integer | ----------+--------------------+------------+------------------------------------------------------- 8 | TIMEVAR1 | Float | ----------+--------------------+------------+------------------------------------------------------- 9 | TIMEVAR2 | Float | ----------+--------------------+------------+------------------------------------------------------- 10-39 | COVAR_CONTINUOUS_n | Float | (30 fields) ----------+--------------------+------------+------------------------------------------------------- 40-47 | COVAR_ORDINAL_n | Integer | (8 fields) ----------+--------------------+------------+------------------------------------------------------- 48-55 | COVAR_NOMINAL_n | Char | (8 fields) ----------+--------------------+------------+------------------------------------------------------- 56-58 | y1, y2, y3 missing | True/False | (3 fields) does the value exist in ground truth ----------+--------------------+------------+------------------------------------------------------- The combination of STUDYID and SUBJID is sufficient to uniquely identify a specific individual. Adding TIMEVAR1 is sufficient to identify to uniquely identify each row. The validation and test data file contains the same fields as the training data, with one primary difference. y1, y2, and y3, aren't given, and are left empty. The last three columns contain the values ���True��� or ���False��� indicate whether y1, y2, or y3 is missing from the ground truth data. This test data file contains the tests for both provisional and system testing, however you will not know which set each row belongs to. Any given subject/study pair belongs entirely to one or the other. The submitted predictions should contain one row for each row in the test data set. Each row should contain three values, comma-separated: the predicted values for y1, y2, y3. The rows should be in the same order as given in the test data. Note again that for any places where the test data indicates we have no ground truth for one or more of the three values, you can feel free to use a 0 for the prediction of that value, as it will be ignored and not contribute towards scoring. ScoringYour score for each individual prediction p, compared against actual ground-truth value t, will be |p - t|. The score for each row, r, will then be the mean of the scores for the individual predictions on that row (possibly 1, 2, or 3 values). Over the full n rows, your final score will be calculated as 10 * (1 - Sum(r) / n). Thus a score of 10.00 represents perfect predictions with no error at all. All scores will be rounded down to two significant digits to help prevent overfitting. (Note that we may internally evaluate at higher precision in the event tie-breaking is needed for prize awards.) Submissions which are malformed in any way, such as having the wrong number of rows, values that do not parse as numeric, etc, will score a 0. To submit your entry, your code will only need to implement a single method, getURL(), which takes no parameters, and returns the URL at which your predictions CSV file can be downloaded. Example TestingAll example testing for this contest should be done offline, using the provided data. Note, however, that you may do a ���Test Examples����� submission using your predictions file for the provided test data. This will not provide any provisional scoring, but will confirm for you that the predictions file works correctly (URL is accessible, correct number of rows and columns, and numerical values parse correctly). This is not required, but can be used as a basic sanity check before making a full submission. BaselineA baseline score of 9.88 has been achieved using a combination of methodologies. Competitors in the main challenge will need to reach a score of 9.90 to be eligible for a prize, and the sub-contests will need a score of at least 9.80. General Notes
Requirements to Win a Prize
If you place in the overall top (5), or top (3) in one of the four sub-contests but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above. | |||||||||||||
Definition | |||||||||||||
| |||||||||||||
Examples | |||||||||||||
0) | |||||||||||||
|
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.