Accelerating Science with Crowdsourcing: How Topcoder Sped up GWAS Analysis
Precision medicine is one of the brightest stars rising in medicine and technology today. Simply put, precision medicine the practice of tying the treatment decisions for a patient to their genetic characteristics. Practitioners use tests and analytics (not just age, weight, history and gender) to understand how a patient is going to progress through a disease, and then treat that process. Patients are organized into treatment classes based on similarities in how their genes vary. Although it sounds futuristic, this approach to healthcare has been around for years. The main roadblocks to expanding precision medicine have been the availability both of tests and of the knowledge bases that power treatment.
Genome-wide association studies (GWAS) are used to create knowledge bases. GWAS relates genetic variants in individuals with specific phenotypes (e.g., disease status). This association of differing genotypes (i.e., your cell’s manufacturing templates) with disease-related phenotypic traits (i.e., the actual products made from those templates) can help identify new therapeutic targets and support the stratification of patients who would most benefit from specific drug classes. Whereas it was once complicated just to collect the necessary data, the challenge has since shifted to data analysis.
More speed, more science
One of the most popular software tools for analyzing GWAS results is the open source software PLINK 1.07. But for today’s large genotype-phenotype data sets, association analyses can take several hours for a single phenotype. Recently, the Pfizer Business Technology High Performance Compute group, collaborators from the Crowd Innovation Lab at Harvard University, and Topcoder were able to speed up this analysis. Through crowdsourcing, the logistic regression in PLINK 1.07 was accelerated by 591 fold. (The full case study is available from Gigascience.) What once took 4.8 hours now takes 29 seconds — an entire afternoon spent on a single study now reduced to a half-time commercial.
Now researchers can quite literally do more science — try more, experiment more, get more insight — in less time. Scientific discovery has moved from the rate at which a few pairs of hands can work to the speed of insight powered by CPUs. In this particular contest, the Topcoder Community acted as a paid Wikipedia of sorts; the best minds from around the world (none with knowledge specific to this field) came to compete and, as a result, solved this problem. Not only did they solve it, but they also boosted the quality of an open source package.
Open innovation and challenge-based crowdsourcing
Through open innovation and data science challenges on the Topcoder platform, the logistic regression for GWAS analysis has accelerated exponentially. Contestants from Topcoder’s 1.1M+ member base around the world competed to outscore one another over the course of the challenge. They submitted code that was automatically run and scored by Topcoder, and posted to a real-time leaderboard. Ultimately, the contest portion of this optimization in data science took place in two rounds of only two weeks each, and the results are in use today. The code was donated, incorporated, and is currently available in the PLINK2 open source project, where it’s available to the greater computational biology community. This project enables more complex analyses of phenotype-genotype data sets, and further extends precision medicine’s reach.
Get more from your data through crowdsourcing. Learn how you can turn big data into big insights with Topcoder.