Challenge Overview
Problem Statement | |||||||||||||
Prizes
1-st Place: $15,000 Progressive prizes: 4 x $1,000 awarded to the top four placements midway through the challenge (23:55 EDT, Januaryt 26, 2019). Note that teams which have passed DP pre-screening will have a significant score boost. For the full breakdown of each match period see the information provided on Challenge.gov By registering for and competing in this challenge you consent to sharing your contact information with NIST, and agree to be contacted by NIST, in the event you are entitled to a payment on this challenge as payments will be issued by NIST and not Topcoder. This match is TCO19 eligible. Everybody who scores above 10k in the system test will get TCO19 points according to these rules. IntroductionDetailed and accurate data builds the foundation for academic research, informed decision making, and a large diversity of smart software systems. Collection and sharing of relevant datasets are often complicated by various restrictions, among them the need to protect private and sensitive information contained in the dataset from 3rd parties, who still may have valid reasons to use the data. Consider a dataset listing all law enforcement incidents in a metropolitan area, where each row is an incident, and the columns contain related information such as incident location, date and time of 911 call, time of police arrival at the site, etc. There is significant benefit for independent researchers to analyze this type of public safety data for different city neighborhoods to compare police presence and public protection for different areas, etc. However, it is not acceptable to share sensitive information about specific incidents, that might allow the identification of victims, suspects, and other sensitive information about these events. One way to secure this data is to censor all sensitive information from the dataset before making it publicly available. For example, removing exact locations of accidents, exact times, names of involved people, etc. However, this will significantly reduce the value of the remaining data for analysis. Another approach, called Differential Privacy, allows for a better balance between data privacy and the loss of useful information. The overall idea of synthetic data is the following: using the original dataset, one may generate a synthetic (also called mock, artificial) dataset that describes fictional criminal events, such that collective statistical properties of the synthetic dataset are very similar to those of the original one, and research conclusions from the synthetic dataset hold true for the original one. Differential Privacy is a formal guarantee of privacy that considers the output probability distributions of randomized, privacy-preserving algorithms. Synthetic data generators that satisfy the Differential Privacy guarantee produce data that can be safely released to the public in place of the original sensitive data. We encourage you to read specialized literature and original research on differential privacy. NIST���s (National Institute of Standards and Technology) PSCR (Public Safety Communications Research) division and Topcoder are running a series of three similar marathon matches, each two-months long, in which you will work to create new, or improve existing, differentially private synthetic data generation algorithms. This is the second marathon match in the series, with a $39k prize pool. The total prize pool of the series is up to $150k which includes: $29k for the first match, $62k for the final match plus $20k in additional prizes at the end of the final match for the winning teams who provide their full code solutions in an open source repository. In each marathon match, you will be given an original dataset, for which you will generate a synthetic dataset. We will run various statistical tests on the submitted synthetic datasets, and score them based on the proximity of their statistical properties to those of the original data. ProblemThe competitor pack provides you with an extract of 2016 data from The San Francisco Fire Department���s Call for Service dataset. In total, ~305k records and 32 columns of data in CSV format, plus additional JSON files with data dictionary. For convenience of handling, all categorical data values were converted to numerical form. Details are explained in the README.md document, included into the pack. Your goal is to develop an (��, ��)-differential privacy algorithm. You will run that algorithm on the provided dataset with an arbitrary �� ≤ 1 / (n^2), where n is the number of dataset records, and three values of ��: 1.0; 0.1; and 0.01, to produce three output CSV files. You will call these output files 1_0.csv, 0_1.csv, and 0_01.csv, pack them together into a ZIP archive, and submit the resulting ZIP to the Topcoder system for preliminary scoring, as explained below. For your convenience, the competitor pack includes Python scripts for submission generation, and a naive implementation of a simple differential privacy synthetic data algorithm, as an example. To ensure that submitted solutions were generated by valid (��, ��)-differential privacy algorithms, and with the specified values of �� and ��, you will be requested to submit a clear and complete write up of your algorithm, along with a clear and correct proof that the algorithm satisfies differential privacy, for a manual pre-screening by a team of experts in differential privacy. Competitors approved during pre-screening will receive ���1000 boost to their further provisional scores, displayed on the challenge leaderboard; i.e. submissions without such approval will have scores within the [0; 1,000] range, and submissions from pre-screened competitors, having ���1000 factor, will score within the [0; 1,000,000] range. The pre-screening approval should not be considered as an official certification of your algorithm���s correctness of using differential privacy, although the pre-screening team will do their best to ensure it. Algorithms with the potential of winning prizes will be reviewed with greater care at the conclusion of the match. To keep the leaderboard accurate, and ensure fair play during the competition, you may be requested to submit a revised write-up and pass the pre-screening procedure again, to spot check the current revision of your algorithm. Differential Privacy Pre-Screening ProcedureOnly the team captain can submit to have your solution pre-screened for differential privacy compliance (see below for teaming rules--notably, the team captain must be a US citizen). To apply for the pre-screening, the team captain should fill out this Google Form. The form requests your Topcoder username, your contact email, and a write-up containing a complete and clear algorithm specification and the theoretic proof that the algorithm satisfies differential privacy. Within approximately one week, you will be notified about the pre-screening outcome. Teams that pass pre-screening will be announced on the forum after the each week���s pre-screen review session; their leaderboard scores will receive the x1000 point boost shortly thereafter. Teams that fail pre-screening will receive direct feedback by email. Submission of Generated Privatized Datasets into Topcoder SystemAs explained earlier, your submission into the Topcoder system will be a ZIP archive, containing three files, named 1_0.csv, 0_1.csv, and 0_01.csv, generated with the corresponding values of ��, and �� ≤ 1 / (n^2). Each of these CSV files must be smaller than 100 Mb in size. An auxiliary Python script is included in the competitor pack, to help you with generation and check the submission archive. Once ready, you will upload this ZIP to a file sharing service that supports direct download URLs (for example, you may upload it to Google.Drive, get the sharing URL, and replace its /open segment by /uc to get the direct download URL). The resulting URL should be submitted into the Topcoder system wrapped in a simple Java code, such as this example: class NistDp2 { public String getAnswerUrl() { return "https://drive.google.com/uc?id=1MUMH0IZ3-zgfLYloFXvOrhA4EpKgEBrl"; } } You can submit your solutions a maximum of once every four hours. Provisional ScoringIn this match, we will use a combination of two scoring methods, explained in details further below. The first one is very the same method used in the first match, developed to quantify the ability of your differential privacy algorithm to conserve clustering characteristics of the dataset being privatized. For that purpose, we compare 3-column marginal density distributions between the original and synthetic datasets. In this second match we will be adding a new scoring method, described in detail below. In this method we will generate sets of rules a dataset row may satisfy or not, and we compare the fraction of entries in the original and privatized dataset which satisfy those rules. In addition to expanding the dataset similarity criteria evaluated by the first scoring method, we expect this heuristic to allow us to efficiently capture data-set properties that may be relevant to common analytics tasks such as classification, without biasing scores towards accuracy on any specific analytics task. The overall score will be the average of scores given by each method to each of your solutions, generated for different epsilons. Notice, that the maximal score in each method would be achieved by submission of the original dataset as the solution; of course that would violate the differential privacy, and thus won���t be a valid solution. Scoring Method #1A single test works on three dataset columns, picked up randomly. The domain of possible values in those columns is split into buckets. For example, say two columns are categorical with 50 and 200 possible values, and the third column is numerical, with integer values between a and b, and also permits empty values. Our scoring algorithm divides the domain of the numerical column into 100 equal ranges, of size (a-b)/100 each. As the column allows empty values, it adds a ���virtual range��� for its empty values. Then, it creates 50 ��� 101 ��� 200 buckets, plus one special bucket for records outside of the valid value range; and for both the original and submitted datasets, it counts the number of records falling into each bucket. Resulting counts are divided by the total number of records in each dataset to get the density distribution of records. Then, our scoring algorithm calculates the absolute difference of density distributions for the original and submitted datasets, taking the sum of absolute differences of density values in corresponding pairs of buckets. Due to normalization of density distributions to 1.0, the resulting difference will be a number s, belonging to the range between 0.0 (perfect match of density distributions) and 2.0 (density distributions for the original and synthetic dataset do not overlap at all). The resulting single test score is defined as S = 1 000 000 × (1 - s / 2). To calculate the score shown in the provisional leaderboard, we created a set of 100 tests, described above, with randomly picked columns for each test. The scores from separate tests are averaged; and if the submitter has not been approved in the pre-screening procedure, the score is divided by 1 000. Scoring Method #2A single test consist of a set of rules for different columns. Each column has 33% chance to be included into a set, thus, on average, a single test sets rules for ~11 columns. For a categorical column, the rule is a randomly picked subset of its possible values (from 1 to maximum number of values); for numeric columns, the rule is a randomly picked range of values. A dataset records satisfies the set of test rules if, in all its categorical columns that are included into the test, it has values included into corresponding rule���s subsets, and in all its numeric columns that are included into the test, it has values within the selected ranges. The way tests are generated guarantees that in the original dataset there was at least a single record matching test rules. For the i-th test we calculate the fraction of records satisfying the test rules in the original (fo,i) and in the privatized (fp,i) datasets. Then we quantify their mistmatch using the following formulas: di = ln(max(fp,i; 10-6)) - ln(fo,i) �� = sqrt( (1/N) * sumi = 1..N((di)^2)), where N = 300 is the total number of tests SCORE = 106 max(0, 1 + �� / (ln(10-3))) The score is divided by 1000 if submitter has not been approved in the pre-screening. NOTES
Final Scoring and ValidationTo determine the final ranking and prize eligibility, at the end of the provisional phase of the match the top performers from the leaderboard will be invited (via the forum) to participate in a sequestered phase for final scoring and validation. Invited teams will need to submit docker containers containing their executable synthetic data generators, final algorithm write-ups and privacy proofs, readable source code and code guides. Note: It is very important for invited teams to watch the forum during the sequestered scoring phase. If there are problems with any of the submitted items, teams may be notified via the forum and given the opportunity to correct and resubmit. Teams will not be contacted by email to correct problems. Final accuracy scoring will take place over a different year of the data-set, reflecting data from a different set of individuals, ensuring that the algorithm refinement process over the data made available during the provisional phase does not violate differential privacy guarantees. The first round of the final scoring will consist of testing your synthetic data generators on an independent ground truth, consisting of another set of randomly generated tests that work the same way as described above, potentially using additional values of (��, ��). The top performing solutions will also undergo an additional thorough differential privacy validation, including reviews of both algorithm definitions and actual source code by the team of differential privacy experts. Final code submissions should include a short code guide (e.g. readme file) that clearly designates source code files into pre-processing, privatization and post-processing code, and identifies where in the source code each significant algorithm step occurs. We will also run submitted solutions on the original dataset to verify that their scores were achieved with the correct �� and �� values. We reserve the right to adjust provisional and final scoring methodologies during the contest, in a way that���s fair for all competitors, in case any flaws are found in the original protocol. Legal RulesThis marathon match is subject to the NIST official rules. In the event of any discrepancy or inconsistency between the terms and conditions of the contest rules, the terms and conditions of the NIST Official Rules on Challenge.gov as specified within the Challenge Specific Agreement shall control. Notice the following points: teaming is allowed, participation of organizations is allowed. Participants non-eligible to prizes for legal reasons, still may participate if they relinquish from the prizes. EligibilityTo be eligible for a prize:
TeamingTeaming is allowed in this competition along with the participation of organizations. However, there are some conditions:
References
| |||||||||||||
Definition | |||||||||||||
| |||||||||||||
Examples | |||||||||||||
0) | |||||||||||||
|
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.