Challenge Overview
Challenge Objective
In this challenge, you have to implement differential privacy preserving techniques on the given dataset in a way so that no correlation to real-world objects/people/entities is possible making use of any available open source libraries.Project Background
The client is exploring the possibilities of using data science challenges for various use cases of their business. As part of the data preparation for data science work, we need to protect privileged information and prevent linkage attacks before opening it to the community. Multiple levels of masking might be required for this. We need to come up with a data masking solution that can provide high scalability and ease of use for the dataset.Development Assets
- The sample xls file containing the required columns that need to be masked will be shared in the challenge forums.
Technology Stack
- Python 2.7
- RAPPOR (https://github.com/google/rappor)
Individual Requirements
Scope- You have to implement a masking program in Python that can reasonably prevent linkage attacks making use of the recommended libraries or any good anonymization software. Masking should create a statistical twin rather than random noise.
- The model used for masking should be documented properly for review purpose. Include a doc or PDF that describes your approach.
Make sure to require two separate documents for validation.
A README.md that covers:
- Deployment - that covers how to build and test your submission.
- Configuration - make sure to document the configuration that is used by the submission.
- Dependency Installation - should clearly describe the step-by-step guide for installing dependencies and should be up to date.
Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.
Important Notes
- The dataset provided has only limited set of rows, however the review will be done against a bigger dataset with more than 1000 rows. Also the year columns in the dataset will have data upto 10 years (YEAR10 column), so make sure that your code handles all columns.
- If you feel that DP is not the right way to do this privatization after checking the data, you can suggest and implement what would be the right way to create a statistical twin.
- The review for this will be subjective based on the details provided below.
This submission will be subjectively reviewed, however following criteria will be taken into account during the review to pick the best submission. Your submission will be reviewed on these requirements:
- Challenge Spec Requirements (40%)
- Requirements Coverage
- Coding Standards (10%)
- Best Practices
- Code Quality
- Development Requirements (40%)
- Testing against bigger dataset
- Performance
- Deployment
- Documentation (10%)
Final Submission Guidelines
- All original source code.
- Documentation