Challenge Overview
Challenge Detail
In this challenge you have to implement differential privacy masking breaks on the given dataset in a way so that no corelation to real world objects/peoples/entities are possible making use of any available open source libraries.
Project Background
The client is exploring the possibilties of using data science challenges for various use cases of their business. As part of the data preparation for data science work we need to protect privileged information and prevent linkage attacks before opening it to the community. Multiple levels of masking might be required for this. We need to come up with a data masking solution that can provide high scalability and ease of use for the dataset.
Technology Stack
There are few open source libraries recommend that can be referred to achieve the masking requirements. You are free to research and use other open source libraries after getting approval.
Individual Requirements
Challenge Input
The sample dataset csv file containing the required columns that need to be masked will be shared in the challenge forums.
Scope
The winner needs to support for any issues faced during the code run for the larger dataset for an additional prize money that would be provided based on the amount of work.
Deployment Guide and Validation Document
Make sure to require two separate documents for validation.
A README.md that covers:
Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.
In this challenge you have to implement differential privacy masking breaks on the given dataset in a way so that no corelation to real world objects/peoples/entities are possible making use of any available open source libraries.
Project Background
The client is exploring the possibilties of using data science challenges for various use cases of their business. As part of the data preparation for data science work we need to protect privileged information and prevent linkage attacks before opening it to the community. Multiple levels of masking might be required for this. We need to come up with a data masking solution that can provide high scalability and ease of use for the dataset.
Technology Stack
- Java 8
- Python
There are few open source libraries recommend that can be referred to achieve the masking requirements. You are free to research and use other open source libraries after getting approval.
- https://github.com/uber/sql-differential-privacy
- https://github.com/arx-deidentifier/arx
Individual Requirements
Challenge Input
The sample dataset csv file containing the required columns that need to be masked will be shared in the challenge forums.
Scope
- You have to implement a masking program in Java/Python that can reasonably prevent linkage attacks making use of the recommended libraries or any good anonymization software.
- Your script should be highly optimized and should be able to be executed on millions of rows without crashing.
- The model used for masking should be documented properly for review purpose. Include a google doc or PDF that describes your approach and specifically explains your epsilon.
The winner needs to support for any issues faced during the code run for the larger dataset for an additional prize money that would be provided based on the amount of work.
Deployment Guide and Validation Document
Make sure to require two separate documents for validation.
A README.md that covers:
- Deployment - that covers how to build and test your submission.
- Configuration - make sure to document the configuration that are used by the submission.
- Dependency Installation - should clearly describe the step-by-step guide for installing dependencies and should be up to date.
Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.
Final Submission Guidelines
- Documentation
- Project Source Code