Challenge Overview
Issue
SNI stands for server name indicator. For HTTPS Connection between Mobile App Client and the server it enables the same IP and Port to be used for multiple Web services (each differentiated by Unique Server Name indicator) as part of Digital certificate.SNIs are one of the biggest contributors to EDR Aggregations which occupy approximately 10% of entire footprint of the client’s Hadoop Cluster (32 PB). As such consolidation scheme will result in significant reduction in cost, storage and related efficiencies.
As you can imagine, on a large network, the amount of captured traffic data per SNI could be rather large. In fact, at one client, SNIs represent some 10% of the entire footprint of their Hadoop Cluster (Which is ~32PB!)
Our client has close to a million SNI that need to be consolidated to few broad domains. For the same Web service/Host there occurs several similar SNIs from the same application(think Google or YouTube) that can be consolidated to a few broad domains.
Task
We need help from the TopCoder Community to develop an algorithm that identifies the commercial name for the included URI (e.g. YouTube) and clusters solutions for that name to shorten the length of the list.Attached is an example data set you can use to train your algorithm.
Warning: Please be advised that some of the data we are working with contains adult content
Possible Approach
Apply text similarity to SNI to identify similar SNIs.
Rely on Levenshtein distance between SNI as strings
Extend to new strings with Machine Learning/ Neural Networks
Final Submission Guidelines
Submit the zipped code to the challenge by the deadline.