Register
Submit a solution
The challenge is finished.

Challenge Overview

Issue

SNI stands for server name indicator. For HTTPS Connection between Mobile App Client and the server it enables the same IP and Port to be used for multiple Web services (each differentiated by Unique Server Name indicator) as part of Digital certificate.

SNIs are one of the biggest contributors to EDR Aggregations which occupy approximately 10% of entire footprint of the client’s Hadoop Cluster (32 PB). As such consolidation scheme will result in significant reduction in cost, storage and related efficiencies.

As you can imagine, on a large network, the amount of captured traffic data per SNI could be rather large.  In fact, at one client, SNIs represent some 10% of the entire footprint of their Hadoop Cluster (Which is ~32PB!)

Our client has close to a million SNI that need to be consolidated to few broad domains. For the same Web service/Host there occurs several similar SNIs from the same application(think Google or YouTube) that can be consolidated to a few broad domains.


Task

We need help from the TopCoder Community to develop an algorithm that identifies the commercial name for the included URI (e.g. YouTube) and clusters solutions for that name to shorten the length of the list.
Attached is an example data set you can use to train your algorithm.

Warning: Please be advised that some of the data we are working with contains adult content


Possible Approach

Apply text similarity to SNI to identify similar SNIs. 
Rely on Levenshtein distance between SNI as strings
Extend to new strings with Machine Learning/ Neural Networks



Final Submission Guidelines

Submit the zipped code to the challenge by the deadline.

ELIGIBLE EVENTS:

2018 Topcoder(R) Open

Review style

Final Review

Community Review Board

Approval

User Sign-Off

ID: 30066633