| Prize Distribution
Prize US$
1st 3,500
2nd 1,750
3rd 1,000
4th 750
Background
Genpact support representatives receive a lot of emails about various different matters. The current method by which the firm processes them and determines their content appears below.
- A workflow system receives emails at a generic support address (e.g., support@genpact.com).
- The email is captured as a support ticket.
- Support Reps read and classify the email into categories like "Invoice Processing", "Payment Status", "Others", etc.
- Some of the data from the email is retyped or manually copied and pasted into fields on the case. For example, an email classified as "Invoice Processing" might include data such as invoice number and amount of payment.
This highly manual process isn���t scalable, and Genpact wants to automate it.
Objective
In the customary training and test framework, create an algorithm that will classify emails received by Genpact into categories based on their contents and extract important data from them.
Data Description
In the data available for this contest, each record contains 4 or more columns belonging to a single message:
+------+-----------+--------------------+-------------------------------------------------------+
|Column| Variable | Type | Description |
+------+-----------+--------------------+-------------------------------------------------------+
| 1 |ID |Integer, Fixed |Email unique identifier |
+------+-----------+--------------------+-------------------------------------------------------+
| 2 |Subject |Character, Fixed |Email subject |
+------+-----------+--------------------+-------------------------------------------------------+
| 3 |Body |Character, Fixed |Email HTML body |
+------+-----------+--------------------+-------------------------------------------------------+
| | |Character, Fixed, |Contains the category identifier that the email was |
| 4 |Category |present only on |was classified into. |
| | |training data | |
+------+-----------+--------------------+-------------------------------------------------------+
| | |Character, Variable,|Contains the name of a field extracted from the |
|5,7...|Field Name |present only on |message. May appear multiple times, or not be present |
| | |training data |at all, always together with a following Field Value. |
+------+-----------+--------------------+-------------------------------------------------------+
| | |Character, Variable,|Contains the value of the field, extracted from the |
|6,8...|Field Value|present only on |message. When it is present, it contains a single |
| | |training data |non-empty substring of the subject or the message body.|
+------+-----------+--------------------+-------------------------------------------------------+
The data used in this match contains messages classified into 10 different categories.
For the field name, "Invoice Number", "Invoice Amount", "Account Number", "PO Number", "Voucher Number" and "Vendor ID" are present in the data used in this match.
Note that for a single message, the same field name may appear multiple times, associated with different values.
Each pair (field name / file value) appears only once, even if its content is present more than once in the message.
Records are formatted as
ID,SUBJECT,BODY,CATEGORY[,FIELD NAME 1, FIELD VALUE 1]...[,FIELD NAME N, FIELD VALUE N]
Example Training data:
1,"Sub1","Body1","Invoice Processing","Invoice Number","123"
2,"Sub2","Body2","Payment Status","Invoice Number","4567","Invoice Number","89/01"
3,"Sub3","Body3","Invoice for Scan"
4,"Sub4","Body4","Payment Status","Invoice Amount","1,972.09","PO Number","00-2012"
5,"Sub5","Body5","Others"
It is important to note that although most subjects and bodies are in English, there are messages that use different languages, and solutions must deal with multiple languages.
Emails and websites addresses contained in the messages were obfuscated. Also reference numbers present in many subjects and bodies were replaced by "[ref]".
Data Sets
- The full data set used in this contest contains a total of about 12,000 messages.
- The full data set was divided randomly into 40% for example tests, 20% for provisional tests, and 40% for system tests.
- For each test, approximately 66% of the data (from that segment) is randomly selected for training, and the remainder for testing.
- For provisional tests, all example data is also added to the training set.
- For system tests, all example and approximately 50% of the provisional data, randomly selected, is also added to the training set.
- There are 1 example case, 2 provisional test cases and at least 10 system test cases.
The example data set, that can be used for local tests, is available here.
Functions
You must implement a class EmailClassification with two functions:
- train, which receives the testType (0, 1, or 2, to indicate Example, Provisional, or System test, respectively), and trainingData.
This function will be called only once per test case, and any integer value can be returned.
In the string[] trainingData, each string contains a single record, and has 4 or more tokens, comma-separated, in the same order as described in Data Description section above.
Note that character fields could contain commas and, to avoid incorrect interpretation, the content of these columns starts and ends with double quotes.
- classify, which receives a string msgData with data from a single message and timeLeft in milliseconds (just as convenience to help in time control).
The format of msgData is the same as a single item of trainingData, except that it contains only the first 3 columns (ID, subject, body).
This method will be called multiple times, one for each testing message.
The returned string[] should contain 6 or more elements.
The first 6 should be 3 pairs of categories and the confidence level of each option.
These categories should be your top 3 choices, i.e. the 3 categories you are most confident in.
Then for each extracted data field found by your solution, it should return a pair of strings with the field name and the field value.
So the expected returned should looks like this:
"CAT1","CONF1","CAT2","CONF2","CAT3","CONF3"[,"FIELD NAME 1","FIELD VALUE 1"]...[,"FIELD NAME N","FIELD VALUE N"]
Confidence should be numeric values in the range between 0 and 1 inclusive.
They don���t need to sum up to one (see scoring section for further details).
Also note that no particular order among category/confidence pairs, neither among fields name/value pairs are required, as long as the first strings are the 3 category/confidence pairs.
If your solution is absolutely sure about the correct category, or for any reason doesn���t want to use all three options, you can simply return an empty string for the category name and "0" as its confidence.
Scoring
The score is composed by two parts:
- Classification Score: For each testing message, the squared Euclidean distance (E) between your returned confidence for each category and the ground truth values will be calculated. Let SSE be the sum of these squared distances (E). A baseline sum of squared distances (SSE_Base) will be calculated by predicting the frequency observed in the complete data set of the three most frequent categories, i.e. the baseline will be calculated using the following categories/confidences:
Baseline Categories = {"Others","0.408","Invoice Processing","0.202","Invoice for Scan","0.081"}
Your classification score will then be calculated as a generalized R2 measure of fit:
ClassificationScore = 1 - SSE / SSE_Base
For example, if we have 4 categories (A to D), your return is {"A", "0.9", "B", "0.2", "C", "0.1"} and the correct category is "A", for this message E (Squared Euclidean Distance) would be:
E = squared distance between (0.9, 0.2, 0.1, 0) and (1, 0, 0, 0) =>
E = (0.9 - 1) ^ 2 + (0.2 - 0) ^ 2 + (0.1 - 0) ^ 2 + (0 - 0) ^ 2 =>
E = 0.01 + 0.04 + 0.01 + 0 = 0.06
In the same scenario, if the correct category was D, then we would have:
E = squared distance between (0.9, 0.2, 0.1, 0) and (0, 0, 0, 1) =>
E = (0.9 - 0) ^ 2 + (0.2 - 0) ^ 2 + (0.1 - 0) ^ 2 + (0 - 1) ^ 2 =>
E = 0.81 + 0.04 + 0.01 + 1 = 1.86
- Data Extraction Score: for the data extraction part, each field name/value pair will be compared to ground truth pairs, for all testing messages. CorrectPairs (present both on returned pairs and ground truth) will add, while IncorrectReturnedPairs (returned by your solution and not present on ground truth) will reduce your score. TotalPairs is the total number of pairs present on ground truth of all messages used as test.
DataExtractionScore = (CorrectPairs - IncorrectReturnedPairs) / TotalPairs
The final score for each test case is a weighted sum between these score as follows:
Score = MAX(0, 800000 * ClassificationScore + 200000 * DataExtractionScore)
If your solution fails to produce a proper return value, your score for this test case will be 0.
The overall score on a set of test cases is the arithmetic average of scores on single test cases from the set. The match standings displays overall scores on provisional tests for all competitors who have made at least 1 full test submission. The winners are competitors with the highest overall scores on system tests.
Limits
- Because different test types deal with different volumes of data, the time limits will also differ. Example tests are limited to 900s (15 minutes), provisional tests to 1200s (20 minutes) and system tests to 1800s (30 minutes). The testType parameter will be 0, 1, or 2, to indicate Example, Provisional, or System test, respectively, so that your code can take timing into account.
- The memory limit is 2048 megabytes.
- There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). The compilation time limit is 60 seconds.
- You will be allowed to make example submissions each 30 minutes and provisional submissions each 3 hours.
General Notes
- This match is rated.
- The allowed programming languages are Java, Python, C++, C# and VB.
- You can include open source code in your submission, provided it is free for you to use and would be for the client as well. Code from open source libraries under the BSD or MIT licenses will be accepted. Other open source licenses may be accepted too, just be sure to ask us.
- The usage of external data and pre-trained models are allowed, as long they meet the license requirements.
- The test servers have only the default installs of all languages, so no additional libraries will be available.
- Keep in mind that, as we are dealing with real world data in this contest, it is possible to have some noise, like misclassified messages or missed extracted data. While it is obviously not ideal, you should found this in few messages, and the data sets used for this contest will remain the same. Solutions must handle such situations in the best possible way.
- Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself, possible solution techniques or related to data analysis.
- Topcoder ran an Ideation Contest about this problem. The winning submissions are available here as references. Feel free to use any ideas from these submissions that you find useful.
Requirements to Win a Prize
In order to receive a prize, you must do all of the following:
- Achieve a score in the top 4, according to system test results. See the "Scoring" section above.
- Within 7 days from the announcement of the challenge winners, submit a complete report at least 2 pages long containing at least the following sections:
- Your Information
- First Name
- Last Name
- Topcoder handle
- Email address
- Approach Used
Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.
This section must contain at least the following:
- Approaches considered
- Approach ultimately chosen
- Steps to approach ultimately chosen, including references to specific lines of your code
- Advantages and disadvantages of the approach chosen
- Comments on open source resources used
- Special guidance given to algorithm based on training
- Potential improvements to your algorithm
- Local Programs Used
- If any parameters were obtained from the training data set, you will also need to provide the program(s) used to generate these parameters.
- There is no restriction on the programming language/libraries used to generate these training parameters, provided they are a free, open-source solution for you to use and would be for the client as well.
If you place in the top 4 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.
|