Wednesday, October 6, 2004 Match summary It is safe to say that few TopCoder members suffer from fear of math, but the results of the latest Single Round Match suggest that fear of text is epidemic. Both divisions were caught off guard by a set of problems in which parsing did not act as an obstacle to performing a task, as it usually does, but constituted the task itself. Jan_Kuipers, gawry, and lars topped the leader board in Division One. Special mention goes to skanthak and bmerry, who made a game effort to solve the hard problem and fell just short on a day when no one in either division got the hard. Division Two saw a victory by ggoprea, followed by the harmoniously named runners-up wizy and quazee.
The ProblemsbloggoShortcutsUsed as: Division Two - Level One:
We are asked to replace the special characters '_' and '*' in a piece of text with the HTML tags for italics and boldface, respectively. The principal difficulty is in deciding whether to use the opening tag, "<i>" or "<b>", or rather the closing tag, "</i>" or "</b>", whenever we come across a special character. In my C implementation, I have a variable called void expand(char *s) { int i, open = 0; for (i = 0; s[i] != '\0'; i++) if (s[i] == '_' || s[i] == '*') { if (open) printf("</%c>", s[i] == '_' ? 'i' : 'b'); else printf("<%c>", s[i] == '_' ? 'i' : 'b'); open = (open+1)%2; } else printf("%c", s[i]); printf("\n"); }
The Used as: Division Two - Level Two: Used as: Division One - Level One:
Given a dictionary of syllabized words such as "squir rel" and "el e phant", we are to verify and if necessary correct a hyphenated word such as "eleph-ant". Among the five problems of the match, I would say that this one requires the most effort for the least reward. Rather than algorithmic prowess, it is swift fingers and attention to detail that lead to success here.
In my C implementation, I first make a copy void hyphenate(char **dictionary, int dnum, char **candidates, int cnum) { char cand[100], good[100], *dict; int ci, di, cx, hx, hh, bx; for (ci = 0; ci < cnum; ci++) { for (cx = 0, hx = -1; candidates[ci][cx] != '\0'; cx++) if (candidates[ci][cx] == '-') hx = cx; else cand[cx - (hx == -1 ? 0 : 1)] = candidates[ci][cx]; cand[cx - (hx == -1 ? 0 : 1)] = '\0';
For each item for (di = 0; di < dnum; di++) { dict = dictionary[di]; if (!strcmp(cand, dict)) { printf("%s\n", cand); break; }
Otherwise, I step through bx = -100; for (cx = hh = 0; cand[cx] != '\0' && dict[cx+hh] != '\0'; cx++) { if (dict[cx+hh] == ' ') { if (abs(cx-hx) < abs(bx-hx)) bx = cx; hh++; } if (cand[cx] != dict[cx+hh]) break; } if (bx == -100 || cand[cx] != '\0' || dict[cx+hh] != '\0') continue;
If all non-space, non-hyphen characters match, I have the right dictionary item. Now, unless the best if (bx == hx) { printf("correct\n"); break; } strncpy(good, cand, bx); good[bx] = '-'; strcpy(&good[bx+1], &cand[bx]); printf("%s\n", good); break; } } } It is not difficult to verify hyphens, but mathematically inclined coders must recoil from this problem because it cannot be generalized or solved systematically. I think it resembles the tasks that professional software developers undertake regularly rather than the puzzles that attract academic researchers. bloggoSequenceSearchUsed as: Division Two - Level Three:
Given a query consisting of words separated by ellipses, such as "fourscore...years...ago", we are to find text passages that contain every query word order. If we wanted to perform this task at high speed, we would build a data structure that maps words to their positions in a documentin other words, an indexand look for sequences of word-position numbers. But we can get by with a solution that searches directly on the text, since each document is only 50 characters long. At worst, we would begin searching at each of the 50 characters and continue to the end of the document, for a total of about 50*50/2 = 1250 character comparisons. My solution starts matching only at the beginning of each word in the document, so I need no more than half that many comparisons.
The most important part of my C implementation is a function that takes a query The branching occurs when I come across a period character in the pattern, denoting the start of an ellipsis. I can now skip the next word in the string and make a recursive call to the matching function. void match(char *pat, int pi, char *str, int si, int *ends) { int skip; if (pat[pi] == '.') { while (!isalpha(str[++si]) && str[si] != '\0'); if (str[si] == '\0') return; skip = si; while (isalpha(str[++skip])); match(pat, pi, str, skip, ends); pi += 3; } On the other hand, I also try to match the current word in the string. The matching attempt has three possible outcomes: it fails in the event of different alphabetic characters in the string and pattern; succeeds if I reach the end of each; or continues with a recursive call if I reach a period in the pattern and a non-alphabetic character in the string. while (tolower(pat[pi]) == tolower(str[si]) && isalpha(str[si])) { pi++; si++; } if (pat[pi] == '\0' && !isalpha(str[si])) { ends[si] = 1; } else if (pat[pi] == '.' && !isalpha(str[si]) && str[si] != '\0') { match(pat, pi, str, si, ends); } }
After int cmp(const void *a, const void *b) { int *sa = *((int **) a), *sb = *((int **) b); int da = sa[1] - sa[0], db = sb[1] - sb[0]; if (da != db) return da - db; if (sa[2] != sb[2]) return sa[2] - sb[2]; return sa[0] - sb[0]; } A quick calculation on the back of an envelope tells me that I won't find more than 20000 passages among all documents. There are roughly 25 possible points at which a passage may begin or end in any document, making a total of about 50*25*25/2=15625 passages over all documents. void sequence_search(char **docs, int doc_num, char *query) { int *ends = calloc(100, sizeof(int)), i, ix, beg, p, q; int dlen, flag = 0, **passages = calloc(20000, sizeof(int *)), pnum = 0; char *doc, buf[100]; for (i = 0; i < 20000; i++) passages[i] = calloc(3, sizeof(int));
I traverse each document, updating the for (i = 0; i < doc_num; i++) { dlen = strlen(doc = docs[i]); for (beg = 0; doc[beg] != '\0'; beg++) if (isalpha(doc[beg]) && !flag) { memset(&ends[beg+1], 0, (dlen-beg)*sizeof(int)); match(query, 0, doc, beg, ends); for (ix = beg+1; ix <= dlen; ix++) if (ends[ix]) { passages[pnum][0] = beg; passages[pnum][1] = ix; passages[pnum++][2] = i; } flag = 1; } else if (!isalpha(doc[beg])) flag = 0; } Finally, I can sort the passages using my comparison function. qsort(passages, pnum, sizeof(int *), cmp); for (i = 0; i < (pnum < 5 ? pnum : 5); i++) { p = passages[i][0]; q = passages[i][1]; strncpy(buf, &docs[passages[i][2]][p], q-p); buf[q-p] = '\0'; printf("%d %d [%s]\n", passages[i][2], p, buf); } } If you are interested in the notion of building an index for each document, see my solution to the proximity-search problem below. bloggoDocStructureUsed as: Division One - Level Two:
Given a pair of syntactically correct HTML documents, we are to compare their tag trees and determine whether one is an intree of the other, meaning that the latter contains every path of the former descending in order from the root. HTML documents can be parsed painlessly if we think of them as parenthesized expressions. Each opening tag is comparable to a left parenthesis, and each closing tag is like a right parenthesis. When building a tag tree, we use the name of each tag as a node label, discarding all other text. Because I know from the input constraints that a node will not have more than 1000 children, my C implementation has a fixed-size child array in each node. struct Tree { char label[100]; struct Tree *children[1000]; int child_num; };
Some people like to push left parentheses or opening tags onto a stack, then pop them at a right parenthesis or closing tag. My preferred approach is to write a function that makes a recursive call at an opening tag and returns at a closing tag. The following function takes a string position int parse(struct Tree *tree, char *str, int ix) { int jx; struct Tree *child; while (1) { while (str[ix++] != '<'); if (str[ix] == '/') { while (str[ix++] != '>'); return ix; } child = calloc(1, sizeof(struct Tree)); jx = 0; while (str[ix++] != '>') child->label[jx++] = str[ix-1]; child->label[jx] = '\0'; ix = parse(tree->children[tree->child_num++] = child, str, ix); } }
To find out whether one tree is an intree of another, I use a recursive function to compare the children of each. In the function below, if (intree(atree->children[aix], 0, btree->children[bix], 0) && intree(atree, aix+1, btree, bix+1)) return 1;until lbackstrom pointed out to me that such a function would be too slow because it did a great deal of unnecessary work. If the current child of atree is an intree of the current child of btree but the later children fail, there is no need to go on and find a further match for the current child of atree .
Observe that if the later children fail starting from this point, they will fail again starting from a later point. We may as well pair the children greedily, using the earliest intree matches we come across.
int intree(struct Tree *atree, int aix, struct Tree *btree, int bix) { if (aix == 0 && bix == 0 && strcmp(atree->label, btree->label)) return 0; if (aix == atree->child_num) return 1; if (bix == btree->child_num) return 0; if (intree(atree->children[aix], 0, btree->children[bix], 0)) return intree(atree, aix+1, btree, bix+1); return intree(atree, aix, btree, bix+1); }
A fancy int count(struct Tree *tree) { int ct = 1, i; for (i = 0; i < tree->child_num; i++) ct += count(tree->children[i]); return ct; } To start the whole process off, I make a preliminary "html" node and advance a pointer in each document to the end of the first opening tag, which is guaranteed to be of "html" type. void compare(char *docA, char *docB) { struct Tree *treeA = calloc(1, sizeof(struct Tree)); struct Tree *treeB = calloc(1, sizeof(struct Tree)); int ix = 0, AinB, BinA; sprintf(treeA->label, "html"); while (docA[ix++] != '>'); parse(treeA, docA, ix); sprintf(treeB->label, "html"); while (docB[ix++] != '>'); parse(treeB, docB, ix); AinB = intree(treeA, 0, treeB, 0); BinA = intree(treeB, 0, treeA, 0); if (AinB && BinA) printf("equivalent\n"); else if (!AinB && BinA) printf("incompatible\n"); else if (AinB) printf("intree %d\n", count(treeB)-count(treeA)); else printf("outtree %d\n", count(treeA)-count(treeB)); } If the trees are neither equivalent nor incompatible, I subtract their node counts in the appropriate order to find out how many nodes they have in common. bloggoProximitySearchUsed as: Division One - Level Three:
Given a query expressing proximity relationships between pairs of passages, we are to find the shortest and earliest matches in a collection of documents. Although we have only 50 documents, each of which accommodates fewer than 400 passages, the structure of the search query may entail several pairwise comparisons between sets of matching passages in each document. A textual index is in order if we want to perform the comparisons at high speed. Most languages offer a hashing facility that works as a ready-made data structure for mapping words to word positions, but I wrote my solution in C without resorting to any high-level functions. I implemented my index using a trie, which is just another name for a maximal DFA. Words are easily inserted into or retrieved from a trie by recursive descent. Each node stores the starting indices of the word that terminates at that node. struct Index { struct Index *next[128]; int count, pos[100]; }; void put(struct Index *index, char *s, int pos, int ix, int end) { int ci = (int) tolower(s[ix]); if (ix == end) { index->pos[index->count++] = pos; return; } if (index->next[ci] == NULL) index->next[ci] = calloc(1, sizeof(struct Index)); put(index->next[ci], s, pos, ix+1, end); } struct Index *get(struct Index *index, char *s, int ix, int end) { int ci = (int) tolower(s[ix]); if (ix < end && index->next[ci] == NULL) return NULL; return ((ix == end) ? index : get(index->next[ci], s, ix+1, end)); }
To build an index for a certain document, I scan its text and put each maximal sequence of alphabetic characters into the trie along with its word number. At the same time, I store the character indices of each word in the array struct Index *make_index(char *text, int **spans) { int ix = 0, begin = -1, pos = 0; struct Index *index = calloc(1, sizeof(struct Index)); do { if (isalpha(text[ix]) && begin == -1) begin = ix; else if (!isalpha(text[ix]) && begin != -1) { spans[pos][0] = begin; spans[pos][1] = ix; put(index, text, pos++, begin, ix); begin = -1; } } while (text[ix++] != '\0'); return index; } To parse a query, I take advantage of the fact that the outermost '+' operator must be surrounded by balanced parenthetical expressions. Once I have found the position of this '+', I can split the query in two and continue parsing recursively. struct Parse { char word[100]; struct Parse *left, *right; int distance; }; struct Parse *make_parse(char *s, int begin, int end) { int ix = begin, bix = 0, depth = 0; char buf[1000]; struct Parse *parse = calloc(1, sizeof(struct Parse)); if (s[begin] == '(') { while (s[++ix] != '+' || depth > 0) { depth += s[ix] == '(' ? 1 : 0; depth -= s[ix] == ')' ? 1 : 0; } parse->left = make_parse(s, begin+1, ix-1); while (s[++ix] != ' ') buf[bix++] = s[ix]; buf[bix] = '\0'; parse->right = make_parse(s, ix+1, end-1); parse->distance = atoi(buf); return parse; } strncpy(parse->word, &s[begin], end-begin); parse->word[end-begin] = '\0'; return parse; } When I'm looking for passages matching a query, the base case is that the query consists of a single word. I can then reach into the index to retrieve a list of word positions, if any. int add_passages(int **passages, int pnum, int docid, struct Index *index, struct Parse *parse) { int i, j, *used = calloc(700, sizeof(int)), pnum1, pnum2; int a, b, c, d, distance, x, y; if (parse->word[0] != '\0') { index = get(index, parse->word, 0, strlen(parse->word)); if (index == NULL) return pnum; for (i = index->count-1; i >= 0; i--) { passages[pnum][0] = passages[pnum][1] = index->pos[i]; passages[pnum++][2] = docid; } return pnum; }
Otherwise, I call my passage-retrieval function recursively for the left and right branches of the parse tree. I compute pairwise intersections of the passages in each set, saving the resulting passages that are within the specified range. The distance = parse->distance; pnum1 = add_passages(passages, pnum+400, docid, index, parse->left); pnum2 = add_passages(passages, pnum1, docid, index, parse->right); for (i = pnum+400; i < pnum1; i++) for (j = pnum1; j < pnum2; j++) { x = passages[i][0] < passages[j][0] ? i : j; y = passages[i][0] < passages[j][0] ? j : i; a = passages[x][0]; b = passages[x][1]; c = passages[y][0]; d = passages[y][1]; if ((c < b || c-b <= distance+1) && !used[25*a+(b<d?d:b)]) { passages[pnum][0] = a; passages[pnum][1] = b < d ? d : b; passages[pnum++][2] = docid; used[25*a+(b<d?d:b)] = 1; } } free(used); return pnum; }
Each element of int cmp(const void *a, const void *b) { int *sa = *((int **) a), *sb = *((int **) b); int da = sa[1] - sa[0], db = sb[1] - sb[0]; if (da != db) return da - db; if (sa[2] != sb[2]) return sa[2] - sb[2]; return sa[0]-sb[0]; } Preliminary to the search proper, I allocate space for the passages and characters spans, then parse the query. void proximity_search(char **docs, char *query) { struct Index *index; struct Parse *parse; int i, j, **passages = calloc(20000, sizeof(int *)), pnum = 0; int ***spans = calloc(50, sizeof(int **)), p, q, a, b, docid; char buf[1000]; for (i = 0; i < 20000; i++) passages[i] = calloc(3, sizeof(int)); for (i = 0; i < 50; i++) { spans[i] = calloc(25, sizeof(int *)); for (j = 0; j < 25; j++) spans[i][j] = calloc(2, sizeof(int)); } parse = make_parse(query, 0, strlen(query)); For each document, I make an index and call my passage-retrieval function, which returns the number of passages found thus far. Finally, I sort the passages and cull the top ten. for (i = 0; i < 50; i++) { index = make_index(docs[i], spans[i]); pnum = add_passages(passages, pnum, i, index, parse); } qsort(passages, pnum, sizeof(int *), cmp); for (i = 0; j < (pnum < 10 ? pnum : 10); i++) { a = passages[i][0]; b = passages[i][1]; docid = passages[i][2]; p = spans[docid][a][0]; q = spans[docid][b][1]; strncpy(buf, &docs[docid][p], q-p); buf[q-p] = '\0'; printf("%d %d [%s]\n", docid, p, buf); } } The solution to this problem consists of three parts: building an index, parsing the query, and finding passage intersections. I do not find any one of these parts to be very dull or very difficult, but it appears from the match results that it would have been wiser to assign them as separate problems rather than rolling all three into one. |
|