The Surface/Symbol Divide

I noted William's post on a new paper (to be published at ICDM) which his student, Richard Wang has written. The paper describes a system which uses a smart combination of wrapper induction, set (list) discovery and graph based ranking to materialize, as if by magic, expanded sets of terms. For example, enter Ursa Major, Ursa Minor, Orion and you will get back a nice list of related terms (including Taurus, Gemini, etc.).

I love this type of thing. It is close to some of the things I worked on during my thesis (mining information from tables in text) and is an example of a very active research area (which includes quite a bit of work on mining Wikipedia for entities and relationships).

However, there is no free lunch. This approach to knowledge discovery is fixed at the surface level of text (and the surface level of the representation language of documents, to be complete). Consequently, the performance of the system highlights both what is good about statistical surface techniques (little training required - which is often the case for systems that work with both document structure, textual data and high precision seed input; works in (m)any language(s); fast) and what is bad (has no real knowledge of language).

An example of this problem can be seen when we give the seeds {obama, clinton} to the system. The following results appear:

# Entity Weight
1 obama 1.00000
2 clinton 1.00000
3 edwards 0.13000
4 romney 0.11125
5 mccain 0.10493
6 he 0.08484
7 giuliani 0.07974
8 the 0.06658
9 bush 0.06585
10 hillary 0.06373

While many of these results are fine, there are also errors which illustrate the separation of surface and symbolic processing: he, the.

Indsend kommentar

Indholdet af dette felt er privat og bliver ikke vist offentligt.