|
E-mail this page | Distance Learning Guides | Donate | FAQ |
| Tuesday, 2008 October 7 17:23:10 GMT | Our catalog has 32123 courses and 6224 programs |
>Web interfaces didn't work because people didn't submit information often enough to remember a password. Right now all interaction with our listers occurs through e-mail attachments
We tried machine parsing for a while. In contrast with what we are doing with topic classification in which machine learning works beautifully, we have mostly given up on machine parsing for raw data. The assumption was that we would write a parser that using perl regular expressions to extract information from a web page. Then if the web page changed, we'd just run the parser again and get the new data.
The problem with this is that writing a parser needed someone who understood the perl pattern matching language. It wasn't particularly time consuming for someone who understood perl to write a parser.
The topic classification is a two step process. First we look at the class identification for class and then we use a lookup table to make a first guess as to the topic of the class. This lookup table is included in our database download package or you can look at it here.
The next step involves doing a pattern match against titles of courses that are already in the database and then guessing a best match using Bayesian statistics. The code to do this is available in the file topics.pl in our library package or you can also look at it here. I think that the main procedure is refine_topic1
It's interesting to see the reason that we have more or less successfully automated topic assignment, but not field assignment. Basically our computer systems "know" that "Introduction to Physics" is a physics class because courses that have similar names have already been associated with a topic in the database. By contrast, our program doesn't know that "John Smith" is a teacher name rather than a course description because that information isn't stored anywhere. In essence we are using our database itself as a rudimentary neural net.