Unlike most other distance learning catalogs, the main purpose of the GNA distance learning catalog is to serve as a test bed for research and development of web technologies. Two technologies which are extensively used in this catalog are
automated data extraction and
automated topic classification.
The first technology involves pattern matching to extract course information from a web page. The system is trained by adding tags to a web entry, and then uses an edit distance algorithm to extract course information. To extract course information, we have a three layer process.
- 1-5 courses get added using our form based information
- 5-50 courses get manually typed by hand using data entry outsourced to http://www.suntecindia.com/
- 50+ courses get added using the automated data extractor
The second technology still needs some work. We begin using naive Bayesian classification, but we still get a lot of odd matches. Trying to improve topic classification.
Other areas of research which we've added is a system of filter plugins which resemble tagging used in social bookmarking.
Aside from this, we have the whole thing run on a postgresql using Mandriva cooker. This is bleeding edge code so we find a number of bugs which we report to the open source community.
=List of catalog projects=
- Add filter for high school courses
- Create e-mail to link for new courses.