You are here: TWiki > Gna Web > AutomatedDataExtraction r6 - 24 Apr 2005 - 14:25 - TWikiGuest


Start of topic | Skip to actions
Keywords:

wrapper induction screen scraping

see also GnaLabs and AutomatedTopicClassification

The Automated Data Extraction code has been modularized and uploaded to CPAN as module WWW::Extractor

=Where to get it=

The source for the module is at

http://www.gnacademy.org/src/lib/WWW/Extractor.pm

with a perl package at

http://www.gnacademy.org/beta/WWW/Extractor

Learn.wrapper still exists at

https://www.gnacademy.org/src/bin/learn.wrapper

Pod documentation is also available.

= How it works =

There are a few original ideas in the design of WWW::Extractor

  • first, it uses edit distance matching rather than finite state machines
  • second, it preclassifies the tokens. For example, before it tries to do a match it will try to figure out if a token is an html tag or a number or a word.
  • third, in figuring out a grammar, it looks at one entry and deduces the characteristics of that entry from that one entry. This is really useful for screen scraping since most of the web pages we are trying to extract information from have been designed using templating systems.

The source code for the ADE is at https://www.gnacademy.org/src/bin/learn.wrapper and released under the terms of the GPL.

The script is inspired by the edit distance technique described by Childovskii et. al. but is also able to use semantic information.

What the script does is to look at a sample entry into which key words have been inserted. It then breaks the file done into tokens and then classifies those tokens into different categories (i.e this is blank space, this is web text, this is a web tag). The code then uses edit distance to try to make a best fit match and then when it does a match it then inserts the tags and spits them out.

The reason that I've used edit distance rather than finite state machines is that the FSM based schemes seem to be overkill.

Reviews

http://citeseer.ist.psu.edu/eikvil99information.html Information Extraction from World Wide Web A Survey (1999)

Papers

http://citeseer.ist.psu.edu/chidlovskii00automatic.html Automatic Wrapper Generation for Web Search Engines

http://citeseer.ist.psu.edu/hsu98generating.html Generating Finite-State Transducers For Semi-Structured Data Extraction From The Web (1998)

http://citeseer.ist.psu.edu/gao99autowrapper.html

Autowrapper
automatic wrapper generation for multiple online services

http://citeseer.ist.psu.edu/muslea98stalker.html

STALKER
Learning Extraction Rules for Semistructured, Web-based Information Sources (1998)

http://citeseer.ist.psu.edu/448792.html Automatic Data Extraction from Lists and Tables in Web Sources

http://citeseer.ist.psu.edu/521682.html IEPAD: Information Extraction Based on Pattern Discovery

http://citeseer.ist.psu.edu/kushmerick97wrapper.html Wrapper Induction for Information Extraction

http://citeseer.ist.psu.edu/embley99recordboundary.html Record-Boundary Discovery In Web Documents (1999)


Calculating edit distance: http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit.html Dynamic Programming Algorithm (DPA) for Edit-Distance

-- TWikiGuest - 24 Apr 2005

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r6 < r5 < r4 < r3 < r2 | More topic actions
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors and is licensed under the terms of the GNU Free Documentation License.
Ideas, requests, problems regarding TWiki? Send feedback