2. WEB PAGES AND DATA RECORDS - Information Extraction from Web Pages Using Automatic Pattern Discovery Method Based on Tree Matching - repository civitas UGM

  ABSTRACT

  This paper discusses the problem of information extraction from such web pages. Internet, especially the web has turned into a vast source of information. Most of the web content are currently generated from data stored in databases. From information provider view, the presentation of them tends to follow some predefined structures or fixed templates. On the other hand, some users want to consume such structured data to be processed further. Extracting such data is useful because it enable human to obtain and integrate data from multiple sources. Automatic pattern discovery method based on tree matching is used as structured data extraction method. The main advantage of the method is that it requires less human intervention. In this paper we will discuss the implemention of the extractor using that method and then the approach is evaluated in terms of correctness (recall) and

  precision. Experimental results show that almost all extraction target can be successfully extracted by the extractor developed.

  However, sometimes other structured data that are not being targetted are also extracted by the extractor. This lead to the provision of manual tuning or filter feature on the extractor developed.

  Keywords information extraction, structured data extraction, tree matching, web content mining

  Structured data in web pages usually contain important information. Such data are often retrieved from underlying databases and displayed in web pages using fixed templates. These structured data are called data records [4]. Automatic data records extraction by computer can help human to gather information from web.

  Structured data extraction from web pages has been studied by researchers. Existing methods addressing the problem can be classified into three categories [4]. Methods in the first category provide some languages to facilitate the construction of data extraction system. In this category the user must manually construct data record's pattern for the extraction target (manual extraction). Methods in the second category use machine learning techniques to learn and construct wrappers (data extractors) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the web. Methods in the second category also called wrapper induction. Methods in the third category are based on the idea of automatic pattern discovery. Methods in this category have some advantages over other methods because these methods don't require separate training, validation, and application phases [2]. Methods in the third category can be divided into two category; string matching based and HTML tree matching based. It has been shown that automatic pattern discovery methods based on HTML tree matching are more efficient than the string matching approaches [3]. The rest of this paper is organized as follows: section presents DEPTA, which is one of automatic pattern discovery methods based on tree matching. Section describes the differences between the developed extractor and DEPTA. Section and Section concludes the paper and suggests the future works.

  2. WEB PAGES AND DATA RECORDS

  Many web pages, especially those offering products, contain data records. The presentation of such web pages can be significantly different in look for human, but they can be classified into one of two classes : list pages and detail pages. A list page contains one or more objects while a detail page contains only information of one objects. Figure shows an example of a list page. In the page, there are two data regions: horizontal (half upper part) and vertical (half lower part). A data region is a collection of similar data records (information of objects of similar type) that exist on contiguous part of a web page. In every data region data records are formatted uniformly using the same template. While Figure on the other hand, is an example of a detail page.

  Figure 1 Example of a list page containing horizontal and

Information Extraction from Web Pages Using Automatic

Pattern Discovery Method Based on Tree Matching

  Sigit Dewanto Computer Science Departement Gadjah Mada University

  Yogyakarta sigitdewanto@gmail.com Khabib Mustofa

  Computer Science Departement Gadjah Mada University Yogyakarta khabib@ugm.ac.id

1. INTRODUCTION

  This paper discuss an approach applicable to list pages only, not

  2. Data regions identifier: This component traverse the including detail pages.

  DOM tree top down to identify each area or region in the inputted page that contains a list of similar data records.

  3. Data records identifier: This component detects the

3. DATA EXTRACTION BASED ON boundaries of individual data records in each data region

  (an area that contains data records describing similar

PARTIAL TREE ALIGNMENT (DEPTA)

  objects). The output of this component is a list of data Automatic pattern discovery methods has been studied by records (still in HTML) from each data region. researchers because of the shortcomings of manual extraction and wrapper induction methods. Both methods are difficult to implement for a large number of sites. Moreover, if the encoding template used by a site changed, the existing wrapper for the site will become invalid.

  Figure 3 . The general architecture of the DEPTA system [4] Figure 2 Example of a detail page containing information of an iPod

  4. Data item extractor: This component aligns and extracts data items (analog to cell in a table) using partial tree alignment algorithm. The output of this component is

  Automatic pattern extraction is possible because the data records one or more table (in accordance with the number of the in the web pages are usually encoded using a very small number identified data regions) that contains aligned data items. of fixed templates. It is possible to find these templates by mining

  For an inputted page with multiple data regions, data repeated patterns in multiple data records. Both string matching from each region is put in a separate table. and tree matching based methods can be used to mine these patterns. Tree matching can be used because web pages are written

  The tree matching algorithm used in DEPTA is simple tree in HyperText Markup Language (HTML) so they can be modeled

  matching (STM). This algorithm used in data regions identifier, as trees.

  data records identifier, and data items extractor (used in partial tree alignment). In DEPTA, the number of tree matching DEPTA is one of the automatic pattern discovery method based on computation is minimized by using visual information. tree matching developed by Zhai and Liu [4]. DEPTA is able to extract data records from a single list page. List page is page that

  4. DEVELOPMENT OF EXTRACTOR

  contains one or more list of objects (for example, a page that The extractor created in this research, called Structured Data displays a list of books including their authors, publishers, prices,

  Extractor (SDE), is developed based on DEPTA and implemented etc.).

  using open source web scripting language PHP combined with The general architecture of DEPTA illustrated in Fig. The input

  JavaScript. There are some differences between SDE and DEPTA to the system is a web page contains one or more data records (a implementation. The differences are as follows: page can have more than one area that contains regularly

  1. In tag tree building, DEPTA uses HTML opening tags structured data records). The system is composed of the following and visual information while SDE uses HTML DOM main components: parser.

  1. DOM tree builder (or tag tree builder): The function of

  2. DEPTA uses visual information in tag tree building, tree this component is to build a document object model matching, and gap between data records candidates

  (DOM) tree from an inputted page. In DEPTA this detection while SDE doesn't. component implemented using MSHTML API, which returns the rendering information of each HTML

  3. In the scoring of similarity between data records element in the browser (also called visual information). candidates, DEPTA only considers tags that contain text

  Using the rendering information and the HTML opening while SDE considers all tags. tags, this component build the DOM tree of an input

  4. DEPTA is capable to extract non-contiguous data page. records whereas SDE is not.

4.1 Tag Tree Building using HTML DOM Parser

  The first step done by SDE in data records extraction is building tag tree from an input web page. The HTML parser used in the tag tree building is an open source library, NekoHTMLParser, which is a DOM-based parser. A DOM-based parser builds a DOM tree from a web page. Nodes in a DOM tree have different types :

  Element, Text, Attribute, Comment, etc. Each HTML tag (or a pair

  of HTML tags) in a web page will be represented as an Element node. Each text within a pair of HTML tags will be represented as a Text node and will be a child of the Element node representing the pair of HTML tags. Each attribute in a HTML tag will be represented as an Attribute node and will be a child of the Element node representing the HTML tag. Each comment in HTML will be represented as a Comment node. Because the thing needed in the structured data extraction method is the tag tree structure, not the entire DOM tree structure, we need to build tag tree based on the DOM tree. Each node in a tag tree represents each tag in a web page. In other words, there is only one type of node in a tag tree : the node representing HTML tag in a web page (the same as Element node in DOM tree). Each text within a pair of HTML tags in a web page will be represented as an innerText property of the node representing the pair of HTML tags.

  After all parameters being recorded, precision and recall value of data records extraction and data items alignment for each input page. Precision and recall are widely used measures to evaluate information retrieval systems and have been adapted to evaluate information extraction system [1]. In the evaluation of SDE, precision of data records extraction divided into two kinds. The first is precision that consider irrelevant extracted data records. In this case the precision value of data records extraction is the number of correctly extracted actual data records divided by the number of extracted data records by SDE. The second is precision that doesn't consider irrelevant extracted actual data records. In this case the precision value of data records is the number of correctly extracted actual data records divided by the number of extracted data records, both correctly and incorrectly. The recall value of data records extraction is the number of correctly extracted actual data records divided by the number of actual data records in the input web page.

  10. The numbers of aligned data items from the correctly extracted actual data records. Aligned data items are not always the same with the actual data items from correctly extracted data records.

  9. The numbers of unaligned (unidentified by SDE) actual data items from the correctly extracted actual data records.

5. EVALUATION SCHEME

  5. The numbers of data records extracted by SDE.

  Extracted data records is not always the same as the actual data records. This can be happen because there are exist irrelevant data records.

  6. The numbers of target data items in the correctly extracted actual data records. This parameter is called actual data items.

  4. The numbers of actual data records that were not identified/extracted by SDE.

  8. The numbers of incorrectly aligned actual data items from the correctly extracted actual data records. By incorrectly aligned data items we mean that those data items that contain the same type of information are not aligned in the same column. For example if there are five data items that should be aligned in the same column (because they contain same type of information), but in the actual alignment three data items are aligned in one columns and the other two data items are aligned in another different column. Another case of incorrectly aligned data items is some data items containing different types of information aligned in one column.

  For a data record, a wrongly extracted data record means that only part of the content of the data record is extracted, or information outside of the data record boundary is extracted and enclosed in.

  2. The numbers of correctly extracted actual data records 3. The numbers of incorrectly extracted actual data records.

  1. The numbers of target data records contained in the input page. This parameter is called actual data records.

  The precision value of the data items alignment is the number of correctly aligned actual data items divided by the number of aligned data items by SDE. The recall value of data items alignment is the number of correctly aligned actual data items divided by the number of actual data items in the correctly extracted actual data records.

  6. EXPERIMENTAL RESULTS

  Table shows the experimental results of the data records extraction for each input web page. The “ACT” column shows the numbers of actual data records in the input web page. The “COR” column shows the numbers of correctly extracted actual data records. The “WRG” column shows the numbers of incorrectly extracted actual data records. The “MISS” column shows the numbers of unidentified actual data records by SDE. The “FOU” column shows the numbers of extracted data records by SDE.

  Tableshows that from 40 pages, incorrectly extracted actual data records are found only in three input pages. In the two of the web pages selected from bursa-kerja.ptkpt.net, all of the actual data records are incorrectly extracted because the structures of the actual data records in these pages are different with the assumption used in the extraction method. In the second page selected from gkarir.com, two data records are incorrectly

  SDE is executed with web pages from 20 websites as inputs. From each website, two different web pages containing structured data are selected as samples. Fourteen of twenty websites are of job vacancies from Indonesia information provider. The other websites are international vacancies information provider. The following parameters are recorded for each execution of SDE:

  7. The numbers of correctly aligned actual data items from the correctly extracted actual data records. By correctly aligned data items we mean that those data items aligned with other data items containing the same type of information. extracted because they are not in the contiguous area (separated). The used extraction method can only identify data records if there are two or more data records of the same type located in one contiguous area.

  Table 1. Data Records Extraction Result No. input Pages ACT COR WRG MISS FOU

  50

  10

  10

  30 30 105 37. careerone.com.au 1

  30 30 105 36. efinancialcareers.sg 2

  49 35. efinancialcareers.sg 1

  10

  10

  49 34. indeed.com 2

  10

  10

  55 33. indeed.com 1

  50

  34 32. jobs.com 2

  15

  1

  24

  25

  96 31. jobs.com 1

  30

  30

  30 30 103 30. yahoo.com 2

  38 29. yahoo.com 1

  9

  9

  36 28. lowongan-kerja.terbaru.com 2

  9

  9

  44 38. careerone.com.au 2

  15

  34

  60

  90

  90

  126 124 2 126 28. lowongan- kerja.terbaru.com 2

  50 21. lowongankerja.com 1 145 145 145 22. lowongankerja.com 2 145 145 145 23. karir.tv 1 240 240 260 24. karir.tv 2 240 240 260 25. karir.com 1 170 170 170 26. karir.com 2 545 545 545 27. lowongan- kerja.terbaru.com 1

  50

  50

  80 17. jobstreet.com 1 400 400 400 18. jobstreet.com 2 400 400 400 19. klikkarir.com 1 154 154 154 20. klikkarir.com 2

  80

  80

  80 16. duniakarir.com 2

  80

  80

  60 10. jobindo.com 2 160 160 160 11. ww.jobsdb.com 1 450 450 550 12. ww.jobsdb.com 2 450 450 550 13. bursa-kerja.ptkpt.net 1 14. bursa-kerja.ptkpt.net 2 15. duniakarir.com 1

  60

  56 39. careerbuilder.com 1

  60 9. jobindo.com 1

  55

  55

  1. datakarir.com 1 198 198 216 2. datakarir.com 2 198 198 216 3. lowongan-pekerjaan.net 1 360 360 390 4. lowongan-pekerjaan.net 2 360 360 390 5. jobitcom.com 1 154 154 168 6. jobitcom.com 2 154 154 168 7. gkarir.com 1 132 132 144 8. gkarir.com 2

  Table 2. Data items alignment results No. Input Pages ACT COR WRG MISS FOU

  different with the other data records in the same data region (i.e., similarity threshold is not reached). Tableshows the experimental results of the data items alignment from correctly extracted actual data records from each input web page. The “ACT” column shows the numbers of actual data items in the correctly extracted actual data records (actual data items). The “COR” column shows the numbers of correctly aligned actual data items. The “WRG” column shows the numbers of incorrectly aligned actual data items. The “MISS” column shows the numbers of unaligned (unidentified by SDE) actual data items. The “FOU” column shows the numbers of aligned data items from the correctly extracted actual data records.

  jobs.com. This actual data record is not extracted because it's too

  39 There is only one unidentified actual data records by SDE, that is an actual data record in the first of the selected pages from

  4

  4

  79 40. careerbuilder.com 2

  10

  10

  46 26. karir.com 2 109 109 121 27. lowongan-kerja.terbaru.com 1

  34

  1. datakarir.com 1

  12

  50 50 152 12. ww.jobsdb.com 2

  28 11. ww.jobsdb.com 1

  20

  20

  52 10. jobindo.com 2

  10

  10

  97 9. jobindo.com 1

  2

  5

  7

  99 8. gkarir.com 2

  12

  51 7. gkarir.com 1

  49 14. bursa-kerja.ptkpt.net 2 230 230

  14

  14

  14 14 149 6. jobitcom.com 2

  32 5. jobitcom.com 1

  30

  30

  30 30 156 4. lowongan-pekerjaan.net 2

  27 3. lowongan-pekerjaan.net 1

  18

  18

  77 2. datakarir.com 2

  18

  18

  50 50 140 13. bursa-kerja.ptkpt.net 1 238 238

  51 15. duniakarir.com 1

  61 25. karir.com 1

  10

  20

  20

  61 24. karir.tv 2

  20

  20

  23 23. karir.tv 1

  5

  5

  23 22. lowongankerja.com 2

  5

  5

  58 21. lowongankerja.com 1

  10

  54 20. klikkarir.com 2

  10

  18

  18

  69 19. klikkarir.com 1

  50

  50

  71 18. jobstreet.com 2

  50

  50

  12 17. jobstreet.com 1

  10

  10

  12 16. duniakarir.com 2

  10

  90 29. yahoo.com 1 480 480 480 30. yahoo.com 2 480 480 480 31. jobs.com 1 472 468 4 472 32. jobs.com 2 250 250 250 33. indeed.com 1 260 260 260

  35. efinancialcareers.sg 1 210 210 210 36. efinancialcareers.sg 2 210 210 210 37. careerone.com.au 1

  Table 3. Precisions and recalls of the data records extractions and data items alignments No. Input Pages A B C D E

  Actual data records are incorrectly extracted by SDE because the actual data records structures are different with the assumptions used by the extraction method or there is only one actual data record in a contiguous area (separated from other data records of similar type). Actual data records are not extracted (unidentified) by SDE because the similarity threshold is not satisfied. Actual data items are incorrectly aligned because the used extraction method only match the tag tree structure and string in the data items without understanding the meaning of the data items. The followings could be considered some steps worth examining as the future works:

  Structured data extractor using automatic pattern discovery method based on tree matching has been successfully developed and evaluated. The method being used is based on DEPTA with several differences in the implementation. Experimental results show that almost all structured data being targeted can be successfully extracted (the average of the recall values for data records extraction and data items alignment are above 90%). However, the extractor also extract other structured data that not being targeted , i.e. irrelevant data, showed by the average precision values of the data records extraction with irrelevant data records being considered. Because of this, the users must manually filter the extraction results. It can be happen because the extraction used method only consider the tree structure pattern and it doesn't able to understand the information contained in data records.

  7. CONCLUSION AND FUTURE WORKS

  Rata-rata 37.26 94.29 94.19 96.92 99.94

  25. karir.com 1 73.91 100 100 100 100 26. karir.com 2 90.08 100 100 100 100 27. lowongan-kerja.terbaru.com 1 25 100 100 98.41 98.41 28. lowongan-kerja.terbaru.com 2 23.68 100 100 100 100 29. yahoo.com 1 29.13 100 100 100 100 30. yahoo.com 2 31.25 100 100 100 100 31. jobs.com 1 70.59 100 96 99.15 99.15 32. jobs.com 2 90.91 100 100 100 100 33. indeed.com 1 20.41 100 100 100 100 34. indeed.com 2 20.41 100 100 96.3 100 35. efinancialcareers.sg 1 28.57 100 100 100 100 36. efinancialcareers.sg 2 28.57 100 100 100 100 37. careerone.com.au 1 22.73 100 100 100 100 38. careerone.com.au 2 26.79 100 100 100 100 39. careerbuilder.com 1 12.66 100 100 100 100 40. careerbuilder.com 2 10.26 100 100 100 100

  10. jobindo.com 2 71.43 100 100 100 100 11. ww.jobsdb.com 1 32.89 100 100 81.82 100 12. ww.jobsdb.com 2 35.71 100 100 81.82 100 13. bursa-kerja.ptkpt.net 1 100 100 14. bursa-kerja.ptkpt.net 2 100 100 15. duniakarir.com 1 83.33 100 100 100 100 16. duniakarir.com 2 83.33 100 100 100 100 17. jobstreet.com 1 70.42 100 100 100 100 18. jobstreet.com 2 72.46 100 100 100 100 19. klikkarir.com 1 33.33 100 100 100 100 20. klikkarir.com 2 17.24 100 100 100 100 21. lowongankerja.com 1 21.74 100 100 100 100 22. lowongankerja.com 2 21.74 100 100 100 100 23. karir.tv 1 32.79 100 100 92.31 100 24. karir.tv 2 32.79 100 100 92.31 100

  1. datakarir.com 1 23.38 100 100 91.67 100 2. datakarir.com 2 66.67 100 100 91.67 100 3. lowongan-pekerjaan.net 1 19.23 100 100 92.31 100 4. lowongan-pekerjaan.net 2 93.75 100 100 92.31 100 5. jobitcom.com 1 9.4 100 100 91.67 100 6. jobitcom.com 2 27.45 100 100 91.67 100 7. gkarir.com 1 12.12 100 100 91.67 100 8. gkarir.com 2 5.15 71.43 71.43 91.67 100 9. jobindo.com 1 19.23 100 100 100 100

  16 Tabel describes the value of precision and recall of the data records extraction and data items alignment resulting from the extraction of the 40 web sites. Column “A” shows the precision score (in percent) of extracting data records considering irrelevant data records. Column “B” shows the precision score (in percent) of extracting data records without considering irrelevant data records. Column “C” indicates the recall (in percent) of extracting data records. Column “D” shows the precision (in percent) of aligning the data items. Column “E” demonstrates the recall (in percent) of aligning data items. The tableshows that the arithmetic mean of the precisions of the data records extraction that consider irrelevant data records is low (below 50%). This shows the shortcoming of the extraction method using automatic pattern discovery: there are many irrelevant data records in the extraction results. It is also shown that the arithmetic mean of the recalls of the data items alignment is below 100% despite there are no unaligned data items. It is because of the data items that contain only non-printed characters (e.g. space, tab) or formatting tags (like BR).

  60

  16

  16

  40 40. careerbuilder.com 2

  40

  40

  60 38. careerone.com.au 2 135 135 135 39. careerbuilder.com 1

  60

  1. Saving the discovered actual data records patterns so they can be used in data records extraction from web page with the same encoding template (without finding the patterns again). The problems of how to extract data records from a web page using previously saved patterns and how to detect the encoding template are also need to be considered.

  2. Integrating the developed system with information extraction tools based on natural language processing to improve the precisions of data records extraction and data items alignment. It can be used to overcome the shortcoming of the extraction method that cannot understand the information contained in data items so the irrelevant data records will be extracted.

8. REFERENCES

  [1] Benchalli, S S., Hiremath, P.S., Algur, S.P. , dan Udapudi,

  V.R., "Mining Data Regions from Web Pages" , 2005 [2] Breuel, T.M.,"Information Extraction from HTML Documents

  by Structural Matching" , Proceedings of the 2nd International

  Workshop on Web Document Analysis (WDA2003) PARC, Inc., Palo Alto, CA, USA, 2003, [online]:

  

http://www.csc.liv.ac.uk/~wda2003/Papers/Section_I/Paper_3.pdf

  access date 12 Feb 2010 [3] Yeonjung, K., Jeahyun, P., Taehwan, K., dan Joongmin, C.,,"Web Information Extraction by HTML Tree Edit Distance

  Matching" , Proceedings of the International Conference on

  Convergence Information Technology (ICCIT.2007) Washington, DC, US, 2007, [online]:

  http://dx.doi.org/10.1109/ICCIT.2007.398 access date 13 Feb

  2010 [4] Zhai, Yanhong and Liu, Bing, "Structured Data Extraction from the Web Based on Partial Tree Alignment" ,IEEE

  Transaction on Knowledge and Data Engineering, vol 16-12

  p1614-1628,IEEE Educational Activities Department,Piscataway, NJ, USA,2006 [online]: http://dx.doi.org/10.1109/TKDE.2006.197 access date 14 Feb 2010,