Media Sitemaps 7. Etc Web Crawling

7 2. Retrieve and process the corresponding page 3. Discover new URLs 4. Repeat on each found URL As can be seen above, the key steps are step 1, 2 and 3. Step 4 just repeats the previous steps, especially step 2 and 3. To execute the fourth step, there are three Graph Browser methods that can be used by the user agents Senellart, 2009. 1. Depth First The User Agent will process the first found URL before search other URLs on the page 2. Breath First The User agent will search all URLs on the page before process the first found URLs 3. Combination of Depth and Breath First Bread first with limited depth on each discovered website This Research will use the third method, Combination of Deep and Breath First to collect mobile phone’s information from the website to be compared with mobile phone’s criteria from the user of the recommender system. 2.3 Regular Expression Chapter 2.2 has explained about four steps of web crawling. It also explained that step four just repeat steps 2 and 3. In step 2, the user agent will retrieve and process the web page to discover new URLs step 3. Some sources of new URLs that can be found on HTML page Senellart, 2009: 1. Hyperlink Example: a href = “…”a

2. Media

Example:  img src = “ …”  embed src = “…”  object data =”…” 3. Frame Example:  frame src = ”…” 8  iframe src = “…” 4. JavaScript link Example: window.open“…” 5. Referrer URLs

6. Sitemaps 7. Etc

This research need to find source 1 and 2, hyperlink and media, especially image to collect data for the recommender system. And to find out the specific resource that mentioned above on HTML page, the system need to apply regular expression. Regular expression Goyvaerts and Levithan, 2009 is specific kind of text pattern that can be used with many modern application and programming language. Regular expression is used to search, edit and manipulate text Vogel, 2007. As a note, the recommender system in this research will be build use Java programming language, so regular expression that is discussed here just regular expression in Java. Regular expression has three basic elements, common matching symbols, metacharacters and quantifier Vogel, 2007. Regular expression’s common matching symbol in java can be seen at Table 2.1. Metacharacters are symbols that have meaning that already defined and make certain common pattern easy to use Vogel, 2007. The list of example of regular expression’s metacharacter in Java can be seen at Table 2.2. Quantifiers are symbols that define how often an element can occur Vogel, 2007. The list of regular expression’s quantifier in Java can be seen at Table 2.3. Table 2.1 Regular expression common matching symbol Vogel, 2007 Symbol Description . Matches any sign regex regex must match at the beginning of the line regex Finds regex must match at the end of the line 9 Table 2.2 Regular Expression’s metacharacter’s example Vogel, 2007 Symbol Description \d Any digit, short for [0-9] \D A non-digit, short for [0-9] \s A whitespace character, short for [ \t\n\x0b\r\f] \S A non-whitespace character, for short for [\s] \w A word character, short for [a-zA-Z_0-9] \W A non-word character [\w] \S+ Several non-whitespace characters Table 2.3 Regular expression’s quantifier Vogel, 2007 Symbol Description Occurs zero or more times, is short for {0,} + Occurs one or more times, is short for {1,} ? Occurs no or one times, ? is short for {0,1} {X} Occurs X number of times, {} describes the order of the preceding liberal {X,Y} .Occurs betw een X and Y times, ? ? aft er a qualifier makes it a reluctant quantifier, it tries to find the smallest match. [abc] Set definition, can match the letter a or b or c [abc][vz] Set definition, can match a or b or c followed by either v or z [ abc] When a appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c [a-d1-7] Ranges, letter between a and d and figures from 1 to 7, will not match d1 X| Z Finds X or Z XZ Finds X directly followed by Z Checks if a line end follows 10

2.5 Extended Weighted Tree Similarity

Dokumen yang terkait

An Extended ID3 Decision Tree Algorithm for Spatial Data

0 4 8

WEIGHTED TREE SIMILARITY SEMANTIC SEARCH FOR E-COMMERCE CONTENT.

0 0 9

Institutional Repository | Satya Wacana Christian University: Sistem Informasi Geografis Pelayanan Umum Berbasis Mobile Phone (Studi Kasus : Kota Pati) T1 672007277 BAB II

0 0 23

Institutional Repository | Satya Wacana Christian University: Recommender System for Mobile Phone Selection applying Extended Weighted Tree Similarity Algorithm

0 1 15

Institutional Repository | Satya Wacana Christian University: Recommender System for Mobile Phone Selection applying Extended Weighted Tree Similarity Algorithm T1 672007238 BAB I

0 0 5

Institutional Repository | Satya Wacana Christian University: Recommender System for Mobile Phone Selection applying Extended Weighted Tree Similarity Algorithm T1 672007238 BAB IV

0 0 18

Institutional Repository | Satya Wacana Christian University: Recommender System for Mobile Phone Selection applying Extended Weighted Tree Similarity Algorithm T1 672007238 BAB V

0 0 1

Institutional Repository | Satya Wacana Christian University: Recommender System for Mobile Phone Selection applying Extended Weighted Tree Similarity Algorithm

0 0 3

Institutional Repository | Satya Wacana Christian University: Aplikasi Layanan Pengiriman dan Penerimaan Pesan Singkat Menggunakan Mobile Phone dalam Jaringan Peer-to-Peer T1 612007071 BAB II

0 0 11

T1__BAB II Institutional Repository | Satya Wacana Christian University: Alat Peraga Receiver RF Circuit Training System GRF3300 T1 BAB II

0 0 5