7
2. Retrieve and process the corresponding page 3. Discover new URLs
4. Repeat on each found URL
As can be seen above, the key steps are step 1, 2 and 3. Step 4 just repeats the previous steps, especially step 2 and 3. To
execute the fourth step, there are three Graph Browser methods that can be used by the user agents Senellart, 2009.
1. Depth First The User Agent will process the first found URL
before search other URLs on the page 2. Breath First
The User agent will search all URLs on the page before process the first found URLs
3. Combination of Depth and Breath First Bread first with limited depth on each discovered
website This Research will use the third method, Combination of
Deep and Breath First to collect mobile phone’s information from the website to be compared with mobile phone’s criteria from the
user of the recommender system. 2.3 Regular Expression
Chapter 2.2 has explained about four steps of web crawling. It also explained that step four just repeat steps 2 and 3.
In step 2, the user agent will retrieve and process the web page to discover new URLs step 3. Some sources of new URLs that can
be found on HTML page Senellart, 2009:
1. Hyperlink Example: a href = “…”a
2. Media
Example:
img src = “ …” embed src = “…”
object data =”…” 3. Frame
Example:
frame src = ”…”
8
iframe src = “…” 4. JavaScript link
Example: window.open“…” 5. Referrer URLs
6. Sitemaps 7. Etc
This research need to find source 1 and 2, hyperlink and
media, especially image to collect data for the recommender system. And to find out the specific resource that mentioned
above on HTML page, the system need to apply regular expression.
Regular expression Goyvaerts and Levithan, 2009 is specific kind of text pattern that can be used with many modern
application and programming language. Regular expression is used to search, edit and manipulate text Vogel, 2007.
As a note, the recommender system in this research will be build use Java programming language, so regular expression
that is discussed here just regular expression in Java. Regular expression has three basic elements, common
matching symbols, metacharacters and quantifier Vogel, 2007. Regular expression’s common matching symbol in java can be
seen at Table 2.1. Metacharacters are symbols that have meaning that already defined and make certain common pattern easy to use
Vogel, 2007. The list of example of regular expression’s metacharacter in Java can be seen at Table 2.2. Quantifiers are
symbols that define how often an element can occur Vogel, 2007. The list of regular expression’s quantifier in Java can be
seen at Table 2.3.
Table 2.1 Regular expression common matching symbol
Vogel, 2007
Symbol Description
.
Matches any sign
regex
regex must match at the beginning of the line
regex
Finds regex must match at the end of the line
9
Table 2.2 Regular Expression’s metacharacter’s example
Vogel, 2007
Symbol Description
\d Any digit, short for [0-9]
\D A non-digit, short for [0-9]
\s A whitespace character, short for [ \t\n\x0b\r\f]
\S A non-whitespace character, for short for [\s]
\w A word character, short for [a-zA-Z_0-9]
\W A non-word character [\w]
\S+ Several non-whitespace characters
Table 2.3 Regular expression’s quantifier Vogel, 2007
Symbol Description
Occurs zero or more times, is short for {0,} +
Occurs one or more times, is short for {1,} ?
Occurs no or one times, ? is short for {0,1} {X}
Occurs X number of times, {} describes the order of the preceding liberal
{X,Y} .Occurs betw een X and Y times,
? ? aft er a qualifier makes it a reluctant quantifier, it tries to
find the smallest match. [abc]
Set definition, can match the letter a or b or c
[abc][vz]
Set definition, can match a or b or c followed by either v or z
[ abc]
When a appears as the first character inside [] when it negates the pattern. This can match any character
except a or b or c
[a-d1-7]
Ranges, letter between a and d and figures from 1 to 7, will not match d1
X| Z
Finds X or Z
XZ
Finds X directly followed by Z Checks if a line end follows
10
2.5 Extended Weighted Tree Similarity