6
Chapter 2 Literature Review
2.1 Previous Research
Recommender systems have changed the way of people in shop online Millner et al, 2003. The result of that study
conclude that recommender system have potential to provide value to their users. Recommender systems are providing value to users
in many different content and commerce environment Millner et al, 2004.
In the previous research about extended weighted tree similarity algorithm Sarno et al, 2003, it’s concluded that the
implementation of it makes the system select more preferred agent buyer or seller agent among agents having the same similarity
value.
Application of Weighted Tree Similarity for E-Business environment on another previous research concluded that this
algorithm can be parameterized by different functions to adjust the similarity of the sub trees. The algorithm can also be applied
in other environments wherein weighted trees are used, and implemented in many programming languages Bhavsar et al,
2005.
The application of Extended Weighted Tree Similarity in Job Searching had not too wide result like OR method and also
not too strict result as AND method Yulianti, 2010. This research suggested that the application will be better if it became
on line application.
2.2 Web Crawling
Web crawling, or Web Spider or Web Robots is an autonomous user agent that retrieve page from the web Senellart,
2009. To crawl the web, the user agent will follow several step, they are:
1. Start from a given or set URL
7
2. Retrieve and process the corresponding page 3. Discover new URLs
4. Repeat on each found URL
As can be seen above, the key steps are step 1, 2 and 3. Step 4 just repeats the previous steps, especially step 2 and 3. To
execute the fourth step, there are three Graph Browser methods that can be used by the user agents Senellart, 2009.
1. Depth First The User Agent will process the first found URL
before search other URLs on the page 2. Breath First
The User agent will search all URLs on the page before process the first found URLs
3. Combination of Depth and Breath First Bread first with limited depth on each discovered
website This Research will use the third method, Combination of
Deep and Breath First to collect mobile phone’s information from the website to be compared with mobile phone’s criteria from the
user of the recommender system. 2.3 Regular Expression
Chapter 2.2 has explained about four steps of web crawling. It also explained that step four just repeat steps 2 and 3.
In step 2, the user agent will retrieve and process the web page to discover new URLs step 3. Some sources of new URLs that can
be found on HTML page Senellart, 2009:
1. Hyperlink Example: a href = “…”a
2. Media
Example:
img src = “ …” embed src = “…”
object data =”…” 3. Frame
Example:
frame src = ”…”
8
iframe src = “…” 4. JavaScript link
Example: window.open“…” 5. Referrer URLs
6. Sitemaps 7. Etc
This research need to find source 1 and 2, hyperlink and
media, especially image to collect data for the recommender system. And to find out the specific resource that mentioned
above on HTML page, the system need to apply regular expression.
Regular expression Goyvaerts and Levithan, 2009 is specific kind of text pattern that can be used with many modern
application and programming language. Regular expression is used to search, edit and manipulate text Vogel, 2007.
As a note, the recommender system in this research will be build use Java programming language, so regular expression
that is discussed here just regular expression in Java. Regular expression has three basic elements, common
matching symbols, metacharacters and quantifier Vogel, 2007. Regular expression’s common matching symbol in java can be
seen at Table 2.1. Metacharacters are symbols that have meaning that already defined and make certain common pattern easy to use
Vogel, 2007. The list of example of regular expression’s metacharacter in Java can be seen at Table 2.2. Quantifiers are
symbols that define how often an element can occur Vogel, 2007. The list of regular expression’s quantifier in Java can be
seen at Table 2.3.
Table 2.1 Regular expression common matching symbol
Vogel, 2007
Symbol Description
.
Matches any sign
regex
regex must match at the beginning of the line
regex
Finds regex must match at the end of the line
9
Table 2.2 Regular Expression’s metacharacter’s example
Vogel, 2007
Symbol Description
\d Any digit, short for [0-9]
\D A non-digit, short for [0-9]
\s A whitespace character, short for [ \t\n\x0b\r\f]
\S A non-whitespace character, for short for [\s]
\w A word character, short for [a-zA-Z_0-9]
\W A non-word character [\w]
\S+ Several non-whitespace characters
Table 2.3 Regular expression’s quantifier Vogel, 2007
Symbol Description
Occurs zero or more times, is short for {0,} +
Occurs one or more times, is short for {1,} ?
Occurs no or one times, ? is short for {0,1} {X}
Occurs X number of times, {} describes the order of the preceding liberal
{X,Y} .Occurs betw een X and Y times,
? ? aft er a qualifier makes it a reluctant quantifier, it tries to
find the smallest match. [abc]
Set definition, can match the letter a or b or c
[abc][vz]
Set definition, can match a or b or c followed by either v or z
[ abc]
When a appears as the first character inside [] when it negates the pattern. This can match any character
except a or b or c
[a-d1-7]
Ranges, letter between a and d and figures from 1 to 7, will not match d1
X| Z
Finds X or Z
XZ
Finds X directly followed by Z Checks if a line end follows
10
2.5 Extended Weighted Tree Similarity
Weighted Tree Similarity is a tree similarity for match- making of agents Bhavsar et al, 2005. The agents are Buyer
agent, that deal with information that the buyer want to buy and Seller agent, that deal with information that the seller want to sell
Yang et al, 2000. The tree that is used in this research is XML tree base on weighted extension of Object-Oriented RuleML
Boley, 2003. The agents are represented by the XML tree that consists of Node-Labeled, Arc-Labeled, and Arc-Weighted
Bhavsar et al, 2005. Node-Labeled is data structure for information representation in various areas. Arc-Labeled is
representation of attributes of product. Arc-Weighted is representation of the relative importance of the product’s
attributes. Criteria, specification and detail are Node-Labeled .Label is Arc-Labeled and weight is Arc-Weighted. As mentioned
in chapter 1, the criteria of the mobile phone that will be counted in this research are price, vendor, and feature.
Formula 2.1 is show the how to decide weight of child of XML tree Sarno, 2003. The formula to count similarity using
Extended Weighted Tree Similarity is shown in Formula 2.2 Sarno, 2003:
Formula 2.1
Formula to decide weight in Extended Weighted Tree Similarity
Explanation: W
: Weighted N
: count of the item Freq
: Frequency
Formula 2.2 Formula to count similarity using Extended Weighted
Tree Similarity
S = ∑Wi Wj∑Wii Wjj…
W = N freq
11
Explanation: S
: Similarity Wi
: Weighted of Parent tree of Input Wj
: Weighted of Parent tree from website Wii
: Weighted of child tree of Input Wjj
: Weighted of child tree from website Figure 2.1 is example of XML tree that represent mobile
phone’s information from site. Formula 2.1 is used in the XML tree to count the weight of every specification.
Figure 2.1 XML tree from website
Figure 2.2 is example of the XML tree that represent mobile phone’s information from user’s input. The weight in the
XML tree is decided by the user that represents his concern about the specification of the mobile phone.
12
Figure 2.2 XML tree from user’s input
Formula 2.2 will be used by the recommender system to count the similarity. The following lines is shown the example
how the formula count similarity between XML trees in figure 2.1 and figure 2.2
S = 0.330.2 0.5 0.4 + 0.5 0.6 0.33 0.7 1.0 0.8 + 0.33 0.1 0.5 0.7 +
0.5 0.3 S = 0.66 0.2 + 0.3 + 0.231 0.8 + 0.033
0.35 + 0.15 S = 0.33 + 0.1848 + 0.0165
S = 0.5313 As can be seen in the previous lines, the similarity between
XML trees in figure 2.1 and figure 2.2 is 0, 5313. The implementation of the formula in the code will be discussed in
chapter 4. 2.2
Recommender System
Recommender system is system that has ability to tailor its output to a particular user implies that it must be able to infer
what the user requires based on previous or current interaction with user or other similar user Mobasher, 2007.
Recommender system is classified based on 2 point of view, they are architectural point of view and algorithm point of
view Mobasher, 2007.
13
From architectural point of view, recommender system has two different generation approaches that classify it to 2 main
categories Mobasher, 2007: 1. Memory based
Memory based systems simply memorize all the data and generalize from it at the time of generating
recommendations. They
are therefore
more susceptible to scalability issues.
2. Model based Model based approached is that perform the
computationally expensive learning phase offline. In other hand, from algorithm point of view,
recommender system is classified to 2 different categories, they are Mobasher, 2007:
1. Knowledge based Knowledge based recommenders rely either on
explicit domain knowledge about the items or knowledge about the users Burke, 2000.
2. Content Filtering In Content-based filtering systems, a user profile
captures the content descriptions of items in which that user has previously expressed interest
3. Collaborative Filtering This system is collaboration of Knowledge Based and
Content filtering system Herlocker et al, 1999. From architectural point of view, recommender system
that is applied in this research is model based because that computing proses not consider any data from the database just
from the input from user. And from algorithm point of view, Knowledge based is the categories of recommender system that is
applied in this system because input from user will be matched with knowledge about the mobile phone through Extended
Weighted Tree Similarity Algorithm.
2.3 Software Analysis