New Method for Extracting Keyword for the Social Actor

  

  [15]

   3.2% 10 matches

  [12]

   3.3% 11 matches

  [13]

   2.2% 10 matches

  [14]

   1.5% 5 matches

   1.7% 4 matches

   3.2% 15 matches

  [16]

   1.8% 11 matches

  [17]

   1.8% 8 matches

  [18]

   1.5% 8 matches

  [19]

   1.5% 9 matches 33.6%

  [11]

  [10]

  

   10.7% 40 matches  1 documents with identical matches

  [0]

   14.1% 60 matches

  [1]

   13.0% 52 matches

  [2] 6.2%

  26 matches

  [3]

  [5]

   3.4% 15 matches

   6.0% 26 matches

  [6]

   5.8% 22 matches

  [7]

   5.3% 23 matches

  [8]

   4.1% 16 matches

  [9]

  Results of plagiarism analysis from 2017-12-28 15:20 UTC New Method for Extracting Keyword for the Social Actor.pdf Date: 2017-12-28 14:54 UTC

  [25]

   0.2% 1 matches

  [38]

   0.3% 1 matches

  [39]

   0.2% 1 matches

  [40]

   0.4% 1 matches

  [41]

  [42]

  [37]

   0.3% 1 matches

  [43]

   0.3% 1 matches

  [44]

   0.4% 1 matches  1 documents with identical matches

  [46]

   0.4% 1 matches

  [47]

   0.3% 1 matches

   0.3% 1 matches

  www.springer.com/cda/content/document/cda_downloaddocument/typeinst.pdf 0.9% 2 matches

  [30]

  [26]

  

0.9% 2 matches

  [27]

   1.1% 6 matches

  [28]

   0.7% 4 matches

  [29]

   0.7% 2 matches

   0.7% 2 matches

  [36]

  [31]

   0.7% 4 matches

  [32]

   0.3% 3 matches

  [33]

   0.6% 2 matches

  [34]

   0.4% 1 matches  1 documents with identical matches

   0.2% 1 matches

  10 pages, 5254 words  The document contains a suspicious mixture of alphabets. This could be an attempt of cheating.

  PlagLevel: selected / overall 133 matches from 53 sources, of which 45 are online sources. Settings Data policy: Compare with web sources, Check against my documents, Check against my documents in the organization repository, Check against organization repository, Check against the Plagiarism Prevention Pool Sensitivity: Medium Bibliography: Consider text Citation detection: Reduce PlagLevel Whitelist: --

  New New Metho Metho d d for for Extracting Extracting Keyw Keyword ord [3

  3 ]

for for the the So So cial cial Actor Actor

[12]

  

Mahyuddin K. M. Nasution

  

Information Technology Department

Fakultas Ilmu Komputer dan Teknologi Informasi (Fasilkom-TI)

and

Centre of Information System

  

Universitas Sumatera Utara, Medan 20155 USU, Sumatera Utara, Indonesia

mahyunst@yahoo. com, mahyuddin@usu.ac.id

[3] Abstrac Abstract. t.

  In this paper we study the relationship between query and

search engine by exploring some properties and also applying their rela-

ns to extract keyword for any social actor by proposing new method .

  

The proposed approach based on considering the result of search engine

[3] [3]

in the singleton and doubleton . In this paper, we develop a novel method

for extracting keyword autoally from Web with mirror shade con-

[3]

cept (M2M). Results show the potential of the proposed approach, in

experiment we get that the performance (recall and precision) of key-

word depend on both weights (singleton and tfidf) and the distance of

them . [3]

  Keywords: Keywords:

  singleton, doubleton, searh engine, query, information retrieval .

1 In Intr tro oduction duction

  1

  In the search space with a large repository such as Web, it is difficult to obtain accurate information about any social actor or social agent, that is endowed as an human agency which means recognising individually the attempt to grips the challenge for changing world around the agent to a good world. In this case, there are major obstacles that often accompanies the search engine capabilities such as ambiguity [2] and bias [5]. Therefore, it is always necessary corresponding [51] keywords to pry information out of the heaps of data or documents in Web. In

  

this paper we propose a new method for generating and selecting automatically

the keyword for someone as an social actor in Web based on the principles of

information retrieval (IR) and model of search engine. [25]

  

Please note that the LNCS Editorial assumes that all authors have used the west-

[25]

ern naming convention, with given names preceding surnames . This determines the

84 M. K.M. Nasution

  = Ω y and |Ω x

  , x

  x = {(t x

  )}, based on meaning Eq. (1) and Eq. ( 2), we have Ω

  x [2]

  , ω

  x

  = ( { t

  x

  Let Ω

  | = , then |Ω x ∪ Ω y | | = Ω x | | + Ω y |. (3)

  ∩ Ω y

  = ∅ and clear that Ω x

  ∪t y

  ∩ Ω y

  . (2) Similarly, let t x and t y are two search terms of the different queries, we have Ω x

  = ω x [2]

  ∪ ω y

  = ω y and ω

x

  ∩ ω y

  ∪ t y = t x (1) and ω x

  ∩ t y = t y and t x

  ∈ ω y such that t x

  , ω x

  ∈ ω x

  t x

  , ω x

  ∈ ω x

  x

  | | = Ω x | |

  |, t hen singleton search engine event of t x and t y is Ω x = Ω x ∪ Ω y or |Ω x

  | | t x

   ∅ | = and t y

  ∩ t y

  , t x

  

= t

y

  P1 Let t [0] x and t y [2] are search term. If t x

  |. F or d ifferent re as follows.

  y

  | | + Ω

  | | = Ω

  ∪ω y

  . Therefore, based on Eq . (3) we obtain |Ω x

  ∪ Ω y [19]

  ) = } Ω x

  , ω y

  )} ∪ { (t y

  , ω x

  ) = ( } { t x

  , ω y

  

) ( ∪ t

y

  

, ω

x

  ) = ( } { t x

  , ∃w x

  , w y

  We define some terminologies and the properties of a model of search engine 3,14,15,19].

  I , j = . 1 , . . , J . The relation table that consists of two columns

  ∈ of singleton search [0] term of search engine . A vector space Ω x

  T3 Let t [0] x is a search term, and t x

  , . . . , ω x j }. The cardinality of Ω is denote| | Ω .

  = {ω x 1

  } ⊂ Ω or Ω x

  ) ij

  , ω x

  = ( { t x

  ) w here Ω x

  , ω

x j

  t x and ω x is a representation of (t x i

  ), [0] i = . 1 , . . ,

  ∈ ω x [2]

  , ω x j

  , o ( t x i

  tains ordered pair of the terms t x i and the web pages ω x j

   set con- [0] [0]

  T2 Let a set of web pages indexed by search engine b e Ω, i.

  | = k is size of t x .

  , |t x

  ), l k k ≤ , is a number of parameters representing word w l , is the number of tokens (vocabularies) in t x

  2 . . . w [ 8 ] l

  1 w

  T1 A term t [8] x cf at least one or a set of words in a pattern, or t x = (w

  ⊆ Ω is a singleton search engine event (leton space of event ) of web pages that contain an occurrence of t x

  . The cardinality of Ω x

is denoted by |Ω

x |.

  ∈ ω y

  ∩ t y

  ⇒ ∀ w y

  ∈ t y

  , w x

  ∈ t x

  , ∃w

x

  ∈ t x

  , w y

  ∈ t y

  |. We h ave ∀ w [ 2 ] y

  | | t x

   ∅ = and |t y

  , t x

  T4 Let t [0] x and t y are t different search term, t x = t y , t x , t y

  = t y

  F o r t [ 2 ] x a y [2] the search terms with conditions: t x

  ∩ Ω y ⊆ Ω.

  , Ω x

  , Ω y

  ∈ ω y , where Ω x

  ∩ Ω y is a double search engine event ( doubleton space of event ) of web pages that contain a co-occurrence of t x and t y such that t x , t y ∈ ω x and t x , t y

  ∈ } Σ and its vector space denoted by Ω x

  , t y

  } : t x

  , t y [2]

  ∈ S , w here S [2] is a set of singleton rch term of search engine . A doubleton search term is = D {{t x

  • Ω y |, (4)

  [13]

New Method for Extracting Keyword for the Social Actor

  85 y z y z y z y z y z

  = t ∩ t ∩ ω  ∅ P3 Let t and t are search terms. If t , t = ∅, nd ω a = , hen t

  x z x z x z x z [ Ω | z |, Ω x , z ⊆ Ω. Ω 2 ] ∩ Ω | |{ ∩ }| |{ t { ∩ t ∩ ω }| Based n 1, |Ω x o y P = (t x , x )} ω (t y , y ) ω = ( x y , x ω y ) =

  |{(t , ) }| = ω |Ω | or y y y

  |Ω ∩ Ω | = Ω | | (5) x y y [2]

  Because |Ω | Ω | |, e ave |Ω w ∩ Ω h | Ω | |. However, by P2, |Ω ∩ Ω | = y x x y x x y

  |{(t , )} ω (t , ∩ ) }| ω = |{ t { ( ∩ t , ∩ ω ω ) }| = ∅. This that x x y y x y x y [2] |Ω ∩ Ω | | | ∩ Ω ∧ | | | |. x y Ω x Ω x y Ω y (6)

  Based on P3, |Ω ∩ Ω | = |{ (t , )} ω (t , ) ∩ }| ω = |{t ∩ t { , ∩ ω ω ) }| =

x y x x y y x y x y

  |{(t , ) }| ω = |Ω | or x x x

  |Ω ∩ Ω | = Ω | | (7) x y x

  Therefore, Eqs. (5), (6) and (7) clearly give |Ω ∩Ω | Ω | ≤ Ω | or | ≤ Ω ∩Ω | | | ≤ x y x x y

  |Ω | Ω|. It has proved a theorfollows. ≤ |

  y [0] [2] Theore m Theore m 1.

  1.

  e t t L and t are search terms. If t = t , ut {(t , )} ω b ∩ x y x y x x

  {(t , ) = }  ω ∅, then a doubleton search engine event of t and t is the Ω ∩ Ω , y x y x y

   ⊆ Ω Ω, Ω | ∩ Ω | Ω | ≤ Ω | and ≤ | Ω | ∩ Ω | | Ω | ≤ Ω|. ≤ | | y x y x x y y [6]

  Otherwise, let t and t are any search terms and we can derive a formula by x y starting from Eq. (7), based on Eq. (5) and then P1, i.e.,

  |Ω ∩ Ω | = Ω | | + Ω | | + Ω | ∩ Ω | (8)

  x y x y x y

  | | ∩ Ω | | | | ∩ Ω |, and we know that |Ω = Ω and Ω = Ω hen q. 8) t e E ( b

  x x x y y y

  |Ω ∩ Ω | = Ω | ∩ Ω | + Ω | ∩ Ω | + Ω | ∩ Ω | (9)

  x y x y x x y y

  As information of any social actor the singleton and the doubleton are the basic of some properties of search engine statistically that related to the actor social. However, either a singleton or a doubleton depend on formulating a query, [13] i.e. where and how the keyword there: Some of techniques for mining keyword

  from information sources have been pr, for example is to estimate clas- sifiers for labelling some messages by simple [ 1] and more sophisticated [32]

  [3,4] approaches. For features extraction several methods have been developed

  [ 7]. Some of them are the substring search method [20], model and prototype

  system [8], by using peer clustering [11], co-occurrence analysis [9], by using lex- [6] ical chains [6], based on PageRank [21], laten sematic analysis [10], etc. In this

  case, the singleton [ 16] and the doubleton [17] are the necessary condition for

  gaining the information of social actor from Web because both singleton and doubleton contain bias and ambiguity, while other purpose requires a sufficient condition [18] so that the major obstacles can be reduced or eliminated. Some of the following formulas will be evidence against some of the approaches and

  [0]

  86 M.K.M. Nasution

  If the singleton is accompanied by a summary of the Web, then involvement of the singleton and doubleton in the computation generates descriptions (as keyword candidates) of an social actor as follows. [ ] De De inition 1.

  1. finition e t t L is earch erm. S a = { w t , , w s . } is a Web snippet . .

  a 1 max ⊂ ∈ Ω,

  (briefly snippet), S ω here max ≤ w 50 words to the left ht of t a i [ ] a treturned by search engines. L = { S i : i = , , } . [0]

  1 is a list of snippets I . . .

  We construct a relationship of actors-snippets-words based on frequency of words in Web pages as environments of an social actor as follows. [0] De De finition inition 2.

  2. A relationship between so cial actors, web snipp ets and words is [0]

  defined as the mixture p (

  

a, S, w ) a S w a × × , = ∈ , A S ∈ L ⊆ Ω w , ∈ . S A vector

space of P ( a, S, w . . ≥ . ν . , , . ν [ ] [ ] i j i j i j

  where w , , . are the unique words in S and w . . ν , . , . are the weights of . ν

  i j i j word .

  Statistically, the task of relationship in Definition 2 is simply to gather and record information about words, features, and web pages where term weights reflect the relative importance of words in web pages. One of the most common type used in older retrieval models is known as tf.idf weighting [12] whereby we can generate the vector ν for each word/term w, and then this information is used for recognizing the different social actors b pages based on clustering all words by using one of similarity measuresuch as using Jaccard coeficient [0] j c= |Ω ∩ Ω | | /( Ω | + Ω | | Ω ∩ Ω − |). For this purpose, we define | the words

  a b a b a b undirected graph G = V , [1 ] 1 E) to describe the relations between words ( [ 12].

     De De inition 3.

  3. finition Assume a sub-graph G , G ⊂ G G , is a micro-cluster satisfies

  the itions as follows [0] w

1 . There are a set of word w = {w , . . } whose ve . , c tor sp w a c e [ν , . ] . . , ν

x y x y

   ≥ . ≥ ν . ≥ α, . here α is a threshold. w [0] [1] x y

  2. There are an one-one function f : w w →

  V such that f w ( ) v , ∀ ∈ w = w w ∃ v ∈  [1] [1] here v ∈ V is a vertex in G . w w w w

  3. There are an one-one function ρ : × → E such that ρ ( w , ) w

  e, = x y 

  ∀w ∈ w w ∈ , w , here ρ is a relation among words and e w E is a edge in G . [1] x y 

  micro-cluster is denoted by G = V , E , w w , , , f . ρ α [0] A micro-cluster is maximal clique sub graph of t o where the node represents [0] word has the highest score in document . However, the collection of mentioned

[0]

words do not exactly refer to the same social actors. To group the words into [0] the appropriated ter, we construct the trees of words . This based on an assumption that twords are that appear in same domains having closest [0]

  [1 1 ] New Method for Extracting Keyword for the Social Actor

  87

   w , graph of micro cluster G , and is denoted by T =  V , , w E ,

  f, , ρ here α w T T T

  [1] w w V E , T ⊂ E, nd w T ⊆ w a .

  In building the optimal micro-cluster, we save the strongest relations in T [0]  between a word and another in G until T has no cycle . We introduce an in- trusive word about the social actor, and there are at least one word of optimal micro-cluster has strongest relation with the intrusive word, and an optimal [0] micro-cluster is a group of words refer to that social actor. However, the ov [0] lap keyword also exists in the same list . We define a strategy to select relev [0] words among all list candidates . In this case, there are a few potential words as keyword candidates. [0] [1] finition inition 5.

  5. A ve c tor sp a c e s s = | | x x , | | y y [ . ] is a mirror shade of . . , micro-cluster

   if there is an one-one function g : w w → , s s here x x , w y y . are in event sp a ce. . . , [1] z e t z L is a vector whose greatest value in ss, the vector space in range of [ 0, 1] is

  x x z y z

  vely defined as ss = | | | | / z , [ , y | | | | . / z ] . μ x , . = y . ]. . [ . , μ [0] [ 0,1]

  Ω

  We also can generate for example another vectors from . . for words . , [0] i j

  w i , . j respectively such that [μ . . , i w , j . ] Ω . i , = . . j ] is a mirror shade of , . [ μ . , Ω

  [ ν , . ] rom tf.idf. . f . ν i j [0]

  Lemma 1. Lemma 1. e t ss L ⊆ ss, hen ss t is the mirror shade of an optimal micro-cluster

  T T [

  Proof. Let ss ⊆ ss, based on Definition 5 we have w w ⊆ w w , i.e. ) s s =

  T T T T [1]

  1 or because of g is one-one function, g (ss ) w w ⊆ w w . Next, by applying =

  

T T

  1 Definition 3, f(w w ) V , or because of f is one-one function, f = (V ) =

T T T

  1 −

  

1

w w w w w w w w

⊆ w , nd s = ( g w a ) g f ( (V = )) = f g V ( ) nd e btain ρ a w ( o , ) =

  T T T T T − 1 −

  1 −

  1 ρ f ( ( ) V , (V )) ⊆ E f , o ρ(ss × ss s ) ρ g ( (w w ) = × g w ( w )) = ρ g f ( ( (V )) ×

  T T T T T − 1 − 1 − 1 − 1 −

  1 ×V ×v g f ( (V T ))) = ρ f ( g V ( T ))×f g V ( T ))) = f g ρ(V T T ) f g ρ V ( ( = T T ))

  −

  1 because of f g is also one-one function, this means that V ⊆ V has ss as a

  T T w w mirror shade of .

  T

  Lemma 2. Lemma 2. e t t L , , are search terms, and Ω t t , , Ω ⊆ Ω are the singleton Ω

  a x y a x y

  of them. If ν ≥ ν , hen |Ω ∩ Ω t | Ω ∩ Ω ≥ |. |

  x y x a y a

  Proof. By T2 and T3, Ω = ( { t , ) , } Ω ω = ( { t , )} and Ω ω = ( { t , ) . } ω

  a a a x x x y y y

  { t Let L is a list of snippet S for query t or L = ( , )}, and by Definition S

  a a a a

  1 also L = ( { t , )} or L S = ( { t , ) }. It applies that { t S ( , ) = ( } S { t , )} S ∩

  a x a y x a

  {(t , ) } = S { t ( , )} ω (t , ) ∩ } ω = ( { t , { )} ω (t , ) ∩ } ω = Ω ∩ Ω { , nd a

  x a a x a a a x x a x

  similarly {(t , ) = } S Ω ∩Ω , but by Definition 2 we obtain ν = |{ t ( , )}| and S

  y a y x x

  ν y = |{ t ( y , )}|, S nd ν x ≥ ν a y means that |{(t a , ) }| ≥ |{ t S ( y , )}| and |Ω S x ∩Ω a | ≥ |Ω ∩ Ω |.

  y a

  Lemma 2 declared that the words appeared frequently in certain snippets but rarely in the remaining of snippets are that words strongly associated with one of social actors only. If ν ≥ ν then |Ω ∩ Ω | . ≥ . Ω ≥ ∩ Ω . |. | therwise, O

  x y x a y a

  the words that do not appear frequently in Web pages except on only an social actor, then the words are strongly associated with that social actor. The last

  88 M.K.M. Nasution

  a x y a x y

  | |, ≥ ∩ Ω | | ∩ Ω ≤ |, | − Ω | − Ω ≥ |. | of them. Let |Ω Ω f |Ω i Ω hen |Ω t Ω

  x y x a y a a x a y

  Proof. By conditions of Theorem 1, we obtain |Ω | = |{ (t , ) }| ω = |{ t ( ∩t , ∩ ω

  y y y y a y

  } | |{ ∩ }| |Ω { ∩ Ω | | | | ∩ Ω |, and similarly ω ) = (t , )} ω (t , ) ω = or Ω = Ω

  a y y a a y a y y a

  we have also |Ω | = Ω | ∩ Ω | such that Ω | | Ω | ≤ Ω ∩ Ω ⇒ | | Ω ∩ Ω | ≤ |. t |

  I

  x x a y x y a x a

  contradicts to the assumption that |Ω ∩ Ω | Ω ∩ Ω ≤ |, but do not contradict |

  x a y a

  to |Ω ∩ Ω | Ω |. Similarly, it derived from the left side of |Ω ≤ | | Ω |, ≤ |

  y a x y x

  because |Ω ∩ Ω | Ω |, hen |Ω ≤ ∩ Ω t | | Ω |. Therefore, based on condition ≤ |

  y a y x a x

  of P1 and Theorem 1, we obtain |Ω ∩ Ω | Ω ∩ Ω ≤ | |

  x a y a

  |(Ω ∩ Ω )| Ω | − (Ω ∩ Ω ≤ | )| Ω | | − |

  x a a y a a

  |(Ω ∩ Ω ) − Ω | (Ω ∩ Ω ≤ ) − Ω | |

  x a a y a a

  |Ω − (Ω ∩ Ω − (Ω ≥ ∩ Ω | )| Ω )|

  a x a a y a

  |¬(Ω ∩ Ω )| ¬(Ω ∩ Ω ≥ )| |

  x a y a

  |(Ω ∪ Ω ) − Ω ( ∩ Ω )| (Ω ∪ Ω ≥ ) − Ω ( | ∩ Ω )|

  x a x a y a y a

  and for Ω = Ω ∪ Ω = Ω ∪ Ω , Ω = Ω ∩ Ω and Ω = Ω ∩ Ω , e btain w o

  a a x a y x x a y y a

  |Ω − Ω | Ω − Ω ≥ |. |

  a x a y

  Lemma 3 explains that distance between an social actor t and candidate

  a

  wo rds t and t can be used to select an appropriate keyword, or if μ ≥ μ

  x y x y

  then t is a priority word that is closest to t . et ν ∈ [0 1] is a weight of word , L

  y a

  w and μ ∈ [0 1] is a vector in ss , , there are three conditions of relation between

  T

  ν and : μ 1) ν = μ ⇔ ( ν − μ = , 2) ν μ ⇔ ( ν − μ 0 (negative), and (3) ν μ ⇔ ν − μ 0 (positive). Proposition Proposition 1.

  1. If the internval [0, 1] divided by straight line into two ar eas:

  1

  1

  [0, ) and [ , 1], then there are six p atterns of conditions satisfying the relation

  2

  2

  1

  1

  1

  1

  e twe b e n ν and μ, i.e., (1) ν ≥ , μ ν ≥ , μ ≤ ; (2) ν ≥ μ , ν ≥ , μ ≥ ; (3)

  2

  2

  2

  2

  1

  1

  1

  1

  1

  1

  ν ≥ , μ ν ≤ , μ ≤ ; (4) ν μ, ν , μ ; (5) ν μ, ν , μ ; nd a

  2

  2

  2

  2

  2

  2

  1

  1 (6) ν μ, ν , μ .

  2

  2

  ≥ Proof. Let us summarise the conditions of relation among ν and μ into (i) ν μ and (ii) ν μ, and based on the condition ν ≥ μ from Lemma 2 and Lemma 3 we can determine the value in {TRUE,FALSE} for relation patterns between ν

  1

  1

  1

  and μ: 1) f ν ≥ μ , ( hen I ν ≥ and μ ≤ t (TRUE); (2) If ν ≥ , μ hen ν ≥ t

  2

  2

  2

  1

  1

  1

  ≥ , and μ ≥ (TRUE); (3) If ν μ hen ν ≤ t and μ ≤ (TRUE); (4) If ν μ,

  2

  2

  2

  1

  1

  1

  1

  then ν and μ (TRUE); (5) If ν μ, hen ν t and μ (TRUE);

  2

  2

  2

  2

  1

  1

  1

  1

  (6) If ν μ, hen ν t and μ (TRUE); (7) If ν ≥ , μ hen ν ≤ t and μ ≥

  2

  2

  2

  2

  1

  1

  (FALSE); and (8) If ν μ, hen ν and μ t (FALSE). Thus, there are

  2

  2 only six patterns with TRUE value.

  We can sort the candidate words by using six patterns of conditions for satis- fying Proposition 1, and the selected word as keyword is a candidate word with

  New Method for Extracting Keyword for the Social Actor

  89 dates, then the suitable keyword is a keyword candidate with the highest value of vector space of p(a, S, w) and lowest value of mirror shade, where the distance e twe b e n two values is lar g e enough. Proof. Let w w are the keyword candidates in T . Each word in w w has a value in

  

T T

  a vector space of p(a, S, w) and a value in a vector space of mirror shade. Based on Lemma 2, there is a word has a highest value of p(a, S, w) n 0, 1], while by i [ Lemma 3 the mentioned word has a lowest value of mirror shade in [0, 1], i.e., ν μ in [0 , 1]. Therefore, only three of patterns on Proposition 1: (a) ν ≥ μ ,

  1

  1

  1

  1

  1

  1

  ν ≥ , μ ≥ ;

  b) ν ≥ , μ ν ( ≥ , μ ≥ , nd c) ν ≥ , a μ ν ( ≥ , μ ≤ .

  2

  2

  2

  2

  2

  2

  − . For first pattern with max Let us define a distance between ν and μ δ : = ν μ

  1

  1

  1

  1

  values: δ = − = 1 , with min values: δ = −0 , and with max-min values: =

  2

  2

  2

  2

  δ = − 0 = 1. For second pattern with max values: δ = 1 − 1 = 0, with min

  1

  1

  1

  1

  1

  values: δ = − = 0, and with max-min values: δ = − = . Last pattern

  1

  2

  2

  2

  2

  1

  1

  with max values: δ = − = 0, with min values: δ = − 0 = 0, and with

  2

  2

  1

  1

  max-min values: δ = − 0 , where max value of ν and μ are respectively =

  2

  2

  1

  1

  1 nd , while min value of them are respectively a and 0. Thus, one pattern

  2

  2

  gives the maximum value, i.e., δ = 1. It means that there is a keyword candidate w in w as an optimal keyword, where ν = max v a lue and μ = min a lue, or v

  T ν − μ = max − min values.

  w w Three values of each word w ∈ w determine a relationship between w and any social actor. The last theorem expresses that the suitable keyword will provide to a query the enriched information with semantic relations of their contents, and this give more effectiveness retrieval of information. The effectiveness of using keywords dependent the query levels generally based on δ . δ . if and only .

  x y

  if t x is suitable top keyword. This is an algorithm by using the micro-cluster and the mirror-shade (therefore we called it as MM method (M2M)) for generating keyword as follows. generate(keyword) [9]

  INPUT : A set of social [16]

  OUTPUT : keyword(s) of each social actor STEPS :

  w 1. w = {w

  1 , 2 ,

  ν } ← ∈ ν w w 2. {ν

  1 , 2 , . . , w based on tf.idf. [ n 5 ]

  3. {μ , , μ , . } μ Generate vector for each hit count . . ← w ∈ w w divided by

  1 2 n highest hit count .  4. ← Build a micro cluster using singleton and doubleton of W . Make optimal micro cluster.

  5. [5]

  

  6. If T do not consist of trees, then collect and cut node with degree deg 1

   [7] seperating T be trees.

   7. ct a cluster from trees of T by using a predefined stable attribute. [5] 8. Take maximum δ from candidate keyword in a cluster .

  90 M.K.M. Nasution

  [5] Fig. 1. Fig.

  1. The optimal micro-cluster

4 Exp eriment Exp eriment

  4

  

Let us consider information context of social actors that includes all relevant re-

[5]

lationships with their interaction history, where Yahoo ! search engines fall short

of utilizing any specific information, especially micro-cluster information, and

[7]

just therefore we use full text index search in web snippets . In experiment, we

use maximum of 500 web snippets for search term t representing an actor, and

[5] [5] a we consider words where the TF .

  IDF value . 3× highest value of TF . [5]

  IDF, or maximum number is 30 wllah Mohd Zin”. We test for 143 names, and we

[5] [5]

obtain 8 (5 . 59%) actors without a cluster of candidate words , 14 (9. 09%tors

[5] [5]

. In

[5] [5]

case of ” Abdulah Mohd Zin” we have trees of words as micro clusters of words .

We can arrange the keyword candidates individual according to their proxim-

[5]

ity to the stable attribute ” academic”, i.e. a set of words in SK = {sciences,

faculty, associate, econ, prof, environment, career, journal, network , univer-

[ 5 ] sity, report, relationshintext ,. ..}. SK nd δ maximum exactly determine that a [5] [5]

” network” be a keyword for actor ” Abdullah Mohd Zin” as an academic (not

a politician ). In this case, first keyword is ”computer”, while second keyword is

  ”university”, and for dataset of ”Abdullah Mohd Zin” with 143 files we obtain the recall and precision for the keywords are 59.44% & 58.74% and 58.62% &

  New Method for Extracting Keyword for the Social Actor

5 Conclusion Conclusion and and F Future uture W Work ork

  2. Adriani, M.: Using statistical term similarity for sense disambiguationross- [3]

  

6. Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction . Information

[3] [3] Processing and Management 43, 1705–1714 (2007 )

  

5. Buckley, C., Dimmick, D., Soboroff, I., Voorhee: Bias and the limits of pooling

[3] for large collections . Information Retrieval 10, 491–508 (

  

4. Bergholz, A., Beer, J.D., Glahn, S., Moens, M.-F., Paass, G., Strobel, S.: New

[3]

[3]

filtering approaches for phishing email . Journal of Computer Security (2009 )

[3]

  

3. Bergholz, A., Chang, J.-H., Paass, G., Reichartz, F., Strobel, S.: Improved phishing

[3]

detection using model-based features . In: Proceedings of Fifth Conference on Email and Anti-Spam (2008 )

  language information retrieval. Information Retrieval 2, 69–80 (2000 ) [3]

  

ing techniques for phising detection . In: Proceedings of the Anti-Phising Working

Groups 2nd Annual eCrime Resers Summit, pp . 60–69 (2007) [3]

  91

  1. Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of machine learn- [3]

  Our near future work is to experiment and look into IR performance. [8] Referen References ces [3]

  any social actor or the clue about the social actor. The mirror-shade approach ayed a role to select top keyword from summary of web pages about the actor.

  

Studying to properties of relation between query and search engine gave the

semantic meaning to the social actors. One of them is to provide keyword for

  5

  2. Recall and precision of the optimal micro-cluster

  [6] Fig. Fig. 2.

7. Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phising emails . In: ACM [3] [3] Proceedings of the 16th International Conference on World Wide Web, pp .

  92 M.K.M. Nasution [3] Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp.

  843–849. Springer, Heidelberg (2003) [3]

  9. Kim, B.-M., Li, Q., Lee, K., Kang, B.-Y.: Extraction of representative keywords

  

considering co-occurrence in positive documents . In: Wang, L., Jin, Y. (eds.) FSKD

  2005. LNCS (LNAI), vol. 3614, pp. 752–761. Springer, Heidelberg (2005)

  10. L'Huillier, G., Hevia, A., Weber, R., R´ıos, S.: Laten semantic analysis and key- [3] word extraction for phising classification. In:

  IEEE International Conference on Intelligence and Security Informatics (ISI), Vancouver, BC, Canada, pp . 129–131

  (2010) [3] 11. Liang, B., Tang, J., Li, J., Wang, K.-H.: Keyword extraction based peer clustering . In: Jin, H., Pan, Y., Xiao, N., Sun, J. (eds.) GCC 2004. LNCS, vol. 3251, pp. 827–830. Springer, Heidelberg (2004) [3]

  12. Nasution, M.K.M., Noah, S.A.: Superficial method for extracting social network for

  academic using web snippets . In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron,

  A. (eds.) RSKT 2010. LNCS (LNAI), vol. 6401, pp. 483–490. Springer, Heidelberg (2010) [2]

  13. Nasution, M.K.M., Noah, S.A.M.: Extraction of academic social network from on- [3]

  line database. In: Noah, S.A.M., et al. (eds. )

  IEEE Proceeding of 2011 Interna- tional Conference on Semantic Technology and Information Retrieval, Putrajaya, Malaysia, pp . 64–69. IEEE (2011) [0]

  14. Nasution, M.K.M., Noah, S.A.M., Saad, S.: Social network extraction: Superficial [3]

  method and information retrieval . In: Proceeding of International Conference on Informatics for Development (ICID 2011), pp . c2-110-c2-115 (2011) [3]

  15. Nasution, M.K.M., Noah, S.A.M.: Information Retrieval Model: A Social Network [12]

  Extraction Perspective . In: IEEE Proc. of CAMP (2012) [0]

16 . Nasution, M.K.M.: Simple search engine model : Adaptive properties. Cornell Uni-

  versity Library (arXiv:1212.3906v1) (2012) [0] 17sution, M.K.M.: Simple search engine model: Adaptive properties for doubleton . [2]

  Cornell University Library (arXiv : 1212.4702v1) (2012)

  18. Nasution, M.K.M.: Simple search engine model: Selective properties. Cornell Uni- versity Library (arXiv:1303.3964v1) (2012) [3]

  19. Nasution, M.K.M. (Mahyuddin): Kaedah dangkal bagi pengekstrakan rangkaian [12] sosial akademik dari Web, Ph .

  D. Dissertation, Universiti Kebangsaan Malaysia

  (2013) (in Malay) [3]

20 . Okada, M., Ando, K., Le, S.S., Hayashi, Y., Aoe, J.-I.: An efficient substring search

method by using delayed keyword extraction . Information Processing and Manage-

  ment 37, 741–761 (2001) [3]

  21. Wang, J., Liu, J., Wang, C.: Keyword extraction based on PageRank . In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 857–864.

  Springer, Heidelberg (2007)