The Vector Space Models

  The Vector Space Models (VSM) doc1

  • Terms are axes of the space
  • Documents are points or vectors in this space
  • - So we have a |V|-dimensional vector

    space

  

Doc 1 : makan makan

Doc 2 : makan nasi

  

Doc 1 : makan makan

Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

  Makan Nasi Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

  Makan

  1

  1 Nasi

  1

  • Terms are axes of the space
  • Documents are points or vectors in this space
  • - So we have a |V|-dimensional vector

    space

  Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

  Makan

  1

  1 Nasi

  1 Makan

  Nasi

  1

  1 Doc 1 Doc 2 Count Doc 1 : makan makan Doc 2 : makan nasi

  Incidence Inverted Index

Matrix Raw TF

TF Biner

  Term Doc 1 Doc 2 Doc 1 Doc 2

  Makan

  1

  1 Nasi

  1 Count Doc 1 : makan makan Doc 2 : makan nasi

  Incidence Inverted Index

Matrix Raw TF

TF Biner

  Term Doc 1 Doc 2 Doc 1 Doc 2

  Makan

  1

  1

  2

  1 Nasi

  1

  1 Count Nasi

  Doc 1 : makan makan Doc 2 : makan nasi Inverted Index

  Doc 2

  1 Raw TF Term Doc 1 Doc 2

  Makan

  2

  1 Nasi

  1 Makan

  Doc 1

  1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Inverted Inverted Matrix Index Index

  

TF Biner Raw TF Logaritmic

TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

  1

  1

  2

  1 Doc 1 : makan makan Doc 2 : makan nasi

Incidence Inverted Inverted

Matrix Index Index

  TF Biner Raw TF Logaritmic TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

  1

  1

  2

  1

  1.3

  1 Nasi Doc 1 : makan makan Doc 2 : makan nasi

  Inverted Doc 2

  1 Index Logaritmic TF Term Doc 1 Doc 2

  Maka

  1.3

  1 Makan

  Doc 1

  1

  n Doc 1 : makan makan Doc 2 : makan nasi Incidence Inverted Inverted Inverted Matrix Index Index Index TF Biner Raw TF Logaritmic TF-IDF TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

  1

  1

  2

  1

  1.3

  1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF

  

TF

Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Maka

  2

  1

  1.3

  1 n Doc 1 : makan makan Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF

  

TF

Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Maka

  2

  1

  1.3

  1 n

  Nasi Doc 1 : makan makan Doc 2 : makan nasi

  Inverted

  1 Index TF-IDF Term Doc 1 Doc 2 Doc 2

  Maka n

  Doc 1 Makan

  1 Nasi

  0.3 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF TF Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Maka

  2

  1

  1.3

  1 n Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF

  

TF

Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Maka

  2

  1

  1.3

  1 n

  • Terms are axes of the space
  • Documents are points or vectors in this space
  • - So we have a |V|-dimensional vector space

  • The weight can be anything : Binary, TF, TF-IDF and so on.
  • Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
  • - These are very sparse vectors - most

    entries are zero.

  How About The Query?

  Query as Vector too...

  Represent Document as

  • Key idea 1:

  vectors in the space

  • • Key idea 2 : Do the same for queries:

  represent them as vectors in the space

  • Key idea 3 : Rank documents

  

according to their proximity to the

  PROXIMITY?

  • Proximity = Kemiripan • Proximity = similarity of vectors
  • Proximity ≈ inverse of distance
  • Dokumen yang memiliki proximity

  

dengan query yang terbesar akan

memiliki score yang tinggi sehingga rankingnya lebih tinggi

  How to Measure Vector Space Proximity?

  • First cut: distance between two points
    • – ( = distance between the end points of the two vectors)

    >Euclidean distance? • Euclidean distance is a bad idea . . .
  • • . . . because Euclidean distance is large

    for vectors of diferent lengths .

  Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip (gossip : 90x jealous: 70x)

  Query : gossip Jealous Inverted Index Inverted Index Logaritmic TF

  TF-IDF Term Doc Doc Doc Quer

  IDF Doc Doc Doc Quer

  1

  2 3 y

  1

  2 3 y

  Gossi

  1

  2.95

  1

  0.17

  0.17

  0.50

  0.17 p Jealou

  1

  2.84

  1

  0.17

  0.17

  0.48

  0.17 Jealous Doc 3

  Query : gossip Jealous

  0.4 Inverted Index Doc 2 TF-IDF Query Term Doc Doc Doc Quer

  1

  2 3 y

  Gossi

  0.17

  0.50

  0.17 p

  Doc 1 Gossip

  Jealou

  0.17

  0.48

  0.17

  0.4 Idea? The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar .

  • - Thought experiment: take a document d

    and append it to itself. Call this document d′.
  • “Semantically” d and d′ have the same content
  • - The Euclidean distance between the two

    documents can be quite large

  Jealous d’

  • The angle between

  0.4 the two documents is 0, corresponding to maximal similarity. d

  Gossip

  0.4 distance : Rank documents according

  • Key idea to angle with query.

  • - The following two notions are equivalent.

  • Rank documents in decreasing order of the angle between query and document
  • Rank documents in increasing order of cosine(query,docum
  • Cosine is a monotonically decreasing

  o o function for the interval [0 , 180 ]

  But how – and why – should we be computing cosines?

  a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)

   

  V    q d

  

q d q d i i

i

  1   cos( q , d )  

       

  V V

  2

  2 q q d d q d i i i

  1 i

  1     q is the tf-idf weight (or whatever) of term i in the query i

d is the tf-idf weight (or whatever) of term i in the document

i

  cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

  • A vector can be (length-) normalized by dividing each of its components by its length
  • Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere)
  • Unit Vector = A vector whose length is exactly

  1 (the unit length)

  Gossip Jealous

  0.4

  0.4 d’ d

  d d d

     

  Gossip Jealous

  1

  

1

d d’

  • - Efect on the two documents d and d′

    (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
  • - Long and short documents now have

    comparable weights

  d q d q d q

     

     ) , cos(

After Normalization :

  1  d

  

  1 

q

   d q d q d d q q d q d q d q

     

   

     

          

  

  1

  1 ) , cos(

  After Normalization : for q, d length-normalized.

   

    

  V i i i d q d q d q

  1 ) , cos( 

    

  • Value of the Cosine Similarity is [0,1]

  Example? Document: car insurance auto insurance N=1000000

  Query: best car insurance Term Query

  tf- tf-wt df idf tf.idf n’lize raw auto 5000

  2.3 best 1 1 50000

  1.3

  1.3 car 1 1 10000

  2.0

  2.0 insurance 1 1 1000

  3.0 2

  3.0 2 2 2 Query length = 1 .

  3

  2

  3 3 .

  8     Document: car insurance auto insurance Query: best car insurance Term Query

  tf- tf-wt df idf tf.idf n’lize raw auto 5000

  2.3 best 1 1 50000

  1.3

  1.3

  0.34 car 1 1 10000

  2.0

  2.0

  0.52 insurance 1 1 1000

  3.0 2

  3.0 2

  0.78 2 2 Query length = 1 .

  3

  2

  3 3 .

  8     Document: car insurance auto insurance Query: best car insurance Term Document

  tf-raw tf-wt idf tf.idf n’lize auto

  1

  1

  2.3

  2.3 best 1.3 car

  1

  1

  2.0

  2.0 insurance

  2

  1.3

  3.0

  3.9 2 2 2 2 Doc length = 2 .

  3

  2 3 .

  9 4 .

  9     Document: car insurance auto insurance Query: best car insurance Term Document

  tf-raw tf-wt idf tf.idf n’lize auto

  1

  1

  2.3

  2.3

  0.3 best 1.3 car

  1

  1

  2.0

  2.0

  0.5 insurance

  2

  1.3

  3.0

  3.9 2

  0.79 2 2 2 Doc length = 2 .

  3

  2 3 .

  9 4 .

  9    

  After Normalization : for q, d length-normalized.

   

    

  V i i i d q d q d q

  1 ) , cos( 

     Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product

  tf.idf n’lize tf.idf n’lize auto

  2.3

  0.3 best

  1.3

  0.34 car

  2.0

  0.52

  2.0

  0.5

  0.26 insurance

  3.0

  0.78

  3.9

  0.79

  0.62

  

2

  2

  2

  2 Doc length = 1  0 1 1.3 1.92 ranking

  • - Represent the query as a weighted tf-idf vector

  • - Represent each document as a weighted tf-idf vector

  • Compute the cosine similarity score for the query vector and each document vector
  • Rank documents with respect to the query by score
  • Return the top K (e.g., K = 10) to the user
documents How similar are term SaS PaP WH the novels

  affection 115

  58

  20 SaS : Sense and jealous

  10

  7

  11 Sensibility gossip

  2 6 wuthering

  38 PaP : Pride and

  Prejudice, and Term frequencies (counts)

  WH : Wuthering Heights? Note: To simplify this example, we don’t do idf weighting.

Log frequency weighting term SaS PaP WH

  normalization term SaS PaP WH

  0.69

  0.79 cos(PaP,WH) ≈

  0.94 cos(SaS,WH) ≈

  0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈

  cos(SaS,PaP) ≈

  0.588

  affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering

  

contd.

  affection

  1.30

  2.04 gossip

  1.85

  2.00

  2.30 jealous

  2.76

  3.06

  1.78 wuthering