The Vector Space Models v2

  The Vector Space Models (VSM) doc1

  • Terms are axes of the space
  • Documents are points or vectors in this space
  • - So we have a |V|-dimensional vector

  

Doc 1 : makan makan

Doc 2 : makan nasi

  

Doc 1 : makan makan

Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

  Makan Nasi Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

  Makan

  1

  1 Nasi

  1

  • Terms are axes of the space
  • Documents are points or vectors in this space
  • - So we have a |V|-dimensional vector

    space

  Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

  Makan

  1

  1 Nasi

  1 Nasi

  1 Doc 2 Count Doc 1 : makan makan Doc 2 : makan nasi

  Incidence Matrix Inverted Index

TF Biner Raw TF

  Term Doc 1 Doc 2 Doc 1 Doc 2

  Makan

  1

  1 Nasi

  1 Count Doc 1 : makan makan Doc 2 : makan nasi

  Incidence Matrix Inverted Index

TF Biner Raw TF

  Term Doc 1 Doc 2 Doc 1 Doc 2

  Makan

  1

  1

  2

  1 Nasi

  1

  1 Count Nasi

  Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF

  1 Doc 2 Term Doc 1 Doc 2

  Makan

  2

  1 Nasi

  1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Inverted Index Logaritmic TF Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

  Makan

  1

  1

  2

  1 Nasi

  1

  1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner

  

Inverted Index

Raw TF

Inverted Index Logaritmic TF Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

  Makan

  1

  1

  2

  1

  1.3

  1 Nasi

  1

  1

  1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Logaritmic TF Term Doc 1 Doc 2

  Makan

  1.3

  1 Nasi

  1 Nasi

  1 Doc 2 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Inverted Index Logaritmic TF Inverted Index TF-IDF

  

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

  Makan

  1

  1

  2

  1

  1.3

  1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF

  

Inverted Index

Logaritmic TF

Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Makan

  2

  1

  1.3

  1 Nasi

  1

  1

  0.3 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF

  

Inverted Index

Logaritmic TF

Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Makan

  2

  1

  1.3

  1 Nasi

  1

  1

  0.3

  0.3 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index TF-IDF Term Doc 1 Doc 2

  Makan Nasi

  0.3 Nasi

  1 Doc 2 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Inverted Index Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Makan

  2

  1

  1.3

  1 Nasi

  1

  1 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Inverted Index Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

  IDF Doc 1 Doc 2

  Makan

  2

  1

  1.3

  1 Nasi

  1

  1

  0.3

  0.3

  • Terms are axes of the space
  • Documents are points or vectors in this space
  • - So we have a |V|-dimensional vector space

  • The weight can be anything : Binary, TF, TF-IDF and so on.
  • Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
  • - These are very sparse vectors - most

    entries are zero.

  How About The Query?

  Query as Vector too...

  • Key idea 1: Represent Document as

  vectors in the space

  • • Key idea 2 : Do the same for queries:

  represent them as vectors in the space

  PROXIMITY?

  • Proximity = Kemiripan • Proximity = similarity of vectors
  • Proximity ≈ inverse of distance
  • Dokumen yang memiliki proximity

  

dengan query yang terbesar akan

memiliki score yang tinggi sehingga

  How to Measure Vector Space Proximity?

  • First cut: distance between two points
    • – ( = distance between the end points of the two vectors)

    >Euclidean distance? • Euclidean distance is a bad idea . . .
  • • . . . because Euclidean distance is large

  Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip

  Query : gossip Jealous Inverted Index Logaritmic TF

  0.17 Jealous 0

  0.48

  0.17

  0.17

  1

  2.84

  1

  0.50

  Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query

  0.17

  0.17

  1

  2.95

  1

  Gossip

  IDF Doc 1 Doc 2 Doc 3 Query

  0.17 Query : gossip Jealous Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query

  Gossip

  0.17

  0.50

  0.17 Jealous 0

  0.17

  0.48

  0.17 Jealous

  0.4 Doc 2 Doc 3 Query Idea? The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar .

  • - Thought experiment: take a document d

    and append it to itself. Call this document d′.
  • “Semantically” d and d′ have the same content
  • - The Euclidean distance between the two

  Jealous d’

  • The angle between

  0.4 the two documents is 0, corresponding to maximal similarity. d distance : Rank documents according

  • Key idea to angle with query.

  • - The following two notions are equivalent.

  • Rank documents in decreasing order of the angle between query and document
  • Rank documents in increasing order of cosine(query,docum
  • Cosine is a monotonically decreasing

  o o

  But how – and why – should we be computing cosines?

  a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)

  V q d

   q d q d i i

   i

  1

  cos( q , d )   

  

  V V

  2

  2

  q d q d q d i i

   

   i

  1  i

  1

  q  is the tf-idf weight  (or whatever)  of term i in the query i d  is the tf-idf weight  (or whatever)  of term i in the document i

  cos(q,d) is the cosine similarity of q and d … or,

  • A vector can be (length-) normalized by dividing each of its components by its length
  • Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere)
  • Unit Vector = A vector whose length is exactly

  1 (the unit length)

  Jealous

  0.4 d’ d

  dd d

  Jealous

  1 d d’

  • - Effect on the two documents d and d′

    (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
  • - Long and short documents now have

    comparable weights

  d d q q d q d q d q

   

    ) , cos( d q d q d q d q d q

After Normalization :

       

    ) , cos(

  After Normalization :  for q, d length-normalized.

   

  

  V i i i d q d q d q

  1

  ) , cos(

  • Value of the Cosine Similarity is [0,1]

  Example?

  Term Query

  tf- raw tf-wt df idf tf.idf n’lize auto 5000

  2.3 best 1 1 50000

  1.3

  1.3 car 1 1 10000

  2.0

  2.0 insurance 1 1 1000

  3.0

  3.0 Document: car insurance auto insurance

  Query: best car insurance N=1000000

  Term Query

  tf- raw tf-wt df idf tf.idf n’lize auto 5000

  2.3 best 1 1 50000

  1.3

  1.3

  0.34 car 1 1 10000

  2.0

  2.0

  0.52 insurance 1 1 1000

  3.0

  3.0

  0.78 Document: car insurance auto insurance

  Query: best car insurance

  Term Document

  tf-raw tf-wt idf tf.idf n’lize auto

  1

  1

  2.3

  2.3 best 1.3 car

  1

  1

  2.0

  2.0 insurance

  2

  1.3

  3.0

  3.9 Document: car insurance auto insurance

  Query: best car insurance

  Term Document

  2.0

  0.79 Document: car insurance auto insurance

  3.9

  3.0

  1.3

  2

  0.5 insurance

  2.0

  tf-raw tf-wt idf tf.idf n’lize auto

  1

  1

  0.3 best 1.3 car

  2.3

  2.3

  1

  1

  Query: best car insurance

  After Normalization :  for q, d length-normalized.

   

  

  V i i i d q d q d q

  1

  ) , cos( Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product

  tf.idf n’lize tf.idf n’lize auto

  2.3

  0.3 best

  1.3

  0.34 car

  2.0

  0.52

  2.0

  0.5

  0.26 insurance

  3.0

  0.78

  3.9

  0.79

  0.62 ranking

  • - Represent the query as a weighted tf-idf vector

  • - Represent each document as a weighted tf-idf

    vector
  • Compute the cosine similarity score for the query vector and each document vector
  • Rank documents with respect to the query by score

  Cosine similarity amongst 3 documents How similar are term SaS PaP WH the novels

  affection 115

  58

  20 SaS : Sense and jealous

  10

  7

  11 Sensibility gossip

  2

  6 PaP : Pride and wuthering

  38 Prejudice, and

  Term frequencies (counts) WH : Wuthering

Log frequency weighting term SaS PaP WH

  affection

  3.06

  2.76

  2.30 jealous

  2.00

  1.85

  2.04 gossip

  1.30

  1.78 wuthering

  term SaS PaP WH

  affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering

  0.588

  cos(SaS,PaP)  ≈

  0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 

  0.94