The Vector Space Models v2

The Vector Space Models (VSM) doc1

Terms are axes of the space
Documents are points or vectors in this space
- So we have a |V|-dimensional vector

Doc 1 : makan makan

Doc 2 : makan nasi

Doc 1 : makan makan

Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

Makan Nasi Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

Makan

1 Nasi

Terms are axes of the space
Documents are points or vectors in this space
- So we have a |V|-dimensional vector
space

Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

Makan

1 Nasi

1 Doc 2 Count Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix Inverted Index

TF Biner Raw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan

1 Nasi

1 Count Doc 1 : makan makan Doc 2 : makan nasi

Incidence Matrix Inverted Index

TF Biner Raw TF

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan

1 Nasi

1 Count Nasi

Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF

1 Doc 2 Term Doc 1 Doc 2

Makan

1 Nasi

1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Inverted Index Logaritmic TF Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan

1 Nasi

1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner

Inverted Index

Raw TF

Inverted Index Logaritmic TF Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan

1.3

1 Nasi

1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Logaritmic TF Term Doc 1 Doc 2

Makan

1.3

1 Nasi

1 Doc 2 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Inverted Index Logaritmic TF Inverted Index TF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

Makan

1.3

1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF

Inverted Index

Logaritmic TF

Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Makan

1.3

1 Nasi

0.3 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF

Inverted Index

Logaritmic TF

Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Makan

1.3

1 Nasi

0.3

0.3 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index TF-IDF Term Doc 1 Doc 2

Makan Nasi

0.3 Nasi

1 Doc 2 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Inverted Index Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Makan

1.3

1 Nasi

1 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Inverted Index Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Makan

1.3

1 Nasi

0.3

Terms are axes of the space
Documents are points or vectors in this space
- So we have a |V|-dimensional vector space
The weight can be anything : Binary, TF, TF-IDF and so on.

Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
- These are very sparse vectors - most
entries are zero.

How About The Query?

Query as Vector too...

Key idea 1: Represent Document as

vectors in the space

• Key idea 2 : Do the same for queries:

represent them as vectors in the space

PROXIMITY?

Proximity = Kemiripan • Proximity = similarity of vectors
Proximity ≈ inverse of distance
Dokumen yang memiliki proximity

dengan query yang terbesar akan

memiliki score yang tinggi sehingga

How to Measure Vector Space Proximity?

First cut: distance between two points

– ( = distance between the end points of the two vectors)

• . . . because Euclidean distance is large

Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip

Query : gossip Jealous Inverted Index Logaritmic TF

0.17 Jealous 0

0.48

0.17

2.84

0.50

Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query

0.17

2.95

Gossip

IDF Doc 1 Doc 2 Doc 3 Query

0.17 Query : gossip Jealous Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query

Gossip

0.17

0.50

0.17 Jealous 0

0.17

0.48

0.17 Jealous

0.4 Doc 2 Doc 3 Query Idea? The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar .

- Thought experiment: take a document d
and append it to itself. Call this document d′.
“Semantically” d and d′ have the same content
- The Euclidean distance between the two

Jealous d’

The angle between

0.4 the two documents is 0, corresponding to maximal similarity. d distance : Rank documents according

Key idea to angle with query.

- The following two notions are equivalent.
Rank documents in decreasing order of the angle between query and document
Rank documents in increasing order of cosine(query,docum
Cosine is a monotonically decreasing

o o

But how – and why – should we be computing cosines?

a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)

V q d

 q d q d i i

 i

cos( q , d )   





V V

q d q d q d i i

 

 i

1  i

q is the tf-idf weight (or whatever) of term i in the query _i d is the tf-idf weight (or whatever) of term i in the document _i

cos(q,d) is the cosine similarity of q and d … or,

A vector can be (length-) normalized by dividing each of its components by its length
Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere)
Unit Vector = A vector whose length is exactly

1 (the unit length)

Jealous

0.4 d’ d

d  d d

Jealous

1 d d’

- Effect on the two documents d and d′
(d appended to itself) from earlier slide: they have identical vectors after length-normalization.
- Long and short documents now have
comparable weights

d d q q d q d q d q

 

  ) , cos( d q d q d q d q d q

After Normalization :

     

  ) , cos(

After Normalization : for q, d length-normalized.

 

  

V i i i d q d q d q

) , cos(

Value of the Cosine Similarity is [0,1]

Example?

Term Query

tf- raw tf-wt df idf tf.idf n’lize auto 5000

2.3 best 1 1 50000

1.3

1.3 car 1 1 10000

2.0

2.0 insurance 1 1 1000

3.0

3.0 Document: car insurance auto insurance

Query: best car insurance N=1000000

Term Query

tf- raw tf-wt df idf tf.idf n’lize auto 5000

2.3 best 1 1 50000

1.3

0.34 car 1 1 10000

2.0

0.52 insurance 1 1 1000

3.0

0.78 Document: car insurance auto insurance

Query: best car insurance

Term Document

tf-raw tf-wt idf tf.idf n’lize auto

2.3

2.3 best 1.3 car

2.0

2.0 insurance

1.3

3.0

3.9 Document: car insurance auto insurance

Query: best car insurance

Term Document

2.0

0.79 Document: car insurance auto insurance

3.9

3.0

1.3

0.5 insurance

2.0

tf-raw tf-wt idf tf.idf n’lize auto

0.3 best 1.3 car

2.3

Query: best car insurance

After Normalization : for q, d length-normalized.

 

  

V i i i d q d q d q

) , cos( Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product

tf.idf n’lize tf.idf n’lize auto

2.3

0.3 best

1.3

0.34 car

2.0

0.52

2.0

0.5

0.26 insurance

3.0

0.78

3.9

0.79

0.62 ranking

- Represent the query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf
vector
Compute the cosine similarity score for the query vector and each document vector
Rank documents with respect to the query by score

Cosine similarity amongst 3 documents How similar are term SaS PaP WH the novels

affection 115

20 SaS : Sense and jealous

11 Sensibility gossip

6 PaP : Pride and wuthering

38 Prejudice, and

Term frequencies (counts) WH : Wuthering

Log frequency weighting term SaS PaP WH

affection

3.06

2.76

2.30 jealous

2.00

1.85

2.04 gossip

1.30

1.78 wuthering

term SaS PaP WH

affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering

0.588

cos(SaS,PaP) ≈

0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈

0.94

The Vector Space Models v2

After Normalization :

Log frequency weighting term SaS PaP WH

Dokumen yang terkait

Elmasri, Ramez; Navathe, Shamkant B., 2007, Fundamentals of Database System, 5th edition, The BenjaminCummings Publishing Company, Inc., California.

The effect of workload and other risk factors of metabolic syndrome among short-haul commercial pilots in Indonesia

Komisi untuk Orang Hilang dan Korban Tindak Kekerasan Commission for The Disappeared and Victims of Violence

Mariana Achugar – What We Remember_ The Construction of Memory in Military Discourse

Peter M. R. Stirk – The Politics of Military Occupation

The Basic Tools of Finance

The Science of Macroeconomics

The Fourth Wave of the Indonesia Family Life Survey: Overview and Field Report

05 Structural Realism The Regime Theory

ASI Eksklusif dan Tingkat Kecerdasan Anak di Taman Kanak-Kanak Exclusive breastfeeding and The Intelligence of Children In Kindergarten

Dukungan

Links

The Vector Space Models v2

After Normalization :

Log frequency weighting term SaS PaP WH

Dokumen yang terkait

Elmasri, Ramez; Navathe, Shamkant B., 2007, Fundamentals of Database System, 5th edition, The BenjaminCummings Publishing Company, Inc., California.

The effect of workload and other risk factors of metabolic syndrome among short-haul commercial pilots in Indonesia

Komisi untuk Orang Hilang dan Korban Tindak Kekerasan Commission for The Disappeared and Victims of Violence

Mariana Achugar – What We Remember_ The Construction of Memory in Military Discourse

Peter M. R. Stirk – The Politics of Military Occupation

The Basic Tools of Finance

The Science of Macroeconomics

The Fourth Wave of the Indonesia Family Life Survey: Overview and Field Report

05 Structural Realism The Regime Theory

ASI Eksklusif dan Tingkat Kecerdasan Anak di Taman Kanak-Kanak Exclusive breastfeeding and The Intelligence of Children In Kindergarten

Dokumen yang Anda mencari sudah siap untuk unduhkan