The Vector Space Models

The Vector Space Models (VSM) doc1

Terms are axes of the space
Documents are points or vectors in this space
- So we have a |V|-dimensional vector
space

Doc 1 : makan makan

Doc 2 : makan nasi

Doc 1 : makan makan

Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

Makan Nasi Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

Makan

1 Nasi

Terms are axes of the space
Documents are points or vectors in this space
- So we have a |V|-dimensional vector
space

Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2

Makan

1 Nasi

1 Makan

Nasi

1 Doc 1 Doc 2 Count Doc 1 : makan makan Doc 2 : makan nasi

Incidence Inverted Index

Matrix Raw TF

TF Biner

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan

1 Nasi

1 Count Doc 1 : makan makan Doc 2 : makan nasi

Incidence Inverted Index

Matrix Raw TF

TF Biner

Term Doc 1 Doc 2 Doc 1 Doc 2

Makan

1 Nasi

1 Count Nasi

Doc 1 : makan makan Doc 2 : makan nasi Inverted Index

Doc 2

1 Raw TF Term Doc 1 Doc 2

Makan

1 Nasi

1 Makan

Doc 1

1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Inverted Inverted Matrix Index Index

TF Biner Raw TF Logaritmic

TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

1 Doc 1 : makan makan Doc 2 : makan nasi

Incidence Inverted Inverted

Matrix Index Index

TF Biner Raw TF Logaritmic TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

1.3

1 Nasi Doc 1 : makan makan Doc 2 : makan nasi

Inverted Doc 2

1 Index Logaritmic TF Term Doc 1 Doc 2

Maka

1.3

1 Makan

Doc 1

n Doc 1 : makan makan Doc 2 : makan nasi Incidence Inverted Inverted Inverted Matrix Index Index Index TF Biner Raw TF Logaritmic TF-IDF TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2

1.3

1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Maka

1.3

1 n Doc 1 : makan makan Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Maka

1.3

1 n

Nasi Doc 1 : makan makan Doc 2 : makan nasi

Inverted

1 Index TF-IDF Term Doc 1 Doc 2 Doc 2

Maka n

Doc 1 Makan

1 Nasi

0.3 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF TF Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Maka

1.3

1 n Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF

Term Doc 1 Doc 2 Doc 1 Doc 2

IDF Doc 1 Doc 2

Maka

1.3

1 n

Terms are axes of the space
Documents are points or vectors in this space
- So we have a |V|-dimensional vector space
The weight can be anything : Binary, TF, TF-IDF and so on.

Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
- These are very sparse vectors - most
entries are zero.

How About The Query?

Query as Vector too...

Represent Document as

Key idea 1:

vectors in the space

• Key idea 2 : Do the same for queries:

represent them as vectors in the space

Key idea 3 : Rank documents

according to their proximity to the

PROXIMITY?

Proximity = Kemiripan • Proximity = similarity of vectors
Proximity ≈ inverse of distance
Dokumen yang memiliki proximity

dengan query yang terbesar akan

memiliki score yang tinggi sehingga rankingnya lebih tinggi

How to Measure Vector Space Proximity?

First cut: distance between two points

– ( = distance between the end points of the two vectors)

• . . . because Euclidean distance is large
for vectors of diferent lengths .

Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip (gossip : 90x jealous: 70x)

Query : gossip Jealous Inverted Index Inverted Index Logaritmic TF

TF-IDF Term Doc Doc Doc Quer

IDF Doc Doc Doc Quer

2 3 y

Gossi

2.95

0.17

0.50

0.17 p Jealou

2.84

0.17

0.48

0.17 Jealous Doc 3

Query : gossip Jealous

0.4 Inverted Index Doc 2 TF-IDF Query Term Doc Doc Doc Quer

2 3 y

Gossi

0.17

0.50

0.17 p

Doc 1 Gossip

Jealou

0.17

0.48

0.17

0.4 Idea? The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar .

- Thought experiment: take a document d
and append it to itself. Call this document d′.
“Semantically” d and d′ have the same content
- The Euclidean distance between the two
documents can be quite large

Jealous d’

The angle between

0.4 the two documents is 0, corresponding to maximal similarity. d

Gossip

0.4 distance : Rank documents according

Key idea to angle with query.

- The following two notions are equivalent.
Rank documents in decreasing order of the angle between query and document
Rank documents in increasing order of cosine(query,docum
Cosine is a monotonically decreasing

o o function for the interval [0 , 180 ]

But how – and why – should we be computing cosines?

a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)

 

V    q d

 q d q d i i

 i

1   cos( q , d )  

     

V V

2 q q d d q d i i i

1 i

1     q is the tf-idf weight (or whatever) of term i in the query _i

d is the tf-idf weight (or whatever) of term i in the document

cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

A vector can be (length-) normalized by dividing each of its components by its length
Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere)
Unit Vector = A vector whose length is exactly

1 (the unit length)

Gossip Jealous

0.4

0.4 d’ d

d d d

   

Gossip Jealous

d d’

- Efect on the two documents d and d′
(d appended to itself) from earlier slide: they have identical vectors after length-normalization.
- Long and short documents now have
comparable weights

d q d q d q

   

   ) , cos(

After Normalization :

1  d



1 

 d q d q d d q q d q d q d q

   

 



   

        



1 ) , cos(

After Normalization : for q, d length-normalized.

 

  

V i i i d q d q d q

1 ) , cos( 

  

Value of the Cosine Similarity is [0,1]

Example? Document: car insurance auto insurance N=1000000

Query: best car insurance Term Query

tf- tf-wt df idf tf.idf n’lize raw auto 5000

2.3 best 1 1 50000

1.3

1.3 car 1 1 10000

2.0

2.0 insurance 1 1 1000

3.0 ₂

3.0 ₂ ₂ ₂ Query length = 1 .

3 3 .

8     Document: car insurance auto insurance Query: best car insurance Term Query

tf- tf-wt df idf tf.idf n’lize raw auto 5000

2.3 best 1 1 50000

1.3

0.34 car 1 1 10000

2.0

0.52 insurance 1 1 1000

3.0 ₂

0.78 ₂ ₂ Query length = 1 .

3 3 .

8     Document: car insurance auto insurance Query: best car insurance Term Document

tf-raw tf-wt idf tf.idf n’lize auto

2.3

2.3 best 1.3 car

2.0

2.0 insurance

1.3

3.0

3.9 ₂ ₂ ₂ ₂ Doc length = 2 .

2 3 .

9 4 .

9     Document: car insurance auto insurance Query: best car insurance Term Document

tf-raw tf-wt idf tf.idf n’lize auto

2.3

0.3 best 1.3 car

2.0

0.5 insurance

1.3

3.0

3.9 ₂

0.79 ₂ ₂ ₂ Doc length = 2 .

2 3 .

9 4 .

9    

After Normalization : for q, d length-normalized.

 

  

V i i i d q d q d q

1 ) , cos( 

   Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product

tf.idf n’lize tf.idf n’lize auto

2.3

0.3 best

1.3

0.34 car

2.0

0.52

2.0

0.5

0.26 insurance

3.0

0.78

3.9

0.79

0.62

2 Doc length = 1  0 1 1.3 1.92 ranking

- Represent the query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf vector
Compute the cosine similarity score for the query vector and each document vector
Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user

documents How similar are term SaS PaP WH the novels

affection 115

20 SaS : Sense and jealous

11 Sensibility gossip

2 6 wuthering

38 PaP : Pride and

Prejudice, and Term frequencies (counts)

WH : Wuthering Heights? Note: To simplify this example, we don’t do idf weighting.

Log frequency weighting term SaS PaP WH

normalization term SaS PaP WH

0.69

0.79 cos(PaP,WH) ≈

0.94 cos(SaS,WH) ≈

0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈

cos(SaS,PaP) ≈

0.588

affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering

contd.

affection

1.30

2.04 gossip

1.85

2.00

2.30 jealous

2.76

3.06

1.78 wuthering

The Vector Space Models

After Normalization :

Log frequency weighting term SaS PaP WH

Dokumen yang terkait

The Level of Egg Sterility and Mosquitoes Age after the Release of Sterile Insect Technique (SIT) in Ngaliyan Semarang

Mangrove Conservation in East Java: The Ecotourism Development Perspectives

The Classification of Bambusa spp. from Celebes Based on the Micromorphological Characters of Leaf Epidermis

The Potential of Indigenous Bacteria for Removing Cadmium from Industrial Wastewater in Lawang, East Java

The Correlation of Regulatory T (TReg) and Vitamin D3 in Pediatric Nephrotic Syndrome

The Role of Coenzymes on Mercury (Hg

The Effect of Low Power Ultrasonic Wave Exposure to Suppress Methicillin-Resistant Staphylococcus aureus

The Plant Wisdom of Dayak Ot Danum, Central Kalimantan

Abstract: The correlation of body mass index with the menstrual cycle

The Impact Of Sundaries To Improve Production And Welfare (Case study Cabai Sundari Innovation Desa Lembor District Brondong Lamongan District East Java Province)

Dukungan

Links

The Vector Space Models

After Normalization :

Log frequency weighting term SaS PaP WH

Dokumen yang terkait

The Level of Egg Sterility and Mosquitoes Age after the Release of Sterile Insect Technique (SIT) in Ngaliyan Semarang

Mangrove Conservation in East Java: The Ecotourism Development Perspectives

The Classification of Bambusa spp. from Celebes Based on the Micromorphological Characters of Leaf Epidermis

The Potential of Indigenous Bacteria for Removing Cadmium from Industrial Wastewater in Lawang, East Java

The Correlation of Regulatory T (TReg) and Vitamin D3 in Pediatric Nephrotic Syndrome

The Role of Coenzymes on Mercury (Hg

The Effect of Low Power Ultrasonic Wave Exposure to Suppress Methicillin-Resistant Staphylococcus aureus

The Plant Wisdom of Dayak Ot Danum, Central Kalimantan

Abstract: The correlation of body mass index with the menstrual cycle

The Impact Of Sundaries To Improve Production And Welfare (Case study Cabai Sundari Innovation Desa Lembor District Brondong Lamongan District East Java Province)

Dokumen yang Anda mencari sudah siap untuk unduhkan