The Vector Space Models
The Vector Space Models (VSM) doc1
- Terms are axes of the space
- Documents are points or vectors in this space
- So we have a |V|-dimensional vector
space
Doc 1 : makan makan
Doc 2 : makan nasi
Doc 1 : makan makan
Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2Makan Nasi Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2
Makan
1
1 Nasi
1
- Terms are axes of the space
- Documents are points or vectors in this space
- So we have a |V|-dimensional vector
space
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2
Makan
1
1 Nasi
1 Makan
Nasi
1
1 Doc 1 Doc 2 Count Doc 1 : makan makan Doc 2 : makan nasi
Incidence Inverted Index
Matrix Raw TF
TF BinerTerm Doc 1 Doc 2 Doc 1 Doc 2
Makan
1
1 Nasi
1 Count Doc 1 : makan makan Doc 2 : makan nasi
Incidence Inverted Index
Matrix Raw TF
TF BinerTerm Doc 1 Doc 2 Doc 1 Doc 2
Makan
1
1
2
1 Nasi
1
1 Count Nasi
Doc 1 : makan makan Doc 2 : makan nasi Inverted Index
Doc 2
1 Raw TF Term Doc 1 Doc 2
Makan
2
1 Nasi
1 Makan
Doc 1
1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Inverted Inverted Matrix Index Index
TF Biner Raw TF Logaritmic
TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 21
1
2
1 Doc 1 : makan makan Doc 2 : makan nasi
Incidence Inverted Inverted
Matrix Index Index
TF Biner Raw TF Logaritmic TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2
1
1
2
1
1.3
1 Nasi Doc 1 : makan makan Doc 2 : makan nasi
Inverted Doc 2
1 Index Logaritmic TF Term Doc 1 Doc 2
Maka
1.3
1 Makan
Doc 1
1
n Doc 1 : makan makan Doc 2 : makan nasi Incidence Inverted Inverted Inverted Matrix Index Index Index TF Biner Raw TF Logaritmic TF-IDF TF Term Doc Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2
1
1
2
1
1.3
1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF
TF
Term Doc 1 Doc 2 Doc 1 Doc 2IDF Doc 1 Doc 2
Maka
2
1
1.3
1 n Doc 1 : makan makan Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF
TF
Term Doc 1 Doc 2 Doc 1 Doc 2IDF Doc 1 Doc 2
Maka
2
1
1.3
1 n
Nasi Doc 1 : makan makan Doc 2 : makan nasi
Inverted
1 Index TF-IDF Term Doc 1 Doc 2 Doc 2
Maka n
Doc 1 Makan
1 Nasi
0.3 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF TF Term Doc 1 Doc 2 Doc 1 Doc 2
IDF Doc 1 Doc 2
Maka
2
1
1.3
1 n Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Inverted Inverted Index Index Index Raw TF Logaritmic TF-IDF
TF
Term Doc 1 Doc 2 Doc 1 Doc 2IDF Doc 1 Doc 2
Maka
2
1
1.3
1 n
- Terms are axes of the space
- Documents are points or vectors in this space
- So we have a |V|-dimensional vector space
- The weight can be anything : Binary, TF, TF-IDF and so on.
- Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
- These are very sparse vectors - most
entries are zero.
How About The Query?
Query as Vector too...
Represent Document as
- Key idea 1:
vectors in the space
• Key idea 2 : Do the same for queries:
represent them as vectors in the space
- Key idea 3 : Rank documents
according to their proximity to the
PROXIMITY?
- Proximity = Kemiripan • Proximity = similarity of vectors
- Proximity ≈ inverse of distance
- Dokumen yang memiliki proximity
dengan query yang terbesar akan
memiliki score yang tinggi sehingga rankingnya lebih tinggiHow to Measure Vector Space Proximity?
- First cut: distance between two points
- – ( = distance between the end points of the two vectors)
• . . . because Euclidean distance is large
for vectors of diferent lengths .
Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip (gossip : 90x jealous: 70x)
Query : gossip Jealous Inverted Index Inverted Index Logaritmic TF
TF-IDF Term Doc Doc Doc Quer
IDF Doc Doc Doc Quer
1
2 3 y
1
2 3 y
Gossi
1
2.95
1
0.17
0.17
0.50
0.17 p Jealou
1
2.84
1
0.17
0.17
0.48
0.17 Jealous Doc 3
Query : gossip Jealous
0.4 Inverted Index Doc 2 TF-IDF Query Term Doc Doc Doc Quer
1
2 3 y
Gossi
0.17
0.50
0.17 p
Doc 1 Gossip
Jealou
0.17
0.48
0.17
0.4 Idea? The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar .
- Thought experiment: take a document d
and append it to itself. Call this document d′.- “Semantically” d and d′ have the same content
- The Euclidean distance between the two
documents can be quite large
Jealous d’
- The angle between
0.4 the two documents is 0, corresponding to maximal similarity. d
Gossip
0.4 distance : Rank documents according
- Key idea to angle with query.
- The following two notions are equivalent.
- Rank documents in decreasing order of the angle between query and document
- Rank documents in increasing order of cosine(query,docum
- Cosine is a monotonically decreasing
o o function for the interval [0 , 180 ]
But how – and why – should we be computing cosines?
a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)
V q d
q d q d i i
i1 cos( q , d )
V V
2
2 q q d d q d i i i
1 i
1 q is the tf-idf weight (or whatever) of term i in the query i
d is the tf-idf weight (or whatever) of term i in the document
icos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
- A vector can be (length-) normalized by dividing each of its components by its length
- Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere)
- Unit Vector = A vector whose length is exactly
1 (the unit length)
Gossip Jealous
0.4
0.4 d’ d
d d d
Gossip Jealous
1
1
d d’- Efect on the two documents d and d′
(d appended to itself) from earlier slide: they have identical vectors after length-normalization.- Long and short documents now have
comparable weights
d q d q d q
) , cos(
After Normalization :
1 d
1
q
d q d q d d q q d q d q d q
1
1 ) , cos(
After Normalization : for q, d length-normalized.
V i i i d q d q d q
1 ) , cos(
- Value of the Cosine Similarity is [0,1]
Example? Document: car insurance auto insurance N=1000000
Query: best car insurance Term Query
tf- tf-wt df idf tf.idf n’lize raw auto 5000
2.3 best 1 1 50000
1.3
1.3 car 1 1 10000
2.0
2.0 insurance 1 1 1000
3.0 2
3.0 2 2 2 Query length = 1 .
3
2
3 3 .
8 Document: car insurance auto insurance Query: best car insurance Term Query
tf- tf-wt df idf tf.idf n’lize raw auto 5000
2.3 best 1 1 50000
1.3
1.3
0.34 car 1 1 10000
2.0
2.0
0.52 insurance 1 1 1000
3.0 2
3.0 2
0.78 2 2 Query length = 1 .
3
2
3 3 .
8 Document: car insurance auto insurance Query: best car insurance Term Document
tf-raw tf-wt idf tf.idf n’lize auto
1
1
2.3
2.3 best 1.3 car
1
1
2.0
2.0 insurance
2
1.3
3.0
3.9 2 2 2 2 Doc length = 2 .
3
2 3 .
9 4 .
9 Document: car insurance auto insurance Query: best car insurance Term Document
tf-raw tf-wt idf tf.idf n’lize auto
1
1
2.3
2.3
0.3 best 1.3 car
1
1
2.0
2.0
0.5 insurance
2
1.3
3.0
3.9 2
0.79 2 2 2 Doc length = 2 .
3
2 3 .
9 4 .
9
After Normalization : for q, d length-normalized.
V i i i d q d q d q
1 ) , cos(
Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product
tf.idf n’lize tf.idf n’lize auto
2.3
0.3 best
1.3
0.34 car
2.0
0.52
2.0
0.5
0.26 insurance
3.0
0.78
3.9
0.79
0.62
2
2
2
2 Doc length = 1 0 1 1.3 1.92 ranking
- Represent the query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf vector
- Compute the cosine similarity score for the query vector and each document vector
- Rank documents with respect to the query by score
- Return the top K (e.g., K = 10) to the user
affection 115
58
20 SaS : Sense and jealous
10
7
11 Sensibility gossip
2 6 wuthering
38 PaP : Pride and
Prejudice, and Term frequencies (counts)
WH : Wuthering Heights? Note: To simplify this example, we don’t do idf weighting.
Log frequency weighting term SaS PaP WH
normalization term SaS PaP WH
0.69
0.79 cos(PaP,WH) ≈
0.94 cos(SaS,WH) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈
cos(SaS,PaP) ≈
0.588
affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering
contd.
affection
1.30
2.04 gossip
1.85
2.00
2.30 jealous
2.76
3.06
1.78 wuthering