The Vector Space Models v2
The Vector Space Models (VSM) doc1
- Terms are axes of the space
- Documents are points or vectors in this space
- So we have a |V|-dimensional vector
Doc 1 : makan makan
Doc 2 : makan nasi
Doc 1 : makan makan
Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2Makan Nasi Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2
Makan
1
1 Nasi
1
- Terms are axes of the space
- Documents are points or vectors in this space
- So we have a |V|-dimensional vector
space
Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Term Doc 1 Doc 2
Makan
1
1 Nasi
1 Nasi
1 Doc 2 Count Doc 1 : makan makan Doc 2 : makan nasi
Incidence Matrix Inverted Index
TF Biner Raw TF
Term Doc 1 Doc 2 Doc 1 Doc 2
Makan
1
1 Nasi
1 Count Doc 1 : makan makan Doc 2 : makan nasi
Incidence Matrix Inverted Index
TF Biner Raw TF
Term Doc 1 Doc 2 Doc 1 Doc 2
Makan
1
1
2
1 Nasi
1
1 Count Nasi
Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF
1 Doc 2 Term Doc 1 Doc 2
Makan
2
1 Nasi
1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Inverted Index Logaritmic TF Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2
Makan
1
1
2
1 Nasi
1
1 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner
Inverted Index
Raw TF
Inverted Index Logaritmic TF Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2Makan
1
1
2
1
1.3
1 Nasi
1
1
1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Logaritmic TF Term Doc 1 Doc 2
Makan
1.3
1 Nasi
1 Nasi
1 Doc 2 Doc 1 : makan makan Doc 2 : makan nasi Incidence Matrix TF Biner Inverted Index Raw TF Inverted Index Logaritmic TF Inverted Index TF-IDF
Term Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2 Doc 1 Doc 2
Makan
1
1
2
1
1.3
1 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF
Inverted Index
Logaritmic TF
Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2IDF Doc 1 Doc 2
Makan
2
1
1.3
1 Nasi
1
1
0.3 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index Raw TF
Inverted Index
Logaritmic TF
Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2IDF Doc 1 Doc 2
Makan
2
1
1.3
1 Nasi
1
1
0.3
0.3 Doc 1 : makan makan Doc 2 : makan nasi Inverted Index TF-IDF Term Doc 1 Doc 2
Makan Nasi
0.3 Nasi
1 Doc 2 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Inverted Index Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2
IDF Doc 1 Doc 2
Makan
2
1
1.3
1 Nasi
1
1 Doc 1 : makan makan jagung Doc 2 : makan nasi Inverted Index Inverted Index Inverted Index Raw TF Logaritmic TF TF-IDF Term Doc 1 Doc 2 Doc 1 Doc 2
IDF Doc 1 Doc 2
Makan
2
1
1.3
1 Nasi
1
1
0.3
0.3
- Terms are axes of the space
- Documents are points or vectors in this space
- So we have a |V|-dimensional vector space
- The weight can be anything : Binary, TF, TF-IDF and so on.
- Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
- These are very sparse vectors - most
entries are zero.
How About The Query?
Query as Vector too...
- Key idea 1: Represent Document as
vectors in the space
• Key idea 2 : Do the same for queries:
represent them as vectors in the space
PROXIMITY?
- Proximity = Kemiripan • Proximity = similarity of vectors
- Proximity ≈ inverse of distance
- Dokumen yang memiliki proximity
dengan query yang terbesar akan
memiliki score yang tinggi sehinggaHow to Measure Vector Space Proximity?
- First cut: distance between two points
- – ( = distance between the end points of the two vectors)
• . . . because Euclidean distance is large
Doc 1 : gossip Doc 2 : jealous Doc 3 : gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip jealous gossip
Query : gossip Jealous Inverted Index Logaritmic TF
0.17 Jealous 0
0.48
0.17
0.17
1
2.84
1
0.50
Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query
0.17
0.17
1
2.95
1
Gossip
IDF Doc 1 Doc 2 Doc 3 Query
0.17 Query : gossip Jealous Inverted Index TF-IDF Term Doc 1 Doc 2 Doc 3 Query
Gossip
0.17
0.50
0.17 Jealous 0
0.17
0.48
0.17 Jealous
0.4 Doc 2 Doc 3 Query Idea? The Euclidean distance between query and Doc3 is large even though the distribution of terms in the query q and the distribution of terms in the document Doc3 are very similar .
- Thought experiment: take a document d
and append it to itself. Call this document d′.- “Semantically” d and d′ have the same content
- The Euclidean distance between the two
Jealous d’
- The angle between
0.4 the two documents is 0, corresponding to maximal similarity. d distance : Rank documents according
- Key idea to angle with query.
- The following two notions are equivalent.
- Rank documents in decreasing order of the angle between query and document
- Rank documents in increasing order of cosine(query,docum
- Cosine is a monotonically decreasing
o o
But how – and why – should we be computing cosines?
a · b = |a| × |b| × cos(θ) Where: |a| is the magnitude (length) of vector a |b| is the magnitude (length) of vector b θ is the angle between a and b cos(θ) = (a · b ) / (|a| × |b|)
V q d
q d q d i i
i
1
cos( q , d )
V V
2
2
q d q d q d i i
i
1 i
1
q is the tf-idf weight (or whatever) of term i in the query i d is the tf-idf weight (or whatever) of term i in the document i
cos(q,d) is the cosine similarity of q and d … or,
- A vector can be (length-) normalized by dividing each of its components by its length
- Dividing a vector by its length makes it a unit (length) vector (on surface of unit hypersphere)
- Unit Vector = A vector whose length is exactly
1 (the unit length)
Jealous
0.4 d’ d
d d d
Jealous
1 d d’
- Effect on the two documents d and d′
(d appended to itself) from earlier slide: they have identical vectors after length-normalization.- Long and short documents now have
comparable weights
d d q q d q d q d q
) , cos( d q d q d q d q d q
After Normalization :
) , cos(
After Normalization : for q, d length-normalized.
V i i i d q d q d q
1
) , cos(
- Value of the Cosine Similarity is [0,1]
Example?
Term Query
tf- raw tf-wt df idf tf.idf n’lize auto 5000
2.3 best 1 1 50000
1.3
1.3 car 1 1 10000
2.0
2.0 insurance 1 1 1000
3.0
3.0 Document: car insurance auto insurance
Query: best car insurance N=1000000
Term Query
tf- raw tf-wt df idf tf.idf n’lize auto 5000
2.3 best 1 1 50000
1.3
1.3
0.34 car 1 1 10000
2.0
2.0
0.52 insurance 1 1 1000
3.0
3.0
0.78 Document: car insurance auto insurance
Query: best car insurance
Term Document
tf-raw tf-wt idf tf.idf n’lize auto
1
1
2.3
2.3 best 1.3 car
1
1
2.0
2.0 insurance
2
1.3
3.0
3.9 Document: car insurance auto insurance
Query: best car insurance
Term Document
2.0
0.79 Document: car insurance auto insurance
3.9
3.0
1.3
2
0.5 insurance
2.0
tf-raw tf-wt idf tf.idf n’lize auto
1
1
0.3 best 1.3 car
2.3
2.3
1
1
Query: best car insurance
After Normalization : for q, d length-normalized.
V i i i d q d q d q
1
) , cos( Document: car insurance auto insurance Query: best car insurance Term Query Document Dot Product
tf.idf n’lize tf.idf n’lize auto
2.3
0.3 best
1.3
0.34 car
2.0
0.52
2.0
0.5
0.26 insurance
3.0
0.78
3.9
0.79
0.62 ranking
- Represent the query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf
vector- Compute the cosine similarity score for the query vector and each document vector
- Rank documents with respect to the query by score
Cosine similarity amongst 3 documents How similar are term SaS PaP WH the novels
affection 115
58
20 SaS : Sense and jealous
10
7
11 Sensibility gossip
2
6 PaP : Pride and wuthering
38 Prejudice, and
Term frequencies (counts) WH : Wuthering
Log frequency weighting term SaS PaP WH
affection
3.06
2.76
2.30 jealous
2.00
1.85
2.04 gossip
1.30
1.78 wuthering
term SaS PaP WH
affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering
0.588
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈
0.94