I am asking if someone knows how to calculate the baseline for cosine similarity?
What do you mean by "baseline"? And how do you intend to use the "baseline"?
In
https://stackoverflow.com/questions/53796642/how-to-calculate-the-baseline-for-cosine-similarity, someone (you?) wrote:
I am calculating Cosine Similarity and as a first step I am calculating the SUMPRODUCT and I am using the range of (B2:H2) as my baseline.
=SUMPRODUCT($B$2:$H$2;B2:H2)
As a second step I am calculating the Square Root of the Sum of the Squares.
=SQRT(SUMSQ($B$2:$H$2))*SQRT(SUMSQ(B2:H2))
In this calculation, I am also using the range of (B2:H2) as my baseline. I have two questions:
How should I interpret the values of the baseline?
How should I determine what the values for the baseline should be?
That makes no sense to me. That calculation always results in 1, regardless of the values in B2:H2. Proof:
Code:
SQRT(SUMSQ(B2:H2)*SQRT(SUMSQ(B2:H2)) = SQRT(SUMSQ(B2:H2))^2 = SUMSQ(B2:H2)
SUMSQ(B2:H2) = B2^2 + C2^2 +...+ H2^2 = SUMPRODUCT(B2:H2,B2:H2)
So SUMPRODUCT(B2:H2,B2:H2) / SQRT(SUMSQ(B2:H2))^2 = SUMPRODUCT(B2:H2,B2:H2) / SUMPRODUCT(B2:H2,B2:H2) = 1
Is there some significance to your using
$B
$2:
$H
$2 v. B2:H2?
For example, do you intend to copy the formula down a column so that the second row becomes:
SUMPRODUCT($B$2:$H$2,
B3:H3) / ( SQRT(SUMSQ($B$2:$H$2)) * SQRT(SUMSQ(
B3:H3)) )
which can also be written:
SUMPRODUCT($B$2:$H$2,
B3:H3) / SQRT(SUMSQ($B$2:$H$2)) / SQRT(SUMSQ(
B3:H3))
If so, I can understand what you might mean by calling B2:H2 a "baseline".
But even so, it does not make sense to me to ask "what the values for the baseline should be?". It is what it is!
Although the cosine similarity can be defined for any pair of vectors B2:H2 and B3:H3, the articles that I pointed to in posting
#2 use cosine similarity to compare text.
Suppose you have 2 sentences that each uses some combination (zero or more) of 7 words. B2:H2 is the count of each word in sentence
#1, and B3:H3 is the count of each word in sentence
#2. The cosine similarity calculated above is a measure of similarity, where 1 is identical and 0 is completely different (no shared words).
In that context, to call sentence
#1 (represented by B2:H2) the "baseline" simply means that all other sentences are compared to it. It has no numeric value.