Conversion from Kullback-Leibler (KL) divergence to Jensen-Shannon (JS) divergence, and cross-entropy intuition
From KL divergence to JS divergence
Kullback-Leibler (KL) divergence D(p, q) is asymmetric, whereas Jensen-Shannon divergence could be viewed as its symmetric counterpart.
The following code example shows the conversion from KL divergence to JS divergence:
import scipy.stats
import scipy.spatial
import numpy as np# Start with two normally distributed samples
# with identical standard deviations;
# The two means are 2 standard deviations away from each other
x1 = scipy.stats.norm.rvs(loc=0, scale=1, size=1000)
x2 = scipy.stats.norm.rvs(loc=2, scale=1, size=1000)# Construct empirical PDF with these two samples
hist1 = np.histogram(x1, bins=10)
hist1_dist = scipy.stats.rv_histogram(hist1)
hist2 = np.histogram(x2, bins=10)
hist2_dist = scipy.stats.rv_histogram(hist2)X = np.linspace(-4, 6, 10)
Y1 = hist1_dist.pdf(X)
Y2 = hist2_dist.pdf(X)# Obtain point-wise mean of the two PDFs Y1 and Y2, denote it as M
M = (Y1 + Y2) / 2# Compute Kullback-Leibler divergence between Y1 and M
d1 = scipy.stats.entropy(Y1, M, base=2)
# d1 = 0.406
# Compute Kullback-Leibler divergence between Y2 and M
d2 = scipy.stats.entropy(Y2, M, base=2)
# d2 = 0.300# Take the average of d1 and d2
# we get the symmetric Jensen-Shanon divergence
js_dv = (d1 + d2) / 2
# js_dv = 0.353
# Jensen-Shanon distance is the square root of the JS divergence
js_distance = np.sqrt(js_dv)
# js_distance = 0.594# Check it against scipy's calculation
js_distance_scipy = scipy.spatial.distance.jensenshannon(Y1, Y2)
# js_distance_scipy = 0.493
The difference between the KL-divergence-derived JS distance and scipy’s JS distance may have been caused by the very coarse binning used to construct the empirical PDF.
Relationship between cross-entropy, entropy, and KL divergence
Cross entropy is the expected number of bits needed to register information in distribution q based on distribution p (base 2 from information theory standpoint; For practical implementation, both natural base and base 2 could be used):
By rewriting log(q) into log(q) = log(q/p) + log(p) we get the following equation:
In other words, cross-entropy between distribution p and q is the entropy of distribution p plus the KL divergence between p and q. We can also get the intuition behind the KL divergence: it is the extra bits needed to register information contained in distribution q using bits we’d use for distribution p. The larger log(q/p) is, the more different distribution q is relative to distribution p.
- Entropy H(p) = — sum(p * log(p); This is the expected d value of the number of bits of information embedded in distribution p. The higher the entropy, the more chaotic and unpredictable the values from distribution p could be
- KL divergence Dkl(p, q) = — sum(p * log(q/p)) = — sum(p * (log(q) — log(p))) = — sum(p * (number of extra bits)); here “extra“ means relative to the number of bits needed to encode information in distribution p