Conversion from Kullback-Leibler (KL) divergence to Jensen-Shannon (JS) divergence, and cross-entropy intuition

Shan Dou
3 min readJul 26, 2021

--

From KL divergence to JS divergence

Kullback-Leibler (KL) divergence D(p, q) is asymmetric, whereas Jensen-Shannon divergence could be viewed as its symmetric counterpart.

The following code example shows the conversion from KL divergence to JS divergence:

import scipy.stats
import scipy.spatial
import numpy as np
# Start with two normally distributed samples
# with identical standard deviations;
# The two means are 2 standard deviations away from each other
x1 = scipy.stats.norm.rvs(loc=0, scale=1, size=1000)
x2 = scipy.stats.norm.rvs(loc=2, scale=1, size=1000)
# Construct empirical PDF with these two samples
hist1 = np.histogram(x1, bins=10)
hist1_dist = scipy.stats.rv_histogram(hist1)
hist2 = np.histogram(x2, bins=10)
hist2_dist = scipy.stats.rv_histogram(hist2)
X = np.linspace(-4, 6, 10)
Y1 = hist1_dist.pdf(X)
Y2 = hist2_dist.pdf(X)
# Obtain point-wise mean of the two PDFs Y1 and Y2, denote it as M
M = (Y1 + Y2) / 2
# Compute Kullback-Leibler divergence between Y1 and M
d1 = scipy.stats.entropy(Y1, M, base=2)
# d1 = 0.406
# Compute Kullback-Leibler divergence between Y2 and M
d2 = scipy.stats.entropy(Y2, M, base=2)
# d2 = 0.300
# Take the average of d1 and d2
# we get the symmetric Jensen-Shanon divergence
js_dv = (d1 + d2) / 2
# js_dv = 0.353
# Jensen-Shanon distance is the square root of the JS divergence
js_distance = np.sqrt(js_dv)
# js_distance = 0.594
# Check it against scipy's calculation
js_distance_scipy = scipy.spatial.distance.jensenshannon(Y1, Y2)
# js_distance_scipy = 0.493

The difference between the KL-divergence-derived JS distance and scipy’s JS distance may have been caused by the very coarse binning used to construct the empirical PDF.

Relationship between cross-entropy, entropy, and KL divergence

Cross entropy is the expected number of bits needed to register information in distribution q based on distribution p (base 2 from information theory standpoint; For practical implementation, both natural base and base 2 could be used):

By rewriting log(q) into log(q) = log(q/p) + log(p) we get the following equation:

In other words, cross-entropy between distribution p and q is the entropy of distribution p plus the KL divergence between p and q. We can also get the intuition behind the KL divergence: it is the extra bits needed to register information contained in distribution q using bits we’d use for distribution p. The larger log(q/p) is, the more different distribution q is relative to distribution p.

  • Entropy H(p) = — sum(p * log(p); This is the expected d value of the number of bits of information embedded in distribution p. The higher the entropy, the more chaotic and unpredictable the values from distribution p could be
  • KL divergence Dkl(p, q) = — sum(p * log(q/p)) = — sum(p * (log(q) — log(p))) = — sum(p * (number of extra bits)); here “extra“ means relative to the number of bits needed to encode information in distribution p

Other useful resources

  1. Aurélien Géron: A Short Introduction to Entropy, Cross-Entropy and KL-Divergence
  2. Occam math: What is KL-divergence | KL-divergence vs cross-entropy | Machine learning interview Qs
  3. Analytics University: Kullback Leibler Divergence || Machine Learning || Statistics

--

--

Shan Dou
Shan Dou

No responses yet