r"""

.. _quantum_natural_gradient:

Quantum natural gradient
========================

.. meta::
    :property="og:description": The quantum natural gradient method can achieve faster convergence for quantum machine
        learning problems by taking into account the intrinsic geometry of qubits.
    :property="og:image": https://pennylane.ai/qml/_images/qng_optimization.png

.. related::

   tutorial_backprop Quantum gradients with backpropagation
   tutorial_vqe_qng Accelerating VQE with quantum natural gradient

*Author: Josh Izaac — Posted: 11 October 2019. Last updated: 25 January 2021.*

This example demonstrates the quantum natural gradient optimization technique
for variational quantum circuits, originally proposed in
`Stokes et al. (2019) <https://arxiv.org/abs/1909.02108>`__.

Background
----------

The most successful class of quantum algorithms for use on near-term noisy quantum hardware
is the so-called variational quantum algorithm. As laid out in the
:ref:`Concepts section <glossary_variational_circuit>`, in variational quantum algorithms
a low-depth parametrized quantum circuit ansatz is chosen, and a problem-specific
observable measured. A classical optimization loop is then used to find
the set of quantum parameters that *minimize* a particular measurement expectation value
of the quantum device. Examples of such algorithms include the :doc:`variational quantum
eigensolver (VQE) <tutorial_vqe>`, the
`quantum approximate optimization algorithm (QAOA) <https://arxiv.org/abs/1411.4028>`__,
and :ref:`quantum neural networks (QNN) <quantum_neural_net>`.

Most recent demonstrations
of variational quantum algorithms have used gradient-free classical optimization
methods, such as the Nelder-Mead algorithm. However, the parameter-shift rule
(as implemented in PennyLane) allows the user to automatically compute
analytic gradients of quantum circuits. This opens up the possibility to train
quantum computing hardware using gradient descent—the same method used to train
deep learning models.
Though one caveat has surfaced with gradient descent — how do we choose the optimal
step size for our variational quantum algorithms, to ensure successful and
efficient optimization?

The natural gradient
^^^^^^^^^^^^^^^^^^^^

In standard gradient descent, each optimization step is given by

.. math:: \theta_{t+1} = \theta_t -\eta \nabla \mathcal{L}(\theta),

where :math:`\mathcal{L}(\theta)` is the cost as a function of
the parameters :math:`\theta`, and :math:`\eta` is the learning rate
or step size. In essence, each optimization step calculates the
steepest descent direction around the local value of :math:`\theta_t`
in the parameter space, and updates :math:`\theta_t\rightarrow \theta_{t+1}`
by this vector.

The problem with the above approach is that each optimization step
is strongly connected to a *Euclidean geometry* on the parameter space.
The parametrization is not unique, and different parametrizations can distort
distances within the optimization landscape.

For example, consider the following cost function :math:`\mathcal{L}`, parametrized
using two different coordinate systems, :math:`(\theta_0, \theta_1)`, and
:math:`(\phi_0, \phi_1)`:

|

.. figure:: ../demonstrations/quantum_natural_gradient/qng7.png
    :align: center
    :width: 90%
    :target: javascript:void(0)

|

Performing gradient descent in the :math:`(\theta_0, \theta_1)` parameter
space, we are updating each parameter by the same Euclidean distance,
and not taking into account the fact that the cost function might vary at a different
rate with respect to each parameter.

Instead, if we perform a change of coordinate system (re-parametrization)
of the cost function, we might find a parameter space where variations in :math:`\mathcal{L}`
are similar across different parameters. This is the case with the new parametrization
:math:`(\phi_0, \phi_1)`; the cost function is unchanged,
but we now have a nicer geometry in which to perform gradient descent, and a more
informative stepsize. This leads to faster convergence, and can help avoid optimization
becoming stuck in local minima. For a more in-depth explanation,
including why the parameter space might not be best represented by a Euclidean space,
see `Yamamoto (2019) <https://arxiv.org/abs/1909.05074>`__.

However, what if we avoid gradient descent in the parameter space altogether?
If we instead consider the optimization problem as a
probability distribution of possible output values given an input
(i.e., `maximum likelihood estimation <https://en.wikipedia.org/wiki/Likelihood_function>`_),
a better approach is to perform the gradient descent in the *distribution space*, which is
dimensionless and invariant with respect to the parametrization. As a result,
each optimization step will always choose the optimum step-size for every
parameter, regardless of the parametrization.

In classical neural networks, the above process is known as
*natural gradient descent*, and was first introduced by
`Amari (1998) <https://www.mitpressjournals.org/doi/abs/10.1162/089976698300017746>`__.
The standard gradient descent is modified as follows:

.. math:: \theta_{t+1} = \theta_t - \eta F^{-1}\nabla \mathcal{L}(\theta),

where :math:`F` is the `Fisher information matrix <https://en.wikipedia.org/wiki/Fisher_information#Matrix_form>`__.
The Fisher information matrix acts as a metric tensor, transforming the
steepest descent in the Euclidean parameter space to the steepest descent in the
distribution space.

The quantum analog
^^^^^^^^^^^^^^^^^^

In a similar vein, it has been shown that the standard Euclidean geometry
is sub-optimal for optimization of quantum variational algorithms
`(Harrow and Napp, 2019) <https://arxiv.org/abs/1901.05374>`__.
The space of quantum states instead possesses a unique invariant metric
tensor known as the Fubini-Study metric tensor :math:`g_{ij}`, which can be used to
construct a quantum analog to natural gradient descent:

.. math:: \theta_{t+1} = \theta_t - \eta g^{+}(\theta_t)\nabla \mathcal{L}(\theta),

where :math:`g^{+}` refers to the pseudo-inverse.

.. note::

    It can be shown that the Fubini-Study metric tensor reduces
    to the Fisher information matrix in the classical limit.

    Furthermore, in the limit where :math:`\eta\rightarrow 0`,
    the dynamics of the system are equivalent to imaginary-time
    evolution within the variational subspace, as proposed in
    `McArdle et al. (2018) <https://arxiv.org/abs/1804.03023>`__.

"""

##############################################################################
# Block-diagonal metric tensor
# ----------------------------
#
# A block-diagonal approximation to the Fubini-Study metric tensor
# of a variational quantum circuit can be evaluated on quantum hardware.
#
# Consider a variational quantum circuit
#
# .. math::
#
#     U(\mathbf{\theta})|\psi_0\rangle = V_L(\theta_L) W_L V_{L-1}(\theta_{L-1}) W_{L-1}
#       \cdots V_{\ell}(\theta_{\ell}) W_{\ell} \cdots V_{0}(\theta_{0}) W_{0} |\psi_0\rangle
#
# where
#
# * :math:`|\psi_0\rangle` is the initial state,
# * :math:`W_\ell` are layers of non-parametrized quantum gates,
# * :math:`V_\ell(\theta_\ell)` are layers of parametrized quantum gates
#   with :math:`n_\ell` parameters :math:`\theta_\ell = \{\theta^{(\ell)}_0, \dots, \theta^{(\ell)}_n\}`.
#
# Further, assume all parametrized gates can be written in the form
# :math:`X(\theta^{(\ell)}_{i}) = e^{i\theta^{(\ell)}_{i} K^{(\ell)}_i}`,
# where :math:`K^{(\ell)}_i` is the *generator* of the parametrized operation.
#
# For each parametric layer :math:`\ell` in the variational quantum circuit
# the :math:`n_\ell\times n_\ell` block-diagonal submatrix
# of the Fubini-Study tensor :math:`g_{ij}^{(\ell)}` is calculated by:
#
# .. math::
#
#     g_{ij}^{(\ell)} = \langle \psi_{\ell-1} | K_i K_j | \psi_{\ell-1} \rangle
#     - \langle \psi_{\ell-1} | K_i | \psi_{\ell-1}\rangle
#     \langle \psi_{\ell-1} |K_j | \psi_{\ell-1}\rangle
#
# where
#
# .. math::
#
#     | \psi_{\ell-1}\rangle = V_{\ell-1}(\theta_{\ell-1}) W_{\ell-1} \cdots V_{0}(\theta_{0}) W_{0} |\psi_0\rangle.
#
# (that is, :math:`|\psi_{\ell-1}\rangle` is the quantum state prior to the application
# of parameterized layer :math:`\ell`), and we have :math:`K_i \equiv K_i^{(\ell)}` for brevity.
#
# Let's consider a small variational quantum circuit example coded in PennyLane:

import pennylane as qml
from pennylane import numpy as np

dev = qml.device("default.qubit", wires=3)


@qml.qnode(dev, interface="autograd")
def circuit(params):
    # |psi_0>: state preparation
    qml.RY(np.pi / 4, wires=0)
    qml.RY(np.pi / 3, wires=1)
    qml.RY(np.pi / 7, wires=2)

    # V0(theta0, theta1): Parametrized layer 0
    qml.RZ(params[0], wires=0)
    qml.RZ(params[1], wires=1)

    # W1: non-parametrized gates
    qml.CNOT(wires=[0, 1])
    qml.CNOT(wires=[1, 2])

    # V_1(theta2, theta3): Parametrized layer 1
    qml.RY(params[2], wires=1)
    qml.RX(params[3], wires=2)

    # W2: non-parametrized gates
    qml.CNOT(wires=[0, 1])
    qml.CNOT(wires=[1, 2])

    return qml.expval(qml.PauliY(0))


params = np.array([0.432, -0.123, 0.543, 0.233])

##############################################################################
# The above circuit consists of 4 parameters, with two distinct parametrized
# layers of 2 parameters each.
#
# .. figure:: ../demonstrations/quantum_natural_gradient/qng1.png
#     :align: center
#     :width: 90%
#     :target: javascript:void(0)
#
# |
#
# (Note that in this example, the first non-parametrized layer :math:`W_0`
# is simply the identity.) Since there are two layers, each with two parameters,
# the block-diagonal approximation consists of two
# :math:`2\times 2` matrices, :math:`g^{(0)}` and :math:`g^{(1)}`.
#
# .. figure:: ../demonstrations/quantum_natural_gradient/qng2.png
#     :align: center
#     :width: 30%
#     :target: javascript:void(0)
#
# To compute the first block-diagonal :math:`g^{(0)}`, we create subcircuits consisting
# of all gates prior to the layer, and observables corresponding to
# the *generators* of the gates in the layer:
#
# .. figure:: ../demonstrations/quantum_natural_gradient/qng3.png
#     :align: center
#     :width: 30%
#     :target: javascript:void(0)

g0 = np.zeros([2, 2])


def layer0_subcircuit(params):
    """This function contains all gates that
    precede parametrized layer 0"""
    qml.RY(np.pi / 4, wires=0)
    qml.RY(np.pi / 3, wires=1)
    qml.RY(np.pi / 7, wires=2)


##############################################################################
# We then post-process the measurement results in order to determine :math:`g^{(0)}`,
# as follows.
#
# .. figure:: ../demonstrations/quantum_natural_gradient/qng4.png
#     :align: center
#     :width: 50%
#     :target: javascript:void(0)
#
# We can see that the diagonal terms are simply given by the variance:


@qml.qnode(dev, interface="autograd")
def layer0_diag(params):
    layer0_subcircuit(params)
    return qml.var(qml.PauliZ(0)), qml.var(qml.PauliZ(1))


# calculate the diagonal terms
varK0, varK1 = layer0_diag(params)
g0[0, 0] = varK0 / 4
g0[1, 1] = varK1 / 4

##############################################################################
# The following two subcircuits are then used to calculate the
# off-diagonal covariance terms of :math:`g^{(0)}`:


@qml.qnode(dev, interface="autograd")
def layer0_off_diag_single(params):
    layer0_subcircuit(params)
    return qml.expval(qml.PauliZ(0)), qml.expval(qml.PauliZ(1))


@qml.qnode(dev, interface="autograd")
def layer0_off_diag_double(params):
    layer0_subcircuit(params)
    ZZ = np.kron(np.diag([1, -1]), np.diag([1, -1]))
    return qml.expval(qml.Hermitian(ZZ, wires=[0, 1]))


# calculate the off-diagonal terms
exK0, exK1 = layer0_off_diag_single(params)
exK0K1 = layer0_off_diag_double(params)

g0[0, 1] = (exK0K1 - exK0 * exK1) / 4
g0[1, 0] = (exK0K1 - exK0 * exK1) / 4

##############################################################################
# Note that, by definition, the block-diagonal matrices must be real and
# symmetric.
#
# We can repeat the above process to compute :math:`g^{(1)}`. The subcircuit
# required is given by
#
# .. figure:: ../demonstrations/quantum_natural_gradient/qng8.png
#     :align: center
#     :width: 50%
#     :target: javascript:void(0)

g1 = np.zeros([2, 2])


def layer1_subcircuit(params):
    """This function contains all gates that
    precede parametrized layer 1"""
    # |psi_0>: state preparation
    qml.RY(np.pi / 4, wires=0)
    qml.RY(np.pi / 3, wires=1)
    qml.RY(np.pi / 7, wires=2)

    # V0(theta0, theta1): Parametrized layer 0
    qml.RZ(params[0], wires=0)
    qml.RZ(params[1], wires=1)

    # W1: non-parametrized gates
    qml.CNOT(wires=[0, 1])
    qml.CNOT(wires=[1, 2])


##############################################################################
# Using this subcircuit, we can now generate the submatrix :math:`g^{(1)}`.
#
# .. figure:: ../demonstrations/quantum_natural_gradient/qng5.png
#     :align: center
#     :width: 50%
#     :target: javascript:void(0)


@qml.qnode(dev, interface="autograd")
def layer1_diag(params):
    layer1_subcircuit(params)
    return qml.var(qml.PauliY(1)), qml.var(qml.PauliX(2))


##############################################################################
# As previously, the diagonal terms are simply given by the variance,

varK0, varK1 = layer1_diag(params)
g1[0, 0] = varK0 / 4
g1[1, 1] = varK1 / 4


##############################################################################
# while the off-diagonal terms require covariance between the two
# observables to be computed.


@qml.qnode(dev, interface="autograd")
def layer1_off_diag_single(params):
    layer1_subcircuit(params)
    return qml.expval(qml.PauliY(1)), qml.expval(qml.PauliX(2))


@qml.qnode(dev, interface="autograd")
def layer1_off_diag_double(params):
    layer1_subcircuit(params)
    X = np.array([[0, 1], [1, 0]])
    Y = np.array([[0, -1j], [1j, 0]])
    YX = np.kron(Y, X)
    return qml.expval(qml.Hermitian(YX, wires=[1, 2]))


# calculate the off-diagonal terms
exK0, exK1 = layer1_off_diag_single(params)
exK0K1 = layer1_off_diag_double(params)

g1[0, 1] = (exK0K1 - exK0 * exK1) / 4
g1[1, 0] = g1[0, 1]


##############################################################################
# Putting this altogether, the block-diagonal approximation to the Fubini-Study
# metric tensor for this variational quantum circuit is
from scipy.linalg import block_diag

g = block_diag(g0, g1)
print(np.round(g, 8))


##############################################################################
# PennyLane contains a built-in function for computing the Fubini-Study metric
# tensor, :func:`~.pennylane.metric_tensor`, which
# we can use to verify this result:
print(np.round(qml.metric_tensor(circuit, approx="block-diag")(params), 8))

##############################################################################
# As opposed to our manual computation, which required 6 different quantum
# evaluations, the PennyLane Fubini-Study metric tensor implementation
# requires only 2 quantum evaluations, one per layer. This is done by
# automatically detecting the layer structure, and noting that every
# observable that must be measured commutes, allowing for simultaneous measurement.
#
# Therefore, by combining the quantum natural gradient optimizer with the analytic
# parameter-shift rule to optimize a variational circuit with :math:`d` parameters
# and :math:`L` parametrized layers, a total of :math:`2d+L` quantum evaluations
# are required per optimization step.
#
# Note that the :func:`~.pennylane.metric_tensor` function also supports computing the diagonal
# approximation to the metric tensor:
print(qml.metric_tensor(circuit, approx='diag')(params))

##############################################################################
# Furthermore, the returned metric tensor is **full differentiable**; include it
# in your cost function, and train or optimize its value!

##############################################################################
# Quantum natural gradient optimization
# -------------------------------------
#
# PennyLane provides an implementation of the quantum natural gradient
# optimizer, :class:`~.pennylane.QNGOptimizer`. Let's compare the optimization convergence
# of the QNG Optimizer and the :class:`~.pennylane.GradientDescentOptimizer` for the simple variational
# circuit above.

steps = 200
init_params = np.array([0.432, -0.123, 0.543, 0.233], requires_grad=True)

##############################################################################
# Performing vanilla gradient descent:

gd_cost = []
opt = qml.GradientDescentOptimizer(0.01)

theta = init_params
for _ in range(steps):
    theta = opt.step(circuit, theta)
    gd_cost.append(circuit(theta))

##############################################################################
# Performing quantum natural gradient descent:

qng_cost = []
opt = qml.QNGOptimizer(0.01)

theta = init_params
for _ in range(steps):
    theta = opt.step(circuit, theta)
    qng_cost.append(circuit(theta))


##############################################################################
# Plotting the cost vs optimization step for both optimization strategies:
from matplotlib import pyplot as plt

plt.style.use("seaborn")
plt.plot(gd_cost, "b", label="Vanilla gradient descent")
plt.plot(qng_cost, "g", label="Quantum natural gradient descent")

plt.ylabel("Cost function value")
plt.xlabel("Optimization steps")
plt.legend()
plt.show()

##############################################################################
# References
# ----------
#
# 1. Shun-Ichi Amari. "Natural gradient works efficiently in learning."
#    `Neural computation 10.2,  251-276 <https://www.mitpressjournals.org/doi/abs/10.1162/089976698300017746>`__, 1998.
#
# 2. James Stokes, Josh Izaac, Nathan Killoran, Giuseppe Carleo.
#    "Quantum Natural Gradient." `arXiv:1909.02108 <https://arxiv.org/abs/1909.02108>`__, 2019.
#
# 3. Aram Harrow and John Napp. "Low-depth gradient measurements can improve
#    convergence in variational hybrid quantum-classical algorithms."
#    `arXiv:1901.05374 <https://arxiv.org/abs/1901.05374>`__, 2019.
#
# 4. Naoki Yamamoto. "On the natural gradient for variational quantum eigensolver."
#    `arXiv:1909.05074 <https://arxiv.org/abs/1909.05074>`__, 2019.
#
#
# About the author
# ----------------
# .. include:: ../_static/authors/josh_izaac.txt