\input notes \nopagenumbers \pageno=1 \advance\voffset by 2\baselineskip \advance\vsize by -2\baselineskip \def\vr{\vec{r}} \def\deg{$^{\circ}$} \def\ms{m s$^{-1}$\;\;} \def\vxi{\vec{\xi}} \def\omo{\varpi_o} \def\om{\omega} \def\bul{{$\bullet$}} \def\sza{\theta_\odot} \def\mza{\mu_\odot} \def\fo{F_\odot} \def\se{\sigma_{ext}} \def\wm{Wm^{-2}} \def\inu{I_{\nu}} \def\fo{F_{\odot}} \def\tl{\tilde{\cal L}} \def\frame #1#2#3#4{\vbox{\hrule height #1pt \hbox{\vrule width #1pt\kern #2pt \vbox{\kern #2pt \vbox{\hsize #3\noindent #4}\kern #2pt} \kern #2pt\vrule width #1pt} \hrule height0pt depth #1pt}} \def\fitframe #1#2#3{\vbox{\hrule height#1pt \hbox{\vrule width#1pt\kern #2pt \vbox{\kern #2pt\hbox{#3}\kern #2pt} \kern #2pt\vrule width#1pt} \hrule height0pt depth#1pt}} \def\chapa{\tensl AT721\hfil\folio} \def\chapb{\tensl\folio\hfil{\sl Chapter 12 -- Linear Inverse problems}} \vskip 15mm \noindent {\largebold AT721 Section 12: \hfil} \vskip 10mm \noindent {\largebold Introduction to Bayes theorem and General Linear inverse \hfil} \vskip 15mm \noindent General references: \noindent Rodgers, C.D., 2000; Inverse Methods for atmospheric sounding, {\it World Sci}, chapter 2 \noindent Bernardo and Smith, 2000; Bayesian Theory, Wiley, 586pp \noindent Other References: \noindent Evans et al., 2002; Submillimeter-wave cloud ice radiometer: Simulations of retreival algorithm performance, {\it J. Geophys. Res.}, 107, 10.1029/2001JD000709 \noindent L'Ecuyer and Stephens, 2002; An Uncertanty model Bayesion Monte carlo retrieval algorithms: application to the TRMM observing system, {\it Q.J.Roy. Meteorol. Soc.}, 128, 1713-1737. \noindent Strand, 1974; coefficient errors caused by using the wrong covariance matrix in the general linear model, {\it Annal Statistics},2, 935-949. We have seen that the solutions to many inversion problems require compromises between the kind of information wanted and the kind of information that in fact can be derived from any given data set. These compromises are forced upon us when we attempt to introduce information into the inversion process that is not known precisely but rather is known only within some bounds. \bmedskip \section{\bf 12.1 \quad Probability and pdfs} \bmedskip A probability represents a state of knowledge rather than a physical entity. Add discussion on probability Suppose $x$ is a variable that takes on a continuous set of values over some interval on the real axis, $a\leq x\leq b$ . We further assume there exists a piecewise continuous function $p_X(x)$ such that the probability, $P(a\leq x\leq b)$ , that $X$ has a value in the interval $a\leq x\leq b$ is given by the area under the curve $p_X(x)$ between $x=a$ and $x=b$ $$ P(a\leq x\leq b) = \int^b_a p_X (x)dx \eqno(12.1) $$ where $p_X(x)$ is the probability density function (or pdf). The pdf must satisfy $$ p_X(x)\geq 0 $$ $$ \int p_X (x)dx=1 $$ where the interval is over the entire range of $x$. \bmedskip \section{\it 12.1.1 \quad Properties of pdfs} \bmedskip A pdf is a continuous function that may have a complicated shape. Therefore it is helpful to derive a few quantities from the distributions that summarize its major properties. One useful property is some indication of the most likely measurement which is the one of highest probability (Fig. 12.1) occurring at the peak of the distribution , referred to as the maximum likelihood point. If the distribution is skewed, this value, however, may not be a good indication of the most typical value. In these circumstances, the expected or mean value is a better characterization, $$ \langle x \rangle =\int x p_X (x)dx \eqno(12.2) $$ where $\langle x \rangle$ is the expected or mean value of $x$. The $nth$ moment of $x$ is defined thus follows $$ \langle x^n \rangle =\int x^n p_X (x)dx=1 \eqno(12.3) $$ and the variance of $x$ is $[\langle x^2 \rangle -\langle x \rangle^2]$ and the standard deviation $$ \sigma_x= [\langle x^2 \rangle -\langle x \rangle^2]^{1 \over 2} \eqno(12.4) $$ which characterizes the width of the distribution. \midinsert \vskip 50mm \hsize=4.25truein {\noindent {\bf Fig. 12.1} \fb The maximum likelihood point of the pdf for data gives the most probable value of the data. In general, this value can differ from the mean datum $\langle x \rangle$.}\endinsert \bmedskip \section{\it 12.1.1 \quad Joint Probability and correlated data} \bmedskip Experiments usually involve collection of more than one datum. We therefore need to quantify the probability that a set of random numbers take on a given value. For example, for continuous random variables $x$ and $y$, the joint probability is the probability that both $x$ and $y$ fall in specified ranges and $$ \int \int p_{XY} (x,y)dxdy=1 \eqno(12.5) $$ Furthermore, the marginal pdf for $X$ is obtained by integrating over $y$ $$ p_X(x) = \int p_{XY} (x,y)dy \eqno(12.6) $$ In some cases, the data are correlated. High values of one datum, for example, occur either with high or low values with another datum (Fig. 12.2a). The joint distribution takes this correlation into account and given this distribution, correlations can be tested by selecting a function that divides the $x,y$ plane into four quadrants of alternating sign centered on the center of the distribution (Fig. 12.2b). Correlated distributions tend to be concentrated in two opposite quadrants. If $[x-\langle x \rangle] [ y-\langle y \rangle]$ is the function then the resulting measure of the correlation is the covariance of $x$ and $y$ is defined as $$ cov(x,y)=\int \int [ x-\langle x \rangle] [ y-\langle y \rangle]p_{XY} (x,y)dxdy \eqno(12.7) $$ $$ = \langle xy \rangle - \langle x \rangle\langle y \rangle $$ which characterizes the basic shape of the joint probability. \midinsert \vskip 50mm \hsize=4.25truein {\noindent {\bf Fig. 12.2} \fb (a) The pdf p(x,y) is contoured as a function of x and x. The angle $\theta$ is a measure of the correlation and is related to the covariance. (b) The function divides the x-y plane into four quadrants of alternating sign.}\endinsert The correlation of $x$ and $y$ is defined as $$ cor(x,y)={cov(x,y) \over \sigma_x \sigma_y} \eqno(12.8) $$ The correlation $cor(x,y)$ is dimensionless and has the following properties \noindent(i) $cor(x,y)=cor(y,x)$ \noindent(ii)$-1\leq cor(x,y) \leq 1$ \noindent(iii) $cor(x,x)=1, cor(x,-x)=-1$ \noindent(iv) $cor(ax+b,cy+d)=cor(x,y)$ if $a,c \neq 0$ For independent $x$ and $y$, the following hold: \noindent(i) $p_{XY}(x,y)=p_X(x)p_Y(y) \eqno(12.9)$ \noindent(ii) $\langle xy \rangle = \langle x \rangle\langle y \rangle$ \noindent(iii) $\langle (x+y)^2 \rangle -$\langle x+y \rangle^2 = \noindent(iv) $cov(x,y)=0$ (note the converse of (iv) is not always true). \bmedskip \section{\it 12.1.1 \quad Functions of random variable} \bmedskip Often problems require the pdf not of $x$ but of some new variable $y=f(y)$, where $f(x)$ is a known function of $x$. The pdf of $y$ $$ p_y(y)=\int \delta (y- f(x))p_x(x) dx \eqno(12.11) $$ where $\delta (y- f(x))$ is the Dirac delta function. \bmedskip \section{\bf 12.2 \quad The Gaussian Distribution} \bmedskip Although the pdf of many types of variables in many problems may be governed by distributions functions of varying forms, the Gaussian or normal distribution describes the pdfs of a vast number types of problems. In fact, Jaynes points out that if the statistics of a given problem are not known precisely, assumptions for the pdf other than Gaussian introduces spurious information into the problem. We will consider only this form of distribution. Mention also maximum entropy - minimum information of normal dist (Rodgers, chapt 2). The normalized form of the distribution is $$ p_x(x)= {1 \over \sigma \sqrt{2\pi}} \exp [(x - \langle x \rangle)/2\sigma^2] \eqno(12.12) $$ where $\langle x \rangle$ and $\sigma$ are defined above. \bmedskip \section{\it 12.2.1 \quad Confidence Limits} \bmedskip The probability $$ P(a\leq x\leq b)=\int^b_a p_X (x)dx={1 \over \sigma \sqrt{2\pi}} \int^b_a \exp[ (x - \langle x \rangle)/2\sigma^2] \eqno(12.13) $$ and for the case with $x=0, a=-\sigma$ and $b=\sigma$, the probability that a parameter falls within the range is 68\%. Figure 11.3. presents this probability as a function of parameter range measured in units of $\sigma$. For example, the probability that a parameter falls within the range $-2\sigma \leq x \leq 2\sigma$ is 95.4\% and so on. \midinsert \vskip 50mm \hsize=4.25truein {\noindent {\bf Fig. 12.3} \fb The probability that a parameter $x$ falls within t standard deviations of the mean value \langle x \rangle.}\endinsert \bmedskip \section{\it 12.2.2 \quad Joint Probability} \bmedskip The joint distribution of two independent Gaussian variables is, according to 12.x, the product of two Gaussian distributions. When the data are correlated, the distribution has the form $$ p_y({\bf y})= {1 \over {2\pi}^{{1\over 2}} \mid {\bf S}_y \mid^{{1\over 2}}} \exp [({\bf y} - \langle {\bf y} \rangle)^T {\bf S}_y^{-1} ({\bf y} - \langle {\bf y} \rangle)] \eqno(12.14) $$ where ${\bf S}_y$ is a matrix of correlations between the different data, $$ S_{y,ij}= \int (y_i-\langle {\bf y} \rangle )(y_j-\langle {\bf y} \rangle) p_Y(y) dy \eqno(12.15) $$ When the individual 'measurements' $y_i$ are uncorrelated with respect to measurements $y_j$ then $$ {\bf S}_y = \pmatrix{\sigma_1^2 & 0 & \ldots \cr 0 & \sigma_2^2 & \ldots \cr & \ldots & \sigma_n^2 \cr} $$ fix up and show examples \bmedskip \section{\bf 12.3 \quad Bayes Theorem} \bmedskip About three centuries ago, people started to give serious thought to the question of how to reason in situations that lack any certainty. Reverand Thomas Bayes is credited with providing an approach that was communicatd after his death in 1761 by Richard Price to the Royal Society in 1763. The technical result at the hear of the essay {\it An essay towards solving a problem in then doctrine of chances} is what we know know as {\it 'Bayes' theorem'}. It was Laplace in 1812, however, - apparently unaware of Bayes' work - who stated the theorem in its general form that we use today. The use of Bayes theorem in the context of inverse problems is now widespread. Discussion of this theorem in the context of inverse problems may be facilitated with the the following definitions: \noindent \bul ~~$ p({\bf x})$ as the {\it prior} pdf of the state ${\bf x}$. This means that the quantity $p({\bf x})d{\bf x}$ is the probability such that ${\bf x}$ lies in the multidimensional volume $({\bf x},{\bf x}+d{\bf x})$ expressing quantitatively our knowledge of ${\bf x}$ before the measurement is made. \noindent \bul ~~$p({\bf y})$ as the {\it prior} pdf of the measurement with a similar meaning. This is the pdf of the measurement before it is made. \noindent \bul ~~$p({\bf x},{\bf y})$ as the joint prior pdf of ${\bf x}$ and ${\bf y}$ meaning that $p({\bf x},{\bf y})d{\bf x}d{\bf y}$ is the probability that ${\bf x}$ lies in $({\bf x},{\bf x}+d{\bf x})$ and ${\bf y}$ lies in $({\bf y},{\bf y}+d{\bf y})$ \noindent \bul ~~$p({\bf y}\mid{\bf x})$ as the conditional pdf of ${\bf y}$ given ${\bf x}$ meaning that $p({\bf y}\mid{\bf x})d{\bf y}$ is the probability that ${\bf y}$ lies in $({\bf y},{\bf y}+d{\bf y})$ when ${\bf x}$ has a given value. This is a function derived by the forward model. \noindent \bul ~~ $p({\bf x}\mid{\bf y})$ as the conditional pdf of ${\bf x}$ given ${\bf y}$ meaning that $p({\bf x}\mid{\bf y})d{\bf x}$ is the probability that ${\bf x}$ lies in $({\bf x},{\bf x}+d{\bf x})$ when ${\bf y}$ has a given value. This is the quantity of interest for solving the inverse problem. \midinsert \vskip 50mm \hsize=4.25truein {\noindent {\bf Fig. 12.3} \fb Fig. 12.3 Illustration of Bayes' theorem for a two-dimensional case.}\endinsert Figure 12.3 provides a conceptual illustration of the concepts behind these definitions in the context of $x$ and $y$ are scalar. Shown are the contours of $p(x,y$ as well as $p(x)$ and $p(y)$ where $$ p(x)= \int^{\infty}_{-\infty} p(x,y)dy ~~~{\rm and}~~~p(y)= \int^{\infty}_{-\infty} p(x,y)dx $$ Bayes theorem states that $$ p({\bf x}\mid{\bf y})={p({\bf y}\mid{\bf x})p({\bf x}) \over p({\bf y})} \eqno(12.16) $$ where the left hand side is the {\it posterior} pdf of the state when the measurement is given. This represents an updating to our prior knowledge $p({\bf x})$ given the measurement. The most likely value of ${\bf x}$ derived from this {\it posterior} pdf might therefore 'contain' our inverse 'solution'. Our knowledge contained in $p({\bf y}\mid{\bf x})$ is explicitly expressed in terms of the forward model and the statistical description of both the error of this model and the error of the measurement. The factor $p({\bf y})$ can practically be ignored as it merely represents a normalizing factor being entirely independent of ${\bf x}$. The general approach to inversion thus follows these basic steps: \noindent \bul ~~ We express our prior knowledge of ${\bf x}$ as a pdf \noindent \bul ~~ The measurement process is expressed as a forward model, $f({\bf x})$, which maps the state space ${\bf x}$ into a measurement space, via $$ {\bf y}= f({\bf x})+ \varepsilon $$ \noindent \bul ~~ Bayes' theorem then provides the formalism to invert this mapping and calculate the {\it posterior} pdf by updating the prior pdf with the measurement pdf \noindent \bul ~~ The most probable solution is then inferred from this posterior pdf Bayes theorem is general, it is not just a specific inverse method which produces a solution but rather encompasses all inverse methods by providing a way of characterizing all possible solutions and assigning a probability density to each. The forward model is not explicitly inverted and one can obtain different explicit 'answers' to the inverse problem according to how the most 'likely' solution is to be defined. The method provides us with a deeper intuition about how the measurement improves our knowledge of the state. \bmedskip \section{\bf 12.4 \quad An example application } \bmedskip From Frank's work \bmedskip \section{\bf 12.5 \quad Linear inverse problem with Gaussian Statistics} \bmedskip The Bayesian approach provides us with a framework for understanding the inverse problem. Given a measurement together with a description of its error statistics, a forward model describing the relationship between the measurement and the unknown state, and any other a priori information, Bayes relationship allows us to identify the class of possible states consistent with all available information and assign a pdf to them. One approach to evaluate (12.16) to obtain the {\it posterior} pdf and thus the most probable solution is to carry out forward model calculations over the entire possible range of ${\bf x}$ to establish $p({\bf x}\mid{\bf y})$ in the form of a histogram. We have already seen Franks example Another more explicit way of evaluating of (12.16) follows by invoking the assumption of Gaussian statistics. It follows from (12.14) that $$ -2 \ln p({\bf y}\mid{\bf x})= ({\bf y} - f({\bf x}))^T {\bf S}_y^{-1}({\bf y} - f({\bf x})) + c_1 \eqno(12.17) $$ where $c_1$ is a constant and ${\bf S}_y$ is the measurement error covariance. If we also invoke these statistics to describe our prior knowledge $$ -2 \ln p({\bf x})= ({\bf x} - {\bf x}_a)^T {\bf S}_a^{-1}({\bf x} - {\bf x}_a) \eqno(12.18) $$ where ${\bf x}_a$ is the a priori value of ${\bf x}$ and ${\bf S}_a$ is the associated covariance matrix. Combining (12.17) and (12.18) into (12.16) provides the following for the {\it posterior} pdf: $$ -2 \ln p({\bf x}\mid{\bf y})= ({\bf y} - {\bf K}{\bf x})^T {\bf S}_y^{-1}({\bf y} - {\bf K}{\bf x}) +({\bf x} - {\bf x}_a)^T {\bf S}_a^{-1}({\bf x} - {\bf x}_a) + c_1 \eqno(12.19) $$ where we introduce a linear form $f({\bf x})\approx {\bf K} {\bf x}$ for the forward model. Since the combination of Gaussian pdfs results in a pdf of the same form, we can write (12.19) as $$ -2 \ln p({\bf x}\mid{\bf y})= ({\bf x} - \langle{\bf x}\rangle)^T {\bf S}^{-1}({\bf x} - \langle{\bf x}\rangle) \eqno(12.20) $$ where the desired {\it posterior} pdf is of Gaussian form with the expected value $\langle{\bf x}\rangle$ and associated covariance ${\bf S}_x$. We can match like terms in (12.20) and (12.19) to arrive at $$ {\bf S}_x^{-1}={\bf K}^T{\bf S}_y^{-1}{\bf K}+ {\bf S}_a^{-1} $$ and $$ \langle{\bf x}\rangle={\bf x}_a+ {\bf S}_a{\bf K}^T({\bf K}{\bf S}_a {\bf K}^T + {\bf S}_y) ({\bf y} - {\bf K}{\bf x}) \eqno(12.21) add 2x2 example \bmedskip \section{\bf 12.6 \quad The maximum a posteriori solution} \bmedskip As noted above, the Bayes relationship allows us to identify the class of possible states consistent with all available information we have to solve our inverse problem. It assigns not a single state but a pdf to represent the class of states. Practical problems require the identification of just one possible state assigning it as a 'solution'. There are a number of possible ways a solution can be constructed from the pdf. Perhaps the most straight forward way of selecting this single state from all states represented by the pdf is to choose either the most likely state, i.e. one for which $p({\bf x}\mid{\bf y})$ is a maximum or the expected value solution $$ \langle{\bf x}\rangle=\int{\bf x}p({\bf x}\mid{\bf y}) d{\bf x} $$ and for either case, the width of the distribution is a measure of the error. For Gaussian density functions the two solutions are identical and are given by (12.21). For these statistics these solutions are also the same as the minimum variance solutions. \bye