Numerical Methods Pdfs
Numerical Methods Pdfs
by
Dr. Anita Pal
Assistant Professor
Department of Mathematics
National Institute of Technology Durgapur
Durgapur-713209
email: [email protected]
.
Chapter 1
Numerical Errors
Module No. 1
Two major techniques are used to solve any mathematical problem − analytical and
numerical. The analytical solution is obtained in a compact form and generally it is free
from error. On the other hand, numerical method is a technique which is used to solve
a problem with the help of computer or calculator. In general, the solution obtained by
this method contains some error. But, for some class of problems it is very difficult to
obtain an analytical solution. For these problems we generally use numerical methods.
For example, the solutions of complex non-linear differential equations cannot be de-
termined by analytical methods, but these problems can easily be solved by numerical
methods. In numerical method there always be a scope to occur errors and hence it
is important to understand the source, propagation, magnitude, and rate of growth of
these errors.
To solve a problem with the help of computer, a special method is required and
this method is known as numerical method. Analytical methods are not suitable to
solve a problem by computer. Thus, the numerical methods are highly appreciated and
extensively used by scientists and engineers.
Let us discuss sources of error.
It is well known that the solution of a problem obtained by numerical method contains
some errors. But, our intension is to minimize the error. To minimize it, the most
essential thing is to identify the causes or sources of the error. Three sources of errors,
viz. inherent errors, round-off errors and truncation errors occur to find a solution of a
problem by using numerical method. They are discussed below.
(i) Inherent errors: These type of errors occur due to the simplified assumptions
made during mathematical modelling of the problem. These errors also occur when the
data is obtained from certain physical measurements of the parameters of the proposed
problem.
(ii) Round-off errors: Generally, the numerical methods are performed using com-
puter. In numerical computation, all the numbers are represented as decimal fraction.
Again, a computer can store finite number of digits for a number. Some numbers viz.
1/3, 1/6, 1/7 etc. can not be represented by decimal fraction in finite numbers of digits.
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors in Numerical Computations
Thus, to represent these numbers some digits must be discarded and hence the numbers
should be rounded-off into some finite number of digits. So in arithmetic computation,
some errors will occur due to the finite representation of the numbers; these errors are
called round-off errors. These errors depend on the word length of the used computer.
(iii) Truncation errors: These errors occur due to the finite representation of an
inherently infinite process. These types of errors are explained by an example. Let us
consider the cosine series.
The Taylor’s series expansion of cos x is
x2 x4 x6
cos x = 1 − + − + ··· .
2! 4! 6!
This is well known that this series s infinite. If we consider the first five terms to
calculate the value of cos x for a given x, then we obtained an approximate value. The
error occurs due to the truncation of the remaining terms of the series and it is called
the truncation of error.
Note that the truncation error is independent of the computational machine.
It may be noted that same number may be exact as well as approximate. For ex-
ample, the number 3 is exact when it represents the number of rooms of a house and
approximate when it represents the number π.
The accuracy of a solution is defined in terms of number of digits used in the com-
putation. The significant digits or significant figures of a number are all its digits,
except for zeros which appear to the left of the first non-zero digit. But, the zeros at
the end of a number are always significant digit. The numbers 0.000342 and 8921.2300
have 3 and 8 significant digits respectively.
Some times we need to cut off usable digits. The number of digits to be cut off depends
on the problem. This process to cut off digits from a number is called rounding-off
of numbers. That is, in rounding process the number is approximated to a very close
number consisting of a smaller number of digits. In that case, one or more digits are
kept with the number, taken from left to right, and all other digits are discarded.
Rules of rounding-off
(i) If the discarded digits constitute a number which is larger than half the unit in the
last decimal place that remains, then the last digit that is left is increased by one.
If the discarded digits constitute a number which is smaller than half the unit in
the last decimal place that remains, then the digits that remain do not change.
(ii) If the discarded digits constitute a number which is equal to half the unit in the
last decimal place that remains, then the last digit that is half is increased by one,
if it is odd, and is unchanged if it is even.
This rule is often called a rule of an even digit.
In Table 1.1, we consider different cases to illustrate the round-off process. In this
table the numbers are rounded-off to the six significant figures. But, computer kept
more number of digits during round-off. It depends on the computer and the type of
the number declared in a programming language.
Note that the round-off numbers contain errors and this errors are called round-off
errors.
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors in Numerical Computations
Absolute error:
Let xA be the approximate value of the exact number xT . Then the absolute error is
denoted by (∆x) and satisfies the relation
∆x ≥ |xT − xA |.
Note that the absolute error is the upper bound of the difference between xT and
xA . This definition is applicable when there are many approximate values of the exact
number xT . Otherwise, ∆x = |xT − xA |.
4
......................................................................................
Also, the exact value xT lies between xA − ∆x and xA + ∆x. It can be written as
xT = xA ± ∆x. (1.1)
Note that the absolute error measures the total error and hence this error measures
only the quantitative side of the error. It does not measure the qualitative, i.e. how
much the measurement is accurate. For example, the length and the width of a pond
are determined by a tape in meter. Suppose that width w = 50 ± 2 m and the length
l = 250 ± 2 m. In both the measurements the absolute error is 2 m, but it is obvious
that the second measure is more accurate.
To determine the quality of measurements, we introduced a new concept called rela-
tive error.
Relative error:
The relative error is denoted by δx and is defined by
∆x ∆x
δx = or , |xT | 6= 0 and |xA | 6= 0.
|xA | |xT |
This expression can also be written as
xT = xA (1 ± δx) or xA = xT (1 ± δx).
Note that the absolute error is the total error when whole thing is measured, while
relative error is the error when we measure 1 unit. That is, the relative error is the error
per unit measurement.
2 2
In case of above example, the relative errors are δw = 50 = 0.04 and δl = 250 =
0.008. Thus, the second measurement is more accurate.
In general, the relative error measures the quantity of error and quality of the mea-
surement. Thus, the relative error is a better measurement of error than absolute error.
Percentage error:
The relative error is measured in 1 unit scale while the percentage error is measured
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors in Numerical Computations
in 100 unit scale. The percentage error is measured by δx × 100%. This error is
sometimes called relative percentage error. Percentage error measures both the
quantity and quality. Generally, when relative error is very small then the percentage
error is determined.
Note that the relative and percentage errors are free from the unit of measurement,
while absolute error depends on the measuring unit.
1
Example 1.1 Find the absolute, relative and percentage error in xA when xT = and
7
xA = 0.1429.
Solution. The absolute error
1 1 − 1.0003
∆x = |xT − xA | = − 0.1429 =
7 7
0.0003
= = 0.000043 rounding up to two significant figures.
7
The relative error
∆x 0.000043
δx = = = 0.000329 ' 0.0003.
xT 1/7
The percentage error is δx × 100% = 0.0003 × 100% ' 0.03%.
Example 1.2 Find the absolute error and the exact number corresponding to the ap-
proximate number xA = 7.543. Assume that the percentage error is 0.1%.
Example 1.3 Suppose two exact numbers and their approximate values are given by
17 √
xT = ' 0.8947 and yT = 71 ' 8.4261.
19
Find out which approximation is better.
Solution. To find the absolute error, we take the numbers xA and yA with a larger
√
number of decimal digits as xA ' 0.894736 · · · , yA = 71 ' 8.426149 · · · .
Therefore, the absolute error in xT is ∆x = |0.894736 · · · − 0.8947| ' 0.000036,
and ∆y = |8.426149 · · · − 8.4261| ' 0.000049.
6
......................................................................................
A decimal integer can be represented in many ways. For example, the number 7600000
can be written as 760 × 104 or 76.0 × 105 or 0.7600000 × 107 . Note that each number has
two parts, the first part is called mantissa and second part is called exponent. In last
form, the mantissa is a proper fraction and first digit after decimal point is non-zero.
This form is known as normalize form and it is commonly used in computer.
Every positive decimal number a can be expressed as
where di are the digits constituting the number (i = 1, 2, . . .). The digit d1 6= 0 and
10m−i+1 is the value of the ith decimal place starting from left.
Let dn be the nth digit of the approximate number x. This digit is called valid
significant digit (or simply a valid digit) if it satisfies the following condition
If the inequality of (1.3) does not satisfied, the digit dn is said to be doubtful. If dn
is a valid digit then all the digits preceding to dn are also valid.
Theorem 1.1 If a number is correct up to n significant figures and the first significant
digit is k, then the relative error is less than
1
.
k × 10n−1
Proof. Let xA and xT be the approximate and exact values. Also, assume that xA is
correct up to n significant figures and m decimal places. There are three cases arise:
(i) m < n
(ii) m = n and
(iii) m > n.
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors in Numerical Computations
Chapter 1
Numerical Errors
Module No. 2
This is the continuation of Module 1. In this module, the propagation of error during
arithmetic operations are discussed in details. Also, the representation of numbers in
computer and their arithmetic calculations are explained.
Thus the absolute error in sum of approximate numbers is equal to the sum of the
absolute errors of all the numbers.
The following points should keep in mind during addition of numbers.
(i) identify a number (or numbers) of the least accuracy,
(ii) round-off the numbers to the nearest exact numbers and retain one digit more
than in the identified number,
Subtraction
The case for subtraction is similar to addition. Let x1 and x2 be two approximate values
of the exact numbers X1 and X2 respectively and X = X1 − X2 , x = x1 − x2 .
Therefore, one can write X1 = x1 ± ∆x1 and X2 = x2 ± ∆x2 .
Now, |X − x| = |(X1 − x1 ) − (X2 − x2 )| ≤ |X1 − x1 | + |X2 − x2 |. Hence,
It may be noted that the absolute error in difference of two numbers is equal to the
sum of individual absolute errors.
Let us consider two exact numbers X1 and X2 with their approximate values x1 and
x2 . Let, X1 = x1 ± ∆x1 and X2 = x2 ± ∆x2 , where ∆x1 and ∆x2 are the absolute
errors in x1 and x2 .
Now, X1 X2 = x1 x2 ± x1 ∆x2 ± x2 ∆x1 ± ∆x1 · ∆x2 .
Therefore, |X1 X2 −x1 x2 | ≤ |x1 ∆x2 |+|x2 ∆x1 |+|∆x1 ·∆x2 |. Both the terms |∆x1 | and
|∆x2 | represent the errors and they are small, so their product is also small. Therefore,
we discard it and dividing both sides by |x| = |x1 x2 | to get the relative error.
Hence, the relative error is
X1 X2 − x1 x2 ∆x2 ∆x1
=
x2 + x1 . (2.3)
x1 x2
From this expression we conclude that the relative error in product of two numbers
is equal to the sum of the individual relative errors.
This result can be extended for n numbers as follows: Let X = X1 X2 · · · Xn and
x = x1 x2 · · · xn . Then
X − x ∆x1 ∆x2
+ · · · + ∆xn .
= + (2.4)
x x1 x2 xn
That is, the total relative error in product of n numbers is equal to the sum of
individual relative errors.
In particular, let all approximate values x1 , x2 , . . . , xn be positive and x = x1 x2 · · · xn .
Then log x = log x1 + log x2 + · · · + log xn .
2
......................................................................................
In this case,
∆x ∆x1 ∆x2 ∆xn
= + + ··· + .
x x1 x2 xn
∆x ∆x ∆x ∆x
1 2 n
Hence, = + + ··· + .
x x1 x2 xn
Let us consider another particular case. Suppose, x = kx1 , where k is a non-zero real
number. Now,
∆x k ∆x1 ∆x1
δx =
= = = δx1 .
x k x 1 x1
Also, |∆x| = |k ∆x1 | = |k| |∆x1 |.
Observed that the relative errors in both x and x1 are same, while absolute error in
x is |k| times the absolute error in x1 .
Let X1 and X2 be two exact numbers and their approximate values be x1 and x2 .
X1 x1
Again, let X = and x = .
X2 x2
If ∆x1 and ∆x2 are the absolute errors, then X1 = x1 + ∆x1 , X2 = x2 + ∆x2 .
Suppose both x1 and x2 are non-zeros.
Now,
x1 + ∆x1 x1 x2 ∆x1 − x1 ∆x2
X −x= − = .
x2 + ∆x2 x2 x2 (x2 + ∆x2 )
Dividing both sides by x and taking absolute values:
X − x x2 ∆x1 − x1 ∆x2 x2 ∆x1 ∆x2
x x1 (x2 + ∆x2 ) x2 + ∆x2 x1 − x2 .
= =
It may be observed that the relative error in quotient is greater than or equal to the
difference of their individual relative errors.
In case of positive numbers one can determine the error of logarithm function. Let
x1 and x2 be the approximate numbers and x = x1 /x2 .
Now, log x = log x1 − log x2 . Thus,
∆x ∆x1 ∆x2 ∆x ∆x1 ∆x2
= − i.e., ≤ + .
x x1 x2 x x1 x2
Example 2.1 Find the sum of the approximate numbers 120.237, 0.8761, 78.23, 0.001234,
234.3, 128.34, 35.4, 0.0672, 0.723, 0.08734. It is known that in each of which all the
written digits are valid. Find the absolute error in sum.
Solution. The least exact numbers are 234.3 and 35.4. The maximum error of each of
them is 0.05. Now, rounding-off all the numbers in two decimal places (one digit more
than the least exact numbers).
Their sum is 120.24 + 0.88 + 78.23 +0.00 + 234.3 + 128.34 +35.4 + 0.07 + 0.72 + 0.09 =
598.27.
Now, rounding-off the sum to one decimal place and it becomes 598.3.
There are two types of errors in the sum. The first one is the initial error. This is
the sum of the errors of the least exact numbers and the rounding errors of the other
numbers, which is equal to 0.05 × 2 + 0.0005 × 8 = 0.104 ' 0.10.
The second one is the error in rounding-off the sum which is 598.3 − 598.27 = 0.03.
Thus, the total absolute error in the sum is 0.10 + 0.03 = 0.13.
Finally, the sum can be expressed as 598.3 ± 0.13.
Example 2.2 Let x1 = 43.5 and x2 = 76.9 be two approximate numbers and 0.02
and 0.008 be the corresponding absolute errors respectively. Find the difference between
these numbers and evaluate absolute and relative errors.
Solution. Here, x = x1 −x2 = −33.4 and the total absolute error is ∆x = 0.02+0.008 =
0.028.
Hence, the difference is 33.4 and the absolute error is 0.028.
The relative error is 0.028/| − 33.4| ' 0.00084 = 0.084%.
4
......................................................................................
Example 2.3 Let x1 = 12.4 and x2 = 45.356 be two approximate numbers and all
digits of both the numbers are valid. Find the product and the relative and absolute
errors.
Solution. The number of valid decimal places in first and second approximate numbers
are one and three respectively. So we round-off the second number to one decimal place.
After rounding-off the numbers become x1 = 12.4 and x2 = 45.4.
Now, the product is x = x1 x2 = 12.4 × 45.4 = 562.96 ' 56.0 × 10.
The result is rounded in two significant figures, because the least number of valid
significant digits of the given numbers is 3.
The relative error in product is
∆x ∆x1 ∆x2 0.05 0.0005
δx = = + = + = 0.004043 ' 0.40%.
x x1 x2 12.4 45.356
Example 2.4 Let x1 = 7.235 and x2 = 8.72 be two approximate numbers, where all
the digits of the numbers are valid. Find the quotient and also relative and the absolute
errors.
Solution. Here, x1 = 7.235 and x2 = 8.72 have four and three valid significant digits
respectively. Now,
x1 7.235
= = 0.830.
x2 8.72
We consider three significant digits, since the least exact number contains three valid
significant digits.
The absolute error in x1 and x2 are respectively ∆x1 = 0.0005 and ∆x2 = 0.005.
The relative error in quotient is
∆x1 ∆x2 0.0005 0.005
x1 + x2 = 7.235 + 8.72 = 0.000069 + 0.000573
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic
Let x1 be an approximate value of an exact number X1 and its relative error be δx1 .
Now, we determine the relative error of x = xk1 , where k is a real number.
Then
x = xk1 = x1 · x1 · · · k times.
Thus, the relative error of the approximate number x is k times the relative error of
x1 .
Let us consider the case, the kth root of a positive approximate value x1 , i.e. the
√
number x = k x1 .
Since x1 > 0,
1
log x = log x1 .
k
Therefore,
∆x 1 ∆x1 ∆x 1 ∆x1
= or
= .
x k x1 x k x1
√
Thus, the relative error in k x1 is
1
δx = δx1 .
k
Example 2.5 Let a = 5.27, b = 28.61, c = 15.8 be the approximate values of some
numbers and let the absolute
√ errors in a, b, c be 0.01, 0.04 and 0.02 respectively. Calcu-
2 3
a b
late the value of E = and the error in the result.
c3
Solution. It is given that the absolute error ∆a = 0.01, ∆b = 0.04 and ∆c = 0.02. One
more significant figure retain to intermediate calculation. Now, the approximate values
√
of the terms a2 , 3 b, c3 are 27.77, 3.0585, 3944.0 respectively.
The approximate value of the expression is
27.77 × 3.0585
E= = 0.0215.
3944.0
6
......................................................................................
Three significant digits are taken in the result, since, the least number of significant
digits in the numbers is three.
The relative error is given by
1 0.01 1 0.04 0.02
δE = 2 δa + δb + 3 δc = 2 × + × +3×
3 5.27 3 28.61 15.8
' 0.0038 + 0.00047 + 0.0038 ' 0.008 = 0.8%.
This is the formula to calculate the total absolute error to compute a function of
several variables.
The relative error can be calculated as
n
∆y X ∂f ∆xi
= .
y ∂xi y
i=1
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic
It may be remembered that some significant digits are lost during arithmetic calcu-
lation, due to the finite representation of computing instruments. This error is called
significant error.
In the following two cases, there are high chances to loss of more significant digits
and care should be taken in these situations:
(i) When two nearly equal numbers are subtracted, and
(ii) When division is made by a very small divisor compared to the dividend.
It should be remembered that the significant error is more serious than round-off
error. These are illustrated in the following examples:
√ √
Example 2.6 Find the difference 10.23 − 10.21 and calculate the relative error in
the result.
√ √
Solution. Let X1 = 10.23 and X2 = 10.21 and their approximate values be
x1 = 3.198 and x2 = 3.195. Let X = X1 − X2 .
Then the absolute errors are ∆x1 = 0.0005 and ∆x2 = 0.0005 and the approximate
difference is x = 3.198 − 3.195 = 0.003.
Thus, the total absolute error in the subtraction is ∆x = 0.0005 + 0.0005 = 0.001
0.001
and the relative error is δx = = 0.3333.
0.003
But, by changing the calculation scheme one can obtained more accurate result. For
example,
√ √ 10.23 − 10.21
X= 10.23 − 10.21 = √ √
10.23 + 10.21
0.02
= ' 0.003128 = x (say).
3.198 + 3.195
The relative error is
∆x1 + ∆x2 0.001
δx = = = 0.0002 = 0.02%.
x1 + x2 3.198 + 3.195
Observed that the relative error is much less that the previous case.
8
......................................................................................
Solution. To illustrate the difficulties of the problem, let us assumed that the com-
puting machine using four significant digits for all arithmetic calculation. The roots of
this equation are
√
15002 − 2
1500 ±
.
2
Now, 1500 − 2 = 0.2250 × 10 − 0.0000 × 10 = 0.2250 × 107 .
2 7 7
√
Thus 15002 − 2 = 0.1500 × 104 .
Hence, the roots are
0.1500 × 104 ± 0.1500 × 104
= 0.1500 × 104 , 0.0000 × 104 .
2
That is, the smaller root is zero (correct up to four decimal places), this occur due
to the finite representation of the numbers.
But, it is noted that 0 is not a root of the given equation.
To get the more accurate result, we use the transformation on arithmetic calculation.
The smaller root of the equation is now calculated as follows:
√ √ √
1500 − 15002 − 2 (1500 − 15002 − 2)(1500 + 15002 − 2)
= √
2 2(1500 + 15002 − 2)
2
= √ = 0.0003333.
2(1500 + 15002 − 2)
Hence, the smaller root of the equation is 0.0003333 and it is more closed to the exact
root. The other root is 0.1500 × 104 .
The situation may aries when |4ac| b2 .
So it is suggested that a care should be taken when nearly two equal numbers are
subtracted. It is done by taking sufficient number of reserve valid digits.
It is mentioned earlier that the numerical methods are used to solve problems using
computer. But, the computer has a limitation to store number either it is an integer
or a real (or floating point) number. Generally, two bytes memory space is used to
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic
store an integer and four bytes space is used to store a floating point number. Due to
the limitation of space, the rules for arithmetic operations used in mathematics do not
always hold in computer arithmetic.
The representation of a floating point number in computer is different from our con-
ventional technique. In computer representation, the technique is used to preserve the
maximum number of significant digits and increase the range of values of the real num-
bers. This representation is known the normalized floating point mode. In this
representation, the whole number is converted to a proper fraction in such a way that
the first digit after decimal point should be non-zero and is adjusted by multiplying
some power of 10. For example, the number 3876.23 is represented in the normalized
form as .387623 × 104 , and in computer representation it is written as .387623E4 (E4
is used to denote 104 ). It is observed that in normalized floating point representation,
a number has two parts – mantissa and exponent. In this example, .387623 is the
mantissa and 4 is the exponent. According to the representation the mantissa is always
greater than or equal to .1 and exponent is an integer.
To explain the computer arithmetic, in this section, it is assumed that the computer
uses only four digits to store mantissa and two digits for exponent. The mantissa and
the exponent have their own signs. In this assumption, the range of floating point
numbers (magnitudes) is .9999 × 1099 to .1000 × 10−99 .
In this section, the four basic arithmetic operations on normalized floating point
numbers are discussed.
2.4.1 Addition
The addition of two normalized floating point numbers is done by using the following
rules:
(i) If two numbers have same exponent, then the mantissas are added directly and
the exponent of the added number is the either exponent.
(ii) If the exponents are different, then lower exponent is shifted to higher exponent
by adjusting mantissa and then the above rule is used to add them.
10
......................................................................................
Solution. (i) Here the exponents are same. So using first rule one can add the numbers
by adding mantissa. Therefore, the sum is .7554E15.
(ii) In this case also, the exponent are equal and in previous case the sum is 1.4199E10.
Notice that the sum contains five significant figures, but it is assumed that the computer
can store only four significant figures. So, the number is shifted right one place before
storing it to the computer memory. To convert it to four significant figures, the exponent
is increased by 1 and the last digit is truncated. Hence, finally the sum is .1419E11.
(iii) For this problem, the exponents are different and the difference is 8 − 3 = 5. The
mantissa of smaller number (low exponent) is shifted 5 places and the number becomes
.0000E8. Now, the numbers have same exponent and hence the final result is .0000E8
+ .3218E8 = .3218E8.
(iv) In this case, the exponents are also different and the difference is 27 − 25 = 2.
So the mantissa of the smaller number (here first number) is shifted right by 2 places
and it becomes .0038E27. Now the sum is .0038E27 + .8541E27 = .8579E27.
(v) This case is different. The exponents are same and the sum is 1.4772E99. Here,
the mantissa has five significant digits, so it is shifted right and the exponent is increased
by 1. Then the exponent becomes 100. Since as per our assumption, the maximum value
of the exponent is 99, so the number is larger than the capacity of the floating number of
the assumed computer. This number cannot store in the computer and this situation is
called an overflow condition. In this case, the computer will generate an error message.
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic
2.4.2 Subtraction
Solution. (i) Here the exponents are equal, and the hence the mantissas are directly
added. Thus, the result is
.8432E10 – .2832E10 = .5600E10.
(ii) Here also the exponents are equal. So the result is .2697E15 – .2693E15 = .0004E15.
The mantissa is not in normalised form. Since the computer always store normalised
numbers, we have to convert it to the normalised number. The normalised number
corresponding to .0004E15 is .4000E12. This is the final answer.
(iii) In these numbers the exponents are different. The number with smaller exponent is
shifted right and the exponent increased by 1 for every right shift. The second number
becomes .0278E–16. Thus the result is .2134E–16 – .0278E–16 = .1856E–16.
2.4.3 Multiplication
Two normalized floating point numbers are multiplied by multiplying the mantissas
and adding the exponents. After multiplication, the mantissa is converted into nor-
malized floating point form and the exponent is adjusted accordingly. Multiplication is
illustrated in the following examples.
Example 2.10 Multiply the following floating point numbers:
(i) .2198E6 by .5671E12
(ii) .2318E17 by .8672E–17
(iii) .2341E52 by .9231E51
(iv) .2341E–53 by .7652E-51.
Solution. (i) In this case, .2198E6 × .5671E12 = .12464858E18.
Note that the mantissa has 8 significant figures, but as per our computer the result
will be .1246E18 (last four significant figures are truncated).
(ii) Here, .2318E17 × .8672E–17 = .20101696E0 = .2010E0.
(iii) .2341E52 × .9231E51 = .21609771E103.
In this case, the exponent has three digits and it is not allowed in our assumed
computer. The overflow condition occurs, so an error message will generate.
(iv) .2341E–53 × .7652E-51 = .17913332E–104 = .1791E–104 and an error message will
come.
2.4.4 Division
Also, the division of normalised floating point number is similar to division of ordinary
number. Only the difference is that the mantissa retains only four significant digits (as
per our assumed computer) instead of all digits. The quotient mantissa must be written
in the normalized form and the exponent is adjusted accordingly.
Example 2.11 Perform the following divisions
(i) .8765E43 ÷ .3131E21
(ii) .9999E5 ÷ .1452E–99
(iii) .3781E–18 ÷ .2871E94.
Solution. (i) .8765E43 ÷ .3131E21 = 2.7994251038E22 = .2799E23.
(ii) In this case, the number is divided by a small number.
.9999E5 ÷ .1452E–99 = 6.8863636364E104 =.6886E105.
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . Propagation of Errors and Computer Arithmetic
Sometimes floating point arithmetics give unpredictable results, due to the truncation
of mantissa. To illustrate this situation, let us consider the following example. It is well
known that 61 ×12 = 2. But, in the case of floating point arithmetic 1
6 = .1667 and hence
1 1
6 × 12 = .1667 × 12 = .2000E1. Also, one can determine the value of 6 × 12 by repeated
addition. Note that .1667 + .1667 + .1667 + .1667 + .1667 + .1667 = 1.0002 =.1000E1,
but .1667 + .1667 + .1667 + · · · 12 times gives 0.1996E1.
Thus, in floating point arithmetics multiplication is not always same as repeated
From these examples we can think numerical computation is very dangerous. But, it
is not such dangerous, as the actual computer generally stores seven digits as mantissa
(in single precision). The larger length of mantissa gives more accurate result.
15
Numerical Analysis
by
Dr. Anita Pal
Assistant Professor
Department of Mathematics
National Institute of Technology Durgapur
Durgapur-713209
email: [email protected]
.
Chapter 1
Numerical Errors
Module No. 3
Different types of finite difference operators are defined, among them forward dif-
ference, backward difference and central difference operators are widely used. In this
section, these operators are discussed.
The third order differences are also defined in similar manner, i.e.
x y ∆ ∆2 ∆3 ∆4
x0 y0
∆y0
x1 y1 ∆ 2 y0
∆y1 ∆ 3 y0
x2 y2 ∆ 2 y1 ∆4 y0
∆y2 ∆ 3 y1
x3 y3 ∆ 2 y2
∆y3
x4 y4
If any entry of the difference table is erroneous, then this error spread over the table
in convex manner.
The propagation of error in a difference table is illustrated in Table 3.2. Let us
assumed that y3 be erroneous and the amount of the error be ε.
Following observations are noted from Table 3.2.
2
......................................................................................
x y ∆y ∆2 y ∆3 y ∆4 y ∆5 y
x0 y0
∆y0
x1 y1 ∆2 y0
∆y1 ∆3 y0 + ε
x2 y2 ∆2 y1 + ε ∆4 y0 − 4ε
∆y2 + ε ∆3 y1 − 3ε ∆5 y0 + 10ε
x3 y3 + ε ∆2 y2 − 2ε ∆4 y1 + 6ε
∆y3 − ε ∆3 y2 + 3ε ∆5 y1 − 10ε
x4 y4 ∆2 y3 + ε ∆4 y2 − 4ε
∆y4 ∆3 y3 − ε
x5 y5 ∆2 y4
∆y5
x6 y6
(ii) The error is maximum (in magnitude) along the horizontal line through the erro-
neous tabulated value.
(iii) In the kth difference column, the coefficients of errors are the binomial coefficients
in the expansion of (1 − x)k . In particular, the errors in the second difference
column are ε, −2ε, ε, in the third difference column these are ε, −3ε, 3ε, −ε, and
so on.
If there is any error in a single entry of the table, then we can detect and correct
it from the difference table. The position of the error in an entry can be identified by
performing the following steps.
(i) If at any stage, the differences do not follow a smooth pattern, then there is an
error.
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
(ii) If the differences of some order (it is generally happens in higher order) becomes
alternating in sign then the middle entry contains an error.
Properties
Example 3.1
4
......................................................................................
f (x) g(x)∆f (x) − f (x)∆g(x)
Example 3.2 ∆ = , g(x) 6= 0.
g(x) g(x + h)g(x)
f (x) f (x + h) f (x)
∆ = −
g(x) g(x + h) g(x)
f (x + h)g(x) − g(x + h)f (x)
=
g(x + h)g(x)
g(x)[f (x + h) − f (x)] − f (x)[g(x + h) − g(x)]
=
g(x + h)g(x)
g(x)∆f (x) − f (x)∆g(x)
= .
g(x + h)g(x)
In particular,
These are called the first order backward differences. The second order differences
are denoted by ∇2 y2 , ∇2 y3 , . . . , ∇2 yn . First two second order backward differences are
∇2 y2 = ∇(∇y2 ) = ∇(y2 − y1 ) = ∇y2 − ∇y1 = (y2 − y1 ) − (y1 − y0 ) = y2 − 2y1 + y0 , and
∇2 y3 = y3 − 2y2 + y1 , ∇2 y4 = y4 − 2y3 + y2 .
The other second order differences can be obtained in similar manner.
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
In general,
where ∇0 yi = yi , ∇1 yi = ∇yi .
Like forward differences, these backward differences can be written in a tabular form,
called backward difference or horizontal difference table.
All backward difference table for the arguments x0 , x1 , . . . , x4 are shown in Table 3.3.
x y ∇ ∇2 ∇3 ∇4
x0 y0
x1 y1 ∇y1
x2 y2 ∇y2 ∇ 2 y2
x3 y3 ∇y3 ∇ 2 y3 ∇ 3 y3
x4 y4 ∇y4 ∇ 2 y4 ∇ 3 y4 ∇4 y4
It is observed from the forward and backward difference tables that for a given table
of values both the tables are same. Practically, there are no differences among the values
of the tables, but, theoretically they have separate significant.
There is another kind of finite difference operator known as central difference operator.
This operator is denoted by δ and is defined by
In general,
x y δ δ2 δ3 δ4
x0 y0
δy1/2
x1 y1 δ 2 y1
δy3/2 δ 3 y3/2
x2 y2 δ 2 y2 δ 4 y2
δy5/2 δ 3 y5/2
x3 y3 δ 2 y3
δy7/2
x4 y4
It may be observed that all odd (even) order differences have fraction suffices (integral
suffices).
Shift operator, E:
Note that shift operator increases subscript of y by one. When the shift operator is
applied twice on the function f (x), then the subscript of y is increased by 2.
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
That is,
In general,
The inverse shift operator can also be find in similar manner. It is denoted by E −1
and is defined by
Similarly, second and higher order inverse operators are defined as follows:
Properties
Average operator, µ:
Differential operator, D:
The differential operator is well known from differential calculus and it is denoted by
D. This operator gives the derivative. That is,
d
Df (x) = f (x) = f 0 (x) (3.19)
dx
d2
D2 f (x) = 2 f (x) = f 00 (x) (3.20)
dx
······ ···············
dn
Dn f (x) = n f (x) = f n (x). (3.21)
dx
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
The factorial notation is a very useful notation in calculus of finite difference. Using
this notation one can find all order differences by the rules used in differential calculus.
It is also very useful and simple notation to find anti-differences. The nth factorial of x
is denoted by x(n) and is defined by
Note that this property is analogous to the differential formula D(xn ) = nxn−1 when
h = 1.
The above formula can also be used to find anti-difference (like integration in integral
calculus), as
1 (n)
∆−1 x(n−1) = x . (3.24)
nh
10
......................................................................................
Lot of useful and interesting results can be derived among the operators discussed
above. First of all, we determine the relation between forward and backward difference
operators.
etc.
In general,
∆n yi = ∇n yi+n , i = 0, 1, 2, . . . . (3.25)
From this relation one can conclude that the operators ∆ and E − 1 are equivalent.
That is,
∆≡E−1 or E ≡ ∆ + 1. (3.26)
That is,
∇ ≡ 1 − E −1 . (3.27)
The expression for higher order forward differences in terms of function values can
be derived as per following way:
δf (x) = f (x + h/2) − f (x − h/2) = E 1/2 f (x) − E −1/2 f (x) = (E 1/2 − E −1/2 )f (x).
That is,
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
1 1/2 2
µ2 f (x) = E + E −1/2 f (x)
4
1 1
= (E 1/2 − E −1/2 )2 + 4 f (x) = δ 2 + 4 f (x).
4 4
Hence,
r
1
1 + δ2.
µ≡ (3.30)
4
Every operator defined earlier can be expressed in terms of other operator(s). Few
more relations among the operators ∆, ∇, E and δ are deduced in the following.
∆ ≡ ∇E ≡ δE 1/2 . (3.31)
There is a very nice relation among the operators E and D, deduced below.
h2 00 h3
Ef (x) = f (x + h) = f (x) + hf 0 (x) + f (x) + f 000 (x) + · · ·
2! 3!
[by Taylor’s series]
h2 h3
= f (x) + hDf (x) + D2 f (x) + D3 f (x) + · · ·
2! 3!
h2 2 h3 3
= 1 + hD + D + D + · · · f (x)
2! 3!
= ehD f (x).
12
......................................................................................
Hence,
E ≡ ehD . (3.32)
Some of the operators are commutative with other operators. For example, µ and E
are commutative, as
1
µEf (x) = µf (x + h) = f (x + 3h/2) + f (x + h/2) ,
2
and
h1 i 1
Eµf (x) = E f (x + h/2) + f (x − h/2) = f (x + 3h/2) + f (x + h/2) .
2 2
Hence,
µE ≡ Eµ. (3.38)
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
(1 + ∆)(1 − ∇) ≡ 1. (3.39)
(ii)
1 1
µf (x) = [E 1/2 + E −1/2 ]f (x) = ehD/2 + e−hD/2 f (x)
2 2
hD
= cosh f (x).
2
(iii)
∆+∇ 1
f (x) = [∆f (x) + ∇f (x)]
2 2
1
= [f (x + h) − f (x) + f (x) − f (x − h)]
2
1 1
= [f (x + h) − f (x − h)] = [E − E −1 ]f (x)
2 2
= µδf (x) (as in previous case).
Thus,
∆+∇
µδ ≡ . (3.40)
2
14
......................................................................................
Hence,
∆∇ ≡ ∇∆ ≡ (E 1/2 − E −1/2 )2 ≡ δ 2 . (3.41)
(v)
∆E −1 ∆
1
+ f (x) = [∆f (x − h) + ∆f (x)]
2 2 2
1
= [f (x) − f (x − h) + f (x + h) − f (x)]
2
1 1
= [f (x + h) − f (x − h)] = [E − E −1 ]f (x)
2 2
1 1/2
= (E + E −1/2 )(E 1/2 − E −1/2 )f (x)
2
= µδf (x).
Hence
∆E −1 ∆
+ ≡ µδ. (3.42)
2 2
δ 1 1/2 −1/2 1 1/2 −1/2
(vi) µ + f (x) = [E + E ] + [E − E ] f (x) = E 1/2 f (x).
2 2 2
Thus
δ
E 1/2 ≡ µ + . (3.43)
2
(vii) δµf (x) = 12 (E 1/2 + E −1/2 )(E 1/2 − E −1/2 )f (x) = 12 [E − E −1 ]f (x).
Therefore,
1
(1 + δ 2 µ2 )f (x) = 1 + (E − E −1 )2 f (x)
4
1 1
= 1 + (E 2 − 2 + E −2 ) f (x) = (E + E −1 )2 f (x)
4 4
2
δ2 2
1 1/2 −1/2 2
= 1 + (E − E ) f (x) = 1 + f (x).
2 2
15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
Hence 2
δ2
1 + δ 2 µ2 ≡ 1+ . (3.44)
2
(viii) r
δ2 δ2
+δ 1+ f (x)
2 4
r
1 1/2 −1/2 2 1/2 −1/2 1 1/2 −1/2 2
= (E − E ) f (x) + (E − E ) 1 + (E − E ) f (x)
2 4
1 1
= [E + E −1 − 2]f (x) + (E 1/2 − E −1/2 )(E 1/2 + E −1/2 )f (x)
2 2
1 1
= [E + E − 2]f (x) + (E − E −1 )f (x)
−1
2 2
= (E − 1)f (x).
Hence, δ2
r
δ2
+δ 1+ ≡ E − 1 ≡ ∆. (3.45)
2 4
In Table 3.5, it is shown that any operator can be expressed with the help of another
operator.
E ∆ ∇ δ hD
r
δ2 δ2
E E ∆+1 (1 − ∇)−1 1+
+ δ 1+ ehD
2 r 4
δ2 δ2
∆ E−1 ∆ (1 − ∇)−1 − 1 +δ 1+ ehD − 1
2 r 4
δ2 δ2
∇ 1 − E −1 1 − (1 + ∆)−1 ∇ − +δ 1+ 1 − e−hD
2 4
δ E 1/2−E −1/2 ∆(1 + ∆)−1/2 ∇(1 − ∇)−1/2 δ 2 sinh(hD/2)
E 1/2+E −1/2 δ 2
µ (1 + ∆/2) (1−∇/2)(1−∇)−1/2 1+ cosh(hD/2)
2 4
×(1 + ∆)−1/2
hD log E log(1 + ∆) − log(1 − ∇) 2 sinh−1 (δ/2) hD
x(0) = 1
x(1) = x
x(2) = x(x − h) (3.46)
x(3) = x(x − h)(x − 2h)
x(4) = x(x − h)(x − 2h)(x − 3h)
and so on.
From these equations it is obvious that the base terms (x, x2 , x3 , . . .) of a polynomial
can be expressed in terms of factorial notations x(1) , x(2) , x(3) , . . ., as shown below.
1 = x(0)
x = x(1)
(3.47)
x2 = x(2) + hx(1)
x3 = x(3) + 3hx(2) + h2 x(1)
x4 = x(4) + 6hx(3) + 7h2 x(2) + h3 x(1)
17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
and so on.
Note that the degree of xk (for any k = 1, 2, 3, . . .) remains unchanged while expressed
it in factorial notation. This observation leads to the following lemma.
Lemma 3.1 Any polynomial f (x) in x can be expressed in factorial notation with same
degree.
Since all the base terms of a polynomial are expressed in terms of factorial notation,
every polynomial can be written with the help of factorial notation. Once a polynomial
is expressed in a factorial notation, then its differences can be determined by using the
formula like differential calculus.
Example 3.4 Express f (x) = 10x4 − 41x3 + 4x2 + 3x + 7 in factorial notation and
find its first and second differences.
Solution. For simplicity, we assume that h = 1.
Now by (3.47), x = x(1) , x2 = x(2) + x(1) , x3 = x(3) + 3x(2) + x(1) ,
x4 = x(4) + 6x(3) + 7x(2) + x(1) .
Substituting these values to the function f (x) and we obtained
f (x) = 10 x(4) + 6x(3) + 7x(2) + x(1) − 41 x(3) + 3x(2) + x(1) + 4 x(2) + x(1) + 3x(1) + 7
Now, the relation ∆x(n) = nx(n−1) (Property 3.1) is used to find the first and second
order differences. Therefore,
∆f (x) = 10.4x(3) + 19.3x(2) − 49.2x(1) − 24.1x(0) = 40x(3) + 57x(2) − 98x(1) − 24
= 40x(x − 1)(x − 2) + 57x(x − 1) − 98x − 24 = 40x3 − 63x2 − 75x − 24
and ∆2 f (x) = 120x(2) + 114x(1) − 98 = 120x(x − 1) + 114x − 98 = 120x2 − 6x − 98.
The above process to convert a polynomial in a factorial notation is a very labourious
task when the degree of the polynomial is large. There is a systematic method, similar to
Maclaurin’s formula in differential calculus, is used to convert a polynomial in factorial
notation. This technique is also useful for a function which satisfies the Maclaurin’s
theorem for infinite series.
Let f (x) be a polynomial in x of degree n. We assumed that in factorial notation
f (x) is of the following form
Solution. Let h = 1. For the given function, f (0) = 11, f (1) = 17, f (2) = 203, f (3) =
1091, f (4) = 3563.
19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operators in Numerical Analysis
Therefore,
By similar way,
22
NON LINEAR EQUATIONS
A problem that most students should be familiar
with from ordinary algebra is that of finding the
root of an equation f(x) = 0, i.e., the value of the
argument that makes f zero.
More precisely, if the function is defined as y =
f(x), we seek the value such that f ( ) = 0.
The precise terminology is that is a zero of the
function f , or a root of the equation f(x) = 0
Note that it have not yet specified what kind of
function f is. The obvious case is when f is an
ordinary real-valued function of a single real
variable x, but we can also consider the problem
when f is a vector-valued function of a vector-
valued variable, in which case the expression
above is a system of equations.
Broadly it can be classified:
• Bracketing method
• Open end methods
Bracketing methods comprises of:
• Bisection method
• Regula falsi or False position method
Open end methods
• Newton Raphson method
• Secant method
• Muller’s method
• Fixed point method
• Bairstow’s method
• Ramanujan’s Method
• Graeffe’s Root-squaring Method
• Quotient–difference Method
BISECTION METHOD
It is based on theorem that if function f(x) is
continuous between a and b and f(a) and f (b)
are of opposite signs then there must be at least
one root.
Algorithm
Choose two point a and b such that f(a)f(b) < 0.
This means that f is negative at one point and
positive at the other.
Let c be the midpoint of the interval [a, b], i.e.,
c = 1/2*(a + b) and consider the product f(a)f(c).
There are three possibilities:
1. f(a)f(c) < 0; this means that a root (there
might be more than one) is between a
and c, i.e., ε [a,c].
2. f(a)f(c) = 0; if we assume that we already
know f(a) 0 this means that f(c) = 0, thus =
c and we have found a root.
3. f(a)f(c) > 0; this means that a root must lie
in the other half of the interval, i.e., a ε [c,b].
At first glance, this is helpful only if we get
the second case and land right on top of a
root, and this does not seem very likely.
However, a second look reveals that if (1) or
(3) hold, we now have a root localized to an
interval ([a, c] or [c, b]) that is half the
length of the original interval [a, b]. If we
now repeat the process, the interval of
uncertainty is again decreased in half, and so
on, until we have the root localized to within
any tolerance we desire.
EXAMPLE 3.1
If f(x) = 2 - ex, and we take the original
interval to be [a, b] = [0,1], then the first
several steps of the computation are as
follows:
f(a) = 1, f(b) = -0.7183 c =[0 + l]/2 = 1/2;
f(c) = 0.3513 > 0
[a,b] [1/2,1];
f(a) = 0.3513, f(b) = -0.7183 c = [1/2+l]/2
= 3/4; f(c) =-0.1170 < 0
[a,b] [1/2,3/4];
f(a) = 0.3513, f(b) = -0.1170 = c = [1/2 +
3/4]/2 = 5/8; f(c) = 0.1318 > 0
[a,b] [5/8,3/4].
Thus, we have reduced the "interval of
uncertainty" from [0,1], which has length 1,
to [5/8,3/4], which has length of 1/8 = 0.125.
If we were to continue the process we would
eventually have the root localized to within
an interval of length as small as we want,
since each step cuts the interval of
uncertainty in half.
Bisection Convergence and Error:
Let [a0, b0] = [a, b] be the initial interval,
with f(a)f(b) <0. Define the approximate
root as
xn = cn = (bn-1 + an-1)/2.
Then there exists a root ε [a, b] such that
| — xn| (1/2)n (b-a)
Moreover, to achieve an accuracy of
| — xn| ε
log b a log
n
log 2
(bn- an) = ½* ( bn-1 – an-1)
(bn- an) = (½)n * ( b0 – a0)
| — xn| ½ * ( bn-1 – an-1)
= ½ (½)n-1 * ( b0 – a0)
= (½)n * ( b0 – a0)
False Position / Regula Falsi Method
• It is also termed as linear interpolation
method.
• Bracketing method.
Straight line is drawn from A (a1, f(a1)) to B
(b1, f(b1)) the point of intersection with
abscissa at c1 is improved estimate of the
root.
• If f(a1)f(c1) < 0, then [a1, c1] brackets the
root. Otherwise, the root is in [c1, b1]. In
Figure it just so happens that [a1, c1]
brackets the root. This means the left end
is unchanged, while the right end is
adjusted to c1. Therefore, the interval that
is used in the next iteration is [a2, b2]
where a2 = a1 and b2 = c1.Continuing this
process generates a sequence c2, c3, …
The equation of the line connecting points A
and B is f (b1 ) f (a1 )
y f (b1 ) ( x b1 )
b1 a1
To find the x-intercept, set y = 0 and solve
for x = c1:
b1 a1 simplify
a1 f b1 b1 f a1
c1 b1 f (b1 )
f (b1 ) f (a1 ) f (b1 ) f (a1 )
f (b1 ) f (a1 )
Since f(c1) > 0, the root must lie in [c1, b1]
so that the left endpoint is adjusted to a2 = c1
and the right end remains unchanged, b2 =
b1. Therefore, the updated interval is
[a2, b2] = [1.189493, 4]. Next,
a2 f b2 b1 f a2
c2
f (b2 ) f ( a2 )
= 2.515720
•Newton’s Method
• Open solution, that requires only one
current guess.
• Root does not need to be bracketed.
• Consider some point x0.
• If we approximate f(x) as a line about x0,
then we can again solve for the root of
the line.
f ( x ) f ( x0 )( x x0 ) f ( x0 )
Newton’s Method
• Solving, leads to the following iteration:
f ( x) 0
f ( x0 )
x1 x0
f ( x0 )
f ( xi )
xi 1 xi
f ( xi )
•Newton’s Method
• This can also be seen from Taylor’s
Series.
• Assume we have a guess, x0, close to the
actual root. Expand f(x) about this point.
x xi x
x 2
f ( xi x) f ( xi ) xf ( xi ) f ( xi ) 0
2!
• If dx is small, then dxn quickly goes to
zero.
f ( xi )
x xi 1 xi
f ( xi )
Newton’s Method
• Graphically, follow the tangent vector
down to the xx-axis intersection.
•Newton’s Method
• Problems
•Newton’s Method
• Need the initial guess to be close, or, the
th
function to behave nearly linear within
the range.
Finding a square-root
• Ever wonder why they call this a square-
root?
• Consider the roots of the equation:
• f(x) = x2-a
• This of course works for any power:
p
a x a 0, p R
p
Finding a square-root
• Example:
2=1.4142135623730950488016887242
097
• Let x0 be one and apply Newton’s
method.
f ( x) 2 x
xi 2 1 2
2
xi 1 xi xi
2 xi 2 xi
x0 1
1 2 3
x1 1 1.5000000000
2 1 2
1 3 4 17
x2 1.4166666667
2 2 3 12
Finding a square-root
• Example: 2 =
1.4142135623730950488016887242097
• Note the rapid convergence
1 17 24 577
x3 1.414215686
2 12 17 408
x4 1.4142135623746
x5 1.4142135623730950488016896
x6 1.4142135623730950488016887242097
• Note, this was done with the standard
Microsoft calculator to maximum
precision.
Finding a square-root
• Can we come up with a better initial
guess?
• Sure, just divide the exponent by 2.
• Remember the bias offset
• Use bit-masks to extract the exponent to
an integer, modify and set the initial
guess.
• For 2, this will lead to x0 = 1 (round
down).
Convergence Rate of Newton’s
en x xn or x xn en
0 f ( x ) f ( xn en )
1
f ( xn en ) f ( xn ) en f ( xn ) en2 f ( n ), for some n x , xn
2
1
f ( xn ) en f ( xn ) en2 f ( n )
2
• Now,
f ( xn ) f ( xn )
en 1 x xn 1 x xn en
f ( xn ) f ( xn )
en f ( xn ) f ( xn )
f ( xn )
1 f ( n ) 2
en 1 en
2 f ( xn )
Convergence Rate of Newton’s
Converges quadratically.
k
if en 10 then,
2 k
en 1 c10
Newton’s Algorithm
• Requires the derivative function to be
evaluated, hence more function
evaluations per iteration.
• A robust solution would check to see if
the iteration is stepping too far and limit
the step.
• Most uses of Newton’s method assume
the approximation is pretty close and
apply one to three iterations blindly.
Division by Multiplication
1
a 0
x
• Newton’s method gives the iteration:
1
a
xk
xk 1 xk xk xk axk2
1
2
xk
xk 2 axk
Reciprocal Square Root
1
a a
a
Reciprocal Square Root
1
Let f ( x ) 2
a 0
x
2
f ( x)
x3
xk xk3
xk 1 xk a
2 2
1
xk 3 axk2
2
1/Sqrt(2)
• Let’s look at the convergence for the
reciprocal square
square-root of 2.
1/Sqrt(x)
1
s e 127
2 (1.m) 2
– What is the reciprocal?
– What is the square-root?
1/Sqrt(x)
• Theoretically,
Slight problem:
The 1.m produces a result between 1 and 2.
Hence, it remains normalized, 1.m’.
1
For , we get a number between ½ and 1.
x
Need to shift the exponent.
Secant Method
What if we do not know the derivative of
f(x)?
Secant Method
xk xk 1
xk 1 xk f ( xk )
f ( xk ) f ( xk 1 )
Which is the Secant Method.
Convergence Rate of Secant
Using Taylor’s Series, it can be shown
ek 1 x xk 1
1 f ( k )
ek ek 1 c ek ek 1
2 f ( k )
Convergence Rate of Secant
This is a recursive definition of the error
term. Expressed out, we can say that:
ek 1 C ek
Where =1.62.
We call this super-linear convergence.
Convergence Rate of Secant
Consider x cos x + 1 = 0.
x3 x2
x4 x3 f x2 2.0229
f ( x3 ) f ( x2 )
Fixed - Point Iteration / Iteration Method/
Method of Successive Substitution
Function f(x) = 0
May be written as f(x) = 0 = x – φ(x)
(x)
Hence x = φ(x)
(x)
xk+1 = φ(xk)
Many problems also take on the specialized
form: g(x)=x, where we seek, x, that
satisfies this equation.
Fixed-Point Iteration
xi 1
x i
2
2 xi i
MULLER’S METHOD
• Extension of Secant Method
• Quadratic curve from three points
(x1, f(x1)), (x2, f(x2)) and (x3, f(x3)) .
• points x4 as one of the roots is taken
as next approximation
• P(x) = a0 + a1 (x-c) + a2 (x-c)2
• For c= x3 and x= x4 assuming it to be
the root
• a2 (x4-x3)2 + a1 (x4- x3) +a0 = 0
• (x4- x3) = (-2a0 )/( a1 ± (a12- 4 a2 a0)1/2)
• This formula chosen for quadratic
equation to avoid subtractive
cancellation
• at x = x1, x2 and x3
• a2 (x1-x3)2 + a1 (x1- x3) + a0 = p(x1) = f
(x1)
• a2 (x2-x3)2 + a1 (x2- x3) + a0 = p(x2) = f
(x2)
• a2 (x3-x3)2 + a1 (x3- x3) + a0 = p(x3) = f
(x3)
• Let h1 = x1- x3, h2 = x2-x3 and fi = f(xi)
• The above equation with these
modifications
• a2 h12 +a2 h1 + a0 = f1
2
• a2 h2 +a2 h2 + a0 = f2
• 0 +0 + a0 = f3
• Since a0 = f3 we can obtain a1 and a2
• a2 h12 +a2 h1 + a0 = f1 – f3 = d1
• a2 h22 +a2 h2 + a0 = f2 - f3 = d2
• This results in
d 2 h12 d1h22
a1
h1h2 h1 h 2
d1h2 d 2 h1
a2
h1h2 h1 h 2
2a0
h4
a1 a12 4a2 a0
• x4 = x3 + h4
• Sign is chosen such a way the
magnitude of denominator is small
• Now x2 x3 and x4 is taken as initial
guess to calculate h5 hence x5
• Example:
• Solve the leonardo equation
3 2
• f(x) = x +2x + 10x – 20 =0 by
Muller’s method
• Sol:
• Three starting guess x1 = 0, x2 = 1, x3
=2
• f1 = -20
• f2 = -7
• f3 =16
• h1 = x1- x3 = -2
• h2 = x2 – x3 =-1
• d1 = f1 – f3 = -36
• d2 = f2 – f3 = -23
• D = h1 h2 (h1- h2)
• D = 2(-2 + 1)= -2
2 2
• a1 = [(-23)(-2) – (-36)(-1) ]/-2 = 28
• a2 = [(-36)(-1) – (-23)(-2)]/-2 = 5
• h = -2*16/[28 ± ( 282 – 4*5*16)1/2] = -
32/49.54
• Taking positive sign x4 = 1.354
• Iteration 2:
• x1 = 1
• x2 = 2
• x3 = 1.354
• h1 = x1- x3 = - 0.354
• h2 = x2 – x3 = 0.646
• f1 = -7
• f2 = 16
• f3 = -0.3096797
• d1 = f1 – f3 = -6.6903202
• d2 = f2 – f3 = 16.3096797
• D = h1h2( h1- h2) = 0.2287031
• a1 = [d2h12 – d1 h22]/D = 21.1454
• a2 = [d1h2 – d2 h1]/D = 6.354
• a0 = f3 = -0.30967
• h= -2a0/[a1 ± (a12 – 4a2a0)1/2] =
0.6193594/42.4762 = 0.01458
• x4 = x3 + h = 1.3686472
Bairstow’s Method
• Here quadratic factor of higher order
polynomial obtained
• Consider cubic equation
3 2
• f(x) = a3x +a2x +a1x +a0
• Let x2 +Rx +S be the exact factor
• Let x2 +rx +s be the approximate
factor
• the quotient and remaider term will be
linear
• f(x) = (x2 +rx +s )(b3x + b2) + b1x +b0
• If r = R and s= S then b1 and b0 shall be
zero
• From the comparison we have
• b3 = a3
• b2 = a2 – rb3
• b1 = a1 –rb2 - sb3
• b0 = a0 – sb2
• The value of bi are given by the
following recurrence formula:
• bn = an
• bn-1 = an-1 – rbn
• bi = ai –r bi+1 – s bi+2 for i = n-2
to 1
• b0 = a0 – sb2
• For x2 +rx +s to be exact fraction
• b0 (r,s) = 0 = b1(r,s)
• Δr0 and Δs0 computed from the above
put it in initial assumption of r0 , s0 to
determine r1 and s1.
SOLUTION OF LINEAR
EQUATION:
The equation
a1x1+ a2x2 + a3x3 + ….+anxn = b
may be written as in the concise form
n
a x
i 1
i i
b
will have infinite solution for xi
To have unique solution the no of
equation and variables be same.
A set of such n independent equation is
termed as system of equations or
simultaneous equation.
a11x1+a12x2+…+a1nxn=b1
a21x1+a22x2+…+a2nxn=b2
a31x1+a32x2+…+a3nxn=b3
.
an1x1+an2x2+…+annxn=bn
In matrix notation it is written as A x = b
This can be solved by:
• Elimination approach
• Iterative approach
Elimination methods are
1. Basic Gauss elimination method
2. Gauss elimination with pivoting
3. Gauss – Jordan method
4. LU decomposition method
5. Matrix inverse method
Given arbitrary system of equations
four possibility may arise:
1. System has a unique solution
2. System has no solution
3. System has infinite solution
4. System is ill conditioned
For systems of equations following
observations are made:
1. For unique solution no of equations
and variables be same.
2. For no of equations less than no of
variables system is said to be under
determined and unique solution may not
be possible.
3. For no of equations larger than no of
variables system is said to be over
determined and unique solution may or
may not be possible.
4. System is said to be homogeneous
when all constants bi are zero.
Direct solutions to linear systems of
algebraic equations
• Solve the system of equations
AX = B
where A
Ak is the matrix A with its kth column
replaced by vector B
|A| is the determinant of matrix A
• For each B vector, we must evaluate
N+1 determinants of size N where N
defines the size of the matrix A.
• Evaluate a determinant as follows using
the method of expansion by cofactors
j 1 i 1
0 0 a33
• Inverse of the product relationship
[A1A2A3]-1 = A3-1A2-1A1-1 1/ a 0 11 0
A1 0 1 / a22 0
0 0 1 / a33
• Gauss Elimination Type Solutions to
Banded Matrices
• Banded matrices
• Have non-zero entries contained
within a defined number of positions
to the left and right of the diagonal
(bandwidth)
NxN system Compact Diagonal
x x 0 x 0 0 0 0 0 0 0 0 0 0 0 x x 0 x
x x x x 0 0 0 0 0 0 0 0 0 0 x x x x 0
0 x x 0 x x 0 0 0 0 0 0 0 0 x x 0 x x
x 0 x x 0 0 x 0 0 0 0 0 x 0 x x 0 0 x
0 x x 0 x 0 x 0 0 0 0 0Stored as x
x 0 x 0 x 0
0 0 0 x x x x 0 x 0 0 0 0 x x x x 0 x
x 0 x x x 0 x
0 0 0 x 0 x x x 0 x 0 0
0 0 0 0 0 0 x x x x 0 0 0 0 x x x x 0
x 0 x x x 0 x
0 0 0 0 0 x 0 x x x 0 x x 0 x x x 0 0
0 0 0 0 0 0 x 0 x x x 0
0 0 0 0 0 0 0 0 x x x x 0 x x x x 0 0
x 0 x x 0 0 0
0 0 0 0 0 0 0 0 x 0 x x
Partial Pivoting
• Always look below the pivot element
and pick the row with the largest value
and switch rows
Complete pivoting
• Look at all columns and all rows to the
right/below the pivot element and switch
so that the largest element possible is in
the pivot position.
• For complete pivoting, you must
change the order of the variable array
• Pivoting procedures give large diagonal
elements
• minimize roundoff error
• increase accuracy
• Pivoting is not required when the
matrix is diagonally dominant
A matrix is diagonally dominant when
the absolute values of the diagonal terms
is greater than the sum of the absolute
values of the off diagonal terms for each
row
Example:
Find the inverse of the matrix
1 2 3
A 0 1 2
0 0 1 a11 a12 a13
Let A1 a22 a23
a33
-1
Since AA = I, we write
a11 =1, a22 = 1
2 a11 + a12= 0, 2a22 + a23 = 0
a12 = -2, a23 = -2
3a11 + 2a12 + a13 = 0, a33 = 1
a13 = 1
Hence 1 2 1
A1 0 1 2
0 0 1
Example:
Use gauss elimination to solve following
system of equation.
2x + y + z =10
3x + 2y +3z =18
x + 4y +9z = 16
Eliminate x from second and third
equation
• Multiply fist eq. with -3/2 and add it to
second equation giving y + 3z = 6
• Multiply fist eq. with -1/2 and add it to
third equation giving 7y + 17z = 22
• From last two equation eliminate y
Multiply -7 to first and add it to two
giving -4z = -20, z = 5,
The upper Triangular form is given
as:
2x + y + z =10
y +3z = 6
z=5
resulting x = 7, y = -9 and z = 5
Example:
Solve the equation
0.0003120 x1 + 0.006032 x2 = 0.003328
0.50000 x1 + 0.8942 x2 = 0.9471
The exact solution is x1= 1 and x2 = 0.5
First solve system with pivoting
a11 a12 a13 x12 0 a11 a12 a13 x13 0
a a23 x22 1 a
21
a22
a22 a23 x23 0
21
a31 a32 a33 x32 0 a31 a32 a33 x33 1
We can apply Gaussian elimination to
each of these systems and result in each
case shall be corresponding column of
A-1 . We can solve all three system
simultaneously by augmented matrix
a11 a12 a13 1 0 0
a a22 a23 0 1 0
21
a31 a32 a33 0 0 1
1 0 0
I 21 1 0
31 32 1
Example:
2 1 1
The matrix A is given as: A 3 2 3
1 4 9
2 1 1 1 0 0
3 2 3 0 1 0
1 4 9 0 0 1
After the first stage
2 1 1 1 0 0
0 1/ 2 3 / 2 3 / 2 1 0
0 7 / 2 17 / 2 1/ 2 0 1
2 1 1 1
0 1 / 2 3 / 2 3 / 2
0 0 2 10
2 1 1 0
0 1 / 2 3 / 2 1
0 0 2 7
2 1 1 0
0 1 / 2 3 / 2 0
0 0 2 1
i 1 j 1 N N
Where Euclidean norm of A= a
i 1 j 1
2
ij
If the matrix A is diagonally dominant,
i.e. the absolute values of the diagonal
terms ≥ the sum of the off-diagonal terms
for each row, then the matrix is not ill-
conditioned
N
aii aij i, j 1,2.3,...N
i . j 1
j i
Effects of ill-conditioning are most
serious in large dense matrices (e.g.
especially those obtained in such
problems as curve fitting by least
squares)
• Sparse banded matrices which result
from Finite Difference and Finite
Element methods are typically much
better conditioned (i.e. can solve fairly
large sets of equations without excessive
roundoff error problems)
Ways to overcome ill-conditioning
• Make sure you pivot!
• Use large word size (use double
precision)
• Can use error correction schemes to
improve the accuracy of the answers
• Use iterative methods
Factor Method (Cholesky Method)
• Problem with Gauss elimination
• Right hand side “load” vector, B ,
must be available at the time of matrix
triangulation
• If B is not available during the
triangulation process, the entire
triangulation process must be
repeated!
• Procedure is not well suited for
solving problems in which B changes
AX = B1 O (N3) + O (N2) Steps
AX = B2 O (N3) + O (N2) Steps
:
AX = BR O (N3) + O (N2) Steps
• Using Gauss elimination, O(N3R)
operations, where N = size of the
system of equation and R = the
number of different load vectors
which must be solved for
Concept of the factor method is to
facilitate the solution of multiple right
hand sides without having to go through
a re-triangulation process for each Br
Factorization step
• Given A, find L and U such that
A= LU
Where A, L and U are NxN matrices.
• We note that |A| 0 |L| |U| 0 and
therefore neither L nor U can be
singular
• We can have only N2 unknown !
• L is defined as lower triangular
matrix.
• U is defined as upper triangular
matrix.
• Now we have N2 + N unknown.
• Reduce the number of unknowns by
selecting either
• lii = 1 i = 1,2,….N Doolittle
Method
• uii= 1 i = 1,2,….N Crout Method
• Now we only have N2 unknowns! We
can solve for all unknown elements of
L and U by proceeding from left to
right and top to bottom.
• Factorization proceeds from left to
right and then top to bottom as:
a11 a12 a13 u11 u12 u13
a a22 a23 l21u11 l21u12 u22 l21u13 u23
21
a31 a32 a33 l31u11 l31u12 l32u22 l31u13 l32u23 u33
• Red → current unknown being solved
• Blue→ unknown value already solved
Note: In the summation terms of RHS
matrix number of operands is least of
the two subscript of the terms of LHS
matrix.
a11 = u11
a12 = u12
a13 = u13
a21 = l21u11
a22 = l21u12 + u22
a23 = l21u13 + u23
a31 = l31u11
a32 = l31u12 + l32u22
a33 = l31u13 + l32u23 + u33
These factorization are by Doolittle
Method.
Note: Besides these two methods there
may be infinite ways to factorize it
depending upon number of N Terms
out of L and U matrices and its values.
Now considering the equation to be
solved
AX = B
• However A= LU where L and U are
known
LUX = B
Forward/backward substitution
procedures to obtain a solution
• Changing the order in which the
product is formed
L(UX) = B
• Now let Y = U X
• Hence we have two systems of
simultaneous equations
LY = B
UX=Y
• Apply a forward substitution sweep to
solve for Y for the system of
equations L Y = B
• Apply a backward substitution sweep
to solve for X for the system of
equations U X = Y
Notes on Factorization Methods
• Procedure
• Perform the factorization by solving
for L and U
• Perform the sequential forward and
backward substitution procedures to
solve for Y and X.
• The factor method is very similar to
Gauss elimination although the order in
which the operations are carried out is
somewhat different.
• Number of operations
• O(N3) for LU decomposition (same as
triangulation for Gauss)
• O(N2) forward/backward substitution
(same as backward sweep for Gauss)
Advantages of LU factorization over
Gauss Elimination
• Can solve for any load vector B at any
time with O(N2) operations (other than
triangulation which is done only once
with O(N3) operations)
• Generally has somewhat smaller
roundoff error
Example comparing costs
If we are solving R systems of NxN
equations in which the matrix A stays the
same and only the vector B changes,
compare the overall costs for Gauss
elimination and LU factorization
• Gauss Elimination costs
Triangulation cost = R[O ( N3)]
Back substitution cost = R[O ( N2)]
Total cost = R[ O (N3) + O (N2) ]
Total cost for large N @ R [O (N3)]
• LU factorization cost
3
• LU factorization cost = [O (N ) ]
• Back forward substitution cost = R
2
[ O (N ) ]
• Total cost = [ O (N3) + R[O(N2)]
• Total cost for R>>N @ R [O (N2)]
• Considering some typical values for R
and N
• We can also implement LU
factorization (decomposition) in
banded mode and the savings
compared to banded Gauss
elimination would be O(M) (where
M = bandwidth)
• Substituting the factored form of the
matrix and changing the order in which
products are taken
L(LTX) = B
• Let LTX = Y
• Now sequentially solve
LY = B by forward substitution
LTX = Y by backward substitution
LDLT Method:
Decompose A = LDLT
• Where L is a lower triangular matrix
• Where D = diagonal matrix
• Set the diagonal terms of L to unit
CHOLESKY’S FACTORIZATION
or Method of root square
A = LLT
A = UTU
a11 a12 a13 u11 u11 u12 u13
a a22 a23 u12 u22 u22 u23
21
a31 a32 a33 u13 u23 u33 u33
a22 l21u12 l22u22 u12u12 u22u22 u122 u222 u22 a22 u122
1
a23 l21u13 l22u23 u12u13 u22u23 u22u23 a23 u12u13 u23 a23 u12u13
u22
a33 l31u13 l32u23 l33u33 u132 u232 u332 u33 a33 u132 u232
1
u12 a12 , u13 1 a13
u11 u11
1
u23 a23 u12u13 ,
u22
1 1
a45 u14u15 u24u24 u34u35 uij aij ukiukj
i 1
u45 j i
u44 uii k 1
u23
1
a23 u12u13 22 2 * 3 16 8
u22 2 2
Thus we have 1 2 3
U 0 2 8
0 0 3
ITERATIVE SOLUTIONS TO
LINEAR ALGEBRAIC EQUAT -
IONS
• As finer discretizations are being
applied with Finite Difference and
Finite Element codes:
• Matrices are becoming increasingly
larger
• Density of matrices is becoming
increasingly smaller
• Banded storage direct solution
algorithms no longer remain attractive
as solvers for very large systems of
simultaneous equations.
Example
• For a typical Finite Difference or
Finite Element code, the resulting
algebraic equations have between 5 and
10 nonzero entries per matrix row (i.e.
per algebraic equation associated with
each node)
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
A
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
Banded compact matrix density
• Storage required for banded compact
storage mode equals NM where N =
size of the matrix, and M = full
bandwidth
• Total nonzero entries in the matrix
assuming (a typical estimate of) 5 non-
zero entries per matrix row = 5N
• Banded compact matrix density = the
ratio of actual nonzero entries to entries
stored in banded compact mode
b a x 0
a x 0
x12 2 21 1 23 3
a22
b a x 0
a x 0
x31 3 31 1 32 2
a33
k 1
N aij k bi
xi
j 1, j i aii
xj
aii
1 i N, k 0
5 x[ k 1] 10 y k
k 1 k
3 y 4 2 x
k 1 1 k
x 2 y
5
y k 1 4 2 xk
3 3
Start with solution guess x[0]=-1, y[0]
= -1 and start iterating on the solution
k 1 k
b a x a x
x2k 1 2 21 1 23 3
a22
k 1 k 1
k 1 b3 a31 x1 a32 x2
x3
a33
1
.
Chapter 4
Module No. 1
h2 ′′
f (x0 ) + hf ′ (x0 ) + f (x0 ) + · · · = 0.
2!
1
. . . . . . . . . . . . . . . . . . . . . . . Newton’s Method to Solve Transcendental Equation
Since h is small, so the second and higher power terms of h are neglected and then
the above equation reduces to
f (x0 )
f (x0 ) + hf ′ (x0 ) = 0 or, h = − .
f ′ (x0 )
f (x0 )
x1 = x0 + h = x0 − . (1.1)
f ′ (x0 )
f (x1 )
x2 = x1 − . (1.2)
f ′ (x1 )
f (xn )
xn+1 = xn − . (1.3)
f ′ (xn )
Geometrical interpretation
f (x)
6
6
6
- x
O ξ x2 x1 x0
The choice of initial guess x0 is a very serious task. If the initial guess is close to
the root then the method converges rapidly. But, if the initial guess is not much close
to the root or if it is wrong, then the method may generates an endless cycle. Also, if
the initial guess is not close to the exact root, the method may generates a divergent
sequence of approximate roots.
Thus, to choose the initial guess the following rule is suggested.
Let a root of the equation f (x) = 0 be in the interval [a, b]. If f (a) · f ′′ (x) > 0 then
x0 = a be taken as the initial guess of the equation f (x) = 0 and if f (b) · f ′′ (x) > 0,
then x0 = b be taken as the initial guess.
Suppose a root of the equation f (x) = 0 lies in the interval [a, b].
The Newton-Raphson iteration formula (1.3) is
f (xi )
xi+1 = xi − = φ(xi ) (say). (1.4)
f ′ (xi )
ξ = φ(ξ). (1.5)
3
. . . . . . . . . . . . . . . . . . . . . . . Newton’s Method to Solve Transcendental Equation
Let |φ′ (x)| ≤ l for all x ∈ [a, b]. Then from the equation (1.6)
|ξ − xn+1 | ≤ ln+1 |ξ − x0 |.
f (εn + ξ)
εn+1 + ξ = εn + ξ −
f ′ (εn + ξ)
That is,
ε2 f ′′ (ξ)
−1
f ′′ (ξ)
= εn − εn + n ′ + ··· 1 + εn ′ + ···
2 f (ξ) f (ξ)
1 f ′′ (ξ)
= ε2n ′ + O(ε3n ).
2 f (ξ)
Neglecting the third and higher powers of εn , the above expression reduces to
f ′′ (ξ)
εn+1 = Cε2n , where C = a constant number. (1.9)
2f ′ (ξ)
Example 1.1 Using Newton-Raphson method find a root of the equation x3 − 2 sin x −
2 = 0 correct up to five decimal places.
Solution. Let f (x) = x3 − 2 sin x − 2. One root lies between 1 and 2. Let x0 = 1 be
the initial guess.
The iteration scheme is
f (xn )
xn+1 = xn −
f ′ (xn )
x3 − 2 sin xn − 2
= xn − n 2 .
3xn − 2 cos xn
n xn xn+1
0 1.000000 2.397806
1 2.397806 1.840550
2 1.840550 1.624820
3 1.624820 1.588385
4 1.588385 1.587366
5 1.587366 1.587365
Therefore, one root of the given equation is 1.58736 correct up to five decimal places.
Example 1.2 Find an iteration scheme to find the kth root of a number a and hence
find the cube root of 2.
f (xn )
xn+1 = xn −
f ′ (xn )
xk − a k xkn − xkn + a
= xn − n k−1 =
k xn k xk−1
n
1h a i
= (k − 1)xn + k−1 .
k xn
n xn xn+1
0 1.00000 1.33333
1 1.33333 1.26389
2 1.26389 1.25993
3 1.25993 1.25992
√
3
Thus, the value of 2 is 1.2599, correct up to four decimal places.
6
......................................................................................
5 x3n + 1
Example 1.3 Suppose 2xn+1 = is an iteration scheme to find a root of the
9
equation f (x) = 0. Find the function f (x).
5 x3n + 1
2xn+1 = .
9
Then, lim xn = l.
n→∞ h i
Now, lim 18xn+1 = 5 lim x3n + 1 .
n→∞ n→∞
That is, 18l = (5l3 + 1), or 5l3 − 18l + 1 = 0.
Therefore, the required equation is 5x3 − 18x + 1 = 0, and hence f (x) = 5x2 − 18x + 1.
Example 1.4 Discuss the Newton-Raphson method to find a root of the equation x15 −
1 = 0 starting with x0 = 0.5.
Solution. It is obvious that the real roots of the given equation are ±1.
Here f (x) = x15 − 1.
Therefore,
x15
n −1 14x15
n +1
xn+1 = xn − 14
= .
15xn 15x14
n
14 × (0.5)15 + 1
Let the initial guess be x0 = 0.5. Then x1 = = 1092.7333. This is
15 × (0.5)14
far away from the root 1. This is because 0.5 is not close enough to the exact root
x = 1.
But, the initial guess x0 = 0.9 gives the first approximate root as x1 = 1.131416 and
it is close to the root 1.
This example shows the importance of initial guess in Newton-Raphson method.
The Newton-Raphson method may also be used to find the complex root. This is
illustrated in the following example.
7
. . . . . . . . . . . . . . . . . . . . . . . Newton’s Method to Solve Transcendental Equation
Solution. Let z0 = 0.5+0.5i = (0.5, 0.5) be the initial guess and f (z) = z 3 +3z 2 +3z+2.
Then f ′ (z) = 3z 2 + 6z + 3. The Newton-Raphson iteration scheme is
f (zn )
zn+1 = zn − .
f ′ (zn )
The values of zn and zn+1 at each iteration are tabulated below:
n zn zn+1
0 ( 0.50000000, 0.50000000) (–0.10666668, 0.41333333)
1 (–0.10666668, 0.41333333) (–0.62715298, 0.53778100)
2 (–0.62715298, 0.53778100) (–0.47841841, 1.0874815)
3 (–0.47841841, 1.0874815) (–0.50884020, 0.90368903)
4 (–0.50884020, 0.90368903) (–0.50117314, 0.86686337)
5 (–0.50117314, 0.86686337) (–0.50000149, 0.86602378)
6 (–0.50000149, 0.86602378) (–0.49999994, 0.86602539)
7 (–0.49999994, 0.86602539) (–0.49999994, 0.86602539)
Using Newton-Raphson method, one can determined the multiple root of the equation
f (x) = 0. But, the following modified formula
f (xn )
xn+1 = xn − k (1.10)
f ′ (xn )
gives a more faster convergent scheme, where k is the multiplicity of the root. The term
in the formula k1 f ′ (xn ) is the slope of the straight line passing through point (xn , f (xn ))
and intersecting the x-axis at the point (xn+1 , 0).
Let ξ be a root of the equation f (x) = 0 with multiplicity k. Then ξ is also a root
of the equation f ′ (x) = 0 with multiplicity (k − 1). In general, ξ is a root of the
equation f p (x) = 0 with multiplicity (k − p), p < k. If the equation f (x) = 0 has a
8
......................................................................................
root with multiplicity k and if the initial guess is very close to the exact root ξ, then
the expressions
f (εn + ξ)
εn+1 = εn − k
f ′ (εn + ξ)
εk−1 k−1 (ξ) εkn k εk+1 k+1 (ξ)
f (ξ) + εn f ′ (ξ) + · · · + (k−1)! f
n
+ k! f (ξ) + (k+1)! f
n
+ ···
= εn − k
εk−2 k−1 (ξ) + εk−1 k (ξ) εkn k+1
f ′ (ξ) + εn f ′′ (ξ) + · · · + (k−2)! f
n
(k−1)! f
n
+ k! f (ξ) + ···
= εn − k
εk−1 k εkn k+1
(k−1)! f (ξ) + k! f (ξ) + ···
n
1 f k+1 (ξ)
Let C = . Neglecting cube and higher order terms of εn , the above
k(k + 1) f k (ξ)
equation becomes εn+1 = Cε2n .
Thus, the rate of convergence of the scheme (1.10) is quadratic.
9
. . . . . . . . . . . . . . . . . . . . . . . Newton’s Method to Solve Transcendental Equation
Example 1.6 Find the multiple root with multiplicity 3 of the equation x4 − x3 − 3x2 +
5x − 2 = 0.
Solution. Let the initial guess be x0 = 0.5. Also, let f (x) = x4 − x3 − 3x2 + 5x − 2.
f ′ (x) = 4x3 − 3x2 − 6x + 5, f ′′ (x) = 12x2 − 6x − 6, f ′′′ (x) = 24x − 6.
The first iterated values are
f (x0 ) f (0.5)
x1 = x0 − 3 ′ = 0.5 − 3 ′ = 1.035714
f (x0 ) f (0.5)
f ′ (x0 ) f ′ (0.5)
x1 = x0 − 2 ′′ = 0.5 − 2 ′′ = 1.083333 and
f (x0 ) f (0.5)
f ′′ (x0 ) f ′′ (0.5)
x1 = x0 − ′′′ = 0.5 − ′′′ = 1.5.
f (x0 ) f (0.5)
The first two values of x1 are closed to 1. It indicates that the equation may have a
double root near 1.
Let x1 = 1.035714.
f (x1 ) f (1.035714)
Then x2 = x1 − 3 ′ = 1.035714 − 3 ′ = 1.000139
f (x1 ) f (1.035714)
f ′ (x1 ) f ′ (1.035714)
x2 = x1 − 2 ′′ = 1.035714 − 2 ′′ = 1.000277
f (x1 ) f (1.035714)
f ′′ (x1 ) f ′′ (1.000277)
x2 = x1 − ′′′ = 1.000277 − ′′′ = 1.000812.
f (x1 ) f (1.000277)
Here it is seen that the three values of x2 are very close to 1. So the equation has a
multiple root near 1 of multiplicity 3.
Let x2 = 1.000139.
The third iterated values are
f (x2 ) f ′ (x2 )
x3 = x2 − 3 ′ = 1.000000, x3 = x2 − 2 ′′ = 1.000000 and
f (x2 ) f (x2 )
f ′′ (x2 )
x3 = x2 − ′′′ = 1.000000.
f (x2 )
All the values of x3 are same and hence one root of the equation is 1.000000 correct
up to six decimal places, with multiplicity 3.
Note that in the Newton-Raphson method the derivative of the function f (x) is
evaluated at each iteration. That is, to find xn+1 , the value of f ′ (xn ) is required for
n = 0, 1, 2, . . .. Therefore, at each iteration two functions are evaluated at the point
xn , n = 0, 1, 2, . . .. So, a separate method is required to find derivatives. Thus, in
each iteration of this method more calculations are needed. But, the following proposed
method can reduced the computational effort:
f (xn )
xn+1 = xn − . (1.11)
f ′ (x0 )
In this method, the derivative of f (x) is calculated only at the initial guess x0 and
obviously it reduces the computation time at each iteration. But, the rate of convergence
of this method reduced to 1. This is, proved in the following theorem.
Theorem 1.3 The rate of convergence of the modified Newton-Raphson method (1.11)
is linear.
Solution. Let ξ be an exact root of the equation f (x) = 0 and xn be the approximate
root at the nth iteration. Then f (ξ) = 0. Let εn be the error occurs at the nth iteration.
Then εn = xn − ξ.
Now, from the formula (1.11), we have
Neglecting square and higher power terms of εn , the above equation reduces to
f ′ (ξ)
εn+1 = εn 1 − ′ .
f (x0 )
f ′ (ξ)
Let C = 1 − , which is free from εn . Using this notation the above error equa-
f ′ (x0 )
tion becomes
This shows that the rate of convergence of the formula (1.11) is linear.
11
. . . . . . . . . . . . . . . . . . . . . . . Newton’s Method to Solve Transcendental Equation
Example 1.7 Find a root of the equation x3 − 3x2 + 1 = 0 using modified Newton-
Raphson formula (1.11) and Newton-Raphson method correct up to four decimal places.
Solution. Let f (x) = x3 − 3x2 + 1. One root of this equation lies between 0 and 1.
Let the initial guess be x0 = 0.5. Now, f ′ (x) = 3x2 − 6x and hence f ′ (x0 ) = −2.25.
The iteration scheme for the formula (1.11) is
f (xn )
xn+1 = xn −
f ′ (x0 )
x3 − 3x2n + 1 x3 − 3x2n + 2.25xn + 1
= xn − n = n .
−2.25 2.25
All the approximate roots are calculated in the following table.
n xn xn+1
0 0.50000 0.66667
1 0.66667 0.65021
2 0.65021 0.65313
3 0.65313 0.65263
4 0.65263 0.65272
5 0.65272 0.65270
Therefore, 0.6527 is a root of the given equation correct up to four decimal places.
By Newton-Raphson method
The iteration scheme for Newton-Raphson method is
f (xn )
xn+1 = xn −
f ′ (xn )
x3 − 3x2 + 1 2x3 − 3x2n − 1
= xn − n 2 n = n2 .
3xn − 6xn 3xn − 6xn
Let x0 = 0.5. The successive iterations are shown below.
n xn xn+1
0 0.50000 0.66667
1 0.66667 0.65278
2 0.65278 0.65270
3 0.65270 0.65270
12
......................................................................................
13
Numerical Analysis
by
Dr. Anita Pal
Assistant Professor
Department of Mathematics
National Institute of Technology Durgapur
Durgapur-713209
email: [email protected]
1
.
Chapter 4
Module No. 2
max{|a1 |, |a2 |, . . . , |an |} and B = max{|a0 |, |a1 |, . . . , |an−1 |}. Then the magnitude of a
1 A
root of the equation (2.2) lies between and 1 + .
1 + B/|an | |a0 |
The other methods are also available to find the upper bound of the positive roots of
the polynomial equation. Two such results are stated below:
Theorem 2.1 (Lagrange’s). If the coefficients of the polynomial
a0 xn + a1 xn−1 + · · · + an−1 x + an = 0
satisfy the conditions a0 > 0, a1 , a2 , . . . , am−1 ≥ 0, am < 0, for some m ≤ n, then the
p
upper bound of the positive roots of the equation is 1 + m B/a0 , where B is the greatest
of the absolute values of the negative coefficients of the polynomial.
and its derivatives f 0 (x), f 00 (x), . . . assume positive values then c is the upper bound of
the positive roots of the equation f (x) = 0.
In the following sections, two iteration methods, viz. Birge-Vieta and Bairstow meth-
ods are discussed to find all the roots of a polynomial equation of degree n.
Assume that Qn−1 (x) and R be the quotient and remainder when Pn (x) is divided
by the factor (x − ξ). Here, Qn−1 (x) is a polynomial of degree (n − 1), so it can be
written as
Thus,
Pn (x) = (x − ξ)Qn−1 (x) + R. (2.5)
If ξ is an exact root of the equation Pn (x) = 0, then R must be zero. Thus, the value
of R depends on the accuracy of ξ. The Newton-Raphson method or any other method
be used to find the value of ξ starting from an initial guess x0 such that
Pn (xk )
xk+1 = xk − , k = 0, 1, 2, . . . . (2.7)
Pn0 (xk )
This method determines the approximate value of ξ, so for this ξ, R is not exactly 0,
but it is a small number.
Since Pn (x) is a polynomial, so it is differentiable everywhere. Also, the values of
Pn (xk ) and Pn0 (xk ) can be determined by synthetic division or any other method. To
find the polynomial Qn−1 (x) and R, comparing the coefficient of like powers of x on
both sides of the equation (2.5). Thus we get the following equations.
a1 = b1 − ξ b 1 = a1 + ξ
a2 = b2 − ξb1 b2 = a2 + ξb1
.. ..
. .
ak = bk − ξbk−1 bk = ak + ξbk−1
.. ..
. .
an = R − ξbn−1 R = an + ξbn−1
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roots of a Polynomial Equation
Thus,
That is,
Again,
Thus, the evaluation of Pn0 (x) is same as Pn (x). Differentiating (2.9) with respect to
ξ, we get
dbk dbk−1
= bk−1 + ξ . (2.12)
dξ dξ
We denote
dbk
= ck−1 . (2.13)
dξ
ck = bk + ξck−1 , k = 1, 2, . . . , n − 1. (2.14)
4
......................................................................................
dR dbn
Pn0 (ξ) = = = cn−1 [using (2.13)].
dξ dξ
bn
xk+1 = xk − , k = 0, 1, 2, . . . . (2.15)
cn−1
x0 1 a1 a2 · · · an−2 an−1 an
x0 x0 b1 · · · x0 bn−3 x0 bn−2 x0 bn−1
x0 1 b1 b2 · · · bn−2 bn−1 bn = R
x0 x0 c1 · · · x0 cn−3 x0 cn−2
1 c1 c2 · · · cn−2 cn−1 = Pn0 (x0 )
Table 2.1: Tabular form of b’s and c’s.
Example 2.1 Find all the roots of the polynomial equation x4 +x3 −8x2 −11x−3 = 0.
Solution. Let P4 (x) = x4 + x3 − 8x2 − 11x − 3 be the given polynomial. Also, let the
initial guess be x0 = −0.5.
First iteration for first root
–0.5 1 1 –8 –11 –3
–0.500000 –0.250000 4.125000 3.437500
–0.5 1 0.500000 –8.250000 –6.875000 0.437500 =b4 = P4 (x0 )
–0.500000 –0.000000 4.125000
1 0.000000 –8.250000 –2.750000=c3 = P40 (x0 )
Therefore,
b4 0.437500
x1 = x0 − = −0.500000 − = −0.340909.
c3 −2.750000
This is the first iterated value.
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roots of a Polynomial Equation
–0.340909 1 1 –8 –11 –3
–0.340909 –0.224690 2.803872 2.794135
–0.340909 1 0.659091 –8.224690 –8.196128 –0.205865=b4
–0.340909 –0.108471 2.840850
1 0.318182 –8.333161 –5.355278=c3
–0.379351 1 1 –8 –11 –3
–0.379351 –0.235444 3.124121 2.987720
–0.379351 1 0.620649 –8.235444 –7.875879 –0.012280=b4
–0.379351 –0.091537 3.158846
1
0.241299 –8.326981 –4.717033=c3
b4 −0.012280
Therefore, x3 = x2 − = −0.379351 − = −0.381954.
c3 −4.717033
–0.381954 1 1 –8 –11 –3
–0.381954 –0.236065 3.145798 2.999944
–0.381954 1 0.618046 –8.236065 –7.854202 –0.000056=b4
–0.381954 –0.090176 3.180241
1
0.236092 –8.326241 –4.673960=c3
b4 −0.000056
Then x4 = x3 − = −0.381954 − = −0.381966.
c3 −4.673960
Therefore,
b3 −14.472139
x1 = x0 − = 1.000000 − = −2.618026.
c2 −4.000010
Second iteration for second root
x2 − 2.00000x − 3.00003 = 0.
Note 2.1 The Birga-Vieta method is used to find all real roots of a polynomial equation.
But, the current form of this method is not applicable to find the complex roots. After
modification, this method may be used to find all roots (real or complex) of a polynomial
equation. Since the method is based on Newton-Raphson method, the rate of convergent
of this method is quadratic, as the rate of convergent of Newton-Raphson method is
quadratic.
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roots of a Polynomial Equation
This method is also an iterative method. In this method, a quadratic factor is ex-
tracted from the polynomial Pn (x) by iteration. As a by product the deflated polynomial
(the polynomial obtained by dividing Pn (x) by the quadratic factor) is also obtained. It
is well known that the determination of roots (real or complex) of a quadratic equation
is easy. Therefore, by extracting all quadratic factors one can determine all the roots of
a polynomial equation. This is the basic principle of Bairstow method.
Let the polynomial Pn (x) of degree n be
These are two non-linear equations in p and q and these equations can be solved by
Newton-Raphson method for two variables (discussed in Module 3 of this chapter).
Let (pT , qT ) be the exact values of p and q and ∆p, ∆q be the (errors) corrections to
p and q. Therefore,
pT = p + ∆p and qT = q + ∆q.
8
......................................................................................
Hence,
a1 = b1 + p b1 = a1 − p
a2 = b2 + pb1 + q b2 = a2 − pb1 − q
.. ..
. .
ak = bk + pbk−1 + qbk−2 bk = ak − pbk−1 − qbk−2 (2.24)
.. ..
. .
an−1 = M + pbn−2 + qbn−3 M = an−1 − pbn−2 − qbn−3
an = N + qbn−2 N = an − qbn−2 .
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roots of a Polynomial Equation
In general,
Note that M and N depend on b’s. Differentiating the equation (2.25) with respect
to p and q to find the partial derivatives of M and N .
∂bk ∂bk−1 ∂bk−2 ∂b0 ∂b−1
= −bk−1 − p −q , = =0 (2.27)
∂p ∂p ∂p ∂p ∂p
∂bk ∂bk−1 ∂bk−2 ∂b0 ∂b−1
= −bk−2 − p −q , = =0 (2.28)
∂q ∂q ∂q ∂q ∂q
For simplification, we denote
∂bk
= −ck−1 , k = 1, 2, . . . , n (2.29)
∂p
∂bk
and = −ck−2 . (2.30)
∂q
With this notation, the equation (2.27) simplifies as
ck−1 = bk−1 − pck−2 − qck−3 . (2.31)
Therefore,
∂bn−1
Mp = = −cn−2
∂p
∂bn ∂bn−1
Np = +p + bn−1 = bn−1 − cn−1 − pcn−2
∂p ∂p
∂bn−1
Mq = = −cn−3
∂q
∂bn ∂bn−1
Nq = +p = −(cn−2 + pcn−3 ).
∂q ∂q
10
......................................................................................
From the equation (2.22), the explicit expressions for ∆p and ∆q, are obtained as
follows:
bn cn−3 − bn−1 cn−2
∆p = −
c2n−2
− cn−3 (cn−1 − bn−1 )
bn−1 (cn−1 − bn−1 ) − bn cn−2
∆q = − 2 . (2.34)
cn−2 − cn−3 (cn−1 − bn−1 )
Therefore, the improved values of p and q are p + ∆p and q + ∆q. Thus if p0 , q0 be
the initial guesses of p and q, then the first approximate values of p and q are
Table 2.2 is helpful to calculate the values of bk ’s and ck ’s, where p0 and q0 are taken
as initial values of p and q.
The second approximate values p2 , q2 of p and q are determined from the equations:
p2 = p1 + ∆p, q2 = q1 + ∆q.
In general,
pk+1 = pk + ∆p, qk+1 = qk + ∆q, (2.36)
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roots of a Polynomial Equation
Example 2.2 Extract all the quadratic factors from the equation x4 + 2x3 + 3x2 + 4x +
1 = 0 by using Bairstow method and hence solve this equation.
First iteration
b4 c1 − b3 c2 b3 (c3 − b3 ) − b4 c2
∆p = − = 1.978261, ∆q = − = 0.891304
c22 − c1 (c3 − b3 ) c22 − c1 (c3 − b3 )
Second iteration
∆p = 0.52695, ∆q = −0.29857.
p2 = p1 + ∆p = 1.90790, q2 = q1 + ∆q = −2.86999.
12
......................................................................................
Third iteration
∆p = −0.479568, ∆q = −0.652031.
p3 = p2 + ∆p = 1.998693, q3 = q2 + ∆q = 0.739273.
Fourth iteration
∆p = −0.187110, ∆q = −0.258799.
p4 = p3 + ∆p = 1.811583, q4 = q3 + ∆q = 0.480474.
Fifth iteration
∆p = −0.015050, ∆q = −0.020515.
p5 = p4 + ∆p = 1.796533, q5 = q4 + ∆q = 0.459960.
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roots of a Polynomial Equation
Sixth iteration
∆p = −0.000062, ∆q = −0.000081.
p6 = p5 + ∆p = 1.796471, q6 = q5 + ∆q = 0.459879.
Note that, ∆p and ∆q are correct up to four decimal places. Thus p = 1.7965, q =
0.4599 correct up to four decimal places.
Therefore, a quadratic factor is x2 + 1.7965x + 0.4599 and the deflated polynomial is
Q2 (x) = P4 (x)/(x2 + 1.7965x + 0.4599) = x2 + 0.2035x + 2.1745.
Thus, P4 (x) = (x2 + 1.7965x + 0.4599)(x2 + 0.2035x + 2.1745).
Hence, the roots of the given equation are
−0.309212, −1.487258, (−0.1018, 1.4711), (−0.1018, −1.4711).
14
Numerical Analysis
by
Dr. Anita Pal
Assistant Professor
Department of Mathematics
National Institute of Technology Durgapur
Durgapur-713209
email: [email protected]
1
.
Chapter 5
Module No. 1
The system of linear and non-linear equations occur in many applications. To solve
a system of linear equations many direct and iterated methods are developed. The old
and trivial methods are Cramer’s rule and matrix inverse method. But, these meth-
ods depend on evaluation of determinant and computation of inverse of the coefficient
matrix. Few methods are available to evaluate a determinant, among them pivoting
method is most efficient and applicable for all type of determinants. In this module,
pivoting method is discussed to evaluate a determinant and inverse of the coefficient
matrix. Then, matrix inverse method is described to solve a system of linear equations.
Other direct and iteration methods are discussed in next modules.
A system of m linear equations with n variables is given by
The quantities x1 , x2 , . . ., xn are the unknowns (variables) of the system and a11 ,
a12 , . . ., amn are called the coefficients and generally they are known. The numbers
b1 , b2 , . . . , bm are constant or free terms of the system.
The above system of equations (1.1) can be written as a single equation:
n
X
aij xj = bi , i = 1, 2, . . . , m. (1.2)
j=1
Also, the entire system of equations (1.1) can be written with the help of matrices as
AX = b, (1.3)
where
b1 x1
a11 a12 · · · a1n
a21 a22 · · · a2n
b2 x2
.. ..
··· ··· ··· ··· . .
A= ,b = x .
and X = (1.4)
ai1 ai2 · · · ain
b
i
i
. .
··· ··· ··· ···
. .
. .
am1 am2 · · · amn bm xm
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix Inverse Method
A system of linear equations may or may not have a solution. If the system of
linear equations (1.1) has a solution then the system is called consistent otherwise it is
called inconsistent or incompatible. Again, a consistent system of linear equations
may have unique solution or multiple solutions. Finding of unique solution is easy, but
determination of multiple solutions, if exists, is a complicated problem.
To solve a system of linear equations usually three type of the elementary transfor-
mations are applied. These are discussed below.
Interchange: The order of two equations can be changed.
Scaling: Multiplication of both sides of an equation by any non-zero number.
Replacement: Addition to (subtraction from) both sides of one equation of the cor-
responding sides of another equation multiplied by any number.
If for a system, all the constant terms b1 , b2 , . . . , bm are zero, then the system is called
homogeneous system otherwise it is called the non-homogeneous system.
Two type of methods are available to solve a system of linear equations, viz. direct
method and iteration method.
Again, many direct methods are used to solve a system of equations, among them
Cramer’s rule, matrix inversion, Gauss elimination, matrix factorization, etc. are well
known.
Also, the mostly used iteration methods are Jacobi’s iteration, Gauss-Seidal’s itera-
tion, etc.
In many applications, we have to determine the value of a determinant. So an efficient
method is required for this purpose. One efficient method based on pivoting is discussed
in the following section.
a a ··· a
11 12 1n
a21 a22 · · · a2n
Let D be a determinant of order n given by
.
··· ··· ··· ···
an1 an2 · · · ann
Using the elementary row operations, D can be reduced to the following upper tri-
angular form:
a11 a12 a13 · · · a1n
0 a(1) (1) (1)
22 a23 · · · a2n
(2) (2)
D0 = 0 0 a33 · · · a3n .
··· ··· ··· ··· · · ·
(n−1)
0 0 0 ··· ann
To convert in this form lot of elementary operations are required. To convert all
the elements of the first column, except first element, to 0 the following elementary
operations are used
(1) ai1
aij = aij − a1j , for i, j = 2, 3, . . . , n.
a11
Similarly, to convert all the elements of the second column below the second element
to 0, the following operations are used.
(1)
(2) (1) ai2 (1)
aij = aij − a ,
(1) 2j
for i, j = 3, 4, . . . , n.
a22
All these elementary operations can be written as
(k−1)
(k) (k−1) aik (k−1)
aij = aij − a
(k−1) kj
; (1.5)
akk
(0)
i, j = k + 1, . . . , n; k = 1, 2, . . . , n − 1 and aij = aij , i, j = 1, 2, · · · , n.
Once D0 is available, then the value of D is given by
(1) (2)
a11 a22 a33 · · · a(n−1)
nn .
It is observed that the formula for the elementary operations is simple and easy to
programmed. The time taken by this method is O(n3 ). But, there is a serious drawback
of this formula, which is discussed below.
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix Inverse Method
(k) (k−1)
To compute the value of aij one division is required. If akk is zero or very small
(k−1)
then the method fails. If akk is very small, then there is a chance of loosing significant
digits or data overflow. To avoid this situation the pivoting techniques are used.
A pivot is the largest magnitude element in a row or in a column or in the principal
diagonal or the leading or trailing sub-matrix of order i (2 ≤ i ≤ n).
Let us consider the following matrix to illustrate these terms:
0 1 0 −5
1 −8 3 10
A= .
9 3 −33 18
4 −40 9 11
For this matrix 9 is the pivot for the first column, −33 is the pivot for the principal
diagonal, −40
" is the # pivot for the entire matrix and −8 is the pivot for the trailing
0 1
sub-matrix .
1 −8
If any one of the column pivot element (during elementary operation) is zero or very
small relative to other elements in that row, then we rearrange the remaining rows in
such a way that the pivot becomes non-zero or not a very small number. The method
is called pivoting. The pivoting methods are of two types, viz. partial pivoting and
complete pivoting, these are discussed below.
In partial pivoting method, the pivot is the largest magnitude element in a column.
In the first stage, find the first pivot which is the largest element in magnitude among
the elements of first column. If it is a11 , then there is nothing to do. If it is ai1 , then
interchange rows i and 1. Then apply the elementary row operations to make all the
elements of first column, except first element, to 0. In the next stage, the second pivot
is determined by finding the largest element in magnitude among the elements of second
column leaving first element and let it be aj2 . In this case, interchange second and jth
rows and then apply elementary row operations. This process continues for (n − 1)th
times. In general, at the ith stage, the smallest index j is chosen for which
In partial pivoting, the pivot is chosen from column. But, in complete pivoting the pivot
element is the largest element (in magnitude) among all the elements of the determinant.
Let it be at the (l, m)th position for first time.
Thus, alm is the first pivot. Then interchange first row and the lth row and of
first column and mth column. In second stage, the largest element (in magnitude) is
determined among all elements leaving the first row and first column. This element is
the second pivot.
In this manner, at the kth stage, we choose l and m such that
(k) (k)
|alm | = max{|aij |, i, j = k, k + 1, . . . , n}.
Then interchange the rows k, l and columns k, m. In this case, akk is the kth pivot
element.
It is obvious that the complete pivoting is more complicated than the partial pivot-
ing. Partial pivoting is easy to program. Generally, partial pivoting is used for hand
calculation.
We have mentioned earlier that the pivoting is used to find the value of all kind of
determinants. To determine the pivot and to interchange the rows and/or columns some
additional time is required. But, for some type of determinants without pivoting one
can determine its value. Such type of determinants are stated below.
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix Inverse Method
into the upper triangular form using (i) partial pivoting, and (ii) complete pivoting and
hence determine the value of A.
Solution. (i) (Partial pivoting) The largest element in the first column is 5, present
in the third row and it is the first pivot of A. Therefore, first and third rows are
interchanged and the reduced determinant is
5 −1 6
−2 7 1 .
1 0 3
Since two rows are interchanged then the value of the determinant is to be multiplied
by −1. To maintain it a variable sign is used and in this case it’s value is sign = −1.
Now, we apply the elementary row operations to convert all elements of first column,
except first, to 0.
2 1
Adding times the first row to the second row, − times the first row to the third
5 5
2 1
row, i.e. R20 = R2 + R1 and R30 = R3 − R1 . (R2 and R20 represent the original second
5 5
row and modified second row respectively.)
The reduced determinant is
5 −1 6
0 33/5 17/5 .
0 1/5 9/5
Now, we determine the second pivot element. In this case, the pivot element is at the
(2, 2)th position, therefore no interchange is required.
−1/5 1 1
Adding =− times the second row to the third row, i.e. R30 = R3 − R2 .
33/5 33 33
The reduced determinant is
5 −1 6
0 33/5 17/5 .
0 0 56/33
6
......................................................................................
Note that this is an upper triangular determinant and hence its value is sign ×
(5)(33/5)(56/33) = −56.
(ii) (Complete pivoting) The largest element in A is 7 at position (2,2). Interchanging
first and second columns and assign sign = −1; and then interchanging first and second
rows and setting sign = −sign = 1. Then the updated determinant is
7 −2 1
0 1 3 .
−1 5 6
Adding 1
7 times the first row to the third row, i.e. using the formula R30 = R3 + 17 R1 .
The reduced determinant is
7 −2 1
0 1
3 .
0 33/7 43/7
Now, we determine the second pivot element from the submatrix
obtained
by deleting
1 3
first row and column. That is, from the trailing sub-matrix .
33/7 43/7
The second pivot is 43/7 at (3,3) position. Interchange the second and third columns
and setting sign = −sign = −1 and then interchanging
second and third rows. Then
7 1 −2
the modified determinant is 0 43/7 33/7 and sign = 1.
0 3 1
21
Now, we apply row operation as R30 = R3 − R2 and we obtain the required upper
43
7 1 −2
triangular determinant 0 43/7 33/7 .
0 0 −56/43
In pivoting method, the symmetry or regularity of the original matrix may be lost. It
is easily observed that the partial pivoting requires less time, as it needs less number
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix Inverse Method
of interchanges than complete pivoting. Again, the partial pivoting needs less number
of comparison to get pivot element. A combination of partial and complete pivoting is
expected to be very effective not only for computing a determinant but also for solving
system of linear equations. The pivoting prevent the loss of significant digits.
Let A be a non-singular square matrix and there exists a matrix B such that AB = I.
Then B is called the inverse of A and vice-versa. The inverse of a matrix is denoted by
A−1 . Now, using some theories of matrices it can be shown that the inverse of a matrix
A is given by
adj A
A−1 = . (1.7)
|A|
8
......................................................................................
In this method, the matrix A is augmented with a unit matrix of same size, and only
elementary row operations are applied to get the inverse of the matrix. Let the order
of the matrix A be n × n and it is augmented with the unit matrix I. This augmented
. .
matrix is denoted by [A..I]. The order of the augmented matrix [A..I] becomes n × 2n.
The augmented matrix is of the following form:
.
a a · · · a1n .. 1 0 ··· 0
11 12
.
a21 a22 · · · a2n .. 0
.. 1 ··· 0
[A.I] =
.
. (1.8)
· · · · · · · · · · · · .. · · ·
··· ··· ···
..
a a ··· a n1 n2 . 0
nn 0 ··· 1
Now, the inverse of A is calculated in two phases. In the first phase, the first half of the
augmented matrix is converted into an upper triangular matrix by using only elementary
row operations. In the second phase, this upper triangular matrix is converted to an
identity matrix by using only row operations. All these operations are applied on the
.
augmented matrix [A..I].
. .
After second phase, the augmented matrix [A..I] is transferred to [I..A−1 ]. Thus, the
right half becomes the inverse of A. Symbolically, we can write as
.. Gauss − Jordan .
−→ I..A−1 .
A.I
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix Inverse Method
Example
1.2 Use
partial pivoting method to find the inverse of the following matrix
2 0 1
A = −1 3 4
.
4 −2 0
.
Solution. The augmented matrix [A..I] is
..
2 0 1 . 1 0 0
.
.
[A..I] = −1 3 4 .. 0 1 0 .
..
4 −2 0 . 0 0 1
Phase 1. (Reduction to upper triangular form):
In the first column 4 is the largest element, so it is the first pivot. So we interchange
first and third rows to place the pivot element 4 at the (1,1) position. Then, the above
matrix
becomes
..
4 −2 0 . 0 0 1
.
−1 3 4 .. .
0 1 0
.
2 0 1 .. 1 0 0
..
1 −1/2 0 .
0 0 1/4
.. 0
∼ −1 3 4 R = 1 R1
.
0 1 0 1 4
..
2 0 1 1 0 0 .
..
1 −1/2 0 . 0 0 1/4
.
0
∼ 0 5/2 4 .. R = R2 + R1 ; R30 = R3 − 2R1
0 1 1/4 2
.
0 1 1 .. 1 0 −1/2
All the elements of first column, except first, become 0. Now, we convert the element
of (3,2) position to 0. For this purpose, we find the largest element (in magnitude) from
the second column leaving first element and it is 52 . Fortunately, it is at (2,2) position
and so there is no need to interchange any rows.
.
1 −1/2 0 .. 0 0 1/4
. 0
∼0 1 8/5 .. 0 2/5 R = 2 R2
1/10 2 5
.
0 1 1 .. 1 0 −1/2
10
......................................................................................
..
1 −1/2 0 . 0 0 1/4
..
0
∼0 1 R3 = R3 − R2
8/5 . 0 2/5 1/10
..
0 0 −3/5 . 1 −2/5 −3/5
..
1 −1/2 0 . 0 0 1/4
..
∼ 0 1 8/5 . 0 2/5 1/10 R30 = − 53 R2
..
0 0 1 . −5/3 2/3 1
By analyzing each step of the method to find the inverse of a matrix A of order n × n, it
can be shown that the time complexity to compute the inverse of a non-singular matrix
is O(n3 ).
Ax = b
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix Inverse Method
x = A−1 b (1.9)
Note 1.3 It is mentioned earlier that the time to compute the inverse of an n×n matrix
is O(n3 ) and this amount of time is required to multiply two matrices of same order.
Hence, the time complexity to solve a system of linear equations containing n equations
is O(n3 ).
13
Numerical Analysis
by
Dr. Anita Pal
Assistant Professor
Department of Mathematics
National Institute of Technology Durgapur
Durgapur-713209
email: [email protected]
1
.
Chapter 5
Module No. 3
Ax = b (3.1)
where
b1 x1
a11 a12 · · · a1n
a21 a22 · · · a2n
b2 x2
.. ..
··· ··· ··· ··· . .
A= ,b = x .
and X = (3.2)
ai1 ai2 · · · ain
b
i
i
. .
··· ··· ··· ···
. .
. .
am1 am2 · · · amn bm xm
Since the matrices L and U are lower and upper triangular, so these matrices can be
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Matrix Factorization
LUx = b. (3.5)
l11 z1 = b1
l21 z1 + l22 z2 = b2
l31 z1 + l32 z2 + l33 z3 = b3 (3.6)
···································· ··· ···
ln1 z1 + ln2 z2 + ln3 z3 + · · · + lnn zn = bn .
This system of equations can be solved by forward substitution, i.e. the value of z1
is obtained from first equation and using this value, z2 can be determined from second
equation and so on. From last equation we can determine the value of zn , as in this
stage the values of the variables z1 , z2 , . . . , zn−1 are available.
By finding the values of z, one can solve the equation Ux = z. In explicit form, this
system is
Observed that the value of the last variable xn can be determined from the last
equation. Using this value one can compute the value of xn−1 from the last but one
equation, and so on. Lastly, from the first equation we can find the value of the variable
x1 , as in this stage all other variables are already known. This process is called the
backward substitution.
Thus, the outline to solve the system of equations Ax = b is given. But, the compli-
cated step is to determine the matrices L and U. The matrices L and U are obtained
from the relation A = LU. Note that, this matrix equation gives n2 equations contain-
ing lij and uij for i, j = 1, 2, . . . , n. But, the number of elements of the matrices L and
U are n(n + 1)/2 + n(n + 1)/2 = n2 + n. So, n additional equations/conditions are
required to find L and U completely. Such conditions are discussed below.
When uii = 1, for i = 1, 2, . . . , n, then the method is known as Crout’s decom-
position method. When lii = 1, for i = 1, 2, . . . , n then the method is known as
Doolittle’s method for decomposition. In particular, when lii = uii for i = 1, 2, . . . , n
then the corresponding method is called Cholesky’s decomposition method.
Similarly, from second column and second row we get the following equations.
The matrix equations Lz = b and Ux = z can also be solved by finding the inverses
of L and U as
z = L−1 b (3.10)
and x = U−1 z. (3.11)
But, the process is time consuming, because finding of inverse takes much time.
It may be noted that the time to find the inverse of a triangular matrix is less than
an arbitrary matrix.
The inverse of A can also be determined from the relation
Let L = [lij ] and U = [uij ] be the lower and upper triangular matrices.
• The inverse of lower (upper) triangular matrix is also a lower (upper) triangular
matrix.
Express A as A = LU, where L and U are lower and upper triangular matrices and
hence solve the system of equations 2x1 − 3x2 + x3 = 1, x1 + 2x2 − 3x3 = 4, 4x1 −
x2 − 2x3 = 8.
Also, determine L−1 , U−1 , A−1 and |A|.
Solution. Let
2 −3 1
1 2 −3
4 −1 −2
l11 0 0 1 u12 u13 l l u l11 u13
11 11 12
= l21 l22 0 0 1 u23 = l21 l21 u12 + l22
l21 u13 + l22 u23 .
l31 l32 l33 0 0 1 l31 l31 u12 + l32 l31 u13 + l32 u23 + l33
To find the values of lij and uij , comparing both sides and we obtained
l11 = 2, l21 = 1 l31 = 4
l11 u12 = −3 or, u12 = −3/2
l11 u13 = 1 or, u13 = 1/2
l21 u12 + l22 = 2 or, l22 = 7/2
l31 u12 + l32 = −1 or, l32 = 7
l21 u13 + l22 u23 = −3 or, u23 = −1
l31 u13 + l32 u23 + l33 = −2 or, l33 = 1.
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Matrix Factorization
2z1 = 1,
z1 + 7/2z2 = 4,
4z1 + 5z2 + z3 = 8.
.
2 0 0 .. 1 0 0
.. .
L.I = 0 .. 0 1 0
1 7/2
..
4 5 1 . 0 0 1
.
1 0 0 .. 1/2 0 0
. 0 1
∼ 0 .. R1 = R1
1 7/2 0 1 0 2
.
4 5 1 .. 0 0 1
.
1 0 0 .. 1/2 0 0
. 0
∼ 0 .. −1/2 1 0 R = R2 − R1 , R30 = R3 − 4R1
0 7/2 2
.
0 5 1 .. −2 0 1
..
1 0 0 . 1/2 0 0
2
.. 0
∼ R1 = R2
0 1 0 . −1/7 2/7 0 7
..
0 5 1 . −2 0 1
..
1 0 0 . 1/2 0 0
.. 0
∼ R = R3 − 5R2 .
0 1 0 . −1/7 2/7 0
3
..
0 0 1 . −9/7 −10/7 1
1/2 0 0
Thus, L−1 =
−1/7 2/7 0 .
−9/7 −10/7 1
Using same process, one can determine U−1 . But, here another method is used
to determine U−1 . We know that the inverse of an upper triangular matrix is upper
triangular.
1 b12 b13
Therefore, let U−1 =
0 1 b23
.
0 0 1
7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Matrix Factorization
A = LLt , (3.13)
where L = [lij ], lij = 0, i < j, a lower triangular matrix, LT is the transpose of the
matrix L.
Again, the matrix A can be written as
A = UUt , (3.14)
LLt x = b. (3.15)
Let
Lt x = z, (3.16)
then
Lz = b. (3.17)
Using forward substitution one can easily solve the equation (3.17) to obtained the
vector z. Then by solving the equation Lt x = z using back substitution, we obtained
the vector x.
Alternately, the values of z and then x can be determined from the following equa-
tions.
From this discussion, it is clear that the solution of the system of equations is com-
pletely depends on the matrix L. The procedure to compute the matrix L is discussed
below.
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Matrix Factorization
l11 0 0 ··· 0
l11 l21 · · · lj1 · · · ln1
l21 l22 0 · · · 0
0 l22 · · · lj2 · · · ln2
··· ··· ··· ··· ···
A = 0 0 · · · lj3 · · · ln3
li1 li2 li3 · · · 0
··· ··· ··· ··· ··· ···
··· ··· ··· ··· ···
0 0 · · · 0 · · · lnn
ln1 ln2 ln3 · · · lnn
2
l11 l21 l11 · · · lj1 l11 · · · ln1 l11
2 2
l21 l11 l21 + l22 · · · lj1 l21 + lj2 l22 · · · ln1 l21 + ln2 l22
··· ··· ··· ··· ··· ···
= .
li1 l11 l21 li1 + l22 li2 · · · lj1 li1 + · · · + ljj lij · · · ln1 li1 + · · · + lni lii
··· ··· ··· ··· ··· ···
2 + l2 + · · · + l2
ln1 l11 l21 ln1 + l22 ln2 · · · lj1 ln1 + · · · + ljj lnj · · · ln1 n2 nn
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Matrix Factorization
1
l31 l21 + l32 l22 = 39 or l32 = l22 (39 − l31 l21 ) = 4
2
l31 + 2
l32 + 2
l33 2 − l2 )1/2 = 1.
= 26 or l33= (26 − l31 32
2 0 0
Therefore, L = 1 9 0
.
3 4 1
Now, the system of equations Lz = b becomes
2z1 = 16
z1 + 9z2 = 206
3z1 + 4z2 + z3 = 113.
unit matrix I to the lower triangular matrix. This lower triangular matrix is the inverse
of L. Now, AA−1 = I reduces to LUA−1 = I, i.e.
The left hand side of the equation (3.21) is a lower triangular matrix. Also the
matrices U and L−1 are known. Therefore, by solving the system of equations (3.21)
using substitution one can easily determine the matrix A−1 .
The following problem is consider to illustrated the method.
1 2 4
Example 3.3 Find the inverse of the matrix A = 1 −2 6 by using Gauss elimi-
2 −1 0
nation method.
.
Solution. The augmented matrix A..I is
..
1 2 4 . 1 0 0
.. .
A.I = 1 −2 6 .. 0 1 0
..
2 −1 0 . 0 0 1
.
1 2 4 .. 1 0 0
0
R2 ← R2 − R1 .
−→ 0 −4 2 .. −1
R30 ← R3 − 2R1 1 0
.
0 −5 −8 .. −2 0 1
.
1 2 4 .. 1 0 0
0 5
R3 ← R3 − 4 R2 .
−→ 0 −4 2 ..
−1 1 0
..
0 0 −21/2 −3/4 −5/4
. 1
1 2 4 1 0 0
−1
Thus, we get U =
0 −4 2, L =
−1 1 0.
0 0 −21/2 −3/4 −5/4 1
x x x
−1
11 12 13
Let A = x21 x22 x23
.
x31 x32 x33
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Matrix Factorization
We generally presumed that the size of the coefficient matrix of a system is not very
large, that is, the entire matrix can be stored in the primary memory of a computer.
But, in many applications, it is seen that the size of the matrix is very large and it
14
......................................................................................
cannot be stored in the primary memory of a computer. So, for such cases the entire
matrix is divided into some matrices with lower sizes. With the help of these lower order
matrices, one can find the inverse of the given matrix. This process of division is known
as matrix partitioning method. This method is also useful when few more variables and
consequently few more equations are added to the original system.
Suppose the coefficient matrix A be partitioned as
..
B . C
A=
··· ··· ···
(3.22)
.
D .. E
where the matrices P, Q, R and S are of the same orders as those of the matrices B, C, D
and E respectively. Then
.. .. ..
B . C P . Q I1 . 0
AA−1 =
··· ··· ··· ··· = ···
··· ··· ,
··· ··· (3.24)
. .. ..
D .. E R . S 0 . I2
BP + CR = I1
BQ + CS = 0
DP + ER = 0
DQ + ES = I2 .
15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Matrix Factorization
Q = −B−1 CS
R = −(E − DB−1 C)−1 DB−1 = −SDB−1
P = B−1 (I1 − CR) = B−1 − B−1 CR.
Note that, we have to determine the inverse of two square matrices B and (E − DB−1 C)
of order l × l and m × m respectively.
That is, the inverse of the matrix A of order n × n depends on the inverses of two
lower order (roughly half) matrices. If the matrices B, C, D, E are still large to fit in
the computer memory, then further partition them.
Example
3.4 Using matrix partition method find the inverse of the matrix
1 2 3
A = 2 −1 0
.
0 2 4
Hence, find the solution of the system of equations
x1 + 2x2 + 3x3 = 1
2x1 − x2 = 0
2x2 + 4x3 = −1.
Solution.
Suppose the matrix
A be partitioned as
..
1 2 . 3 .
. B .. C
2 −1 .. 0
A= = · · · · · · · · · .
··· ··· ··· ···
..
.
D . E
0 2 .. 4
" #
1 2
The matrices B, C, D, E are given by B =
2 −1
" #
3 h i h i
C= , D= 0 2 , E= 4 .
0
16
......................................................................................
..
P . Q
Then the inverse of A is given by A−1 =
· · · · · · · · · , the matrices P, Q, R and
.
R .. S
S are obtain from the following formulae.
1
.
Chapter 5
Module No. 4
where
(1) ai1
aij = aij − a1j ; i, j = 2, 3, . . . , n.
a11
1
. . . . . . . . . . . . . . . . . . . . . Gauss Elimination Method and Tri-diagonal Equations
(1)
Now, to eliminate x2 (here also assumed that a22 6= 0) from the third, forth, . . ., and
(1) (1) (1)
a32 a an2
nth equations, the second equation is multiplied by − (1) , − 42
(1)
, . . ., − (1)
respectively
a22 a22 a22
and successively added to the third, fourth, . . ., and nth equations. The reduced system
of equations becomes
where
(1)
(2) (1) ai2 (1)
aij = aij − a ;
(1) 2j
i, j = 3, 4, . . . , n.
a22
where,
(k−1)
(k) (k−1) aik (k−1)
aij = aij − (k−1)
akj ;
akk
(0)
i, j = k + 1, . . . , n; k = 1, 2, . . . , n − 1, and apq = apq ; p, q = 1, 2, . . . , n.
Note that, from last equation one can determine the value of xn easily. From last
but one equation we can determine the value of xn−1 using the value of xn obtained
from last equation. In this process, we can determine the values of all variables and this
process is known as back substitution.
2
......................................................................................
(n−1)
bn
From last equation we have, xn = (n−1)
. Using this value, we can determine the
ann
value of xn−1 from the last but one equation, and so on. Finally, first equation gives
the value of x1 .
The process to determine the values of the variables xi ’s is a back substitution because
we first determine the value of the last variable xn , but the evaluation of the elements
(k)
aij ’s is a forward substitution.
Note 4.1 In Gauss elimination method, it is assumed that the diagonal elements are
non-zero. If one these elements is zero or close to zero then the method is not applicable
to solve the system of equations though it may have a solution. In this case, the partial
or complete pivoting method must be used to find a solution or a better solution.
Solution. Multiplying the second and third equations by 2 and subtracting them from
first equation we obtained
2x1 − x2 + x3 = 5,
−5x2 − 5x3 = −15,
−7x2 + 5x3 = −9.
Multiplying third equation by 5/7 and subtracting from second equation we get
2x1 − x2 + x3 = 5,
−5x2 − 5x3 = −15,
10 10
− x3 = − .
7 7
Observe that the value of x3 can easily be determined from the third equation and
it is x3 = 1. Using this value, from second equation we have x2 = 2. Finally, from first
equation 2x1 = 5 + 2 − 1 = 6, i.e. x1 = 3.
Hence, the solution is x1 = 3, x2 = 2, x3 = 1.
3
. . . . . . . . . . . . . . . . . . . . . Gauss Elimination Method and Tri-diagonal Equations
b01
1 0 ··· 0 x1
b0
x2
0 1 ··· 0 2
.. = . . (4.5)
··· ··· ··· ···
. ..
0 0 ··· 1 xn b0n
.. Gauss − Jordan .
A.b −→ I..b0 . (4.6)
.
The augmented matrix A..b is
..
1 1 1 . 4
.. .
A.b = 2 −1 3 .. 1
.
3 2 −1 .. 1
..
1 1 1 . 4 R0 = R − 2R ,
.
2 1
∼ 0 −3 1 .. −7 20
R3 = R3 − 3R1
.
0 −1 −4 .. −11
..
1 1 1. 4
1
.. 0
∼ 0 −3 R = R3 − R2
1. −7 3 3
..
0 0 −13/3 . −26/3
..
1 1 1 . 4 R0 = − 3 R ,
..
13 3
∼ 0 1 −1/3 . 7/3 30
R2 = − 13 R2
.
00 1 .. 2
..
1 0 4/3 . 5/3
.
∼ 0 1 −1/3 .. 7/3 R10 = R1 − R2
.
00 1 .. 2
..
1 0 0 . −1 R0 = R − 4 R ,
.
1 3 3
∼ 0 1 0 .. 3 10
R2 = R2 + 13 R3
..
001. 2
5
. . . . . . . . . . . . . . . . . . . . . Gauss Elimination Method and Tri-diagonal Equations
b1 x1 + c1 x2 = d1
a2 x1 + b2 x2 + c2 x3 = d2 (4.7)
a3 x2 + b3 x3 + c3 x4 = d3
·················· ···
an xn−1 + bn xn = dn .
b1 c1 0 0 · · · 0 0 0 0
a2 b2 c2 0 · · · 0 0 d1
0 0
0 a b c ··· 0 0 0 0 d2
3 3 3
A= and d = . (4.8)
··· ··· ··· ··· ··· ··· ··· ··· ··· ..
0 0 0 0 · · · 0 an−1 bn−1 cn−1
dn
0 0 0 0 ··· 0 0 an bn
This matrix has many interesting properties. Note that the main diagonal and its
two adjacent (below and upper) diagonals are non-zero and all other elements are zero.
This special matrix is called tri-diagonal matrix and the system of equations is called a
tri-diagonal system of equations. This matrix is also known as band matrix.
A tri-diagonal system of equations can be solved by the methods discussed earlier.
But, this system has some special properties. Exploring these special properties, the
system can be solved by a simple way, starting from the LU decomposition method.
Let A = LU where
6
......................................................................................
γ1 0 0 · · · 0 0 0
β2 γ2 0 · · · 0 0 0
L = · · · · · · · · · · · · · · · · · · · · · ,
0 0 0 ··· β
n−1 γn−1 0
0 0 0 ··· 0 β n γn
1 α1 0 · · · 0 0 0
0 1 α2 · · · 0 0 0
and U = · · · · · · · · · · · · · · · · · · · · · .
0 0 0 ··· 0 1 α
n−1
0 0 0 ··· 0 0 1
Then
γ1 γ1 α1 0 ··· 0 0 0
β2 α1 β2 + γ2 α2 γ2 · · · 0 0 0
LU = 0 β3 α2 β3 + γ3 · · · 0 0 0 .
··· ··· ··· ··· ··· ··· ···
0 0 0 · · · 0 βn βn αn−1 + γn
Now, comparing both sides of the matrix equation LU = A and we obtain the
following system of equations.
γ1 = b1 , γi αi = ci , or, αi = ci /γi , i = 1, 2, . . . , n − 1
βi = ai , i = 2, . . . , n
ci−1
γi = bi − αi−1 βi = bi − ai , i = 2, 3, . . . , n.
γi−1
Hence, the elements of the matrices L and U are given by the following equations.
γ1 = b1 ,
ai ci−1
γi = bi − , i = 2, 3, . . . , n (4.9)
γi−1
βi = ai , i = 2, 3, . . . , n (4.10)
αi = ci /γi , i = 1, 2, . . . , n − 1. (4.11)
γ1 = b1 = 1
c1
γ2 = b2 − a2 = 2 − (−1).2 = 4
γ1
c2 3 5
γ3 = b3 − a3 = 1 − 3. = −
γ2 4 4
d1 d2 − a2 z1 5 d3 − a3 z2 2
z1 = = 4, z2 = = , z3 = =−
b1 γ2 2 γ3 5
2 c2 14 c1 8
x 3 = z3 = − , x2 = z2 − x3 = , x1 = z1 − x2 = − .
5 γ2 5 γ1 5
8 14 2
Therefore, the required solution is x1 = − , x2 = , x3 = − .
5 5 5
The above method is not applicable for all kinds of tri-diagonal equations. The
equations (4.12) and (4.13) are valid only if γi 6= 0 for all i = 1, 2, . . . , n. If any one of γi
is zero at any stage, then the method fails. Remember that this method is based on LU
decomposition method and LU decomposition method is applicable and gives unique
solution if all the principal minors of the coefficient matrix are non-zero. Fortunately,
a modified method is available if one or more γi are zero. The modified method is
described below.
8
......................................................................................
n
Y
d’s contains x. Thus, the product P = di depends on x. Lastly, the value of the
i=1
determinant is obtained by substituting x = 0 in P .
Example 4.4 Find the values of the following tri-diagonal determinants.
1 1 0 1 2 0
A = 1 1 −3 , B = −1 2 −2 .
0 −1 3 0 −1 2
10