Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
RETRIEVAL OF ATMOSPHERIC COMPENSATION FOR HYPERSPECTRAL IMAGERY USING DATA MINING TECHNIQUES
A Thesis Presented
By
Junling Gao
to
The Department of Mechanical & Industrial Engineering
in partial fulfillment of the requirements
for the degree of
Master of Science
in the field of
Industrial Engineering
Northeastern University
Boston, Massachusetts
May 2019
ii
ABSTRACT
One of the biggest interferences of hyperspectral remote sensing is the atmosphere, which
can degrade the multi- and hyperspectral imageries. The quick atmospheric correction
(QUAC) utilizes an in-scene approach, and is significantly faster than physics-based
methods, but more approximate. In this research we attempt to use data mining techniques
to retrieval the atmospheric corrections. The key QUAC assumption is that the average of
diverse endmember reflectance spectra, excluding highly structured materials (e.g.,
vegetation, shallow water, mud), is always the same. The biggest contribution of this
research is to derive comparable results without this key assumption. To this end, we
assembled a training set and a validation set of hyperspectral cubes from AVIRIS sensor
website. We built an endmember reflectance library retrieved using QUAC and the
training cubes. For a testing cube, the gain featured in QUAC is derived using only the fact
that the gain is independent of in-scene space coordinates. It is found that the gain derived
is comparable with that from QUAC and yields similar reflectance. We also tested an
unsupervised machine learning algorithm to compensate the atmospheric effects directly,
featuring clustering and classification algorithms.
iii
TABLE OF CONTENTS
1. BACKGROUND .................................................................................................... 1
1.1 History of Remote Sensing ................................................................................. 1
1.2 History of Hyperspectral Remote Sensing ......................................................... 2
1.3 Method of Atmospheric Correction Based on Hyperspectral Remote Sensing . 3
2. INTRODUCTION OF QUAC ALGORITHM ....................................................... 3
3. ALGORITHM DESCRIPTION.............................................................................. 6
3.1 Trial Algorithm ................................................................................................... 6
3.1.1 Trial Algorithm Description ......................................................................... 6
3.1.2 Result Discussion ......................................................................................... 8
3.2 Adjusted Algorithm ............................................................................................ 8
3.2.1 Endmember Library Building .................................................................... 10
3.2.2 Gain Retrieving .......................................................................................... 11
3.2.3 Gain Retrieving Example ........................................................................... 12
3.2.4 Results Discussion ...................................................................................... 21
3.2.5 Future Improvement ................................................................................... 23
iv
4. UNSUPERVISED MACHINE LEARNING METHOD ..................................... 24
4.1 Algorithm Description .................................................................................... 24
4.1.1 K-Means Clustering ..................................................................................... 24
4.1.2 Select the Sample Set ................................................................................... 24
4.2 Build the Random Forest Model ..................................................................... 25
4.3 Test the model using remaining data from same cube that build the model ... 26
4.4 Test the model using new data cube ............................................................... 26
4.5 Results Discussion .......................................................................................... 27
REFERENCES: ................................................................................................................ 29
1
1. BACKGROUND 1.1 History of Remote Sensing
Remote Sensing as an integrated technology was proposed by American scholars in 1960s.
In order to more fully describe this technique and method, E.L.Pruitt defines remote
sensing as “ a technique for obtaining images or data of a target being detected
photographically or non-photographically”. In reality, we generally recall remote sensing
a technique that is far from the target and that determines, measures, and analyzes the
nature of the target through indirect contact. Because of the wide range of applications of
remote sensing technology, remote sensing has received great attention and rapid
development under the driving force of this power.
Satellite remote sensing has pushed remote sensing technology to a new stage of
comprehensive development and widespread application. Since the launch of the first Earth
Resources Satellite in 1972, China, the United States, France, Russia, ESA, Japan, India
and other countries have launched a number of Earth observation satellites.
Nowadays, the multi-sensor technology of satellite remote sensing has been able to cover
all parts of the atmospheric window and optical remote sensing can contain visible light,
near-infrared and short-wave infrared to detect the reflections and scattering of the target.
The thermal infrared remote sensing wavelength can range from 8-micrometer to 14-
micrometers to detect the radiation characteristics of the target’s emissivity and
temperature. The microwave remote sensing wavelength ranges from 1mm to 100cm. The
passive microwave remote sensing mainly detects the target’s radiance and temperature.
After more than half a century of development, remote sensing technology and multi-field
applications have entered a new stage. It cannot only passively receive the natural light
reflected from the ground object but can use the synthetic aperture radar and the laser radar
to actively emit electromagnetic waves to achieve all-weather ground observation (Zhang,
2017). The relationship between remote sensing and national level activities, for example
national economy, ecological protection and national security, is getting more and more
close. Not only these “big things”, remote sensing is also close to our daily life. It plays an
important role in weather forecasting, air quality monitoring, electronic maps and
2
navigation (Zhang, 2017). In the 21st century, remote sensing has its new features which
are high spatial resolution, high spectral resolution and high temporal resolution. And
because of these, this technology can be applied to more new fields.
1.2 History of Hyperspectral Remote Sensing
The branch we are going to talk about in this essay is one of these “three high” features,
high spectral resolution.
High spectral resolution is also called Hyperspectral Remote Sensing. Imaging
spectrometers is used to capture the ground radiation information among dozens or
hundreds of spectral channels. And because of this, at the time obtaining the spatial image,
every pixel can also get a continuous spectral curve containing the diagnostic spectral
features of the object.
The very first imaging spectrometer the AIS-1 was successfully developed in the United
Sates in 1983. In 1987, the United States launched the second generation of hyperspectral
imager AVIRIS, which has been continuously updated and has become an incubator for
the development of aerospace hyperspectral remote sensing technology in the United States.
Since then, many countries have developed a variety of aviation imaging spectrometers,
such as CASI in Canada, ROSIS in Germany, HyMap in Australia and so on.
The outstanding features and advantages of hyperspectral remote sensing make it play an
increasingly important role in many fields.
For example through the analysis of the diagnostic spectral characteristics of mineral
elements, hyperspectral remote sensing can accurately identify and map mineral
components; in vegetation research, hyperspectral data can be used to retrieve vegetation
physical and chemical parameters; in the aspect of water quality monitoring, by spectral
retrieval of chlorophyll, yellow matter, suspended matter and other components in water,
we can grasp the black odor water distribution and pollution sources. Hyperspectral remote
sensing technology has a great potential of application in Military target reconnaissance,
position and equipment camouflage identification and battlefield environment background
analysis.
3
1.3 Method of Atmospheric Correction Based on Hyperspectral Remote Sensing
When we try to analyze the data that comes from the remote sensing, we will realize that
the data includes lots of things. If we tried to quantitatively retrieve or acquire the Earth
information and accurately identify the ground feature, we will have to do the atmospheric
correction to get the real reflect of radiation directly from the sun.
The purpose of atmospheric correction is to eliminate the influence of atmospheric and
illumination factors on the reflection of ground objects. In a broad sense, atmospheric
correction is to obtain the real physical model parameters such as the reflectivity,
emissivity or surface temperature. In the narrow sense, atmospheric correction is to obtain
the true reflectivity data of the ground object, and eliminate the influence of water vapor,
oxygen, carbon dioxide, methane and ozone on the reflection of the ground object and
eliminate the influence of atmospheric molecules and aerosol scattering. In most cases,
atmospheric correction is also a process of retrieval the true reflectivity of the ground object.
Common absolute atmospheric correction methods include LOWTRAN radiance transfer
model, Moderate Resolution Model (MORTRAN model), ACTOR Model, 6S Model, Dark
Object Subtraction Model, Reflectance Retrieval based on statistical model like Flat Field,
log residuals, Internal Average Relative Reflectance and Empirical Line, QUAC and
FLAASH.
2. INTRODUCTION OF QUAC ALGORITHM
The presence of the atmosphere has the huge impact to the views of earth’s surface from
aircraft and spacecraft. This huge degradation includes attenuation of reflected light as well
as loss of contrast due to sunlight scattering by atmospheric aerosols and molecules
(Bernstein, Jin, Gregor, & Adler-Golden, 2012). What remote-sensing need to do is that
eliminating these atmospheric effects so that we can get the inherent spectral reflectance
of the surface materials.
Many atmospheric correction methods and algorithms can be found and if we tried to study
these algorithms, including those based on first-principles radiation transport (RT)
4
calculations (Gao, Heidebrecht, & Goetz, 1993; Perkins et al., 2012), empirical approached
such as the empirical line method(ELM) (Roberts, 1985; Kruse, Kierein-Young, &
Boardman, 1990), approximate in-scene methods such as the IAR (internal average
reflectance) (Kruse, 1988) and FF (flat field) (Roberts, Yamaguchi, & Lyon, 1986), we
will realize that all these approaches are lack of the combination which can tie up the high
accuracy, high computational speed, and independence from prior knowledge together.
Accordingly, there is a new atmospheric correction (QUAC) (Bernstein et al., 2005;
Bernstein, Adler-Golden, Perkins, Berk, & Levine, 2005, 2006; Bernstein, Adler-Golden,
Sundberg, & Ratkowski, 2008), which can almost tackle all these kinds of shortcomings.
The mathematical theory of atmospheric correction is described below.
𝐿"#$ = (𝐴 + 𝐶𝜌+,-) + 𝐵r (1)
𝐿"#$ in the formula is the observed spectral radiance for a pixel. r in the formula is the
surface reflectance for a pixel. 𝜌+,- is the spatially averaged reflectance from the pixels
that next to the one that you try to detect.
If you try to simplify the equation, you could find that, in practical, we consider the term
𝐴 + 𝐶𝜌+,- as a constant in one specific image. So, we say that this term is an offset to all
the pixels in one image. And as for B, we will do a reciprocal thing. We set Gain equals
1/B. After we sort and merge several terms in the formula, we could retrieve a new one.
r = Gain(𝐿"#$ − 𝑂𝑓𝑓𝑠𝑒𝑡) (2)
For QUAC, we determine these parameters directly from the in-scene spectral data and a
key underlying assumption (Bernstein, Jin, Gregor, & Adler-Golden, 2012).
𝐺𝑎𝑖𝑛 = ?@ABCDEFG?(HIGJKL@MNA)ABCD
(3)
𝑂𝑓𝑓𝑠𝑒𝑡 = min(𝑝𝑖𝑥𝑒𝑙𝑣𝑎𝑙𝑢𝑒𝑓𝑜𝑟𝑒𝑎𝑐ℎ𝑏𝑎𝑛𝑑) (4)
5
In the equation (3), we can see the numerator, < 𝜌-\] >_`# , is the mean value of the
endmember from the reference library, which means that this part is related to the
reflectance of the pure materials. The denominator 𝐿"#$ is radiance that we can directly get
from the real sensor.
As we mentioned before, QUAC is a method that can provide high accuracy and at the
mean time has a very good computational speed. However, all these advantages are based
on one assumption. The key QUAC assumption, which empirically holds for most scenes,
is that the average of diverse endmember reflectance spectra, excluding highly structured
materials (e.g., vegetation, shallow water, mud), is always the same (Bernstein, Jin, Gregor,
& Adler-Golden, 2012).
We first need to determine the channels and the number for the endmember’s extraction
from the whole scene. After we fixed our number of endmembers and the channels, we will
apply the same channels and the number to the reference library. That means that we need
to utilize the reference pure reflectance material library, using SMACC to retrieve the
endmember from the library based on same number and same channels, and then do the
QUAC. Because the raw data we can get from sensor directly is the radiance data, we have
to assume that the reflectance data we use to calculate the Gain is the same thing we retrieve
from the reference library.
Fig 1 The original data cube for Lake St. Clair, Maine; Fig2 QUAC revised data cube for Lake St. Clair, Maine.
6
For example, we run the QUAC algorithm for one data cube downloaded from the AVIRIS
website using the software EVVI 5.5 for Mac. According to the radiance data from the
same pixel we can tell the difference after running the QUAC. As we can see from the two
images, Fig 1 is the original data cube. Fig 2 is the revised data cube. We can clearly see
that the peak value from the Fig 2 is smaller than Fig 1. And apparently the revised image
is darker than the original one.
3. ALGORITHM DESCRIPTION 3.1 Trial Algorithm 3.1.1 Trial Algorithm Description
l(1) 𝑋bb 𝑋bc 𝑋bd 𝑋be 𝑋bf …… 𝑋b\Kc 𝑋b\Kb 𝑋b\ l(2) 𝑋cb 𝑋cc 𝑋cd 𝑋ce 𝑋cf …… 𝑋c\Kc 𝑋c\Kb 𝑋c\
𝑛 × 𝑛
Fig 3 The schematic diagram for the trial algorithm.
The first thought come up with is that we can just directly get a “matching” given a certain
constraint. According to the equation:
𝑙𝑛𝜌_`#(b) − 𝑙𝑛𝜌_`#
(c) = ln iHIGJj Kkll$-mHIGJn Kkll$-m
o ⟹ 𝑙𝑛 @EFG(j)
@EFG(n) = ln i
HIGJj Kkll$-mHIGJn Kkll$-m
o (5)
𝑋bb
𝑋bb
𝑋bb
𝑋bc … 𝑋bb
𝑋b\Kb 𝑋bb
𝑋b\
𝑋bc
𝑋bb 𝑋cc
𝑋cc … 𝑋bb
𝑋b\Kb 𝑋bb
𝑋b\
… … … … … 𝑋b\Kb
𝑋bb 𝑋bb
𝑋bb … 𝑋\Kb\Kb
𝑋\Kb\Kb 𝑋bb
𝑋b\
𝑋b\
𝑋bb 𝑋b\
𝑋bc … 𝑋b\
𝑋b\Kb 𝑋b\
𝑋b\ Jiji.
7
After deforming the formula (2) for different pixel, we can easily get the formula (5). And
we can see that the Gain term has gone. That means we eliminate the main assumption in
the QUAC algorithm. Before we are going to apply the equation (5), we need to claim
several conditions.
The first condition is that we are going to use the QUAC Material Library, which means
that we are going to use all the pure materials to do the calculation and then directly match
up with the corresponding material. The second constraint that we will apply in the progress
of calculation is that we need to keep the numerator to be the same.
The main idea about this trial algorithm is shown as the Fig 3. We first calculate the left
side of the equation (5). We use the first condition, which is the QUAC Material Library.
We have 168 natural and manmade materials and each of the materials have its
corresponding 224 values for 224 different channels. The progress of the calculation is
described as above. For example, we choose the first channel which is l = 365.92981nm
and put each material in the place of numerator and the rest of other material in the
denominator. After doing that, we will get a 168 × 168 data set for each channel from the
whole 224 channels.
The next thing we are going to do is that, when a new picture come in, we first using
SMACC to retrieve endmembers for the whole image. The reason why we are going to use
endmembers to do the calculation is that we want to simplify the data cube and the formula
also can apply to endmember. For example, we have radiance data for the 50 endmembers.
In order to fill the right side of the formula, we need to do the same progress as for the left
side. We keep one endmember as the numerator and the rest of the endmember. We will
get a 50 × 50 data set for each channel.
The last step is to search the smallest distance. For example, we try to determine the first
endmember from the new picture belongs to which kind of material from the Material
Library.
We have to calculate the distance between the first 50 vectors in 50 × 50 × 224, which
the 50 × 50 × 224 is the whole data set for all the endmember and the X1 is the numerator,
8
and every 168 vectors from the first 168 × 168. When we get the smallest Euclidean
distance, it should be close to 0 or at least close to 0 based on the magnitude of the data so
that we can say that the numerator which remain the same match each other and we can
determine the specific material for each endmember.
But when we do all the calculation for the method, we get the plot as below. No matter
which pixel we choose, the results is not good. We can find the smallest for each
endmember for the test cube but when we get the matching material, the difference is big.
3.1.2 Result Discussion
When we get this result, which is far from what we expect, we need to go through back to
figure out why this kind of big difference happened.
We are using totally different two sources. One is the objective existing library which has
no relationship with the real data cube. One is real world data which directly collect from
the sensor. The reflectance data we have in the QUAC Material Library are the pure data
but the endmember we got from the sensor is not such pure.
3.2 Adjusted Algorithm
After figuring out the main reason why the matching didn’t work, we look back to the basic
physic which is described in the Fig 4.
𝜌"#$
A
B
C
𝜌${| 𝜌+,-
Fig 4 Schematic diagram of the sensor receiving the reflection information from the surface object in
reality.
9
There are three types of reflection that the sensor can receive A, B and C. According to the
formula (5), we make a mistake which is using the wrong data source for the path B. So,
we are going to revise the term of 𝜌_`#.
Remembered what we want to do is that we don’t want to apply this kind of assumption
based on the QUAC Material Library. We want a more general method based on the
information from image itself rather than the thing that has no relation with the scene.
The basic mathematical thing has no big difference. The formula we use is still the
following.
𝜌${|(`) = b
}~𝐿"#$
(`) − 𝑂𝑓𝑓𝑠𝑒𝑡� ⟹ 𝐵 = HIGJ(F) Kkll$-m
@J��(F) 𝑖 ∈ {1,2, … , 𝑁} (6)
In the formula, 𝐿"#$(`) is the radiance from the 𝑖m� pixel or any subset of pixels, for example
could be the endmembers, and the 𝜌${|(`) is the corresponding pixels. As the description,
assuming we have pixel 1 and pixel 2 and we pick one same band for each one of them.
We can get the following things:
𝜌${|(b) (𝜆, 𝑥, 𝑦) = ~𝐿"#$
(b) (𝜆, 𝑥, 𝑦) − 𝑂𝑓𝑓𝑠𝑒𝑡(𝜆)� × ib}(𝜆)o (7)
𝜌${|(c) (𝜆, 𝑥, 𝑦) = ~𝐿"#$
(c) (𝜆, 𝑥, 𝑦) − 𝑂𝑓𝑓𝑠𝑒𝑡(𝜆)� × ib}(𝜆)o (8)
We do the logarithm for (6) and (7) and we can get,
𝑙𝑛𝜌${|(b) = ln ~b
}� + ln~𝐿"#$
(b) − 𝑂𝑓𝑓𝑠𝑒𝑡� (9)
𝑙𝑛𝜌${|(c) = ln ~b
}� + ln~𝐿"#$
(c) − 𝑂𝑓𝑓𝑠𝑒𝑡� (10)
(8) – (9), we can get,
10
𝑙𝑛𝜌${|(b) − 𝑙𝑛𝜌${|
(c) = ln iHIGJj Kkll$-mHIGJn Kkll$-m
o ⟹ 𝑙𝑛 @J��(j)
@J��(n) = ln i
HIGJj Kkll$-mHIGJn Kkll$-m
o (11)
According to (10), we can see that ln ~b}� is eliminated when we do the calculation. That is
to say, b} has no relationship with pixels or endmembers we choose. The only variable is 𝜆.
So, the key point is
b}= HIGJ
(j) Kkll$-m
@J��(j) = HIGJ
(n) Kkll$-m
@J��(n) = HIGJ
(�) Kkll$-m
@J��(�) = ⋯ (12)
3.2.1 Endmember Library Building
As we mentioned before, in the algorithm we try to use the reflectance data from the scene
itself. So, the first thing is that we need to build an endmember library.
Because the number of endmembers in an image are usually limited, we have to focus on
the image or the prat including as abundant material information as it can. We downloaded
several data cubes from AVIRIS and create training and testing data set, each containing
10 subset of data cubes from different part of one image or different image.
After separating the data into training and testing, we use training set to create the
endmember library. The software we use is ENVI 5.5.1. In most cases, linear combinations
of a small number of endmember spectra (~10 – 100) can accurately represent a large
number of spectra (>10,000) associated with image (Bernstein, Jin, Gregor, & Adler-
Golden, 2012). So, for each subset of data cube in the training set, we use SMACC to find
its own endmembers for each subset and then combine all the endmember reflectance data
into one file which is the initial endmember library.
Once we get this initial combination library, the very first thing to do with this file is to
keep the data that follow several data conditioning steps to make sure that a valid baseline
is determined. Steps include (1) removal of border pixels, (2) excluding values ≤ 0 and (3)
excluding values ≥ 10,000.
11
3.2.2 Gain Retrieving
After we built this library, our next step is to test the new data we collect from the testing
set. Because according to the formulas above, we realize that the endmember library should
works for any pixel. We believe that when we randomly select several radiances, which
are the data we can directly read from the sensor and do the ratio one by one with the library
endmember reflectance data for each selected radiance data, we will get several sets of
ratio values. We compare these sets among each other, we can find that the closest vectors
between every two groups are almost the same.
Because this is a problem that involved high dimension data set, the simple method cannot
meet the complexity of calculation. We decided to use a data structure which is called K-
Dimension Tree (KD-Tree). The basic idea of KD-Tree is Binary Tree which has a root
node, left-side tree and right-side tree. When you get the hang of the main idea, we can
start to build the tree. Progress shows as the Fig 5 below.
Fig 5 the progress of getting smallest distance
5. Repeat Step 1-4 Until All the Ponits are in the Tree
4.Switch the Baseline Dimension and Repeat Step 1
3.Split Left-side Tree & Right-side Tree
2.Calculate the Median of the Chosen Dimention
1.Determine a Dimension
Progress
12
Following this float, we build the tree. And the next step is to search the closest vectors
between the two sets.
The input is the tree which is already built based on one of the ratios set. The output is the
nearest vector for each vector in the other set.
We have several steps to do. The first is to find the leaf node containing the specific
dimension values of target vector X in KD-Tree.
Second, recursively searching down from the root node, if the target dimension value of
the target vector X is smaller than the value of the split node, then it moves to the left child
node. If not, it moves to the right child node. We repeat this until the node is the leaf node.
Thirdly, we temporarily regard this as the closest vectors and backtracking. We do this
Backtracking is to find the if there is any other vector that can be closer to the target vector
X, which is to say that we have to consider the branches that have not been searched
including a vector that is closer than now.
If the distance between X and its unvisited branch under the parent node is less than current
smallest distance, then it is considered that there is a closer vector and we enter the node
and perform the same search process as step (1). Find a closer vector, update to the current
“nearest vector”. And also update the smallest distance
The backtracking progress is performed from the bottom up until there is no branch closer
to X when backtracking to the root node.
3.2.3 Gain Retrieving Example
We randomly select a test data from testing group. Data information list below. The site
name is Rosemount located in Minnesota. And then we run the SMACC using radiance
data and get 50 endmembers. We use the 3rd, 4th, 5th, 6th, 7th, 8th endmember as the input
test data. The Fig 6 shows the test radiance data.
13
Fig 6 Six radiance data to test the algorithm
For these picked up testing data we are not going to use all the 224 bands as the original
one. The reason why is that it has atmospheric absorption and after do the ratio we will get
the infinite value. We don’t want these meaningless data. So, we remove 74 brands from
the original wavelength and do the analysis, which include wavelength between 1343-
1453nm, 1731-2069nm, 2377-2497nm.
Because this is the new data that we get, so we don’t know that which is the corresponding
endmember in the endmember library. According to the equation,
1𝐵 =
𝐿"#$(b) − 𝑂𝑓𝑓𝑠𝑒𝑡
𝜌${|(`) =
𝐿"#$(c) − 𝑂𝑓𝑓𝑠𝑒𝑡
𝜌${|(`) =
𝐿"#$(d) − 𝑂𝑓𝑓𝑠𝑒𝑡
𝜌${|(`)
(𝑖 = 1, 2, 3, … ; 𝑖 ∈ {𝑡ℎ𝑒𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑒𝑛𝑑𝑚𝑒𝑚𝑏𝑒𝑟𝑖𝑛𝑡ℎ𝑒𝑒𝑛𝑑𝑚𝑒𝑚𝑏𝑒𝑟𝑙𝑖𝑏𝑟𝑎𝑟𝑦}) (13)
we know the goal is to find the ratio value between two radiance among three test data with
whole endmember library which has closest Euclidean distance. So, we do the ratio step
with the library for all the four-test data and get four ratio value frames. And then, we
calculate the distance between test1 and test2, test1 and test3, test1 and test4, test1 and
test5, test1 and test6, test2 and test3, test2 and test4, test2 and test5, test2 and test6, test3
and test4, test3 and test5, test3 and test6, test4 and test5, test4 and test6, test5 and test6.
For each combination, we will get the smallest distance. For example, smallest distance
between the test1 and test2 is mindis_12 = 4136.6244652. The index of this value can
tell us some information.
14
The first thing is that we can know the corresponding reflectance endmember for specific
test radiance data. According to the analysis above, we know that no matter which the
radiance, when it is divided by the matching endmember, the ratio value should be the
same. And the index of the smallest subtraction value of the ratio can tell both the Test1
radiance’s corresponding endmember and the test2 radiance’s corresponding endmember.
The row index of the distance value is the index of test1 corresponding endmember in the
library. And the column index of the distance is the index of test2 corresponding
endmember in the library.
The second information we can get is what we desire, the Gain value. According to the
formula and the explanation above, the two ratio values we use to get the smallest distance
between two test data should be close. When we do the calculation, we will get the flowing
index for all the test data we use. The results show in the Table 1.
Pair Name Index (1) Index (2) Test1 and Test2 Test1 = 220 Test2 = 352 Test1 and Test3 Test1 = 220 Test3 = 371 Test1 and Test4 Test1 = 243 Test4 = 340 Test1 and Test5 Test1 = 238 Test5 = 269 Test1 and Test6 Test1 = 220 Test6 = 361 Test2 and Test3 Test2 = 266 Test3 = 252 Test2 and Test4 Test2 = 265 Test4 = 175 Test2 and Test5 Test2 = 353 Test5 = 362 Test2 and Test6 Test2 = 64 Test6 = 286 Test3 and Test4 Test3 = 79 Test4 = 259 Test3 and Test5 Test3 = 97 Test5 = 269 Test3 and Test6 Test3 = 288 Test6 = 234 Test4 and Test5 Test4 = 222 Test5 = 284 Test4 and Test6 Test4 = 310 Test6 = 352 Test5 and Test6 Test5 = 164 Test6 = 353
Table 1 the result of comparing the distance. The first column is the pair that calculated the distance;
the second column is the index of one of the test data in the pair; the third column is the other test data’s index.
From the index we could clearly see that, for the distance between test1, test5 and the
distance between test 3, test 5, we get the fixed index of test 5. And the distance between
test 1 and test 2, test 1 and test 3, test1 and test 6, we get the fixed index of test 1.
15
That means that the reflectance data from these pairs can be found in the endmember library
we build or at least we can find some very close one. We are going to focus on these
distances which include six test data sets. We can see based on the chart above, we can get
the six data sets from each ratio sets and the Fig 7 present below. (The curves in the plot is
the reciprocal of the values which directly get from the data set.)
Fig 7 Six Gain curves pick from two pairs from the test data. The blue is derived from the test 1; the
orange one is derived from the test 3; the green one is derived from the test 6.
Although the values of these distances seem like large, the original magnitude of data is
large, and the number of channels is 150. We could see six lines which are almost
overlapping with each other. The main difference among each line we can derive from the
image is from the two peak values and the channels between 2268nm-2367nm. In order to
make the Gain we get more universal, we will do the average as the Fig 8.
Fig 8 The average Gain curve of the six derived Gain curves.
Once we got this average Gain plot, we need to compare with the original Gain curve which
derived from the QAUC to see how close between each other. After putting these two in
the same image, we can see the Fig 9 as below.
16
Fig 9 The orange one is the average Gain curve; the blue one is the original QUAC gain curve.
We can intuitively see from the picture that two curves are close, especially in the channels
between 365nm-1343nm. Considering that the three Gain values which we used to get the
average value is not completely same, which is the distance between each other is not 0,
we could say that this result is encouraging. If we measure the difference between Average
curve and original curve by percentage value, the value is ±19.60%. If we compare each
Gain to the original Gain, we can get the following six plots.
Fig 10 Test 1 Gain compared with Fig 11 Test 5 Gain compared with
the original curve. the original curve.
Fig 12 Test 2 Gain compared with Fig 13 Test 3 Gain compared with
the original curve. the original curve.
17
Fig 14 Test 6 Gain compared with Fig 15 Test 3 Gain compared with the original curve. the original curve.
For test 1 Gain, the difference measured by accuracy is ±13.51% and the difference is
evenly distributed in all channels. For test 5 Gain, the accuracy is ±36.18% and the main
difference is mainly concentrated in the channels between 2268nm-2367nm and the two-
peak value. For test 2 Gain the accuracy is ±17.44% and the difference is evenly
distributed in all channels. For test 3 Gain the accuracy is ±12.61% and the difference is
evenly distributed in all channels. For test 6 Gain the accuracy is ±14.76% and the main
difference is mainly concentrated in the channels between 2268nm-2367nm. For test 3
Gain the accuracy is ±29.01% and the main difference is mainly concentrated in the
channels between 2268nm-2367nm and two peak values. When get this average Gain, we
will apply this new Gain to the radiance that we didn’t pick and compare to the
corresponding reflectance that get from the old QUAC. The result shows as the Fig 16. The
trend of the curve is the same and the difference of two lines are small.
Fig 16 The blue line is the actual QUAC reflectance; the orange line is the Revised QUAC reflectance
which derived by using the new average Gain value.
18
The results we discussed before is only for one data cube which is from Rosemount located
in Minnesota. Because we have several test data cubes, we will test the results of the
algorithm using the rest of all the cubes. And we will check if the value of difference
between derived Gain and original Gain value divided by the original Gain value is in a
reasonable range and plot the two curves and the percentage value. In Fig 17 is the data
located in Grayling Fire in Maine. After we chose the same endmember radiance data and
run the algorithm, we can get the two Gain curves. We can see the main difference is almost
distributed among the 150 channels and two peak values. And the percentage distance
value compared with the original one is 0.576.
Fig 17 the left picture is the actual in-scene image and the right one is the comparison between the actual QUAC Gain and retrieval Gain.
In Fig 18 is the data located Eastern Peatland in Ontario. After we choose the same
endmember radiance data and run the algorithm, we can get the two Gain curves. These
two Gain curves has bigger difference than all the results before. No matter the peak value
or the values in other channels the gap is big. We can calculate the percentage distance
value compared with the original Gain is 1.345.
Fig 18 the left picture is the actual in-scene image and the right one is the comparison between the actual QUAC Gain and retrieval Gain.
19
In Fig 19 is the data located in Oak Ridge in TNE/W. After we chose the same endmember
radiance data and run the algorithm, we can get the two Gain curves. These two Gain curves
also have big difference. But the different thing is that the main difference gathers around
the two peak values. The values in the rest of the channels is close enough. We can calculate
the percentage distance value compared with the original Gain is 1.154.
Fig 19 the left picture is the actual in-scene image and the right one is the comparison between the actual
QUAC Gain and retrieval Gain.
In Fig 20 is also from the Oak Ridge in TNE/W. After we chose the same endmember
radiance data and run the algorithm, we can get the two Gain curves. These two Gain curves
also have big difference. But the different thing is that the main difference gathers around
the two peak values. We can calculate the percentage distance value compared with the
original Gain is 1.105.
Fig 20 the left picture is the actual in-scene image and the right one is the comparison between the actual
QUAC Gain and retrieval Gain.
20
In Fig 21 is from Rosemount located in Minnesota. After we choose the same endmember
radiance data and run the algorithm, we can get the two Gain curves which match up quite
well and the percentage distance value compared with the original Gain is 0.3238.
Fig 21 the left picture is the actual in-scene image and the right one is the comparison between the actual
QUAC Gain and retrieval Gain.
In Fig 22 is the data from Rosemount located in Minnesota. After we chose the same
endmember radiance data and run the algorithm, we can get the two Gain curves. Also, we
can calculate the percentage distance value compared with the original Gain is 0.4261.
Fig 22 the left picture is the actual in-scene image and the right one is the comparison between the actual
QUAC Gain and retrieval Gain.
21
3.2.4 Results Discussion
Here we can see that for each of the Gain value calculated based on the algorithm we
designed has a good result of accuracy. We need to figure out that why this kind of
difference or error comes from. In the Fig 23, we can see that the error ratio between
retrieval gain curve and the original gain curve.
Fig 23 The error of each test data cube
Just as the we mentioned before, the value is calculated by the retrieval gain subtract the
original gain and divided by the original gain. We could see the location information and
the value measured the difference between the retrieval gain and original gain.
We believe that there are several reasons cause the fluctuation in the Fig 23. First, the
library is not big enough to cover all the endmember.
We are using completely two different sets of AVIRIS data to build the library and do the
algorithm and comparison. The data we used during building the library may not contain
every material in the testing data.
This is why when we do the comparison, there are always some difference between the two
Gain curve.
To check if the difference is mainly because the number of endmembers which are used to
build the library is not enough, we will randomly pick a data cube which is used to build
the library to validate the algorithm. The expect Gain curve which derived by this training
data cube should be the exact same with the original QUAC curve.
22
We choose the data from Sleeping Bear, Maine downloaded from the AVIRIS, which is
also used to build the library. And we do the same process as data from the Rosemount
located in Minnesota. We randomly select 6 test radiance data from the Sleeping Bear data
cube. And then do the distance calculation to find the closest one. In this case, the distance
of between each radiance should be 0.
Fig 24 The average value of 20 Gain compared with the original QUAC curve.
We can get the plot as the shown in the Fig 24. The blue line is the original Gain curve and
the orange line is the Gain we calculated based on the algorithm. As we can see the curve
is completely overlapping. If we measure the difference between the two lines by numeric,
which means we use the Gain we derived subtract the original one and then dived by the
original one. The percentage value of difference is 0%.
We apply the new Gain value to the radiance data that we didn’t pick and compared with
the corresponding reflectance data to see how close these two. The result is shown as in
Fig 25.
Fig 25 the blue one is the actual reflectance; the orange one is the revised one.
23
We can see that the two lines are completely overlapping. This result shows that the
algorithm works pretty well if we have enough data and revise the endmember library.
Second, the information contained in our selected testing data is not adequate. The Fig 17
– Fig 22 are the test cubes that we have. We can see that there is only one gain curve has
big difference among all the channels compared with the original QUAC Gain value. If we
look at the image of this special case, we notice that most of the objects on the ground in
the picture are vegetation. This may be the reason why the retrieval gain value is way from
the original one.
Except this all-vegetation image, other images have plenty of ground objects. The
endmember we used to build the library doesn’t contain such plenty of vegetation, so we
couldn’t find a good matching result for the particular testing data cube. For this reason,
when we determine the test data cube, we have to avoid picking such images which only
contains one ground object.
3.2.5 Future Improvement
Based on the main reason that could possibly cause the error, we may have the future
improvement. The first is that we could increase the number of data cube which used for
the endmember library building. And once we have enough number of data sets to build a
more general library, we don’t have to use all the endmember from every data cube. We
could simplify the library using clustering and using the mean value of each cluster as the
new endmember. By doing this simplify, we can make the algorithm more general.
Since we only compare this to the QUAC, we could compare it to other methods like
FLAASH and ELM to see the accuracy of the result. Because QUAC itself has difference
with FLAASH. So, we should figure out how’s the performance of our algorithm compared
with FLAASH.
We could collect more data that come from different kinds of sensor. Because what we are
doing is focusing on a specific sensor. We would like to know that if the algorithm works
for different sensors.
24
4. UNSUPERVISED MACHINE LEARNING METHOD
The algorithm we are talking about in the previous chapters is more focus on the
endmember of the data cube, which means that we will do the preprocessing when we
apply this algorithm. But what we are talking about right now is going to use all the pixels
from the data cube.
4.1 Algorithm Description 4.1.1 K-Means Clustering
When we receive a new data cube, the data what we got directly from the sensor is radiance.
Besides this radiance data, we can also get the reflectance data for the corresponding pixel.
The very first step is to clean the data. In radiance data, we will drop the pixel that has
negative values and we will have to drop the same pixel in the reflectance data. And in
reflectance data, we will drop the pixel that has negative value and larger than 10000 and
also for these bad pixels in the reflectance data, we will drop them in the radiance data.
After doing this preprocessing, we will focus on the radiance data first. We do the K-Means
Clustering for the radiance. We import the package in the scikit-learn. For example, for the
test cube we set the number of clusters is 50 and other parameter remain the default, which
is max_iter = 300, tol = 0.0001, precompute_distances = ‘auto’, verbose = 0, random_stat
= None.
After we successfully do the clustering, we will have a label for each cluster, which is 0-
49. And the we do the order for the clusters. We arrange the clusters as a decent way based
on the number of points in each cluster.
4.1.2 Select the Sample Set
As the processing above we have already order the clusters base on the number of points
in each cluster. We need to randomly pick some samples from each cluster. Since the
number of points in every cluster is not the same, we determine to randomly choose 25
points for each cluster and form a training set combined with the corresponding labels for
each clustering.
25
At the meantime, we also extract corresponding reflectance data for each pixel whose
radiance data is used to form the sample set. We do the average for these reflectance data
and tie the mean reflectance data with the label 0-49.
4.2 Build the Random Forest Model
Once we have our training set, we could build a Random Forest model using these data.
We spilt these data into training and testing data. And set the binary labels as the target
value. The hyperparameter in the random forest are list below. Boostrap = ‘true’,
n_estimator =30, criterion = ‘entropy’, randome_state = 42.
We use these 75% of data from the sample set to train the random forest model and use the
rest of 25% of data in the sample set as the test set to test the model. The results of the
model display in the Fig 26. The overall accuracy is 0.8856.
After run the model, as the Fig 26 shows, we can get the acuuracy for each predict label
and find these the corresponding reflectance data for each of the test data and then
compared with the label reflectance data which we get in the previous steps.
…
Fig 26 the form shows the accuracy of each different label.
26
4.3 Test the model using remaining data from same cube that build the model
Once we have the model, we would like to use different data to test this model.For exmple
we want to test remaining points in label 10.
We can get the Fig 27 as below. The orange line is the original reflectance of the input
radiance data. The blue line is the predicting reflectance that we use random forest model
to derive.
We can also check the entire data that we choose as the training set to train the random
forest model. The Fig 28 shows the plot that all the radiace labeled as 10 we choose.
Because of the number of clusters we determine is small comparaed with the raw data we
have, we can clearly see the wide range of value in the cluster.
Fig 27 the predict reflectance vs the actual reflectance. Fig 28 All the reflectance data in label 10.
4.4 Test the model using new data cube
The information we have in Fig 27 and Fig 28 is from the same label in the same data cube
which is build the model. We want to test if the model we build could be apply to other
new data cube.
We choose the data cube located on Rosemount in Minnesota downloaded from the
AVIRIS. After having is radiance data, we pick the 3 radiance randomly from the whole
119660. We can get the following five images from Fig 29 - 31.
27
For each plot, we have the actual reflectance which is corresponding to the radiance we
choose and the predict reflectance we get from the random forest model.
Fig 29 the blue line derived from RF Fig 30 the blue line derived from RF the orange line is the raw reflectance the orange line is the raw reflectance
Fig 31 the blue line derived from RF the orange line is the raw reflectance
We can see that when we apply the new radiance data to the random forest model, the
results we have shown in the picture is reasonable.
4.5 Results Discussion
In the section of 4.3 and 4.4, we have several results for the unsupervised machine learning.
Some results are good, but some results are not. After give a further thought of the reason
that could lead to these kinds of result, we list several reasons.
The first reason is the number of clusters. The cluster number which is 50 may not be the
best one for this case. As we can see in the Fig 28, the label 10 have lots of radiance and
28
the range of the radiance data is wide. For this kind of phenomenon, we could see the result
in Fig 27 from section 4.3 has big gap in the predicting one and the actual raw data.
The second reason is that we don’t have enough varieties of radiance from different ground
objects to train the random forest model. In the results from section 4.4, we can see that
some of the data from the new cube fit the model well in Fig 29 and Fig 31. But there is
also one sample that perform bad when plug into the model. We can see the in Fig 30 the
predicting one and the actual one has completely different shape.
Based on the main reason we come up with, we can do the improvement for the future
study and to see whether the results improved.
29
REFERENCES
Zhang Bing (2017). Current Status and Future Prospects of Remote Sensing[J]. Bulletin of
Chinese Academy of Sciences, 32(7), 774-784.
Bernstein, L. S., Jin, X., Gregor, B., & Adler-Golden, S. M. (2012). Quick atmospheric
correction code: algorithm description and recent upgrades. Optical
engineering, 51(11), 111719.
Gao, B. C., Heidebrecht, K. B., & Goetz, A. F. (1993). Derivation of scaled surface
reflectances from AVIRIS data. Remote sensing of Environment, 44(2-3), 165-178.
Perkins, T., Adler-Golden, S. M., Matthew, M. W., Berk, A., Bernstein, L. S., Lee, J., &
Fox, M. (2012). Speed and accuracy improvements in FLAASH atmospheric
correction of hyperspectral imagery. Optical Engineering, 51(11), 111707.
Roberts, D. (1985). Calibration of airborne imaging spectrometer data to percent
reflectance using field spectral measurements. In 19. International Symposium on
Remote Sensing of Environment (pp. 679-688).
Kruse, F. A., Kierein-Young, K. S., & Boardman, J. W. (1990). Mineral mapping at Cuprite,
Nevada with a 63-channel imaging spectrometer. Photogrammetric Engineering
and Remote Sensing, 56(1), 83-92.
Kruse, F. A. (1988). Use of airborne imaging spectrometer data to map minerals associated
with hydrothermally altered rocks in the northern grapevine mountains, Nevada,
and California. Remote Sensing of Environment, 24(1), 31-51.
Roberts, D. A., Yamaguchi, Y., & Lyon, R. J. P. (1986). Comparison of various techniques
for calibration of AIS data. NASA STI/Recon Technical Report N, 87, 21-30.
30
Bernstein, L. S., Adler-Golden, S. M., Sundberg, R. L., Levine, R. Y., Perkins, T. C., Berk,
A., ... & Hoke, M. L. (2005). A new method for atmospheric correction and
aerosol optical property retrieval for VIS-SWIR multi-and hyperspectral imaging
sensors: QUAC (QUick atmospheric correction). SPECTRAL SCIENCES INC
BURLINGTON MA.
Bernstein, L. S., Adler-Golden, S. M., Perkins, T. C., Berk, A., & Levine, R. Y. (2005). U.S.
Patent No. 6,909,815. Washington, DC: U.S. Patent and Trademark Office.
Bernstein, L. S., Adler-Golden, S. M., Perkins, T. C., Berk, A., & Levine, R. Y. (2006). U.S.
Patent No. 7,046,859. Washington, DC: U.S. Patent and Trademark Office.
Bernstein, L. S., Adler-Golden, S. M., Sundberg, R. L., & Ratkowski, A. J. (2008, October).
In-scene-based atmospheric correction of uncalibrated VISible-SWIR (VIS-SWIR)
hyper-and multi-spectral imagery. In Remote Sensing of Clouds and the
Atmosphere XIII (Vol. 7107, p. 710706). International Society for Optics and
Photonics.