Stereo vision with texture learning for fault-tolerant automatic baling

Preview:

Citation preview

Stereo Vision with Texture Learning for Fault-tolerant

Automatic Baling

Morten Rufus Blasa,b,∗, Mogens Blankea,c,

aTechnical University of Denmark, Department of Electrical Engineering, Automation

and Control Group, Elektrovej build. 326, DK-2800 Kgs. Lyngby, DenmarkbCLAAS Agrosystems, Bøgeskovvej 6, 3490 Kvistgaard, Denmark

cCeSOS, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway

Abstract

This paper presents advances in automated baling using stereo-vision. A

robust classification scheme is developed for learning and classifying based on

texture and shape. Using a state-of-the-art texton approach a fast classifier

is suggested that can handle non-linearities and artifacts in data. Shape

information is employed to make the classifier robust to large variations in

lighting conditions and greatly reduce the likelihood that artifacts in signals

from the stereo vision system lead to gross errors in estimated object posi-

tions. The classifier is tested on data from a stereovision guidance system

on a tractor. The system is shown to be able to classify cut plant material

(called swath) by learning it’s appearance. A 3D classifier is successfully

used to train the texture classifier. It is demonstrated from field tests how

fault-tolerant fusion of steering reference data are obtained for an automated

baling vehicle.

∗Corresponding authorEmail addresses: rufus.blas@claas.com (Morten Rufus Blas), mb@elektro.dtu.dk

(Mogens Blanke)

Preprint submitted to Computers and Electronics in Agriculture October 16, 2010

Keywords: texture classification, field navigation, fault-tolerance, stereo

vision, robotics

1. Introduction1

Dependable navigation in outdoor, unstructured environments is crucial2

if autonomous machines shall become a reality in agriculture. Blind GPS3

information is not sufficient due to presence of non-mapped objects, uncer-4

tainties and GPS availability. When a machine shall make navigation deci-5

sions without human interference, an ability to recognise specific structures6

in the environment is needed. Using sensors for outdoor use, typically stereo7

cameras and laser range finders, algorithms that process their data need be8

robust against artifacts in signals and natural variations of the environment.9

Recent efforts towards following field structures, without relying on GPS,10

were presented in [1], [2] and [3]. General findings were that texture methods11

had limited success, due to large variations in conditions, and colour methods12

were restricted to mainly identifying green plants [4]. A mapping approach13

using visual odometry was described in [5]. showing advantages over GPS in14

terms of robustness and accuracy over shorts distances. In robotics, recent15

uses of classification based on colour and texture include [6] where terrain for16

a Mars rover was classified. A combination of visual and geometric features17

was used in [7] for outdoor classification where texture was used in a bag-of-18

words approach and off-line training was needed. A very large set of filter-19

banks was needed to handle all variation. The approach described in [8] was20

more efficient and suggested a texture representation also used in this paper.21

It has the advantage that the texture can be learnt online and allows the22

2

classifier to be optimised for individual scenes without the need of large filter23

banks.24

Combining a 3D classifier and a texture classifier could be a solution to25

the general problem of following structures in the field, for example, to plow,26

seed, spray or harvest. Focusing on baling as a specific harvesting task, this27

requires to follow rows of cut straw or grass (swath) in order to pick it up and28

to process it into bales. This is a labour intensive and repetitive task which29

is of interest to automate. The difficulties pertaining to automating this task30

are similar to the difficulties in automating a large range of other agricultural31

tasks. The ability to track such structure using 3D shape information from32

a stereo camera and/or GPS information, was previously demonstrated [9],33

but issues remained with respect to robustness.34

This paper considers how existing 3D classification methods in agricul-35

ture could be made more robust and tolerant to faults. A combination of36

texture classification, mapping, and supervision is suggested to achieve this37

goal. A novel texton-based classifier is presented that uses online methods38

to learn texture information about the swath and the surroundings. It is39

then analysed how this information could be integrated in a mapping system40

to keep track of measured swath positions and an implementation is pre-41

sented where the map is used to guide a vehicle along a swath by steering42

the tractor’s front wheels while a driver controls the throttle and brakes.43

Results from field tests finally demonstrate how supervision and fusing of 3D44

and texture information in a map were successfully applied in an automatic45

baling system.46

3

2. System Overview47

An automatic baling system is composed of two parts: a vision system and48

a control system as described in Fig.1. In the vision system, a left and right49

image are acquired from a stereo camera. A stereo algorithm extracts 3D50

information from the images. This is then fed into a 3D tracking algorithm51

that tracks the swath field structure. A learning algorithm teaches a texture52

tracking algorithm to track the swath based on the 3D tracking. Supervision53

allows the learning process to be fault-tolerant. Mapping provides fusion of54

tracking information. To enable mapping a fault-tolerant positioning fuses55

and supervises visual odometry and GPS information. The output from the56

mapping system is fed to a track controller that in turn steers the wheels of57

the tractor.58

2.1. Hardware59

In a proof of concept configuration and for the present paper, image pro-60

cessing was done on a laptop which interfaces to the vehicle controller ECU61

through the tractor’s CAN bus. The driver interfaces to the control system62

through a terminal to change settings and engages/disengages the automatic63

steering system through a switch. The baler has integrated pressure sensors64

which are used to measure the bale diameter. This information is used in65

the controller to assure an even filling of the bale chamber. A wheel angle66

sensor provides feedback about the angle of the front wheels of the tractor.67

Hydraulics allow actuation of the front wheels. A stereo camera is mounted68

in front of the tractor and a Real Time Kinematic (RTK) GPS on the roof69

provides position information.70

4

3. Stereo processing71

Stereo-vision perceives depth using triangulation. The distance to a point72

is determined by the triangle between the point and where it appears in each73

of two images. To do this the two images must be aligned. The process of74

aligning the images is part of a calibration step. Given a calibrated stereo75

camera the images are then aligned by warping them. This is known as76

rectification. This gives two cameras with parallel optical axes and horizontal77

epipolar lines. A dense estimation of ranges is then performed at each pixel78

by matching along the epipolar lines [10]. This is done using a correlation79

window with typical sizes of 10x10 pixels. The correlation window matches80

texture in the two images with each other. The output of the matching81

process is a disparity image (see Fig.11). This gives the image difference82

between the position of objects in the two cameras in pixels. The horizontal83

distance from the image centre to the object image is dl for the left image84

and dr for the right image. Then the disparity value d is given by:85

d = dl − dr (1)

A pixel in the disparity image can then be projected to a 3D coordinate86

using triangulation.87

For a stereo camera a point in the depth map is defined by (u, v, d) which88

is the column, row, and disparity value in the disparity image. This can89

be projected to and from a 3D coordinate defined by (x, y, z) in the camera90

coordinate system by:91

5

xl =

x

y

z

=

(u−u0)bd

(v−v0)bd

fb

d,

(2)

where u0 and v0 are the column and row coordinate of the optical centre92

of the image in pixels. b is the camera baseline and f is the focal length for93

the rectified image.94

4. Height Classification95

In order to identify the swath in 3D the ground plane is first estimated (see96

the work by Konolige [11]). The distance from each pixel in the disparity97

image to the plane is then calculated (using the 3D coordinates and the98

point-to-plane distance). Pixels below the plane are set to zero-distance.99

The height values are then normalised based on a median value of a subset100

of the largest height values. The output is a 2D image scaled from 0-101

1 with a higher value indicating a higher likelihood of pertaining to the102

swath (under the assumption that only the swath is higher than the ground103

plane). In practice this performs well even though the ground may be uneven104

and/or hilly. In the local area used for height classification the ground plane105

approximation works well.106

5. Texture Classification107

5.1. General Aspects108

In a seminal paper, Leung and Malik [12] showed that many textures could109

be represented and re-created using a small number of basis vectors extracted110

6

from the local descriptors; they called the basis vectors textons. While Leung111

and Malik used a filter bank, Varma and Zisserman [13] showed that small112

local texture neighbourhoods may be better than using large filter banks. In113

addition, a small local neighbourhood vector can be much faster to compute114

than multichannel filtering such as Gabor filters over large neighbourhoods.115

5.2. Texton Labelling116

Given a colour image as input pixel neighbourhoods in the image are117

grouped into belonging to 1 of 23 texton types (texton image). This number118

was chosen as a compromise between quality and speed. These textons are119

learnt from the training image. This is done by first extracting a descriptor120

in the form of a vector from each pixel location in the image. For each121

location pi the vector is:122

pi =

W1 ∗ Lc

W2 ∗ ac

W2 ∗ bc

W3 ∗ (L1 − Lc)...

W3 ∗ (L8 − Lc)

(3)

Where Lc, ac, bc is the colour of the pixel at this location in CIE*LAB123

colour-space. (L1−Lc), ..., (L8−Lc) are the intensity differences between the124

pixel at this location and the 8 surrounding pixels in a 3× 3 neighbourhood.125

The vector elements are then weighted using {W1 = 0.5, W2 = 1, W3 = 0.5}.126

A K-means algorithm is then run on all these descriptors to extract cluster127

centres which we refer to as textons. The K-means algorithm finds the set128

7

of textons µj that partitions the descriptors into k sets S = S0, S1, ..., Sk by129

trying to minimize [14]:130

S∗ = argminS

k∑

j=1

pi∈Sj

‖pi − µj‖2 (4)

where pi are the descriptors in the training image(s) and Sj is the set of131

descriptors belonging to cluster j out of k clusters.132

Each pixel location in the image is then labeled as belonging to a texton133

by finding the nearest texton in Euclidean space. An example of such a134

classification is shown in Fig.2.135

5.3. Texture Training136

The texton image is used as the basic input to train the texture classifier.137

Textons were only calculated for the left image. Additional extra redundancy138

could be achieved by using both left and right images. However as they139

view the same scene they will often have similar failure modes. Therefore140

this redundancy was not used in order to instead get faster performance.141

Given a training mask representing the location of the swath in an image142

(representing the location of the ”swath” as well as the surroundings ”not143

swath”), and a texton image the average histogram of texton occurrences144

around 32 × 32 image patches for the ”swath” and ”not swath” case can be145

constructed (with 32× 32 pixels corresponding to 1− 2% of the total image146

size).147

As is apparent from the swath images in Fig.4, there are multiple objects148

with unique textures present in both the swath and in the area surrounding149

the swath. For example, stubble from harvested plants may be present150

8

in part of the image or tire tracks from other machines may form separate151

textures. The solution presented here is to identify and model each texture152

independently by representing each of them with a mean histogram. This153

also allows handling of how texture changes with distance from the camera154

because objects further away have a smaller pixel resolution. Identification of155

the different textures is done by taking the list of histograms for the ”swath”156

and ”not swath” cases and clustering them (similarly to what was done for157

the textons) independently of each other using K-means [14]. This was done158

using a single training image. For the results presented here, the number159

of clusters was 3 for each case. The reason for using K-means was simply160

for speed. Other probabilistic learning approaches could be used, including161

Support Vector Machines (SVM) [15] and Gaussian Mixture Models (GMM)162

[16].163

5.4. Texture Classifier164

Classification is done on a texton image by identifying a likelihood that165

a histogram centred around a pixel is either ”swath” or ”not swath”. This166

is represented by the two hypotheses: Hs for ”swath” and Hn for ”not167

swath”. A suitable distance measure must be used to compare the his-168

tograms. Having considered several distance measures the sum-of-absolute169

distances (SAD) was chosen and the ratio of the distance from an observed170

distribution to either of the hypotheses were used for classification. A main171

reason for using this distance measure was that it can be implemented to172

run very fast since it does not involve square roots or exponential functions.173

Given two histograms h1 and h2 the distance between them is (defined by174

the ⊕ operator):175

9

h1 ⊕ h2 =

k∑

j=1

|h1(j) − h2(j)| (5)

In [16] it was demonstrated that GMM’s could be used to model the176

distribution of colours of a dirt road. Each Gaussian could be made to177

recognize different sets of colours and allow modeling of multi-modal colour178

distributions by mixing the Gaussians. RGB colour data is 3 dimensional179

which is much smaller than a texton histogram with 64 dimensions. Modeling180

texture even with just one Gaussian was found too slow on current hardware181

and a number of Gaussians seem needed to represent the distributions well.182

A compromise was achieved by modeling the texture histograms as a183

number of K-means clusters. A major penalty in doing this is that the184

density of the distributions is not modeled.185

The output of the texture training allows the histograms at each pixel186

in an image to be labeled as belonging to either Hs or Hn. Then these187

histograms are clustered separately for each hypothesis using K-means with188

m = 3 clusters.189

Then Hs is Hs = {hs,1,...,hs,m} and Hn = {hn,1,...,hn,m} for Hn. The190

classifier is then formulated as a distance ratio to the nearest cluster under191

each hypothesis:192

d =min(h ⊕ hn,1,...,h ⊕ hn,m)

min(h ⊕ hs,1,...,h ⊕ hs,m)(6)

The distance is computed for each pixel in the image. A geometrical193

analysis is then used to detect if a field structure is present in the image. An194

example of classifying an image is given in Figure 2.195

10

5.5. Swath Detection196

The swath in the image is parameterised by a width, position and orienta-197

tion. A mask can then be constructed for all feasible swath parameterisations198

(see Fig. 3 for an example of a mask). An exhaustive search is then per-199

formed within a quantised set of the parameter space to maximise a match200

score. In Fig. 3 the dark region has a value of −1, the white region has a201

value of 1. This mask is multiplied pixel-wise with the classified images for202

both the height as well as the texture, and then summed to produce a scalar203

value indicating goodness of a match.204

5.6. Updating Texture Model205

The texture training produces a model of the texture that the classifier206

uses. In the current implementation, training is performed at a rate of207

about 1 Hz. Training is only done if the 3D classifier match score is above208

a threshold. The newly trained texture model is then validated by testing209

the match score from using the old model against the new one. If better,210

the old model is discarded. It has been considered whether it could pay off211

to use an incremental texture model where the old model is not completely212

discarded. This approach had difficulties with local minima. In effect, as213

new models are constantly being tested, the learning system does have an214

incremental attribute in that often the method converges to finding a single215

training image that best represents the field variations. Nevertheless, it still216

has the ability of quickly switching model if a new type of environment is217

encountered.218

Attempts were initially made with trying to implement an offline model219

instead of learning the textures online. However as can be seen in Fig.4220

11

the variation that needs to be captured by such a model is quite large. For221

natural environments it is difficult to get training data for generating offline222

models that can handle all the necessary variation. In practice one could223

do the above training offline without any changes. From an implementation224

point of view the training is done concurrently with texture tracking on a225

separate processor/core so it does not slow down the system.226

6. Classification Results227

In order to evaluate the performance of the classifier, a simple scoring228

scheme was designed. The classification rate was defined as the number229

of correct pixel classifications Dc normalised by the maximum number of230

correct pixel classifications, Dc,max. High classification rate is the better.231

The false alarm rate was defined as the number of false pixel classifications232

(false alarms) Df normalised by the maximum number of possible false pixel233

classifications Df,max. Low false alarm rates are preferred. A hand labeled234

image was used as ground truth.235

Classification rate =Dc

Dc,max

(7)

False alarm rate =Df

Df,max

(8)

A set of 20 consecutive images (sampled with about 1 Hz) were selected236

at random from 5 different data-sets of swaths (see Fig.4). The data-sets had237

few shadows and all images in a data-set had the sun coming from the same238

angle (these parameters can cause the texture to change). A hand labeled239

set of ground truth images were made for all 100 images. For each of the240

sequences, the classifier was learnt from one image and then applied to the241

12

others. For this image the position of the swath was calculated using the242

3D classifier that extracts the width and position of the swath based on the243

3D profile in the stereo images.244

The results (see Fig.5) are very good with an average of around 90%245

detection rate relative to the ground truth. The false alarm rate is around246

4% and is thus also good. Some images in data-set #3 have problems247

getting the heading of the swath correct as in Fig.4 where 2(c) has a slightly248

wrong heading. This error was attributed to limited resolution this far out249

meaning that there was not enough texture information to classify reliably250

on this specific swath. There was a spike on image #17 in data-set #4. This251

was due to an ambiguity in the image where the classifier chose a solution252

slightly to the left of the hand labeled image to compensate for a large lump253

of swath lying off centre relative to the rest of the swath. This is shown in254

Fig.6. Such artifacts are unavoidable in a natural environment, and fault-255

tolerant techniques need be employed to avoid undesired behaviours from the256

steering control. The fault-tolerance aspects are discussed in Section 8.257

7. Mapping258

The mapping system relies on recent research for high-precision position-259

ing using VO fused with GPS (see [11]). By tracking features that change260

between images, the change in pose of the camera can be estimated. This is261

done by computing changes in position between image frames. GPS provides262

global correction so drift in the VO subsystem can be avoided. This fused263

position estimate is used to maintain a map of the position of the swath.264

The swath map is modelled as a Taylor series expansion of a clothoid [17]:265

13

y(x) = y0 + tan(φ)x + C0x2

2(9)

Where y0 is the lateral offset between the vehicle and the swath centre,266

φ is the angle of the swath relative to the vehicle, C0 is the curvature of267

the swath and x is distance ahead of the vehicle. Estimating the curvature268

from a single image can be difficult. This is due to the natural variance of269

the swath position and width. To estimate the curvature requires a larger270

”accumulated field of view”, which is obtained as follows.271

For each tracked image, the centre point of the swath is extracted for each272

horizontal line in the image. The centre points are stored as a list. Old273

centre points are deleted if they represent positions behind the bailer, and274

new centre points are added as new images are tracked. Clothoid parameters275

are estimated using least-squares fitting of the parameters in Eq. 9 to the276

centre points in the list, hence representing a range from the camera field of277

view back to the pickup on the baler. The baler picks up the swath so the278

dynamic map naturally ends at the baler.279

8. Supervision and fault-tolerance280

Two supervision systems are run concurrently in the software. The first281

operates solely on position sensor information (Positioning Supervision in282

Fig.1) and is based on work described in [9] and this is not described fur-283

ther. The second is concerned with supervising the trained texture classifier284

(Texture Supervision in Fig.1).285

To make sure the trained texture classifier is not faulty, a supervision286

system has been setup for validating. The principle is to evaluate the texture287

14

classifier relative to the 3D classifier over a number of frames. Change288

detection theory is applied to determine if the texture classifier output is289

similar to the 3D tracking. Two residuals are formed: one for lateral and290

another for angular deviation. The residuals are simply the differences291

between outputs of the 3D and the texture based classifiers.292

When no prior information can be assumed, statistical change detection293

is often based on a log likelihood ratio (si):294

si = lnpθ1(ri)

pθ0(ri)(10)

where the probability densities from the observed distributions are used,295

pθ1 for the case of a fault, pθ0 for the normal case, respectively, and the value296

of a residual at sample instant i is denoted ri.297

Decision on whether a change has taken place need averaging over a num-298

ber of values of si in order to get adequate confidence. In the implemented299

change detection algorithm, a standard cumulative sum (CUSUM) algorithm300

is used, see e.g. [18] and references herein, to detect a change in mean:301

S(k) =

k∑

i=1

si =

k∑

i=1

lnpθ1(ri)

pθ0(ri)(11)

To obtain estimates of probability density functions for normal and not-302

normal cases, histograms of the residuals were analysed for cases with and303

without faults. These are shown in Fig.7. By inspection, faults appear to be304

given by a shift in the mean µ of the residuals ri while the standard deviation305

σ remained unchanged. Therefore, the distributions for normal and faulty306

cases are approximated by scalar Gaussian distributions,307

15

Normal : N (µ0, σ) : pθ0(ri) =1

σ√

2πexp(−ri − µ0

2σ2) (12)

Fault : N (µ1, σ) : pθ1(ri) =1

σ√

2πexp(−ri − µ1

2σ2) (13)

A scalar test function g(k) was made as an approximation to S(k) over a308

window of N samples, and the test function is the well known309

g(k) =k

i=k−m

lnpθ1(ri)

pθ0(ri)(14)

=µ1 − µ0

σ2

k∑

i=k−N

(ri −µ1 + µ0

2)

Changes in mean of 20 cm and 5 deg are used to reject a trained tex-310

ture classifier. The variances are estimated to be σ2 = 88.0 cm2 and σ2 =311

16.5 deg2, respectively. A sample window of N = 20 (roughly 2 seconds) was312

used to calculate the cumulative sum with tests for both a positive and neg-313

ative change in mean. A fault flag was triggered if this value exceeds γ = 5314

for the lateral fault, and γ = 2 for the angular fault.315

Histograms of the residuals were analysed for cases with and without316

faults. These are shown in Fig. 7. By inspection, faults appear to be given317

by a shift in the mean of the residuals ri and the probability densities of318

the residuals are reasonably well approximated by Gaussian distributions.319

Therefore, standard statistical change detection can distinguish between the320

cases of zero mean (no fault) and non-zero mean (fault). If a test is to be321

made to detect a known change in magnitude, the standard cumulative sum322

(CUSUM) test can be used. If the magnitude of change need be assumed323

unknown, a Generalised Likelihood Ration test (GLR) is to be employed [19].324

16

In this paper, a CUSUM detector is used because, in a fault-tolerant guidance325

system, position deviation of a known (small) magnitude could be tolerated.326

A change in mean of 20 cm and 5 deg are used to reject a trained texture327

classifier. The variance is estimated to be σ2 = 88.0 cm2 and σ2 = 16.5 deg2328

respectively. A sample window of N = 20 (roughly 2 seconds) was used to329

calculate the cumulative sum with tests for positive or negative changes in330

mean. A fault flag was triggered if this value exceeded γ = 5 for the lateral331

fault, and γ = 2 for the angular fault.332

For a change in mean of A = 20 cm over the sample window it corresponds333

to detecting a change in mean between N (0, σ2

N) and N (A, σ2

N). The proba-334

bility of a false alarm PFA is the area from the threshold and the probability335

of detection PD is then given by the area under the [19]:336

PFA = Q(γ

σ2/N) = 0.0086 (15)

PD = Q(γ − A

σ2/N) = 0.9999 (16)

Where Q is the right-tail probability. For the angular deviation, similar337

results can be obtained: PFA = 0.0138, PD = 0.9995. This gives a good338

detection rate while also assuring a reasonably low false alarm rate.339

The residuals checks are run concurrently with the rest of the system and340

can be seen in Fig.8 for the run in Fig.13. The maximum value is shown at341

each time step for both positive and negative changes in mean (denoted by342

g(k)). In this run two faults are detected. Only the second fault is detected343

by both residuals.344

Individual match scores for the two classifiers are first thresholded based345

17

on their match score. If they are both accepted, they are passed to the346

supervision system. If no faults are detected, the results are fused by taking347

a weighted mean of the lateral and angular deviations (based on the match348

scores). If a fault is detected, the texture classifier information is not added349

to the map. Mapping provides intelligent filtering of the classifier outputs350

so jumps are not experienced when switching between the information used.351

The approach outlined gave good fault-tolerance against artifacts in the im-352

age processing.353

9. Control354

A tracking control system to collect the swath is shown in the block355

diagram in Fig. 1. The control system remains active as long as the match356

score is above a predefined threshold and there is map information ahead of357

the vehicle such that a reference track can be computed. The control system358

has a variable offset that is calculated to ensure the bale chamber is filled359

evenly. If the bale chamber is unevenly filled then the bale becomes cone360

shaped. Pressure sensors inside the baler provide a measure of how evenly361

it is filled. The bale must have a certain size before the pressure sensors362

give usable feedback. To compensate for this lack of feedback the controller363

has two states. In the initial state where pressure has not yet built up364

an open-loop steering pattern is followed where the vehicle changes between365

being positioned on the left edge of the swath and then the right edge of the366

swath. This motion is parameterised as how far out this edge is relative to367

the centre of the swath; how long it should follow an edge, and how sharply368

it should change sides. The steering system changes mode when the bale369

18

size reaches a minimal level and sensor feedbacks can be used. The sensor370

feedback provides a signal in the range −1 to 1 indicating how cone shaped371

the bale is. An adequately amplified version of this signal is added to the372

measured lateral offset. The swath parameters y0, φ, C0 are finally fed to the373

tracking controller.374

10. Results375

10.1. Texture Versus 3D376

A comparison between the measured lateral and angular deviation of the377

swath position relative to the vehicle for the 3D and texture algorithms is378

shown in Fig. 10. Clearly, the two follow each other closely. It is interesting379

to note that the two algorithms complement each other in that sudden drops380

in the match score seldom occur in both at the same time. The match score381

was on average higher for the texture method than for the 3D method in this382

sequence. This may be different from swath to swath depending upon how383

high it is and how slanted the edges are. Towards the end of the sequence384

(around frame 1500) the 3D method started to have problems detecting a385

height difference on a very flat section of swath, but the texture method386

continued driving to the end. An example of the normalized images for387

texture and 3D classification can be seen in Fig.11. A comparison between388

the two classifications methods and the fused approach using mapping is389

given in Fig.12. Specifically the texture method was found more sensitive to390

changes in lighting than the 3D method. For example due to shadows. The391

3D method can potentially have problems with mistaking standing crop for392

the field structure which the texture method is less sensitive to. By combining393

19

texture and 3D a more robust system can be achieved. However the system394

is still sensitive to training using 3D so will require some initial 3D structure395

for the first training. If the 3D tracking detects a wrong 3D structure this396

can potentially be detected by comparing it to the texture classification that397

is less sensitive to this.398

10.2. Field Tests399

Results for the system operating live are provided in Fig.13. The results400

span a period of about 800 s and involves creating 7 bales. The system401

steers autonomously. There were 3 drops in the match score which were402

where the swath ended in the headland and manual steering was required to403

bring the vehicle to the next length of swath. During the turning periods404

the lateral deviation measurements were ignored. The match scores shown405

have been normalised from 0− 1. The vehicle needed to stop when ejecting406

a finished bale, which can be observed from the velocity plot. The driver407

kept the velocity at 2m/s but also increased it to close to 3m/s for a certain408

part of the results.409

11. Conclusion410

There has generally been little efforts at improving redundancy in vision411

based guidance systems. The system presented here improved redundancy412

by combining 3D measurements, texture, and mapping. The results in this413

paper showed that textures present in outdoor agricultural environments414

can be learnt and tracked robustly. 3D data provided by the stereo camera415

facilitated such learning. Geometric shape constraints allowed an initial416

sorting of false positives from a swath from analysis of width, position, and417

20

orientation. Supervision provided further detection of faults by comparing418

the learnt model to 3D tracking. Future work for the texture classifier will419

involve eliminating the need for online learning based on the mask from 3D420

tracking and thus make it more capable to perform independently of 3D data.421

More field tests will also need to be made with the texture classifier to see422

how it handles moving shadows and lighting changes.423

The mapping component allowed additional parameters such as the swath424

curvature to be tracked and increased robustness of the overall system. Ex-425

tensive tests demonstrated the system to work online on an autonomously426

steered vehicle on a variety of swath.427

12. References428

[1] J. Jin, L. Tang, Corn plant sensing using real-time429

stereo vision, J. Field Robot. 26 (6-7) (2009) 591–608.430

doi:http://dx.doi.org/10.1002/rob.v26:6/7.431

[2] F. Rovira-Mas, S. Han, J. Wei, J. F. Reid, Autonomous guidance of432

a corn harvester using stereo vision, Agricultural Engineering Interna-433

tional: the CIGR Ejournal. 9.434

[3] T. Coen, A. Vanrenterghem, W. Saeys, J. De Baerdemaeker, Autopilot435

for a combine harvester, Comput. Electron. Agric. 63 (1) (2008) 57–64.436

doi:http://dx.doi.org/10.1016/j.compag.2008.01.014.437

[4] T. Bakker, H. Wouters, K. van Asselt, J. Bontsema, L. Tang, J. Muller,438

G. van Straten, Original paper: A vision based row detection sys-439

21

tem for sugar beet, Comput. Electron. Agric. 60 (1) (2008) 87–95.440

doi:http://dx.doi.org/10.1016/j.compag.2007.07.006.441

[5] M. Kise, Q. Zhang, Creating a panoramic field image using multi-442

spectral stereovision system, Comput. Electron. Agric. 60 (1) (2008)443

67–75. doi:http://dx.doi.org/10.1016/j.compag.2007.07.002.444

[6] A. Angelova, L. Matthies, D. Helmick, P. Perona, Fast terrain classifi-445

cation using variable-length representation for autonomous navigation,446

2007, pp. 1–8.447

[7] I. Posner, M. Cummins, P. Newman, A generative framework for fast448

urban labeling using spatial and temporal context, Auton. Robots 26 (2-449

3) (2009) 153–170. doi:http://dx.doi.org/10.1007/s10514-009-9110-6.450

[8] M. R. Blas, M. Agrawal, A. Sundaresan, K. Konolige, Fast color/texture451

segmentation for outdoor robots, in: IROS, 2008, pp. 4078–4085.452

[9] M. R. Blas, M. Blanke, Natural environment modeling and fault-453

diagnosis for automated agricultural vehicle, in: Proceedings 17th IFAC454

World Congress, Seoul, Korea, 2008, pp. 1590–1595.455

[10] K. Konolige, Small vision systems: hardware and implementation, in:456

Intl. Symp. on Robotics Research, 1997, pp. 111–116.457

[11] K. Konolige, M. Agrawal, M. R. Blas, R. C. Bolles, B. Gerkey,458

J. Sola, A. Sundaresan, Mapping, navigation, and learning459

for off-road traversal, J. Field Robot. 26 (1) (2009) 88–113.460

doi:http://dx.doi.org/10.1002/rob.v26:1.461

22

[12] T. Leung, J. Malik, Representing and recognizing the visual appearance462

of materials using three-dimensional textons, International Journal of463

Computer Vision 43 (1) (2001) 29–44.464

[13] M. Varma, A. Zisserman, Texture classification: Are filter banks nec-465

essary?, Proceedings of the IEEE Conference on Computer Vision and466

Pattern Recognition 2 (2003) 691–696.467

[14] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.468

[15] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin469

classifiers, Proceedings of the Fifth Annual ACM Workshop on Compu-470

tational Learning Theory (1992) 144–152.471

[16] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel,472

P. Fong, J. Gale, M. Halpenny, G. Hoffmann, K. Lau, C. Oakley,473

M. Palatucci, V. Pratt, P. Stang, S. Strohband, C. Dupont, L.-E. Jen-474

drossek, C. Koelen, C. Markey, C. Rummel, J. van Niekerk, E. Jensen,475

P. Alessandrini, G. Bradski, B. Davies, S. Ettinger, A. Kaehler, A. Ne-476

fian, P. Mahoney, Stanley: The robot that won the darpa grand chal-477

lenge, Journal of Field Robotics 23 (9) (2006) 661–692.478

[17] B. Southall, C. J. Taylor, Stochastic road shape estimation, Com-479

puter Vision, IEEE International Conference on 1 (2001) 205.480

doi:http://doi.ieeecomputersociety.org/10.1109/ICCV.2001.10022.481

[18] M. Blanke, M. Kinnaert, J. Lunze, M. Staroswiecki, Diagnosis and Fault-482

Tolerant Control 2nd Edition, Springer, 2006.483

23

[19] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume 2,484

Prentice Hall, 1998.485

24

Figure 1: Block diagram of vision system and tracking control for baling. Supervised

classification and positioning provides mapping of field structures, which are fed to the

steering controller.25

(a) (b)

(c) (d)

(e)

Figure 2: (a) Left image from the stereo camera. (b) Texton classification. Each pixel

colour represents a different texton (using a set of random colours). (c) The stereo camera

image with a transparency mask based on the swath classification. (d) Swath classification

based on texture with intensity representing the strength of classification. (e) Visualization

of the texture classification using 3D stereo information.

26

Figure 3: (a) The original image with a swath in the middle. (b) Mask illustrating width,

position, and orientation of swath extracted from stereo-algorithm.

1(a) 1(b) 1(c) 1(d) 1(e)

2(a) 2(b) 2(c) 2(d) 2(e)

Figure 4: Images 1(a-e) illustrate the training images used. Image #10 in each dataset is

classified in 2(a-e) with the blue lines representing the swath bounds.

27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2070

75

80

85

90

95

100

Image #

Per

cent

Datasets − Detection Rate

Set. 1Set. 2Set. 3Set. 4Set. 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

2

4

6

8

10

12

Image #

Per

cent

Datasets − False Alarms

Set. 1Set. 2Set. 3Set. 4Set. 5

Figure 5: Classification and false alarm rate for the tested sequences.

28

Figure 6: This is the worst match in the used datasets according to the scoring scheme

where the algorithm follows a lump of swath slightly to the left of the rest of the swath.

Image #17 in dataset #4.

29

−20 −15 −10 −5 0 5 10 150

0.02

0.04

0.06

0.08

0.1

0.12Angular Deviation (no fault)

degrees−15 −10 −5 0 5 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14Angular Deviation (faulty)

degrees

−60 −50 −40 −30 −20 −10 0 10 20 300

0.005

0.01

0.015

0.02

0.025

0.03

0.035Lateral Deviation (no fault)

cm−60 −50 −40 −30 −20 −10 0 100

0.01

0.02

0.03

0.04

0.05

0.06

0.07Lateral Deviation (faulty)

cm

Figure 7: Normalised histograms for residuals with and without faults. The histograms

are shown with fitted Gaussian probability density functions.

30

0 500 1000 1500 2000−2

−1

0

1

2

3

4Fault Detection (Angular Fault)

g(k)

frame

0 500 1000 1500 2000−4

−2

0

2

4

6

8Fault Detection (Lateral Fault)

g(k)

frame

Figure 8: Fault detection in angular and lateral deviations using a CUSUM change detec-

tor. The thick line defines the threshold for a fault.

31

Figure 9: The open-loop zig-zag motion of the vehicle is here clearly illustrated by the red

line. The vehicle in turn drives on the left edge of the swath and then on the right edge.

32

[ptb]

0 500 1000 1500 2000−100

−50

0

50

100Lateral Deviation

frame

cm

3DTexture

0 500 1000 1500 2000−30

−20

−10

0

10

20Angular Deviation

frame

degr

ees

3DTexture

0 500 1000 1500 20000

0.2

0.4

0.6

0.8Match Score

frame

3DTexture

Figure 10: Results for 1800 frames of video. The measured lateral and angular deviation

of the swath relative to the vehicle plotted for both 3D and texture. The match score is

shown for the two algorithms. There was no swath inside the field of view between frames

950-1100, 1270-1380, 1600-1800.

33

Figure 11: An example of classification results from an image pair. Only the left image

is shown for clarity. In the disparity image warmer colours indicate shorter range. The

more white a pixel is in the classified images the stronger the field structure is detected.

3D Texture 3D and texture Mapping

Sensitivity to lighting Low High Low

Sensitivity to other 3D structures High Low High (but can detect fault)

Requires 3D field structure Yes No No (after training)

Requires field structure with texture different from surroundings No Yes No

Can reliably identify field structure curvature No No Yes

Can detect classification faults No No Yes

Can fuse multiple image classifications (reduce noise) No No Yes

Figure 12: Comparison of classifications methods.

34

100 200 300 400 500 600 700 800 900 1000−200

−100

0

100

200

cm

Lateral Deviation VS Bale Pressure

Scaled Pressure DifferenceSwath Lateral Deviation

100 200 300 400 500 600 700 800 900 10000

50

100

150

200Bale Diameter

cm

100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1Match Score

100 200 300 400 500 600 700 800 900 10000

1

2

3Velocity

m/s

time (s)

Figure 13: Results for a 13 minute run where 7 bales are created. The controller is using

the feedback signal from the baler.

35