[Lecture Notes in Computer Science] Fuzzy Systems and Knowledge Discovery Volume 4223 || A Fuzzy Symbolic Inference System for Postal Address Component Extraction and Labelling

L. Wang et al. (Eds.): FSKD 2006, LNAI 4223, pp. 937 – 946, 2006. © Springer-Verlag Berlin Heidelberg 2006

A Fuzzy Symbolic Inference System for Postal Address Component Extraction and Labelling

P. Nagabhushan1, S.A. Angadi2, and B.S. Anami3

1 Department of Studies in Computer Science, University of Mysore, Mysore 2 Department of Studies in Computer Science, University of Mysore, Mysore and Basaveshwar

Engineering College, Bagalkot 3 Department of Computer Science and Engineering, BEC, Bagalkot

[email protected]

Abstract. It is important to properly segregate the different components present in the destination postal address under different labels namely addressee name, house number, street number, extension/ area name, destination town name and the like for automatic address reading. This task is not as easy as it would appear particularly for unstructured postal addresses such as that are found in India. This paper presents a fuzzy symbolic inference system for postal mail address component extraction and labelling. The work uses a symbolic representation for postal addresses and a symbolic knowledge base for postal address component labelling. A symbolic similarity measure treated as a fuzzy membership function is devised and is used for finding the distance of the extracted component to a probable label. An alpha cut based de-fuzzification technique is employed for labelling and evaluation of confidence in the decision. The methodology is tested on 500 postal addresses and an efficiency of 94% is obtained for address component labeling.

Keywords: Postal address component labelling, Fuzzy methodology, Symbolic similarity measure, alpha cut based de-fuzzification, Inference System.

1 Introduction

Efforts to make postal mail services efficient are seen the world over. There is a spurt of activity in postal automation area in recent times. [1] enlists the computer vision tasks in postal automation. Delivery of mail to the addressee at the destination place requires sorting for onward dispatch at the origin post office and re-sorting if needed at intermediate post offices and lastly sorting for distribution. Hence mail sorting is a very important and skilled task which should be made efficient to improve the quality of mail services. It can be made efficient by devising tools for the automation of various sub tasks of mail sorting. Towards this end, tools/ techniques from different domains such as pattern recognition, image processing, graph theory, optimization, soft computing etc need to be applied.

The different aspects of postal services that need to be automated are discussed in [2]. The literature survey reveals that researchers around the world are addressing various issues required for postal automation especially contributing to mail sorting,

938 P. Nagabhushan, S.A. Angadi, and B.S. Anami

but there is little effort found in simulating the human expertise required for postal mail handling, a few of them are described here. An algorithmic prototype for automatic verification and validation of postal addresses is presented in [3]. [4] proposes a methodology for truthing, testing and evaluation of postal address components. A formal method for information theoretic analysis of postal address components is given in [5]. The address component identification, required for postal automation in India and other countries, which do not have structured address formats, is not attempted. The task of address component labelling is similar to text/ word categorization. Literature is abound with general text categorization works applied to other domains [6].

In this work a fuzzy symbolic inference system for extraction and labelling of postal address components is presented. A Symbolic similarity measure is devised for identifying the address component labels using the symbolic representation of the postal address and a symbolic knowledge base. The similarity measure is a fuzzy membership function as it gives approximate nearness to various possible labels. This necessitates the disambiguation of the similarity formulation and is carried out by an inference mechanism using fuzzy alpha cut methodology. The alpha cut set is further used in defining a confidence value for the decision made. The methodology has given a labelling accuracy of 94%.

The remaining part of the paper is organized into five sections. Section 2 presents a discussion on the postal mail address component labelling problem. Section 3 gives the symbolic representation of the postal address and the symbolic knowledge base employed. Section 4 describes the fuzzy symbolic inference system for address component labelling. It elaborates the similarity formulation and alpha cut based de-fuzzification technique used for disambiguation and confidence evaluation. Section 5 gives the results and provides critical comments. Section 6 presents the conclusion.

2 Postal Mail Address Component Labelling Problem

The structure of postal addresses in developed countries like USA, UK etc is fairly standardized [7,8] as brought out by the examples in Figure 1, and this is facilitated by the structured layout of the localities. The addresses are always written using the same structure hence the line of occurrence is sufficient to identify the address component such as street name, postal code etc. The same standardization though is not found in a country like India and it is difficult to devise a standard address format for the postal addresses in India. Indian postal addresses generally give a description of the geographical location of the delivery point of the addressee, for example, Near Playground, Behind CTO, Besides City Hospital etc. A typical set of examples of UK, USA and Indian addresses are given in Figure 1.

As indicated by address-3 in Figure 1, the postal addresses in the Indian context are not very structured and the destination addresses are written using the location description. The postal addresses generally make use of well known land marks, houses of famous personalities and popular names of roads for describing the addressee and mail delivery point. All these give an unstructured nature to the postal addresses. People also use synonyms like street/ road for cross, avenue for road etc when writing destination addresses and may some times use wrong spellings. After

A Fuzzy Symbolic Inference System 939

studying a large number of postal addresses, the various components that may be present in a typical Indian postal address are found to be about twenty. Every address will not contain all the components, and some addresses may contain more than one value for the same component type. The postal addresses in general are approximate/ incomplete/ imprecise descriptions of the mail delivery points (addressee). It is required to identify these components of an address for its proper interpretation. This address component labelling task is not trivial, particularly when the addresses are unstructured and the labelling is to be based on the address information itself. This paper presents a fuzzy symbolic inference system for labelling the address components of an unstructured postal address taking Indian addresses as a case study. The methodology can be adopted in other countries having similar unstructured format.

Address 1:UK Address: Nildram Ltd [recipient] Ardenham Court [probably the building name: Not all addresses have this part.] Oxford Road [street name] AYLESBURY [postal town (town/city)] BUCKINGHAMSHIRE [county (not needed)] HP19 3EQ [postal code] GREAT BRITAIN [country name, if posted from outside country] Address 2:USA Address: JOHN DOE [recipient] BITBOOST SYSTEMS [Organization, required if office address] SUITE 5A-1204 [Suite name, if available and length on street name line is not sufficient] 421 E DRACHMAN [Site no. and street name with direction] TUCSON AZ 85705 [Place, state and zip code] USA [country name, if posted from outside country] Address 3: Indian Address: Mr. Joseph [recipient] Near Kalika Devi Temple, [Landmark] Behind Govt Hospital [Landmark] Kollur-01 [Place and PIN}

Karnataka [State] India [Country name]

Fig. 1. Typical Addresses

3 Symbolic Representation

The symbolic representation of objects is an advantageous one especially for objects which have different and varying number of fields and corresponding data/ knowledge bases [9]. Section 3.1 presents the symbolic representation of postal address and section 3.2 describes the symbolic knowledge base employed in this work.

3.1 Postal Address

Some of the fields of postal addresses are qualitative, such as addressee name, care of name etc, other fields such as house number; road number, postal code (postal index number/ PIN) etc may be numeric, though their use is non numeric in nature. The values taken by most of the fields for a given address, can be distinct or one among


the given range or enumerated list of values. A postal address may not contain all the possible fields. This description of the postal address makes it a suitable candidate for representation using symbolic data approach [9].

Symbolic objects offer a formal methodology to represent such variable information about an entity. Symbolic objects are extensions of classical data types. Symbolic objects can be of three different types, Assertion Object, Hoard Object and Synthetic Object. An assertion object is a conjunction of events pertaining to a given object. An event is a pair which links feature variables and feature values. A Hoard object is a collection of one or more assertion objects, whereas a synthetic object is a collection of one or more hoard objects [12]. The postal address object is described as a hoard object consisting of three assertion type objects [10] namely Addressee, Location and Place as described in (1).

[Place]}Location], [e],{[Addresse OBJECT ADDRESS POSTAL = (1)

The Addressee specifies the name and other personal details of the mail recipient; the Location specifies the geographical position of the mail delivery point and Place specifies the city/ town or village of the mail recipient. Each of these assertion objects is defined by a collection of events described by the feature variables. The feature variables or postal address fields of the different assertion objects are listed in (2),(3) and (4). Each of the feature describes some aspect of the object and all the features together completely specify the assertions objects. However, certain features remain missing in a typical postal address because they are not available and in some cases the written address may contain more than the required address components (typically more values for one feature, viz two or more landmarks).

tion)]n)(Designa(Salutatio

n)(Professioification)Name)(Qual of Name)(Care (Addressee[Addressee = (2)

m)](PBNo)(Fir

ndMark))(Area)(LaName)(Road useNumber)(Ho (House[Location = (3)

N)(Via)](Place)(PIct)(State)uk)(Distri(Post)(Tal[Place = (4)

A typical postal address and its representation as a symbolic object is given in Table 1.

Table 1. A Typical Postal Address Object

Postal Address Symbolic Representation Shri Shankar S Menisinkai, Certified Engineer, “GuruKrupa”, 12th Main Road VidyagiriBagalkot-587102 Karnataka State

PostalAddressObject={[Addressee=(Salutation=Shri),(AddresseeName=ShankarSManisinkai),(Designation=Certified Engineer)], [Location=(HouseName=GuruKrupa),(Road=12thMainRoad),(Area=Vidyagiri)], [Place=(place=Bagalkot),(PIN=587102), (State=Karnataka)] }

3.2 Knowledge Base for Address Component Labelling

The symbolic knowledge base employed for postal address component labelling is devised based on the frame structured knowledge base presented in [11] and study of


large number of postal addresses. The symbolic knowledge base used in this work provides a systematic approach for address component labelling and an improved performance as compared to the work described in [11]. The symbolic knowledge base, AD_COMP_KB is organized as a synthetic object of three hoard objects namely Addressee Knowledge base: Addresskb, Location Knowledge base: Locationkb and Place Knowledge base: Placekb as given in (5).

[Placekb]}b],[Locationkekb],{[AddresseAD_COMP_KB = (5)

AD_COMP_KB={ {Addressekb= [Salutation] [Addressee Name] [Care of Name] [Qualification] [Profession] [Designation] }

{Locationkb= [House No.] [House Name] [Road No.] [Road Name] [Area Name] [Land Mark] [POST BOX] [Firm Name] [PIN Code] [POST] }

{Placekb= [Place] [Taluk] [District] [VIA] [State] [Country] } }

Fig. 2. Structure of Symbolic Address Component Knowledge Base

The hoard objects are made of assertion objects as detailed in Figure 2. All the assertion objects of the symbolic knowledge base have the events described in Figure 3. The knowledge base is populated with the values extracted by observing large number of postal addresses.

Events of Assertion Object={(Number of Words), (Occurring Line), (Number Present), (Inv Comma Present),

(ALL CAPITALS), (keywords), (tokens)}

Fig. 3. Events Associated with Assertion Object

4 Fuzzy Symbolic Inference System

The postal address component labelling for unstructured addresses is carried out by the symbolic knowledge base supported fuzzy inference system. The postal address component inference system takes the destination postal address in text form as input, separates the probable components and labels them. The proposed system assumes that different components are on separate lines or on the same line separated by a comma. The fuzzy symbolic inference system for address component extraction and labelling is depicted in Figure 4.

The inference for address component labelling is done at the assertion object level. The labelled components (the identified assertion objects) are then grouped into postal


Fuzzy Symbolic Address Component

Extraction and Labelling

Symbolic Postal Address Component Knowledge base

DestinationPostal Address

Extracted and Labelled Address Components

Manual Intervention

Fig. 4. The Fuzzy Symbolic Inference System for Address Component Extraction and Labelling

hoard object (the symbolic representation of the postal address). The inference mechanism uses symbolic analysis for labelling the address components using the similarity measure, defined in section 5.1 as a fuzzy membership value and fuzzy alpha cut technique for assigning confidence measure for the decision.

4.1 Symbolic Similarity Measure for Address Component Labelling

The problem of address component labelling is not easy and should be ascertained by the information specified by the component only. The presence of some key words and their occurrence relative to the other components helps in identifying the components. The symbolic data analysis for address component labelling needs distance/ similarity measures to map the input to possible candidates. [9,12] describe widely used symbolic data distance measures for similarity. Distance measures for interval type of data, absolute value/ ratio type of data etc are described. The distance/ similarity measure described in [12] is made up of three components, namely similarity due to position, similarity due to content and similarity due to span of the two objects being compared. The position similarity is defined only for interval type of data and describes the distance of one object to the initial position of other object. The span similarity is defined for both interval and absolute type of data and describes the range/fraction of similarity between the objects. The content similarity describes the nearness between the contents of the two objects. The similarity measures defined in [12] have been used for clustering, classification etc, and have been tested on fat oil and iris data. As postal object has only absolute values the span and content similarity measures defined in [12] are modified and used in this fuzzy symbolic inference system for address component labelling.

The similarity measure gives the similarity of the input component with various component labels (assertion objects) of the symbolic synthetic object AD_COMP_KB. The similarity measure between ith input component (IPi) and jth component label (ctj) of the knowledge base is found using (6).

∑=

=EV

kkji netsim

EVctIPS

1

*1

),( , for 1≤ i≤ n and 1≤ j ≤ m (6)


Where, n is the number of available components in input address and m is the number of

possible component labels or assertion objects in the knowledge base. EV takes a value of 7, representing the seven events of the assertion objects

The values of netsimk are calculated for each event of assertion object using the computations implied in (7) for the first five to calculate content similarity and (8) for the last two to calculate span and content similarity.

KBIPSum

Intersewfk __

* , for 1 ≤ k ≤ 5 (7)

⎟⎟⎠

⎞⎜⎜⎝

⎛ ++KBIPSum

KBCompIPComp

KBIPSum

Intersewfk __*2

__

__* , for 6≤k≤7

(8)

Where, Interse is number of words/elements common to input component and

component label under test Comp_IP is the number of words/ elements in the input component Comp_KB is the number of words/ elements in the component label (knowledge

base) under test and InterseKBCompIPCompKBIPSum −+= ____

The weight factors wfk are pre-defined for every component and the values are assigned based on the importance of the events in different labels. This similarity measure is the fuzzy membership function of the input component in the component label class. The actual decision of the label class is made using the de-fuzzification technique described in section 4.2.

4.2 Fuzzy Symbolic Methodology for Address Component Labelling

The methodology for address component labelling involves separating the components (in separate lines or separated by commas) and extracting the required features. These features are stored in a newly devised data structure called Postal Address Information Structure (PDIS). The structure of PDIS is given in Figure 5. Then the PDIS is used to find the similarity measure with all the component labels.

After the symbolic similarity measure is calculated for the various component labels for an input component using equation (6), the component labels are arranged in the decreasing order of similarity value in a similarity array. Now to make a decision as to which component class, the input component belongs, a de-fuzzification process is taken up. The de-fuzzification is done by defining the fuzzy α-cut set. The α value is calculated using equation (9).

00 * SDFCS −=α (9)

Where, S0 is the maximum similarity value obtained for the input component FC is the de-

fuzzification constant and is taken as 0.1, based on the experimentation with postal address components.


The alpha cut set is obtained from the similarity array by taking into the cut set all the members of the similarity array whose value is greater than α. This is depicted pictorially in Figure 6. The α-cut set is used to identify the component label with assigned confidence value for the decision. If the α-cut set has only one member then the component label, ct0 (corresponding to I0 and S0 ) is assigned to the input component with confidence measure of 100.

Postal Address Component { Number of words Integer // Stores the number of tokens in the component Occurring Line Integer // The address line where the component occurs Number Boolean // Flag, set if one of the token is a number Inverted Comma Boolean // Flag, set if one or more of tokens are in inverted comma All Capitals Boolean // Flag, set if one of the tokens has all capital characters Marked Boolean // Flag, set if one of the key words is present Category String // To store the category of key word/ address component Tokens String // To store the tokens/ address of the address component Confidence String // To store the confidence level of the identification Component Type String // To identify/ label the component }

Fig. 5. Postal Address Information Structure

Sim

ilar

ity

0.1

Alpha

ct0 ct1 ct2 ctn-1 ctnComponent Labels

–cut set={ct0,ct1,ct2}

Fig. 6. The De-fuzzification Process and the α cut set

If the α- cut set has more than one component label then the probable component labels are output with the decreasing order of confidence. The confidence of the system in a given component label is evaluated using equation (10). If a particular label has a confidence of above 50% then the component is assigned the label, otherwise manual resolution is resorted to.

100*

1

,

∑=

=p

kk

jji

S

SC

for 1 ≤ j ≤ p and 1 ≤ i ≤ n (10)

Where, Ci,j= Confidence of assigning jth component label to ith input component n is the number of input components and p is the number of component labels in α-cut set Sj is the similarity if ith input component with jth component label in similarity array.


5 Results and Discussions

The fuzzy symbolic inference system for address component labeling is tested on various types of addresses and the results are encouraging. Table 2 summarizes the output of the system for a typical input addresses and lists the highest two similarity values generated with respect to input components and the corresponding identified labels. The overall results are given in Table 3. The total efficiency of the system is about 94% and can be increased by making the symbolic knowledge base much stronger. The developed system is robust enough for use in practical situations. The system has achieved an average component wise address identification efficiency of 94.68%.

Table 2. Result of Address Component Identification

Input Address

Output Address Components

Component Similarity Measurewith label

Similarity Measure with label

Alpha cut set Assigned Label

Confidence of decision

Mr 0.228, Salutation 0.1, Addressee {Salutation} Salutation 100 Bhosale Chandra 0.148, addressee 0.1, Care of

Name {Addressee} Addressee 100

Near DaddennavarHospital

0.228, Landmark 0.1, Care of Name

{Land Mark} Land Mark 100

Extension Area 0.278, Area 0.093, Landmark

{Area} Area 100

Bagalkot 0.228, Place 0.114, PIN {Place} Place 100

Mr. Bhosale Chandra, Near Daddennavar Hospital, Extension Area, Bagalkot , 587101

587101 0.114, PIN 0.1, State {PIN} PIN 100 Shri 0.228, Salutation 0.1, Addressee {Salutation} Salutation 100

S K Deshpande 0.123, Addressee 0.1, Care of Name

{Addressee} Addressee 100

Padmakunja 0.186, House Name 0.1, Care of Name

{House Name} House Name

100

15th Cross 0.119, Road Number 0.1 Care of Name

{Road Number}

Road Number

100

Moonlight Bar 0.93,Landmark 0.86, Postbox {Landmark,PostBox}

Landmark 52

Vidyagiri 0.186, Areaname 0.1,State {Areaname} Areaname 100 Bagalkot 0.2, place 0.1,State {State} State 100

Shri, S K Deshpande, “Padmakunja”, 15th Cross, Moonlight Bar, Vidyagiri, Bagalkot, 587102

587102 0.126, Pincode 0.107,Post {Pincode} Pincode 100

Table 3. Overall Results of Address Component Identification

Confidence of Component Labelling

Sl. No Particulars

All 100% >75% and < 100%

< 75%

Percentageof Total

addresses (=500)

1 Correctly labeled addresses 399 7 0 94 2 Addresses with one incorrectly labeled

Component 18 02 03 4.6

3 Addresses with two or more incorrectly labeled components

05 01 01 1.4

0 1


6 Conclusions

The fuzzy symbolic methodology for address component labelling presented in this paper has addressed one of the very important sub tasks of integrated postal automation, namely extracting and labelling of postal address components. These labelled address components form a symbolic address object, which can be further used in address interpretation and mapping to the mail delivery point. It employs symbolic similarity measures for address component labelling, which is treated as fuzzy membership function. The fuzzy alpha cut method is employed for de-fuzzification and deciding on the label of components with confidence value. The inference methodology suggested here is an important prior step for postal address interpretation and dynamic optimal route generation for delivery of mail.

References

1. Giovani Garibotto, 2002,“ Computer Vision in Postal Automation” Elsag Bailey- TELEROBOT,2002.

2. P.Nagabhushan, (1998), “ Towards Automation in Indian Postal Services : A Loud Thinking”, Technovision , Special Volume, pp 128-139

3. M.R.Premalatha and P. Nagabhushan, 2001, “An algorithmic prototype for automatic verification and validation of PIN code: A step towards Postal Automation”, NCDAR, 13th and 14th July 2001, PESCE Mandya India,pp 225-233

4. Srirangaraj Setlur, A Lawson, Venu Govindaraju and Sargur N Srihari,, 2001,” Truthing, Testing and Evaluation Issues in Complex Systems”, Sixth IAPR International Conference on Document Analysis and Recognition, Seattle, WA, pp 1205-1214

5. Sargur N. Srihari, Wen-jann Yang and Venugopal Govindaraju, 1999, “Information Theortic Analysis of Postal Address Fields for Automatic Address Interpretation”, ICDAR-99, Bangalore India, pp 309-312

6. Fabrizio Sebastiani, 2002, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol 34, No. 1, pp 1-47

7. http://www.bitboost.com/ref/international-address-formats.html 8. Universal Postal Union Address Standard, “FGDC Address Standard Version 2”. 9. Bock H.-H. ,Diday E.,2000, “Analysis of symbolic Data”, Heidelberg 2000

10. P.Nagabhushan,S.A.Angadi,B.S.Anami,2005, “A Symbolic Data Structure for Postal Address Representation and Address Validation through Symbolic Knowledge Base”, Premi 05, 18-22 December 2005,Kolkata India, Springer Verlag, LNCS 3776, pp388-393

11. P.Nagabhushan,S.A.Angadi,B.S.Anami,2005, “A Knowledge -Base Supported Inferencing of AddressComponents in Postal Mail” NVGIP 05, 2nd and 3rd March 2005, JNNCE,Shimoga, India

12. K.Chidanada Gowda, 2004, “Symbolic Objects and Symbolic Classification”, Proceedings of International Conference on Symbolic and Spatial Data Analysis :Mining Complex Data Structures Pisa, September 20th, 2004,pp1-18

Documents

[Lecture Notes in Computer Science] Fuzzy Systems and Knowledge Discovery Volume 4223 || A Fuzzy Symbolic Inference System for Postal Address Component Extraction and Labelling