Data for Machine Learning1190881/FULLTEXT01.pdf · big subject such as machine learning requires a bit of background to understand. 1.4.1 Artificial Intelligence A machine that can

Data for Machine LearningData generation and simulation of a logistics operation for machine learning

Erik Hedman

Högskoleingenjör, Datorspelsutveckling

2017

Luleå tekniska universitet

Institutionen för system- och rymdteknik

Abstract

In the logistics business, a priority is to deliver packages at the right time in the right place. Mistakes

can happen in any task that a human makes a decision. In this project, a simulation is developed of a

logistics operation, used to generate data for machine learning algorithms. This project is one part of

a bigger project. The algorithm will be trained to discover abnormalities in the flow of packages, with

the goal to reduce the amount of wrongfully handled packages. Machine learning algorithms and

training is parts of the bigger project and will not be covered in this paper. This project was brought

forth by IT-consulting Company Data Ductus.

Sammanfattning

En prioritet I logistik branschen är att leverera paket vid rätt tid på rätt plats. När en Människa tar ett

beslut om en uppgift så kan de hända misstag. I det här projektet utvecklas en simulering av ett

logistik system som genererar data vilket ska användas till en algoritm för maskininlärning. Det här

projektet är en del av ett större projekt. Algoritmen ska bli tränad för att upptäcka avvikande

beteende i flödet av paket, med mål att reducera mängden av felaktigt hanterade paket. Algoritmer

för maskininlärning och träning av algoritmer är delar av det stora projektet och kommer inte att

förklaras i denna artikel. Det här projektet var framtaget av IT-konsultföretaget Data Ductus.

Abbreviations and Terms AI – Artificial Intelligence

IoT – Internet of Things

IT – Information Technology

Contents

1 Introduction ......................................................................................................................................... 1

1.1 Data Ductus ............................................................................................................................. 1

1.2 Goals & Purpose ...................................................................................................................... 1

1.3 Limitations............................................................................................................................... 2

1.4 Background ............................................................................................................................. 2

1.4.1 Artificial Intelligence ....................................................................................................... 2

1.4.2 Machine Learning ............................................................................................................ 3

1.4.3 The Swedish Mail Format ................................................................................................ 3

1.5 Social, Ethical and Environmental Considerations .................................................................. 4

1.6 Method ................................................................................................................................... 5

1.6.1 Python ............................................................................................................................. 5

2 Design and Implementation ................................................................................................................. 6

2.1 Setting up the data base ............................................................................................................... 6

2.2 Generating the data ...................................................................................................................... 6

2.3 Simulating the data ....................................................................................................................... 7

3 Results .................................................................................................................................................. 9

3.1 The database ................................................................................................................................. 9

3.2 Data generator .............................................................................................................................. 9

3.3 Simulation ..................................................................................................................................... 9

3.4 Result summary ............................................................................................................................ 9

4 Discussion ........................................................................................................................................... 11

5 Conclusion .......................................................................................................................................... 12

Appendix ............................................................................................................................................... 14

1

1 Introduction Logistics is the implementation of an operation i.e. the management of the flow of packages

between a start and end location. The goal of the operation is to get the packages to the right

location at the right time. If a package is delivered to the wrong location the business can be in

danger of losing its trust. A late package can lead to a bad reputation. Thus, making sure packages

get to the right place at the right time is crucial to the operation.

In the postal system, each package undergoes a few different tasks, first of the package is prepared

with a name and address. The package is placed in a mailbox that is later discharged by a postal

worker and delivered to the closest mail terminal. The packages get sorted twice, once for the first

three numbers in the postal code and once for the last two. The first sort is a rough sort and the

second is a more precise one. The package is then sent to one of the 750 dispensing offices and later

of to the destined mailbox [13].

Within this system mistakes can happen [5], which can lead to delivery delays or even misplaced

packages. A product that can decrease the number of packages that are handled wrongfully would

support such a business. The product must be able to understand the logistics operation and how

the packages should be handled, to understand when a mistake is made and how to correct it. To

escape coding a solution for all mistakes that can be done, an application that can learn is what is

necessary. This sort of software if done correctly could be applied to lots of different logistics

businesses.

In recent years, machine learning has become more popular when dealing with predictions of data

[6], due to the calculating power of current computers. Creating this kind of artificial intelligence

requires data used for training and testing. The data, in this case, will be in the form of letters with

the Swedish mailing address format. When using only manufactured data to train a model, it is

important that the data used is created to look as real as possible. Otherwise, the transition

between simulation to real world situation can be difficult.

This report is based on a project done for IT-consulting firm Data Ductus with the goal of improving

logistics operations using software that can detect and correct mistakes done in a supply chain.

1.1 Data Ductus Data Ductus is a multinational IT consulting firm specialised in technically advanced solutions. They

help their customers succeed with their businesses by combining deep technical expertise and

business knowledge, creating tailor made solutions with tangible benefits.

Their services include system development and integration, network service management solutions

& orchestration, IoT-expertise, as well as operation, management and support.

They can provide services within a vast range of industries due to their highly skilled engineers and

project managers and are known for their ability to adapt and meet changing business requirements

quickly.

With offices in both Sweden and US, they offer their services globally since 1989. Their customers

range from large international groups to small start-ups.

1.2 Goals & Purpose The goal of this project was:

1. Generate data in the form of letters.

2

2. Creating a simulation of a flow of packages.

3. Feed the generated data to the simulation.

4. Merge the system with a model that observes and learns from the simulation. This model is

the product of two other projects.

5. The simulation shall work for batch and real-time streaming.

The purpose of the project is to create software that can find and correct errors in a logistic system.

These errors are the outcome of human mistakes. The software will be designed as a machine

learning software i.e. a model that will learn how the system works and what a correct sent letter

looks like versus an incorrect one.

The model will be created such that it can be applied to real-world logistics operations. Thus, the

software shall reduce the number of wrongly handled packages.

1.3 Limitations A big limitation was the time limit and prior knowledge about the subject. The implementation time

was greatly reduced by all the research that was necessary to understand the subject and how to go

about the implementation. Since this is one part of a bigger project a limitation is to be compatible

with the other parts, the formatting of the data input was influenced by the other parts of the

project.

1.4 Background This section covers the background information of the most important concepts used in this paper. A

big subject such as machine learning requires a bit of background to understand.

1.4.1 Artificial Intelligence A machine that can accomplish tasks that humans require intelligence to do is often seen as artificial

intelligence [8]. The Turing test is an approach to defining intelligence [1]. The test consists of a

human interrogator and a machine. The machine must answer some written questions. If the

interrogator can’t tell whether the response comes from a computer or a human, then the computer

passes the test. The computer requires four skills to pass the test:

Natural language processing to communicate with the interrogator.

Knowledge representation to remember what is observed.

Automated reasoning to use its memory to answer questions.

Machine learning to detect patterns and adapt to new circumstances.

The Turing test avoids physical contact on purpose because physical contact is unnecessary for

displaying intelligence. The total Turing test is a supplement for the original Turing test which adds a

video signal so that the interrogator can test the subject’s visual abilities; this test also includes a

hatch for the interrogator to pass objects to the subject. To pass the total Turing test, the machine

will need two additional attributes,

Computer vision to see objects that the interrogator hands the subject.

Robotics to receive and manipulate the objects that are given from the interrogator.

The six attributes described above represent the most of AI, although AI researchers spent little time

focusing on the test. It is believed to be more important to understand the underlying parts of

intelligence rather than designing a machine specifically made to complete these tasks [1].

3

1.4.2 Machine Learning Machine learning is everywhere. Spam filters, recommenders and self-driven cars are examples on

what machine learning can accomplish. Machine learning simply put is a machine that improves its

performance when completing future tasks after doing observations on the environment. Building a

machine learning solution is a complex task. Creating such a solution, following a specific workflow is

a good idea. Usual workflow consists of the following:

1 Defining the problem

Describe the problem and list similar problems and assumptions. Explain why the problem

needs to be solved. Describe how to solve the problem.

2 Preparing the data

Search for the available data, see what can be removed and if something is missing. Get the

data in the right format. Scale the data if needed.

3 Spot check algorithms

Test a lot of different algorithms to check which work for your data.

4 Improving the Results

After the spot check, run an analysis on the parameters of the top algorithms, to push the

algorithms to the limit.

These four steps can be applied to most machine learning problems. The advantages of building a

machine learning application are first off the learning capabilities. If trained on some datasets the

program will eventually learn to represent the data as different features [14]. The old approach

using a data scientist to analyse the data and define the features manually requires more time in

some cases and might not be possible. In recent years, machine learning has been used to find

relevant features in otherwise tangled datasets. Such feature finding can be used for example in face

recognition and speech recognition.

Another advantage is parameter tuning. An advanced neural network can have more than a million

tuneable parameters. A human couldn’t possibly fine tune such a large amount manually to find the

most optimal parameters. Therefore, learning algorithms such as gradient descent can be used to

find the best tuning. A disadvantage of machine learning is that there is no guarantee that all

problems can be solved with a machine learning algorithm.

The large amount of data sometimes required to train a model can be troublesome to work with or

collect. Fortunately, there exists a wide variety of complexity among machine learning algorithms

some that require fewer data, and some require more.

1.4.3 The Swedish Mail Format There are five attributes that describe where the letter should arrive: name, street, street number,

zip code and city [9].

4

Figure 1: Letter case.

An explanation of the attributes in Figure 1:

The name is the name of the receiver in this case “Mottagare Mottagarsson”.

The street is the specified street for the letter to arrive at, which is directly connected with

the last two digits of the zip code.

The street number specifies which number on the street the letter should arrive at; this is

the end of the delivery.

Zip Code is the attribute with the most information. The first two digits are describing a

city/area; the third tells us the delivery form and the last two digits specify a bundle of

streets [10].

City/Area is the city/area that the letter will arrive at, this attribute relates to the first two

digits in the zip code.

Sometimes the data of the sender is also written in the letters. This can be in the same form as the

receiver.

1.5 Social, Ethical and Environmental Considerations Machine learning is a hot topic of discussion, primarily because of automation. This automation has

already replaced a lot of physical labour. This is known, but most of the machines that replaced the

jobs have been programmed for a specific task. Therefore, only jobs that are predictable in nature

can be taken. In more recent years a lot of research has been made on “intelligent” machines. This is

where machine learning enters the picture.

The creation of smarter software will give a result of replacing humans in even more advanced jobs.

This is a huge problem for the economy if no solution is made. The automation is inevitable and

must be adapted to society. In an article [11] the author mentioned an approach which was an all-

automated economy, where everyone had a guaranteed base income. In the same article, the

author also mentioned the Peltzman effect as a counterargument for the argument that lazy people

will not do any work if they get money anyway. The Peltzman effect cited from an article [12] “The

Peltzman Effect is the hypothesised tendency of people to react to a safety regulation by increasing

other risky behaviour, offsetting some or all of the benefit of the regulation.”

5

All data that is used to train the model is manufactured so that no one feels that their names or

addresses are compromised.

1.6 Method The implementation process for this project consisted mostly of research on different subjects about

machine learning and AI. The first phase was to understand what machine learning is and how it can

be applied to logistics. This includes deciding what machine learning library to be used and how to

use it. The next step was to decide what type of data that will be generated and how to introduce

human mistakes in the data. The last two steps of the project were to build the simulation of the

data flow and merge the system with the AI.

1.6.1 Python After the research phase of the project, it was necessary to create a prototype as fast as possible

with the machine learning libraries available. Python was chosen because of its capability of fast

implementation of code and the applicable libraries within machine learning, such as Scikit-Learn

and Tensorflow. We ended up using Scikit-Learn.

6

2 Design and Implementation The general design of the project is to generate data in the form of letters, simulate the letters in a

logistics operation and save data to be used for a machine learning algorithm. The data gets their

properties from a database containing zip code, city, and other information needed for a letter.

Errors are introduced in the simulation representing human mistakes so that the AI can learn how a

mistake can look like. This project can be divided into three parts that will be explained in more

detail.

Figure 2: The general design concept for the simulation. Output 1 is the raw data generated.

Output 2 is in the form of dispatched letters.

2.1 Setting up the database The database consists of a list of names [15], a list of cities with the related zip codes and a list of

cities with the related streets. The database is built to simplify an expansion in data so that it’s easy

to add more locations and names. A list of links for the data centres used in the simulation is also

saved in the database.

2.2 Generating the data Data is generated from the database in the form of letters in the Swedish mail format. The generator

loads the configuration from the database as demonstrated in figure 2. Four arguments are

required: the number of letters to be generated, the percentage of critical errors, the percentage of

non-critical errors and the output file name. Critical errors are “mistakes” that cause the letter to be

defective and can’t be sent. Letters with non-critical errors can still be sent in the simulation. The

letters with critical errors are tagged so that the AI can learn to sort out inadequate letters. The

errors that are introduced in the data are the following:

Missing zip code (critical)

Missing street (critical)

Missing street number (critical)

Wrong zip code length (critical)

Missing name

Missing city

To explain one iteration of the “Data Generator” part in Figure 2, we start with a system with a list

containing a bunch of addresses that where generated from the database. First of letters are

generated by randomly picking an address in the list and a name of one who “lives” there. A user

7

defined amount of letters will now be tampered with. For each error that will be generated a letter

is picked at random that does not already have an error. When a letter that is not altered is found,

information on that letter is affected by one of the available errors. The data is now fully generated

and is sent to the simulation.

2.3 Simulating the data The simulation is built such that each city is represented with a post centre and each post centre is

connected to the city’s streets. Links between post centres are loaded from the database which

decides where the post centres can send their letters. Letters are loaded into the system from the

generated data. The system removes all invalid letters and sends each legitimate letter to a random

post centre, this step represent the mailbox discharge, where a postal worker empties a mailbox and

delivers the mail to the closest sorting terminal. Within each post centre, their letters are sorted by

the zip code.

Figure 3: The design of the main simulation loop.

If the two first numbers on the zip code match the post centres zip code numbers, then the letter is

marked to be sent to its delivery address in the city. If the zip codes don’t match the letter is marked

to be sent to another post centre with the correct zip code.

The system loops through all the post centres sorted letters, and with an error chance that is

decided by the user, a “mistake” is created. The following errors can be manufactured:

The marked address is changed to a different street within the city

The marked address’s street number is changed to a different random number

The marked address is changed to a random different sorting centre

Sorted letters that are tampered with is flagged so that the AI can learn that the delivery is incorrect.

The sorted letters are saved in a file for the AI to observe. All sorted letters are sent to their marked

address whether it is the correct one or not. This process is repeated several times as figure 3 shows.

The number of iterations is defined by the user.

Simulation of one letter in one loop as described in figure 3 will be explained now. The first step is

the data generator. The size of the address list is checked and a random number between 0 and the

size of the list is used to choose the destination of the letter. The generator randomizes errors to the

generated letters, no errors where generated in this case. The system checks if the letter has

adequate information, if that’s the case the letter is distributed to one of the existing sorting

centres. The destined sorting centre is randomized.

8

The letter is sorted by its zip code in the sorting centre. Additional information is added to the letter,

the current location is added as well as the end location. An error in sorting can be made, the chance

of this happening is user defined. The letter happened to get an error; one of the three different

kinds of error gets randomized. An error in the form of wrong city location is generated, now the

data in the letter is tweaked to match a random different city. The data that is tweaked is the newly

added end location. The post centre now sends the letter to the wrong city. The sent letter is saved

as data for the machine learning algorithm. The letter arrives in and goes through the same thing

over again. This time no errors where generated after the sorting process. The letter is sent to the

correct city. Data is saved to the machine learning algorithm. Arriving at the correct city’s sorting

centre the letter is now sent to the right address and is correctly delivered. Data is saved. Each time

letters are sent from the sorting centres data is saved to a file with the purpose of examination by

the machine learning algorithm.

9

3 Results In this section, the results of the project are described in three parts. The database, data generator

and lastly the simulation.

3.1 The database The database consists of 100 different names, 21 cities and 44 streets. The number of street

numbers can be assigned by the user. 200 street numbers make the total amount of unique

addresses 184 800.

3.2 Data Generator Data is generated and saved as a csv file one file is saved per loop the data saved to the file is used

by the machine learning model. The file consists of a header that describes each column; each line

under the header represents one letter. The error percentage can be tuned between 0-100.

3.3 Simulation The simulation can load data that is generated in the form of letters. Letters that are loaded into the

system is simulated as a logistics operation. An error chance can be tuned between 0-100 and

determines the chance a letter is sent to the wrong address from a post centre, all post centres have

the same percentage. The letter then gets a different end location than the address on the letter.

The simulation saves all letters sent in the system to a csv file one file is saved per loop the data

saved to the file is used by the machine learning model. The data saved each transfer is the

following:

Name

Surname

Street

Street number

Zip code

City

Legitimate - Boolean

Start street

Start street number

Start zip code

Start city

End street

End street number

End zip code

End city

Correct Delivery – Boolean

The seven first parameters describe the information on the letter that is sent. The four next

parameters tell us the start position of the letter. Parameters with the “End” prefix tell us the

position the letter is going to have after the delivery; this may not be the correct position. The last

parameter is a true or false statement with says whether the transfer is a correct one or not.

3.4 Result Summary Using a data set with a size of 100 000, 20% critical error, 10% non-critical error and an error chance

in the simulation at 50%. With these settings, the simulation was tested ten times with ten iterations

of the main loop. Table 1 shows the correlation between time and the number of letters generated.

10

The correlation between time and letter simulated in one loop is shown in table 2. The ten test runs

can be found in an attached file named “Sim10Iterations.xlsx”. The average time for each iteration is

described in table 3.

The tests show that the data generator and the simulation work and fulfil almost all the

requirements made from the question formulation. Data is generated and simulated with errors

introduced. The data can be applied to a machine learning algorithm in batch. Functionality that did

not make it to the deadline was real time streaming.

Size Time

10000 0.162

100000 1.65

1000000 16.5 Table 1: The left column “Size” is the number of letters generated, on the right column “Time” shows the time it

took to generate the data in seconds.

Size Time

10000 0.316

100000 3.35

1000000 39.5 Table 2: The left column “Size” is the number of letters simulated, on the right column “Time” shows the time it

takes for the simulation to do one iteration i.e. send all letters in the system one time.

Iteration Time

1 5,99

2 8,33

3 10,4

4 9,53

5 11,9

6 10,3

7 12,4

8 10,1

9 10,1

10 12,7 Table 3: Times for each main loop iteration using a data set size of 100 000. This table is the average of 10 test

runs.

11

4 Discussion One thing that worked great with this project was the communication between the two other

projects which built the machine learning model. We had a lot of fruitful discussions about how the

pieces should be put together. This was especially good at the research phase where we gave

suggestions to each other about good research material.

The amount of time that was soaked into research was so much that it crippled the implementation

and development of the simulation; on the other hand, the research was necessary to understand

such a complex problem as machine learning. All time that was invested early in the research phase

made it easier to understand what sort of data, a machine learning algorithm can use. The reason

why I did so much research was that I wanted to learn as much as possible about machine learning

because I felt that I could get insight on how to format the data and I didn’t want to miss out on

better ways to work with machine learning. I ended up doing more research then needed for my

specific task but I look at it not as wasted time but as good self-improvement.

A lot of improvements can be made for this project; first of, the structure of the database was not

well enough thought through. The database can be redesigned to be understood easier and become

more efficient. A simple improvement would be to rework the different files to a single file with a

more logical structure. The list of names is now locked to a full name that should be two different

categories first name and surname, this was thought of during the implementation but the time

spent to improve the system was not something that was a priority. The letter generation works

great its functionality requirements are met.

Using the solution is a bit troubling if the user wants to tune any of the parameters of the data

generator or the simulation it is required to go into the code. This can be solved by implementing a

user interface, which would make it easier to use and if such an improvement was made then other

functions could also be introduced such as easier integration with different machine learning

algorithms, more options on how the data should be saved etc.

In comparison to a real logistics operation, the simulation is small and does not contain much detail.

A lot of improvements can be done in this area; shipping vehicles can be put into the system so that

the delivery time can be accountable into the simulation. This small change in the system makes it

much more complex and can hopefully make a more applicable AI. Transport timings and transport

damage can be monitored. Fuel optimisation problems could be examined.

4.1 Future Work If I were to continue to work on the project I would add some additional functions. One feature I

would add is errors in the form of misreading. For example if a letter case had a number one or

seven the characters could be “badly written” and be misinterpreted which as a result will be sent to

the wrong address.

Automation for the whole proses would be implemented. As I mentioned earlier in the thesis, the

link between the simulation and the AI only works in batch mode. Adding the machine learning

algorithm to the main loop could automate the process. Another feature that I want to implement is

the sender’s details on the letters. Right now the letters only contain the information about the

receiver therefore no relation between receiver and sender can be made by the AI.

12

5 Conclusion The project was on the right track for success, but it needs some more work to fulfil all the goals of

the project. The simulation works in batch mode and the data that is generated can be applied to a

machine learning algorithm, but since real time streaming is not implemented in the simulation

which was one of the goals, the project cannot be considered finished. A bit more work is needed to

develop the simulation further so that the solution can be used for incremental learning. In this

project, a logistics operation simulation is introduced. The simulation generates data as letters and

letter deliveries. The data produced is later used by a machine learning algorithm to learn how to

correctly send a letter.

13

References

[1] Russell S, Norvig P. (2010). Artificial Intelligence A Modern Approach (3rd edition). New jersey:

Pearson Education.

[2] Fasli M. (2014). Analyzing and modeling complex and big data | Professor Maria Fasli |

TEDxUniversityofEssex. TEDx Talks. Available at:

https://www.youtube.com/watch?v=8DqQCZMawNg [Accessed 31/03 2017].

[3] Creative Punch. (2014). Artificial Dataset Generation for Machine Learning with Python and

Numpy / Theano. Available at: http://creative-punch.net/2014/08/artificial-dataset-machine-

learning-python/ [Accessed 07/04 2017].

[4] Rief M, Shafait F, Dengel A. (2012). Dataset Generation for Meta-Learning. Available at:

http://www.dfki.de/KI2012/PosterDemoTrack/ki2012pd15.pdf [Accessed 07/04 2017].

[5] Human Error. Available at: https://en.wikipedia.org/wiki/Human_error [Accessed 12/04 2017].

[6] Foote K. (2016). Machine Learning: From Then Until Now. Available at:

http://www.dataversity.net/machine-learning-now/ [Accessed 18/04 2017].

[7] Brownlee J. (2013). How to Prepare Data For Machine Learning. Machine Learning Process.

Available at: http://machinelearningmastery.com/how-to-prepare-data-for-machine-learning/

[Accessed 18/04 2017].

[8] Copeland J. (2000). What is Artificial Intelligence?. Available at:

http://www.alanturing.net/turing_archive/pages/reference%20articles/what%20is%20ai.html

[Accessed 20/04 2017].

[9] Swedish Standards Institute. Brevets yttre. Available at:

http://www.sis.se/Documents/TK/TK%20322/Brevets_Yttre.pdf [Accessed 26/04 2017].

[10] Postnummer i Sverige. Available at: https://sv.wikipedia.org/wiki/Postnummer_i_Sverige

[Accessed 26/04 2017].

[11] Ford M. (2015) Rise of the Machines: The Future has a Lot of Robots, Few Jobs for Humans.

Available at: https://www.wired.com/brandlab/2015/04/rise-machines-future-lots-robots-jobs-

humans/ [Accessed 27/04 2017].

[12] Specht P. (2007). The Peltzman Effect: Do Safety Regulations Increase Unsafe Behavior?

Available at: http://www.asse.org/assets/1/7/fall07-feature02.pdf [Accessed 27/04 2017].

[13] Adminen. (2013). Brevets väg genom postsystemet. Available at:

http://www.startsverige.nu/brevets-vag-genom-postsystemet/ [Accessed 04/05 2017].

[14] Bupe C. (2015). What are the advantages and disadvantages of machine learning? Available at:

https://www.quora.com/What-are-the-advantages-and-disadvantages-of-machine-learning

[Accessed 05/05 2017].

[15] Joe. ListOfRandomNames. Available at: http://listofrandomnames.com/index.cfm?textarea

[Accessed 12/05 2017].

https://www.youtube.com/watch?v=8DqQCZMawNg

http://www.dfki.de/KI2012/PosterDemoTrack/ki2012pd15.pdf

https://en.wikipedia.org/wiki/Human_error

http://www.alanturing.net/turing_archive/pages/reference%20articles/what%20is%20ai.html

http://www.sis.se/Documents/TK/TK%20322/Brevets_Yttre.pdf

https://sv.wikipedia.org/wiki/Postnummer_i_Sverige

http://www.asse.org/assets/1/7/fall07-feature02.pdf

https://www.quora.com/What-are-the-advantages-and-disadvantages-of-machine-learning

http://listofrandomnames.com/index.cfm?textarea

14

Appendix Sim10Iterations.xlsx