26

Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms
Page 2: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Classifying and Mining Job AdsPertaining to the Offshore Sector

in Casablancaa

Master: Big DataResearch project

”Data Science for Improving Education and Employability inMorocco”

Mehdi ELOUALIAdvisor: Pr.Ghita MEZZOUR

2

Page 3: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Abstract

Understanding the job market is a huge step forward in the question of adaptinguniversities curricula to the the job market needs, and preparing people to facethe challenges of tomorrow. The intuitive way to understand the job marketis by classifying job ads in their own categories. This is why I will implementan automated machine learning classification system. This system is based onsupervised learning, which means that it is built based on training corpora con-taining the correct job label for each ad. I used K-Nearest-Neighbor and NaiveBayes Clustering to classify the texts. After text classification comes informa-tion extraction. We extracted both salary and language, that we obtained byfiltering the ads. We also extracted the most used Programming languages.The results we obtained are quite surprising, and go against the popular beliefs.We discovered that Spanish is the most paid language in the offshore sectorahead from French and English, despite their highest popularity. We also knewthat the most needed programming language in Morocco is Java, followed in a2nd position by Javascript, and then “C” that comes in a close 3rd position.R language comes in 8th position ahead of Python and C++, showing the ef-fectiveness of the Big data promoting cam- paign initiated by the Moroccanministry of education. This work is mod- estly contributing to a much big-ger project called “ Data Science for Improved Education and Employability inMorocco”.

Page 4: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Acknowledgments

This work would not have been possible without the kind support and help ofmany individuals. I would like to extend my sincere thanks to all of them.I am highly indebted to Pr MEZZOUR Ghita for her guidance and constantsupervision as well as for providing necessary information regarding the projectand also for her support in completing the project. I would like to express myspecial gratitude and thanks to doctoral student KHAOUJA Imane , for givingme such attention and time. I would like to express my gratitude towards myparents and brother for their encouragement which helped me in completion ofthis project. My thanks and appreciations also go to my colleague in developingthe project and people who have willingly helped me out with their abilities.This work is supported by USAID under grant AID-OAA-A-11-00012. Theviews and conclusions contained in this document are those of the authors andshould not be interpreted as representing the official policies, either expressedor implied of USAID.

1

Page 5: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Contents

1 Introduction 5

2 Background 62.1 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Machine learning roadmap . . . . . . . . . . . . . . . . . . . . . . 7

3 Data 83.1 Building a labeled corpus . . . . . . . . . . . . . . . . . . . . . . 93.2 Dealing with imbalanced data . . . . . . . . . . . . . . . . . . . . 93.3 Curing Imbalanced Training Data . . . . . . . . . . . . . . . . . . 9

4 Methodology 104.1 Automatization protocol . . . . . . . . . . . . . . . . . . . . . . . 104.2 Creating a text classifier with R . . . . . . . . . . . . . . . . . . . 10

5 Results 115.1 KNN machine learning algorithm . . . . . . . . . . . . . . . . . . 115.2 Balancing the data for better performance . . . . . . . . . . . . 115.3 Oversampling with efficiency . . . . . . . . . . . . . . . . . . . . 125.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.5 Injecting more data to the NB algorithm . . . . . . . . . . . . . . 135.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 Information extraction 156.1 Top Programming languages in Morocco . . . . . . . . . . . . . 156.2 Is C Popularity really declining ? . . . . . . . . . . . . . . . . . . 166.3 Languages popularity in offshore ads . . . . . . . . . . . . . . . 176.4 Salary in call centers . . . . . . . . . . . . . . . . . . . . . . . . 18

7 Limitations 19

8 Related work 20

9 Future work 21

10 Conclusion 22

Page 6: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

List of Tables

5.1 KNN confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . 125.2 Confusion matrix for Naives Bayes . . . . . . . . . . . . . . . . . 135.3 Confusion matrix for Naives Bayes after overfeeding . . . . . . . 135.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6.1 Average monthly salary Summary . . . . . . . . . . . . . . . . . 18

Page 7: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

List of Figures

2.1 Bag of words approach . . . . . . . . . . . . . . . . . . . . . . . . 7

5.1 Machine learning pattern . . . . . . . . . . . . . . . . . . . . . . 14

6.1 Most common programming languages in Morocco . . . . . . . . 156.2 Evolution of popularity scores of programming languages since

2001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.3 Top Languages popularity in the Moroccan Offshore industry . . 176.4 Salary in term of languages in call centers . . . . . . . . . . . . . 18

Page 8: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 1

Introduction

This last century Job market has witnessed constant transformation and likemany other fields it could not escape the era of digitalization. Although thedigitalization of the job market has revolutionized the way recruiters and can-didates interacts, the amount of data (job ads) generated every day became soimmense that it became impossible to classify them manually. With the everincreasing availability of electronic documents and the rapid growth of the inter-net, the task of automatic categorization of documents became the key methodfor organizing the information. Machine learning algorithms, could offer a sus-tainable solution. In fact the benefits that this technology brings to the table,are speed and efficiency.The data that we are analyzing are Job ads. I obtained all job ads in theCasablanca region from the team working on the big project. They used key-words taken from a Finance ministry report, obtaining around 39,000 job ads.These job ads were collected from different websites. But mining job ads posesmany challenges, because such ads are unstructured and use inconsistent ter-minology. The need of automatically retrieve useful knowledge from the hugeamount of textual data, in order to assist the human analysis is a reality. Thisis why I chose to develop an automated machine learning classification systemto classify offshore ads. Choosing the offshore sector was not a random move,in fact the offshore industry in Morocco is developing very rapidly, and thecountry is investing massively in offshoring campus, docks and infrastructures.Technically speaking, we create a machine learning model using a number oftext documents (called Corpus) as Input and its corresponding class/category(called Labels) as Output. We used KNN and Naives Bayes algorithms for thetext classification in order to pick up the best performing one. We discuss thetunings for the KNN and NB , in order to optimize their results in the Resultssection. The final section is about the insights we got from this works, it’s calledInformation extraction. Our work is the first step of a long process that aims toget an insight of Moroccan job market. We focused in particular in the offshorejob market and we used machine learning in order to classify ads as offshoreor not, then we analyzed them. This analyze provides us with some pertinentinformation about the average salary, language and programming languages.The next step could be classifying the offshore ads into ITO(Information Tech-nology Outsourcing) and BPO (business process outsourcing), and also compar-ing the programming languages in the offshore industry with the local market.

Page 9: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 2

Background

In many data science project, the most important substance is data. Dataare the starting point of any mathematical or statistical study. That beensaid, a Document classification system, is fueled by data. Document classifi-cation or document categorization is to classify documents into one or moreclasses/categories manually or algorithmically. Today we try to classify job Adsalgorithmically. Document classification falls into Supervised Machine learningTechnique. Technically speaking, we create a machine learning model usinga number of text documents (called Corpus) as Input and its correspondingclass/category (called Labels) as Output. The model thus generated will beable to classify into classes when a new text is supplied.The current trend in worldwide industries is not to create something new in termof technology (Machine learning algorithms in our case), but rather finding newapplications and purposes to this technologies.Machine learning algorithms are the core of our automated classification sys-tem, and this is the best way we could find to tackle an immense number ofunstructured text. Classifying jobs ads automatically remain the main goal ofthis report.

2.1 Bag of words

The bag-of-words model is a simplified representation used in natural languageprocessing and information retrieval (IR). In this model, a text (such as a sen-tence or a document) is represented as the bag (multiset) of its words, disregard-ing grammar and even word order but keeping multiplicity. The bag-of-wordsmodel is commonly used in methods of document classification where the (fre-quency of) occurrence of each word is used as a feature for training a classifier.

Page 10: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Figure 2.1: Bag of words approach

2.2 Machine learning roadmap

Machine learning classification roadmap :We can divide the steps into:- Creation of Corpus- Preprocessing of Corpus- Creation of Term Document Matrix- Preparing Features and Labels for Model- Creating Train and test data- Running the model- Testing the model

This is the basic scheme of a machine learning text classification systemsText Classification aims to assign a text instance into one or more class(es) in apredefined set of classes. The classification of the Ads constitute an importantpart of this data science project. In fact this is the same data that will beanalyzed and tested later on. The goal of text classification is to assign somepiece of text to one or more predefined classes or categories. The piece of textcould be a document, news article, job ad, email, etc.In our particular case the goal is to classify an ad whether it’s an offshoring ora non offshoring ads. To do so, we chose to build a text classification machinelearning system.

7

Page 11: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 3

Data

I obtained all job ads in the Casablanca region from the team working on the bigproject. They used keywords taken from a Finance ministry report , obtainingaround 39,000 job ads. The keywords we use are “offshore”, “informatique”(computer science), “banque” (bank), “assurance” (insurance), “teleconseiller”,“centre d’appel” (call center). The job ads that were collected include also non-offshore ads. In order to remove such ads, I built a machine learning classifierthat can automatically distinguish between offshore and non-offshore job ads inthe collected data.The Ads are for the most unstructured, and the variability in their structure , iseven greater when we switch from a websites to another . Adressing grammaticaland syntaxical errors are also mandatory in order to successfully classify adsand this is where machine learning algorithms could do the difference. Machinelearning algorithms need huge amount of proper text data to learn before modelsare created for the purpose of Ads classification.One of the first steps in working with text data is to pre-process it. It is anessential step before the data is ready for analysis. Majority of available textdata is highly unstructured and noisy in nature – to achieve better insights orto build better algorithms, it is necessary to clean the data. For example, JobAds data is highly unstructured – it is an informal communication – typos,bad grammar, presence of unwanted content like URLs, Stopwords, Expressionsetc. are the usual suspects. Escaping HTML characters: Data obtained fromweb usually contains a lot of html. It is thus necessary to get rid of theseentities. Decoding data: This is the process of transforming information fromcomplex symbols to simple and easier to understand characters. Text data maybe subject to different forms of decoding like “Latin”, “UTF8” etc. Therefore,for better analysis, it is necessary to keep the complete data in standard encodingformat. UTF-8 encoding is widely accepted and is recommended to use.Blank space Lookup: Removing white spaces Removal of Stop-words: Whendata analysis needs to be data driven at the word level, the commonly occurringwords (stop-words) should be removed. One can either create a long list of stop-words or one can use predefined language specific libraries.Removal of Punctuations: All the punctuation marks according to the prioritiesshould be dealt with. For example: “.”, “,”,”?” are important punctuationsthat should be retained while others need to be removed.

Page 12: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

3.1 Building a labeled corpus

The Machine learning methods used in this paper (KNN ) has been widely usedfor text classification. However, no project has considered this technology forClassifying job Ads. A random sample of 500 Job ads concerning (online bank-ing, computer science, and telecommunication ) in the region of CasablancaMorocco, have been selected. From there 170 Ads have been randomly chosenand manually labeled as Offshoring and Non Offshoring ads. Next, we randomlyselected 70/100 of the Dataset and assigned it for Training the model and the30% leftover for the Test. The goal of this paper is to automatize the classifica-tion process for the Job ads, in order to gain time and to reduce misclassificationrates due to human errors . The KNN classifier is based on the assumption thatthe classification of an instance is most similar to the classification of otherinstances that are nearby in the vector space. The main computation is thesorting of training documents in order to find the k nearest neighbors for thetest document.

3.2 Dealing with imbalanced data

Imbalanced data typically refers to a problem with classification problems wherethe classes are not represented equally.In our example, we have a 2-class (binary) classification problem with 169 in-stances (rows). A total of 124 instances are labeled with Class-1 and the re-maining 45 instances are labeled with Class-2.This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is125:45 or more concisely 3:1.You can have a class imbalance problem on two-class classification problems aswell as multi-class classification problems.

3.3 Curing Imbalanced Training Data

We now understand what class imbalance is and why it provides misleadingclassification accuracy.We can change the dataset that we use to build your predictive model to havemore balanced data.This change is called sampling the dataset and there is one main method thatwe can use to even-up the classes:We can add copies of instances from the under-represented class called over-sampling. This approaches are often easy to implement and fast to run.

9

Page 13: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 4

Methodology

4.1 Automatization protocol

Automated text classification has multiple applications, classifying books ina library catalogs, or determining the topic of a conference paper, and thenew application that we found for this technologies, classifying jobs ads. Theproblem can be simplified by converting documents into vectors, also known asterm-document vectors. Each dimension in the model represents a word, andeach vector had a classification. To classify a test document, the document isconverted into a vector then the k nearest training vectors ‘vote’ to determinethe classification of the test document.

4.2 Creating a text classifier with R

This section will explain how to create a text classifier with R, we will imple-ment a machine-learning algorithm, which classifies texts as being either anoffshoring ad or a Non offshoring ad. What we need first is a folder containingthe offshoring and non offshoring ads . . .After initializing the R session, a vector with the names of the two candidates iscreated. Then, a function is written which cleans the texts by removing punctu-ation, strips white spaces, converts everything to lower case, and removes stopwords After writing a cleaning function the Text Document Matrix is createdand the cleaning function is applied to the texts. A term document matrix,is a matrix that describes the frequency of terms that occur in a collection ofdocuments.Now, a function is written which creates a data frame of the list objects whichcombines the TDM and the name of the respective candidate.Next, the two list objects are combined into a single data frameWe are now in a position to separate the data frame into training and test dataset. The training data is used to train our classifier that is then applied to thetest data.Now, we use K-Nearest-Neighbor Clustering to classify the texts. In the finalstep, we determine the accuracy of or classifier.

Page 14: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 5

Results

5.1 KNN machine learning algorithm

The first training dataset was created with respect for the real world ratio ofthe offshore and non offshore ads of 1:3 . The results in global accuracy of theclassifier was 90/100 , but accuracy is not the metric to use when working withan imbalanced dataset.There are metrics that have been designed to tell us a more truthful story whenworking with imbalanced classes. For the 1:3 setup, the Offshore class, showed62% precision, 57% recall and 60% Fscore.These results are far less impressivethan the 90% accuracy could suggest.

5.2 Balancing the data for better performance

We are now coping with a total of 214 job ads, We bulked up the under rep-resented class with copies of job ads, and we are now dealing with 90 offshoreads while the remaining 124 ads were Non offshore. We achieved a much morebalanced ratio of 1:1.5. After running the algorithm, the best results we gotfor the Offshore class was 96% precision , 96% recall, and 96% Fscore . Whichseems at first glance like the performance of the classifier has significantly im-proved. The best results for the Non offshore class is 95% precision , 95% recalland 95% Fscore , which is a big step forward. These very optimistic results areto take with parsimony, because the testing data came from the initial dataset,whereas the training data also belong . We are dealing with a case of overfit-ting, the same jobs ads that constitute the training, could appear in the testdata, because we add copies when we did the oversampling. This problem canbe solved with testing with new data, but there will still be another limitationwhich is the diversity of the ads. In fact our learning algorithms needs diver-sity in order to perform well in a wide range of scenarios, adding copies of thesame ads could do the trick for balancing the data, but adding new ads, whenoversampling have much more benefits.

Page 15: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

ActualPrediction non offshore offshore

non offshore 40 4offshore 0 16

Table 5.1: KNN confusion matrix

5.3 Oversampling with efficiency

Adding copies of ads to balance the data and boost the accuracy wasn’t a badidea, despite its limitations. But adding new jobs ads is a way better way to doso. The first reason is that adding new ads will make the diversity of the trainingdata much more profitable for the machine learning algorithms, second reason isthat we can efficiently do the testing with a portion of the same dataset, becausethey are no copies (we want to predict ads that were never seen before).

100% of the offshore ads were well classified, and 90% of the non offshore adswere also classified correctly. or the offshore ads, recall is 80%, precision is 80%,and Fscore is also 80% . For Non Offshore ads, recall is 85%, precision is 90%and Fscore is 87%. This results seems less interesting than those obtained withthe oversampling with copies, but they are more representative of the reality.

5.4 Naive Bayes

In machine learning, naive Bayes classifiers are a family of simple probabilisticclassifiers based on applying Bayes’ theorem with strong (naive) independenceassumptions between the features. Naive Bayes is a simple technique for con-structing classifiers: models that assign class labels to problem instances, repre-sented as vectors of feature values, where the class labels are drawn from somefinite set. We decided to create a Naive Bayes classifier, and to feed it the exactsame data, that we used for the KNN classifier . After running the algorithmseveral times we couldn’t pass the 40% Accuracy barrier, which was a very dis-appointing result. Also we have to note that the non offshore showed 0% inprediction, which was unexpected . In fact Naive Bayes will not be reliable ifthere are significant differences in the attribute distributions compared to thetraining dataset. An important example of this is the case where a categoricalattribute has a value that was not observed in training. In this case, the modelwill assign a 0 probability and be unable to make a prediction.

12

Page 16: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

ActualPrediction non offshore offshore

non offshore 0 0offshore 36 23

Table 5.2: Confusion matrix for Naives Bayes

5.5 Injecting more data to the NB algorithm

We decided to train the Naives Bayes algorithm with a bigger job dataset con-sisting of 432 job ads, 192 Non offshore ads and 240 offshore ads, which is roughlydouble the number of ads we used in the first test. The results were way betterand we could achieve 70% accuracy. For the Offshore ads, the precision is 80%,the recall is 100% , and fscore is 89% For the non offshore ads, the precision is63%, the recall is 71%, and the Fscore is 67% .

ActualPrediction non offshore offshore

non offshore 69 40offshore 18 73

Table 5.3: Confusion matrix for Naives Bayes after overfeeding

5.6 Summary

The overfeeding method with fresh ads, improved significantly the performanceof both NB and KNN. This is the machine learning theory, the more you trainthe model the less error you get, but at a certain point the model will hit aplateaus, even if you give the model more example for training like it is showedin figure 5.1

KNN NB KNN OVERFED NB OVERFEDFSCORE 60 % - 84 % 78%RECALL 57% - 83 % 86%

ACCURACY 62% 40 % 95% 70 %PRECISION 90 % - 85 % 72%

Table 5.4: Summary

13

Page 17: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Figure 5.1: Machine learning pattern

14

Page 18: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 6

Information extraction

6.1 Top Programming languages in Morocco

We analyzed 7000 IT job ads in Casablanca from January 2017 to June 2017with the purpose of determining the extent of the demand for a given program-ming language . The first step was to create a list of all worldwide availableprogramming languages.Then we used this list as a dictionary of words to tackle the job dataset , inorder to count every single programming languages appearance. There is a totalof 256 programming languages that we were searching for in the 7000 job ads ,and here are the results for the top languages:

Figure 6.1: Most common programming languages in Morocco

Page 19: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Clearly the most needed programming language in Morocco is JAVA, followedin a 2nd position by Javascript, and C comes in a close 3rd position. Alsowe noticed that C++ is not popular at all. R language is ahead Python andC++, which traduce the impact of the big data wave in Morocco. Cobol doesn’tattract attention anymore and is close to extinction.

6.2 Is C Popularity really declining ?

Looking back at 2016, the TIOBE index(The TIOBE Programming Commu-nity index is an indicator of the popularity of programming languages) suggeststhat the interest given by developers to the C programming language wouldbe declining. At mid-year 2016, the programming language invented by DennisRitchie was at its lowest level since the launch of the TIOBE index in 2001, morethan 15 years ago. It is also important to note that C’s popularity rating hasbeen falling since November 2015. The second half of 2016 would also not havebeen better for C which recorded a continued decline in its popularity score.The C language increased from 16.04 % in January 2016, on the TIOBE scale,to 9.35% in January 2017, reflecting a decline in popularity of more than 40%over 12 months. Last August, TIOBE explained the decline in popularity of Clanguage for various reasons. The first is the fact that C is struggling to evolvein some markets, especially in fast-growing areas such as web and mobile appli-cation development. Moreover, the last IEEE ranking of the best programminglanguages showed that the C language is not among the top 10 languages forthe Web And in terms of mobile development, the flagship language for Android(Java) is popular.

Figure 6.2: Evolution of popularity scores of programming languages since2001

16

Page 20: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

6.3 Languages popularity in offshore ads

Here are the results i obtained after analysing 7000 offshore ads in order to seewhich language is the most popular.

Figure 6.3: Top Languages popularity in the Moroccan Offshore industry

French is by far the most prolific language for the offshore industry in Morocco. French represents by its own 70% of the demand. English represent slightlymore than 15% of the demand, and Arabic represents only 5%. In the otherhand, Deutch, Italian and Spanish are extremely rare , and represents whencombined a total demand of 9%.

17

Page 21: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

6.4 Salary in call centers

We decided to calculate the monthly average salary in call centers based on thelanguage with a 95% confidence level, here are the results :

Figure 6.4: Salary in term of languages in call centers

Spanish is the most paid language in the callcenter market, followed by French,then Deutch, English and finally Arabic.

AVG monthly salary(dhs) Nb of ads Standard deviation Confidence intervalSpanish 3700 75 1365 3700+-308French 3280 1612 800 3280+-39Deutch 3260 74 832 3260 +-190English 3140 342 583 3140 +- 58Arabic 3100 117 520 3100 +-95

Table 6.1: Average monthly salary Summary

18

Page 22: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 7

Limitations

Our work is gravitating around the region of Casablanca. However Moroccohas many other important cities such as Tangier and Kenitra, that are hostinga lot of industrial investments, and are believed to host many more in the nextdecade, notably with the future Chinese industrial park in Tanger. We couldextent our work to these hotspots in order to get a better knowledge of theMoroccan job market. The ads that we collected, are online job ads. We didn’taddress the intern or the mailing list job promotions. Also we didn’t addressthe fact that some ads are duplicated towards multiple websites, and couldpotentially bias the results.

Page 23: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 8

Related work

Litecky and al.[Litecky et al. (2012)] gathered job ads from five large U.S. jobweb sites. More than 2.7 million computing job advertisements were collectedfrom these sites. They analyzed the frequency of job skills in IT positions.Smith and al.[Smith & Ali (2014)] showed in their study that the data miningtechniques could be used by faculty and programs to work and meet the demandsof Information Technology market. Through continuous data collections andthrough using the technology routine of data analysis, by showing the trendof programming jobs. Karakatsanis and al.[Kar] used latent semantic indexingto transform the O*NET occupation data (database were job occupations aredescribed) into a latent space. Then job ads were projected in this space in orderto detect the O*NET occupations with the greatest demand, identify popularjob clusters and study the changes over time in the job market.

Page 24: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 9

Future work

Our work is the first step of a long process that aim to get an insight of Moroccanjob market. We focused in particular in The offshore job market and we usedmachine learning in order to classify ads as offshore or not, then we analyzedthem. This analyze provides us with some pertinent information about theaverage salary, language and programming languages.The next step could be classifying the offshore ads into ITO(Information Tech-nology Outsourcing) and BPO (business process outsourcing), and also compar-ing the programming languages in the offshore industry with the local market.

Page 25: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Chapter 10

Conclusion

We showed that we could classify job ads using machine learning algorithmssuck as K-Nearest Neighbours or Naives Bayes. The unstructured nature ofthe ads is what motivates us to use machine learning, in fact using key wordsclassification was not a good idea because everyone’s grammar and style is verydifferent. Classifying job ads as offshore or not offshore was the first step of aglobal work that aim to better understand the Moroccan job market. This betterunderstanding is very important because it could impact universities curricula inorder to match the job markets needs and to react quicker to the market changes.Some of the benefits of this work are, reducing unemployment and more adaptededucation. We also made some statistics about the job market. We discoveredthat the most needed language in the offshore industry is French, but the mostpaid language is Spanish. Deutch salary is higher than English which is quitesurprising. Arabic, despite being the native language is the least paid language.In the IT market, Java is the most used programming language , followed byJavascript, and then “C” that comes in a close 3rd position. R language comesin 8th position ahead of Python and C++, showing the effectiveness of thebig data promoting campaign initiated by the Moroccan ministry of education.This work was a true opportunity to analyze the job markets in Morocco likenever before, using the latest technologies available. The insights that we getfrom this work will hopefully contribute to enrich a much bigger project called”Data science for improved education and employability in Morocco”.

Page 26: Classifying and Mining Job Ads in Casablancaa Reports...and syntaxical errors are also mandatory in order to successfully classify ads and this is where machine learning algorithms

Bibliography

KarData mining approach to monitoring the requirements of the job mar-ket: A case study. https://www.infona.pl/resource/bwmeta1.element.elsevier-933fdc83-06c7-3ce2-9569-b9ffe0eac4af

Litecky et al. 2012Litecky, Chuck; Igou, Amy J.; Aken, Andrew: Skills in the ManagementOriented IS and Enterprise System Job Markets. In: Proceedings of the 50thAnnual Conference on Computers and People Research. ACM, New York,NY, USA, 2012 (SIGMIS-CPR ’12). – ISBN 978–1–4503–1110–6, 35–44

Smith & Ali 2014Smith, David; Ali, Azad: Analyzing Computer Program-ming Job Trend Using Web Data Mining. In: Issues inInforming Science & Information Technology 11 (2014), Januar,203. https://www.questia.com/library/journal/1G1-420050951/

analyzing-computer-programming-job-trend-using-web. – ISSN1547–5840