Upload
jaroslaw-szymczak
View
267
Download
2
Embed Size (px)
Citation preview
Automatic image moderation in classifieds
By Jaroslaw Szymczak
PYDATA PARIS @ PYPARIS 2017June 12, 2017
Agenda
● Image moderation problem● Brief sketch of approach● Machine learning foundations of the solution:
○ Image features○ Listing features (and combination of both)
● Class imbalance problem:○ proper training○ proper testing○ proper evaluation
● Going live with the product:○ consistent development and production environments○ batch model creation○ live application○ performance monitoring
Scale of business at OLX
4.4APP RATING
#1 app +22 COUNTRIES (1)
1) Google play store; shopping/lifestyle categoriesNote: excludes Letgo. Associates at proportionate share
→ People spend more than twice as long in OLX apps versus competitors
became one of the top 3 classifieds app in US less than a year after its launch
130 Countries+60 million monthly listings
+18 million monthly sellers
+52 million cars are listed every year in our platforms; 77% of the total amount of cars manufactured!+160,000 properties are listed daily
• 2 houses • 2 cars• 3 fashion items• 2.5 mobile phones
At OLX, are listed every second:
✔ real photo of the phones
✔ selfie with a dress
✔ real shoes photo
✘ human on the picture (OLX India)
✘ stock photo (OLX Poland)
CALL 555-555-555
✘ contact details (all sites)
✘ NSFW (all sites)
Binary image classification
Image features:
● CNN fine tuning● transfer learning● image represented as 1D vector
Classic features:
● category of listing● is listing from business of a private
person● what is the price?
All fed to
Why not more, e.g. title, description, user history?
Because of pragmatism, we don’t want to overcomplicate the model:● CNN are state of the art for image recognition● classical features help in improving accuracy, but having too many of them would
decrease significance of image features
Convolutional Neural Networks
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
Fine tuning and transfer learning
Source: lecture notes to Stanford Course CS231n: http://cs231n.stanford.edu/slides/2017
Inception network
Source: http://redcatlabs.com/2016-07-30_FifthElephant-DeepLearning-Workshop/
Inception 21kTrained on 21 841 classes on ImageNet setTop-1 accuracy above 37%Available for mxnet: https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md
VGG16 network
Source: https://www.cs.toronto.edu/~frossard/post/vgg16/
● used model from Keras● easy to freeze arbitrary layers (layer.trainable = False )
Gradient boosting?
● instead of weights update in each round you try to fit the weak learner to residuals of pseudo-residuals
● similarly like in neural networks, shrinkage parameter is used when updating the algorithm to compensate for loss function
eXtreme Gradient Boosting (XGBoost)
Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
Class imbalance - proper training
● possibilities to deal with the problem:○ undersampling majority class○ oversampling minority class:
■ randomly■ by creating artificial examples (SMOTE)
○ reweighting● undersampling suits our needs the most
○ the general population of good images is not very much “hurt” by undersampling
○ having training data size limitations we can train on more unique examples of bad images
○ we undersample in such manner, that we change the ratio from 99:1 to 9:1
Class imbalance - proper evaluation● accuracy is useless measure in such case● sensible measures are:
○ ROC AUC○ PR AUC○ Precision @ fixed Recall○ Recall @ fixed Precision
● ROC AUC:○ can be interpreted as concordance probability (i.e. random positive example has the probability
equal to AUC, that it’s score is higher)○ it is though too abstract to use as a standalone quality metric○ does not depend on classes ratio
● PR AUC○ Depends on data balance○ Is not intuitively interpretable
● Precision @ fixed Recall, Recall @ fixed Precision:○ they heavily depend on data balance○ they are the best to reflect the business requirements○ and to take into account processing capabilities (then actually Precision @k is more accurate)
Consistent development and production environments
● ensure you have the drivers installednvidia-smi
● create docker image
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
...
ENV BUILD_OPTS "USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1"RUN cd /home && git clone https://github.com/dmlc/mxnet.git mxnet --recursive --branch v0.10.0 --depth 1 \ && cd mxnet && make -j$(nproc) $BUILD_OPTS
...
RUN pip3 install tensorflow==1.1.0RUN pip3 install tensorflow-gpu==1.1.0RUN pip3 install keras==2.0
● use nvidia-docker-compose wrapper
Batch processwith use of Luigi framework
● re-usability of processing● fully automated pipeline● contenerized with Docker
Luigi tips
● create your output at the very end of the task● you can dynamically create dependencies by yielding the task
● adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task … --workers 15)
● for straightforward workflows inheritance comes handy:
class SimpleDependencyTask(luigi.Task): def create_simple_dependency(self, predecessor_task_class, additional_parameters_dict=None): if additional_parameters_dict is None: additional_parameters_dict = {} result_dict = {k: v for k, v in self.__dict__.items() if k in predecessor_task_class.get_param_names()} result_dict.update(additional_parameters_dict) return predecessor_task_class(**result_dict)
ads_from_one_day = yield DownloadAdsFromOneDay(self.site_code, effective_current_date)