Kaggle Machine Learning Projects Ashok Kumar Harnal203.122.28.235/pdf/Kaggle_Projects_Executed_in_the_course.pdf · Kaggle and About Projects Kaggle is a platform for predictive modelling

Kaggle Machine Learning Projects

Ashok Kumar Harnal

FORE School of Management, New Delhi

Page 2 of 52

About Kaggle and About Projects

Kaggle is a platform for predictive modelling and analytics competitions on which

companies, public bodies and researchers post their data and pose problems relating to them

from the domain of predictive analytics. Statisticians and data miners from all over the world

compete to produce the best models. The data posted ranges from megabytes to terabytes. Data

of the range of gigabytes is common. This competitive approach relies on the fact that there are

countless strategies that can be applied to any predictive modelling task and it is impossible to

know at the outset which technique or analyst will be most effective.

Here is how works:

1. The competition host (company) prepares the data and a description of the problem.

Kaggle offers a service which helps the host do this, as well as frame the competition,

anonymize the data, and integrate the winning model into their operations.

2. Participants experiment with different techniques and compete against each other to

produce the best models. For most competitions, submissions are scored immediately

(based on their predictive accuracy relative to a hidden solution file) and summarized on

a live leaderboard.

Projects participated by us

We have participated in a number of projects on Kaggle. Some

of the projects are listed below. I also write a technical blog at

this link: http://ashokharnal.wordpress.com where I describe in

detail how we have executed these projects, project-code as also

the results of competition. I have been using R and python at

different times.

This compilation is a record of projects that we have executed. For paucity of time, not all of

these are listed in my technical blog. This booklet describes projects and the associated problems

but not the solutions. If you wish to have access to solutions also, please log into this e-learning

site (http://203.122.28.250/bigdata) of FORE as userid: ‘myguest’, password: Qwerty#123 and

peruse the project-codes and results.

Please visit my technical blog at this link

(http://ashokharnal.wordpress.com )

where many projects are described in

detail.

https://www.kaggle.com/

http://ashokharnal.wordpress.com/

http://203.122.28.250/bigdata

http://203.122.28.250/bigdata



Page 3 of 52

Contents About Kaggle and About Projects ................................................................................................................. 2

Here is how works: ................................................................................................................................... 2

Projects participated by us ....................................................................................................................... 2

1. Bosch Production Line Performance ..................................................................................................... 7

Problem: Reduce manufacturing failures ................................................................................................. 7

Data ....................................................................................................................................................... 7

Data Files ............................................................................................................................................... 7

File descriptions ........................................................................................................................................ 8

2. Africa Soil Properties Challenge .......................................................................................................... 10

Problem: Predict physical and chemical properties of soil using spectral measurements .................... 10

Data ..................................................................................................................................................... 11

File descriptions .................................................................................................................................. 11

Data fields ........................................................................................................................................... 11

Techniques used: ................................................................................................................................ 12

3. Rossmann Drug Store.......................................................................................................................... 13

Problem: Forecast sales using store, promotion, and competitor data ................................................. 13

Data ..................................................................................................................................................... 13

Files ..................................................................................................................................................... 13

Data fields ........................................................................................................................................... 13

Feature Engineering ............................................................................................................................ 14

Techniques used ................................................................................................................................. 14

4. Walmart: Acquire Valued shoppers challenge .................................................................................... 16

Problem: Predict which shoppers will become repeat buyers ............................................................... 16

Data ..................................................................................................................................................... 17

Files ..................................................................................................................................................... 17

Fields ................................................................................................................................................... 17

Feature Engineering ............................................................................................................................ 18

5. Avazu CTR Prediction .......................................................................................................................... 20

Problem: Predict whether a mobile ad will be clicked ........................................................................... 20

Data ..................................................................................................................................................... 20


Page 4 of 52

Data fields ........................................................................................................................................... 20

6. Facial keypoints detection .................................................................................................................. 22

Problem: Detect the location of keypoints on face images .................................................................... 22

Data ..................................................................................................................................................... 23

7. Forest Cover Prediction ...................................................................................................................... 24

Problem: Use cartographic variables to classify forest categories ......................................................... 24

Data ..................................................................................................................................................... 25

Data Fields ........................................................................................................................................... 25

8. Boehringer Ingelheim: Which drugs are effective? ............................................................................ 28

Problem: Predict a biological response of molecules from their chemical properties ........................... 28

Data ..................................................................................................................................................... 28

9. West Nile virus prediction ................................................................................................................... 29

Problem: Predict West Nile virus in mosquitos across the city of Chicago ............................................ 29

Data ..................................................................................................................................................... 29

Main dataset ....................................................................................................................................... 30

Spray Data ........................................................................................................................................... 30

Weather Data ...................................................................................................................................... 31


10. Caterpillar tube pricing ................................................................................................................... 33

Problem: Model quoted prices for industrial tube assemblies .............................................................. 33

Data ..................................................................................................................................................... 33


train_set.csv and test_set.csv ............................................................................................................. 33

tube.csv ............................................................................................................................................... 33

bill_of_materials.csv ........................................................................................................................... 34

specs.csv.............................................................................................................................................. 34

tube_end_form.csv ............................................................................................................................. 35

components.csv .................................................................................................................................. 35

comp_[type].csv .................................................................................................................................. 35

type_[type].csv .................................................................................................................................... 35

11. San Francisco (SanFrancisco) Crime Classification .......................................................................... 36

Problem: Predict the category of crimes that occurred in the city by the bay ....................................... 36

Page 5 of 52

Data ..................................................................................................................................................... 37

Data fields ........................................................................................................................................... 37

12. Airbnb New User Bookings ............................................................................................................. 38

Problem: Where will a new guest book their first travel experience? ................................................... 38

Data ..................................................................................................................................................... 38


Fields ................................................................................................................................................... 39

13. TFI: Restaurant Revenue Prediction ............................................................................................... 40

Problem: Predict annual restaurant sales based on objective measurements ...................................... 40

Data ..................................................................................................................................................... 41


Data fields ........................................................................................................................................... 41

14. Otto Group Product Classification Challenge ................................................................................. 42

Problem: Classify products into the correct category ............................................................................ 42

Data ..................................................................................................................................................... 42

15. Walmart Recruiting - Store Sales Forecasting ................................................................................. 43

Problem: Use historical markdown data to predict store sales .............................................................. 43

Data ..................................................................................................................................................... 43

stores.csv ............................................................................................................................................ 43

train.csv ............................................................................................................................................... 44

test.csv ................................................................................................................................................ 44

features.csv ......................................................................................................................................... 44

16. Springleaf: Determine whether to send a direct mail piece to a customer ................................... 45

Problem: Predict which customers can be directly targeted .................................................................. 45

Data ..................................................................................................................................................... 45

Results achieved ................................................................................................................................. 46

17. Santander Customer Satisfaction ................................................................................................... 47

Problem: Which customers are happy customers? .................................................................................. 47

Data Files ............................................................................................................................................ 47


Results achieved .................................................................................................................................. 48

18. Influencers in Social Networks ........................................................................................................ 49

Page 6 of 52

Problem: Predict which people are influential in a social network ......................................................... 49

Data Files ............................................................................................................................................ 49

19. Predicting Red Hat Business Value ................................................................................................. 51

Problem: Classifying customer potential ................................................................................................ 51

Data Files ............................................................................................................................................ 51

Page 7 of 52

1. Bosch Production Line Performance

Problem: Reduce manufacturing failures

(Area: Operation/Manufacturing)

A good chocolate soufflé is decadent, delicious, and delicate. But, it's a challenge to prepare.

When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your

steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing

companies, has an imperative to ensure that the recipes for the production of its advanced

mechanical components are of the highest quality and safety standards. Part of doing so is

closely monitoring its parts as they progress through the manufacturing processes.

Because Bosch records data at every step along its assembly lines, they have the ability to apply

advanced analytics to improve these manufacturing processes. However, the intricacies of the

data and complexities of the production line pose problems for current methods.

In this competition, Bosch is challenging to predict internal failures using thousands of

measurements and tests made for each component along the assembly line. This would enable

Bosch to bring quality products at lower costs to the end user.

Data

Data Files

File Name Available Formats

test_categorical.csv .zip (19.75 mb)

http://www.bosch.com/en/com/home/index.php

https://www.kaggle.com/c/bosch-production-line-performance/download/test_categorical.csv.zip

Page 8 of 52


train_categorical.csv .zip (19.78 mb)

train_date.csv .zip (58.77 mb)

test_date.csv .zip (58.78 mb)

sample_submission.csv .zip (1.55 mb)

test_numeric.csv .zip (270.33 mb)

train_numeric.csv .zip (269.98 mb)

The data for this competition represents measurements of parts as they move through Bosch's

production lines. Each part has a unique Id. The goal is to predict which parts will fail quality

control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named

according to a convention that tells you the production line, the station on the line, and a feature

number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number

3939.

On account of the large size of the dataset, we have separated the files by the type of feature they

contain: numerical, categorical, and finally, a file with date features. The date features provide a

timestamp for when each measurement was taken. Each date column ends in a number that

corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which

L0_S0_F0 was taken.

In addition to being one of the largest datasets (in terms of number of features) ever hosted on

Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes

are expected to make this a challenging problem.

File descriptions

train_numeric.csv - the training set numeric features (this file contains the 'Response' variable) test_numeric.csv - the test set numeric features (you must predict the 'Response' for these Ids) train_categorical.csv - the training set categorical features test_categorical.csv - the test set categorical features train_date.csv - the training set date features test_date.csv - the test set date features sample_submission.csv - a sample submission file in the correct format

https://www.kaggle.com/c/bosch-production-line-performance/download/train_categorical.csv.zip

https://www.kaggle.com/c/bosch-production-line-performance/download/train_date.csv.zip

https://www.kaggle.com/c/bosch-production-line-performance/download/test_date.csv.zip

https://www.kaggle.com/c/bosch-production-line-performance/download/sample_submission.csv.zip

https://www.kaggle.com/c/bosch-production-line-performance/download/test_numeric.csv.zip

https://www.kaggle.com/c/bosch-production-line-performance/download/train_numeric.csv.zip

Page 9 of 52

Page 10 of 52

2. Africa Soil Properties Challenge

Problem: Predict physical and chemical properties of soil using spectral

measurements

(Area: Environment/Geology)

Advances in rapid, low cost analysis of soil samples using infrared spectroscopy, georeferencing

of soil samples, and greater availability of earth remote sensing data provide new opportunities

for predicting soil functional properties at unsampled locations. Soil functional properties are

those properties related to a soil’s capacity to support essential ecosystem services such as

primary productivity, nutrient and water retention, and resistance to soil erosion. Digital mapping

of soil functional properties, especially in data sparse regions such as Africa, is important for

planning sustainable agricultural intensification and natural resources management.

Diffuse reflectance infrared spectroscopy has shown potential in numerous studies to provide a

highly repeatable, rapid and low cost measurement of many soil functional properties. The

amount of light absorbed by a soil sample is measured, with minimal sample preparation, at

hundreds of specific wavebands across a range of wavelengths to provide an infrared spectrum

(Fig. 1). The measurement can be typically performed in about 30 seconds, in contrast to

conventional reference tests, which are slow and expensive and use chemicals.

Page 11 of 52

Conventional reference soil tests are calibrated to the infrared spectra on a subset of samples

selected to span the diversity in soils in a given target geographical area. The calibration models

are then used to predict the soil test values for the whole sample set. The predicted soil test

values from georeferenced soil samples can in turn be calibrated to remote sensing covariates,

which are recorded for every pixel at a fixed spatial resolution in an area, and the calibration

model is then used to predict the soil test values for each pixel. The result is a digital map of the

soil properties.

This competition asks one to predict 5 target soil functional properties from diffuse reflectance

infrared spectroscopy measurements.

Data

File descriptions

train.csv - the training set has 1158 rows. test.csv - the test set has 728 rows. sample_submission.csv - all zeros prediction, serving as a sample submission file in the correct

format.

Data fields

SOC, pH, Ca, P, Sand are the five target variables for predictions. The data have been

monotonously transformed from the original measurements and thus include negative values.

PIDN: unique soil sample identifier SOC: Soil organic carbon pH: pH values Ca: Mehlich-3 extractable Calcium P: Mehlich-3 extractable Phosphorus Sand: Sand content m7497.96 - m599.76: There are 3,578 mid-infrared absorbance measurements. For example,

the "m7497.96" column is the absorbance at wavenumber 7497.96 cm-1. We suggest to remove spectra CO2 bands which are in the region m2379.76 to m2352.76, but one does not have to.

Depth: Depth of the soil sample (2 categories: "Topsoil", "Subsoil")

We have also included some potential spatial predictors from remote sensing data sources. Short

variable descriptions are provided below and additional descriptions can be found at AfSIS data.

The data have been mean centered and scaled.

BSA: average long-term Black Sky Albedo measurements from MODIS satellite images (BSAN = near-infrared, BSAS = shortwave, BSAV = visible)

CTI: compound topographic index calculated from Shuttle Radar Topography Mission elevation data

ELEV: Shuttle Radar Topography Mission elevation data EVI: average long-term Enhanced Vegetation Index from MODIS satellite images.

http://www.africasoils.net/data/datasets

Page 12 of 52

LST: average long-term Land Surface Temperatures from MODIS satellite images (LSTD = day time temperature, LSTN = night time temperature)

Ref: average long-term Reflectance measurements from MODIS satellite images (Ref1 = blue, Ref2 = red, Ref3 = near-infrared, Ref7 = mid-infrared)

Reli: topographic Relief calculated from Shuttle Radar Topography mission elevation data TMAP & TMFI: average long-term Tropical Rainfall Monitoring Mission data (TMAP = mean

annual precipitation, TMFI = modified Fournier index)

Techniques used:

Bayesian Additive Regression Trees, a variation of boosting in which each weak learner is fitted to errors using Bayesain approach.

Least error: 0.4692; Worst error: 28.77

Page 13 of 52

3.Rossmann Drug Store

Problem: Forecast sales using store, promotion, and competitor data

(Area: Marketing/Human Resource)

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store

managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales

are influenced by many factors, including promotions, competition, school and state holidays,

seasonality, and locality. With thousands of individual managers predicting sales based on their

unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging one to predict 6 weeks of daily sales

for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create

effective staff schedules that increase productivity and motivation. By helping Rossmann create

a robust prediction model, one will help store managers stay focused on what’s most important to

them: their customers and their teams!

One is provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the

"Sales" column for the test set. Note that some stores in the dataset were temporarily closed for

refurbishment.

Data

Files

train.csv - historical data including Sales test.csv - historical data excluding Sales sample_submission.csv - a sample submission file in the correct format store.csv - supplemental information about the stores

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Id - an Id that represents a (Store, Date) duple within the test set Store - a unique Id for each store Sales - the turnover for any given day (this is what is to be predicted) Customers - the number of customers on a given day Open - an indicator for whether the store was open: 0 = closed, 1 = open

Page 14 of 52

StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools StoreType - differentiates between 4 different store models: a, b, c, d Assortment - describes an assortment level: a = basic, b = extra, c = extended CompetitionDistance - distance in meters to the nearest competitor store CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the

nearest competitor was opened Promo - indicates whether a store is running a promo on that day Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not

participating, 1 = store is participating Promo2Since[Year/Week] - describes the year and calendar week when the store

started participating in Promo2 PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the

promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

Feature Engineering

1. Mean(sales)/mean(customers) for each store 2. Mean(sales)/mean(customer) for each store and by day-of-week wise 3. Mean(sales) only by day-of-week wise 4. Mean(sales)/mean(customer) only by day-of-week wise 5. Mean(Sales) by each store and day of week wise 6. Mean(Sales) by each store and by promo wise 7. Mean(sales) by each store-type and promo wise 8. Mean(Sales) by assortment and store-type wise

Techniques used

1. Random Forest 2. Xgboost 3. Model mix: α * (i) + (1-α) * (ii)

Page 15 of 52

Page 16 of 52

4.Walmart: Acquire Valued shoppers

challenge

Problem: Predict which shoppers will become repeat buyers

(Area: Marketing/Human Behavior)

Consumer brands often offer discounts to attract new shoppers to buy their products. The most

valuable customers are those who return after this initial incented purchase. With enough

purchase history, it is possible to predict which shoppers, when presented an offer, will buy a

new item. However, identifying the shopper who will become a loyal buyer -- prior to the initial

purchase -- is a more challenging task.

The Acquire Valued Shoppers Challenge asks participants to predict which shoppers are most

likely to repeat purchase. To aid with algorithmic development, we have provided complete,

basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an

acquisition campaign. The incentive offered to that shopper and their post-incentive behavior is

also provided.

This challenge provides almost 350 million rows of completely anonymised transactional data

from over 300,000 shoppers. It is one of the largest problems run on Kaggle to date.

Warning: this is a large data set. The decompressed files require about 22GB of space.

Page 17 of 52

This data captures the process of offering incentives (a.k.a. coupons) to a large number of

customers and forecasting those who will become loyal to the product. Let's say 100

customers are offered a discount to purchase two bottles of water. Of the 100 customers, 60

choose to redeem the offer. These 60 customers are the focus of this competition. Predict which

of the 60 will return (during or after the promotional period) to purchase the same item again.

To create this prediction, we give a minimum of a year of shopping history prior to each

customer's incentive, as well as the purchase histories of many other shoppers (some of whom

will have received the same offer). The transaction history contains all items purchased, not

just items related to the offer. Only one offer per customer is included in the data. The training

set is comprised of offers issued before 2013-05-01. The test set is offers issued on or after 2013-

05-01.

Data

Files

Four relational files are provided:

transactions.csv - contains transaction history for all customers for a period of at least 1 year prior to their offered incentive

trainHistory.csv - contains the incentive offered to each customer and information about the behavioral response to the offer

testHistory.csv - contains the incentive offered to each customer but does not include their response (one is predicting the repeater column for each id in this file)

offers.csv - contains information about the offers

Fields

All of the fields are anonymized and categorized to protect customer and sales information. The

specific meanings of the fields will not be provided (so don't bother asking). Part of the challenge

of this competition is learning the taxonomy of items in a data-driven way.

history

id - A unique id representing a customer

chain - An integer representing a store chain

offer - An id representing a certain offer

market - An id representing a geographical region

repeattrips - The number of times the customer made a repeat purchase

repeater - A boolean, equal to repeat trips > 0

offerdate - The date a customer received the offer

transactions

id - see above

chain - see above

dept - An aggregate grouping of the Category (e.g. water)

category - The product category (e.g. sparkling water)

Page 18 of 52

company - An id of the company that sells the item

brand - An id of the brand to which the item belongs

date - The date of purchase

productsize - The amount of the product purchase (e.g. 16 oz of water)

productmeasure - The units of the product purchase (e.g. ounces)

purchasequantity - The number of units purchased

purchaseamount - The dollar amount of the purchase

offers

offer - see above

category - see above

quantity - The number of units one must purchase to get the discount

company - see above

offervalue - The dollar value of the offer

brand - see above

Feature Engineering

1. What is the attitude of customer towards the offered product? Has he purchased this in

the past? (Binary attribute: Yes/No)

2. productAffinity- How many times the customer has purchased the offered category (No

of transactions)

3. prod_purchasedamount: Total amount spent on this category?

4. Customer attitude towards offered brand: What are his total expenses on brand?

5. brand_affinity: How interested a customer is in brand of company that offered this

product; Count number of transactions that pertained to this brand

6. category_affinity: What is the attitude of customer towards offered category i.e. even

from other companies?

7. chain_affinity: Attitude of customer towards the store chain? Does he visit it often? And

how often has customer transacted and what are his total purchases

8. Basket of categories: What are customer's variety of category purchases?

9. Build brand popularity score: Which brand is more popular?

10. Build Category popularity score: Which category is more popular

11. Build Company popularity score: Which company is more popular

Page 19 of 52

Top-Entry Score: 0.62703; Worst Entry: 0.4430

Page 20 of 52

5.Avazu CTR Prediction

Problem: Predict whether a mobile ad will be clicked

(Area: Advertising)

In online advertising, click-through rate (CTR) is a very important metric for evaluating ad

performance. As a result, click prediction systems are essential and widely used for sponsored

search and real-time bidding.

For this competition, we have provided 11 days worth of Avazu data to build and test prediction

models. Can one find a strategy that beats standard classification algorithms?

Data

File descriptions

train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks are subsampled according to different strategies.

test - Test set. 1 day of ads to for testing one’s model predictions. sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5

Benchmark.

Data fields

id: ad identifier click: 0/1 for non-click/click hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. C1 -- anonymized categorical variable

Page 21 of 52

banner_pos site_id site_domain site_category app_id app_domain app_category device_id device_ip device_model device_type device_conn_type C14-C21 -- anonymized categorical variables

Highest Score: 0.3791384 Worst score: 23.72

Page 22 of 52

6. Facial keypoints detection

Problem: Detect the location of keypoints on face images

(Area: Computer Vision)

The objective of this task is to predict keypoint positions on face images. This can be used as a

building block in several applications, such as:

tracking faces in images and video

analysing facial expressions

detecting dysmorphic facial signs for medical diagnosis

biometrics / face recognition

Detecing facial keypoints is a very challenging problem. Facial features vary greatly from one

individual to another, and even for a single individual, there is a large amount of variation due to

3D pose, size, position, viewing angle, and illumination conditions. Computer vision research

has come a long way in addressing these difficulties, but there remain many opportunities for

improvement.

Each predicted keypoint is specified by an (x,y) real-valued pair in the space of pixel indices. There are 15

keypoints, which represent the following elements of the face:

left_eye_center, right_eye_center, left_eye_inner_corner, left_eye_outer_corner,

right_eye_inner_corner, right_eye_outer_corner, left_eyebrow_inner_end,

left_eyebrow_outer_end, right_eyebrow_inner_end, right_eyebrow_outer_end, nose_tip,

mouth_left_corner, mouth_right_corner, mouth_center_top_lip, mouth_center_bottom_lip

Left and right here refers to the point of view of the subject.

Page 23 of 52

In some examples, some of the target keypoint positions are missing (encoded as missing entries

in the csv, i.e., with nothing between two commas).

The input image is given in the last field of the data files, and consists of a list of pixels (ordered

by row), as integers in (0,255). The images are 96x96 pixels.

Data

training.csv: list of training 7049 images. Each row contains the (x,y) coordinates for 15

keypoints, and image data as row-ordered list of pixels.

test.csv: list of 1783 test images. Each row contains ImageId and image data as row-

ordered list of pixels

submissionFileFormat.csv: list of 27124 keypoints to predict. Each row contains a

RowId, ImageId, FeatureName, Location. FeatureName are "left_eye_center_x,"

"right_eyebrow_outer_end_y," etc. Location is what needs to be predicted.

Best score: 1.9397; Worst score: 52.07

Page 24 of 52

7. Forest Cover Prediction

Problem: Use cartographic variables to classify forest categories

(Area: Environment)

In this competition one is asked to predict the forest cover type (the predominant kind of tree

cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual

forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS)

Region 2 Resource Information System data. Independent variables were then derived from data

obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and

contains binary columns of data for qualitative independent variables such as wilderness areas

and soil type.

This study area includes four wilderness areas located in the Roosevelt National Forest of

northern Colorado. These areas represent forests with minimal human-caused disturbances, so

that existing forest cover types are more a result of ecological processes rather than forest

management practices.

The study area includes four wilderness areas located in the Roosevelt National Forest of

northern Colorado. Each observation is a 30m x 30m patch. One is asked to predict an integer

classification for the forest cover type. The seven types are:

1 - Spruce/Fir

2 - Lodgepole Pine

3 - Ponderosa Pine

4 - Cottonwood/Willow

Page 25 of 52

5 - Aspen

6 - Douglas-fir

7 - Krummholz

Data

The training set (15120 observations) contains both features and the Cover_Type. The test set

contains only the features. One must predict the Cover_Type for every row in the test set

(565892 observations).

Data Fields

Elevation - Elevation in meters

Aspect - Aspect in degrees azimuth

Slope - Slope in degrees

Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features

Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features

Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway

Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice

Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice

Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice

Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points

Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation

Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation

Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

The wilderness areas are:

1 - Rawah Wilderness Area

2 - Neota Wilderness Area

3 - Comanche Peak Wilderness Area

4 - Cache la Poudre Wilderness Area

The soil types are:

1 Cathedral family - Rock outcrop complex, extremely stony.

2 Vanet - Ratake families complex, very stony.

3 Haploborolis - Rock outcrop complex, rubbly.

4 Ratake family - Rock outcrop complex, rubbly.

5 Vanet family - Rock outcrop complex complex, rubbly.

6 Vanet - Wetmore families - Rock outcrop complex, stony.

7 Gothic family.

8 Supervisor - Limber families complex.

9 Troutville family, very stony.

10 Bullwark - Catamount families - Rock outcrop complex, rubbly.

11 Bullwark - Catamount families - Rock land complex, rubbly.

Page 26 of 52

12 Legault family - Rock land complex, stony.

13 Catamount family - Rock land - Bullwark family complex, rubbly.

14 Pachic Argiborolis - Aquolis complex.

15 unspecified in the USFS Soil and ELU Survey.

16 Cryaquolis - Cryoborolis complex.

17 Gateview family - Cryaquolis complex.

18 Rogert family, very stony.

19 Typic Cryaquolis - Borohemists complex.

20 Typic Cryaquepts - Typic Cryaquolls complex.

21 Typic Cryaquolls - Leighcan family, till substratum complex.

22 Leighcan family, till substratum, extremely bouldery.

23 Leighcan family, till substratum - Typic Cryaquolls complex.

24 Leighcan family, extremely stony.

25 Leighcan family, warm, extremely stony.

26 Granile - Catamount families complex, very stony.

27 Leighcan family, warm - Rock outcrop complex, extremely stony.

28 Leighcan family - Rock outcrop complex, extremely stony.

29 Como - Legault families complex, extremely stony.

30 Como family - Rock land - Legault family complex, extremely stony.

31 Leighcan - Catamount families complex, extremely stony.

32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.

33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.

34 Cryorthents - Rock land complex, extremely stony.

35 Cryumbrepts - Rock outcrop - Cryaquepts complex.

36 Bross family - Rock land - Cryumbrepts complex, extremely stony.

37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.

38 Leighcan - Moran families - Cryaquolls complex, extremely stony.

39 Moran family - Cryorthents - Leighcan family complex, extremely stony.

40 Moran family - Cryorthents - Rock land complex, extremely stony.

Page 27 of 52

Best Score: 1.0 ; Worst Score: 0.0000

Page 28 of 52

8. Boehringer Ingelheim: Which drugs are

effective?

Problem: Predict a biological response of molecules from their chemical

properties

(Area: Molecular Biology)

The objective of the competition is to help us build as good a model as possible so that we can,

as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set

represents a molecule. The first column contains experimental data describing an actual

biological response; the molecule was seen to elicit this response (1), or not (0). The remaining

columns represent molecular descriptors (d1 through d1776), these are calculated properties that

can capture some of the characteristics of the molecule - for example size, shape, or elemental

constitution. The descriptor matrix has been normalized.

The data is in the comma separated values (CSV) format. Each row in this data set represents a

molecule. The first column contains experimental data describing a real biological response; the

molecule was seen to elicit this response (1), or not (0). The remaining columns represent

molecular descriptors (d1 through d1776), these are calculated properties that can capture some

of the characteristics of the molecule - for example size, shape, or elemental constitution. The

descriptor matrix has been normalized.

The problem is to determine which molecular configurations are effective.

Data

The data is in the comma separated values (CSV) format. Each row in this data set represents a

molecule. The first column contains experimental data describing a real biological response; the

molecule was seen to elicit this response (1), or not (0). The remaining columns represent

molecular descriptors (d1 through d1776), these are caclulated properties that can capture some

of the characteristics of the molecule - for example size, shape, or elemental constitution. The

descriptor matrix has been normalized.

Overall accuracy: 76.16%

Page 29 of 52

9.West Nile virus prediction

Problem: Predict West Nile virus in mosquitos across the city of Chicago

(Area: Public Health)

West Nile virus is most commonly spread to humans through infected mosquitos. Around 20%

of people who become infected with the virus develop symptoms ranging from a persistent fever,

to serious neurological illnesses that can result in death.

In 2002, the first human cases of West Nile virus were reported in Chicago. By 2004 the City of

Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive

surveillance and control program that is still in effect today.

Every week from late spring through the fall, mosquitos in traps across the city are tested for the

virus. The results of these tests influence when and where the city will spray airborne pesticides

to control adult mosquito populations.

Given weather, location, testing, and spraying data, this competition asks one to predict when

and where different species of mosquitos will test positive for West Nile virus. A more accurate

method of predicting outbreaks of West Nile virus in mosquitos will help the City of Chicago

and CPHD more efficiently and effectively allocate resources towards preventing transmission of

this potentially deadly virus.

Data

In this competition, one will be analyzing weather data and GIS data and predicting whether or

not West Nile virus is present, for a given time, location, and species.

Every year from late-May to early-October, public health workers in Chicago setup mosquito

traps scattered across the city. Every week from Monday through Wednesday, these traps collect

mosquitos, and the mosquitos are tested for the presence of West Nile virus before the end of the

week. The test results include the number of mosquitos, the mosquitos species, and whether or

not West Nile virus is present in the cohort.

http://www.cdc.gov/westnile/

http://www.cdc.gov/westnile/

https://kaggle2.blob.core.windows.net/competitions/kaggle/4366/media/moggie2.png

Page 30 of 52

Main dataset

These test results are organized in such a way that when the number of mosquitos exceed 50,

they are split into another record (another row in the dataset), such that the number of mosquitos

are capped at 50.

The location of the traps are described by the block number and street name. For convenience,

we have mapped these attributes into Longitude and Latitude in the dataset. Please note that

these are derived locations. For example, Block=79, and Street= "W FOSTER AVE" gives us an

approximate address of "7900 W FOSTER AVE, Chicago, IL", which translates to (41.974089,-

87.824812) on the map.

Some traps are "satellite traps". These are traps that are set up near (usually within 6 blocks) an

established trap to enhance surveillance efforts. Satellite traps are postfixed with letters. For

example, T220A is a satellite trap to T220.

Spray Data

The City of Chicago also does spraying to kill mosquitos. We give the GIS data for their spray

efforts in 2011 and 2013. Spraying can reduce the number of mosquitos in the area, and therefore

might eliminate the appearance of West Nile virus.

https://www.google.com/maps/place/41%C2%B058%2726.7%22N+87%C2%B049%2729.3%22W/@41.9749544,-87.8071845,12z/

https://www.google.com/maps/place/41%C2%B058%2726.7%22N+87%C2%B049%2729.3%22W/@41.9749544,-87.8071845,12z/

Page 31 of 52

Weather Data

It is believed that hot and dry conditions are more favorable for West Nile virus than cold and

wet. We provide with the dataset from NOAA of the weather conditions of 2007 to 2014, during

the months of the tests.

Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev:

662 ft. above sea level

Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea

level

http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N

Page 32 of 52

File descriptions

o train.csv, test.csv - the training and test set of the main dataset. The training set consists of data from 2007, 2009, 2011, and 2013, while in the test set one is requested to predict the test results for 2008, 2010, 2012, and 2014.

Id: the id of the record Date: date that the WNV test is performed Address: approximate address of the location of trap. This is used to send to the

GeoCoder. Species: the species of mosquitos Block: block number of address Street: street name Trap: Id of the trap AddressNumberAndStreet: approximate address returned from GeoCoder Latitude, Longitude: Latitude and Longitude returned from GeoCoder AddressAccuracy: accuracy returned from GeoCoder NumMosquitos: number of mosquitoes caught in this trap WnvPresent: whether West Nile Virus was present in these mosquitos. 1 means

WNV is present, and 0 means not present. o spray.csv - GIS data of spraying efforts in 2011 and 2013

Date, Time: the date and time of the spray Latitude, Longitude: the Latitude and Longitude of the spray

o weather.csv - weather data from 2007 to 2014. Column descriptions in noaa_weather_qclcd_documentation.pdf.

o sampleSubmission.csv - a sample submission file in the correct format

Best Entry: 0.85991; Worst Entry: 0.40415

Page 33 of 52

10. Caterpillar tube pricing

Problem: Model quoted prices for industrial tube assemblies

(Area: Logistics)

Walking past a construction site, Caterpillar's signature bright yellow machinery is one of

the first things one notice. Caterpillar sells an enormous variety of larger-than-life construction

and mining equipment to companies across the globe. Each machine relies on a complex set of

tubes (yes, tubes!) to keep the forklift lifting, the loader loading, and the bulldozer from dozing

off.

Like snowflakes, it's difficult to find two tubes in Caterpillar's diverse catalogue of machinery

that are exactly alike. Tubes can vary across a number of dimensions, including base materials,

number of bends, bend radius, bolt patterns, and end types.

Currently, Caterpillar relies on a variety of suppliers to manufacture these tube assemblies, each

having their own unique pricing model. This competition provides detailed tube, component, and

annual volume datasets, and challenges to predict the price a supplier will quote for a given tube

assembly.

The dataset is comprised of a large number of relational tables that describe the physical

properties of tube assemblies.

The competition challenges to combine the characteristics of each tube assembly with supplier

pricing dynamics in order to forecast a quote price for each tube. The quote price is labeled as

cost in the data.

Data

File descriptions

train_set.csv and test_set.csv

This file contains information on price quotes from our suppliers. Prices can be quoted in 2

ways: bracket and non-bracket pricing. Bracket pricing has multiple levels of purchase based on

quantity (in other words, the cost is given assuming a purchase of quantity tubes). Non-bracket

pricing has a minimum order amount (min_order) for which the price would apply. Each quote is

issued with an annual_usage, an estimate of how many tube assemblies will be purchased in a

given year.

tube.csv

This file contains information on tube assemblies, which are the primary focus of the

competition. Tube Assemblies are made of multiple parts. The main piece is the tube which has a

Page 34 of 52

specific diameter, wall thickness, length, number of bends and bend radius. Either end of the

tube (End A or End X) typically has some form of end connection allowing the tube assembly to

attach to other features. Special tooling is typically required for short end straight lengths

(end_a_1x, end_a_2x refer to if the end length is less than 1 times or 2 times the tube diameter,

respectively). Other components can be permanently attached to a tube such as bosses, brackets

or other custom features.

bill_of_materials.csv

This file contains the list of components, and their quantities, used on each tube assembly.

specs.csv

This file contains the list of unique specifications for the tube assembly. These can refer to

materials, processes, rust protection, etc.

Page 35 of 52

tube_end_form.csv

Some end types are physically formed utilizing only the wall of the tube. These are listed here.

components.csv

This file contains the list of all of the components used. Component_type_id refers to the

category that each component falls under.

comp_[type].csv

These files contain the information for each component.

type_[type].csv

These files contain the names for each feature.

Page 36 of 52

11. San Francisco (SanFrancisco) Crime

Classification

Problem: Predict the category of crimes that occurred in the city by the bay

(Area: Crime)

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious

criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth

inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work,

there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12

years of crime reports from across all of San Francisco's neighborhoods. Given time and

location, one must predict the category of crime that occurred.

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data

ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning

week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set.

Problem is to build a model to predict the category of crimes that occurred in the city by the bay

Page 37 of 52

Data

Data fields

Dates - timestamp of the crime incident Category - category of the crime incident (only in train.csv). This is the target variable to

predict. Descript - detailed description of the crime incident (only in train.csv) DayOfWeek - the day of the week PdDistrict - name of the Police Department District Resolution - how the crime incident was resolved (only in train.csv) Address - the approximate street address of the crime incident X - Longitude Y - Latitude

***************

Page 38 of 52

12. Airbnb New User Bookings

Problem: Where will a new guest book their first travel experience?

(Area: Tourism)

Instead of waking to overlooked "Do not disturb" signs, Airbnb travelers find themselves rising

with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat,

or cooking a shared regional breakfast with their hosts.

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By

accurately predicting where a new user will book their first travel experience, Airbnb can share

more personalized content with their community, decrease the average time to first booking, and

better forecast demand.

Data

In this challenge, you are given a list of users along with their demographics, web session

records, and some summary statistics. You are asked to predict which country a new user's first

booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT',

'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Please note that 'NDF' is different from

'other' because 'other' means there was a booking, but is to a country not included in the list,

while 'NDF' means there wasn't a booking.

The training and test sets are split by dates. In the test set, you will predict all the new users with

first activities after 7/1/2014 (note: this is updated on 12/5/15 when the competition restarted). In

the sessions dataset, the data only dates back to 1/1/2014, while the users dataset dates back to

2010.

File descriptions

train_users.csv - the training set of users

test_users.csv - the test set of users

sample_submission.csv - correct format for submitting your predictions

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

Page 39 of 52

Fields

id: user id

date_account_created: the date of account creation

timestamp_first_active: timestamp of the first activity, note that it can be earlier than

date_account_created or date_first_booking because a user can search before signing up

date_first_booking: date of first booking

gender

age

signup_method

signup_flow: the page a user came to signup up from

language: international language preference

affiliate_channel: what kind of paid marketing

affiliate_provider: where the marketing is e.g. google, craigslist, other

first_affiliate_tracked: whats the first marketing the user interacted with before the

signing up

signup_app

first_device_type

first_browser

country_destination: this is the target variable you are to predict

sessions.csv - web sessions log for users

user_id: to be joined with the column 'id' in users table

action

action_type

action_detail

device_type

secs_elapsed

countries.csv - summary statistics of destination countries in this dataset and their

locations

age_gender_bkts.csv - summary statistics of users' age group, gender, country of

destination

Page 40 of 52

13. TFI: Restaurant Revenue Prediction

Problem: Predict annual restaurant sales based on objective measurements

(Area: Strategic Planning)

With over 1,200 quick service restaurants across the globe, TFI (Tab Food Investments) is the

company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes,

Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make

significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process

based on the personal judgement and experience of development teams. This subjective

data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the

wrong location for a restaurant brand is chosen, the site closes within 18 months and operating

losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites

would allow TFI (Tab Food Investments) to invest more in other important business areas, like

sustainability, innovation, and training for new employees. Using demographic, real estate, and

commercial data, this competition challenges you to predict the annual restaurant sales of

100,000 regional locations.

TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000

restaurants. The data columns include the open date, location, city type, and three categories of

obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column

indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive

analysis.

https://www.kaggle.com/c/restaurant-revenue-prediction

Page 41 of 52

Data

File descriptions

train.csv - the training set. Use this dataset for training your model. test.csv - the test set. To deter manual "guess" predictions, Kaggle has supplemented the test

set with additional "ignored" data. These are not counted in the scoring. sampleSubmission.csv - a sample submission file in the correct format

Data fields

o Id : Restaurant id. o Open Date : opening date for a restaurant o City : City that the restaurant is in. Note that there are unicode in the names. o City Group: Type of the city. Big cities, or Other. o Type: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile o P1, P2 - P37: There are three categories of these obfuscated data. Demographic

data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators.

o Revenue: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values.

Page 42 of 52

14. Otto Group Product Classification

Challenge

Problem: Classify products into the correct category

(Area: e-commerce)

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more

than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France).

We are selling millions of products worldwide every day, with several thousand products being

added to our product line.

A consistent analysis of the performance of our products is crucial. However, due to our diverse

global infrastructure, many identical products get classified differently. Therefore, the quality of

our product analysis depends heavily on the ability to accurately cluster similar products. The

better the classification, the more insights we can generate about our product range.

Each row corresponds to a single product. There are a total of 93 numerical features, which

represent counts of different events. All features have been obfuscated and will not be defined

any further.

There are nine categories for all products. Each target category represents one of our most

important product categories (like fashion, electronics, etc.). The products for the training and

testing sets are selected randomly.

Data

trainData.csv - the training set testData.csv - the test set

sampleSubmission.csv - a sample submission file in the correct format

Data fields

id - an anonymous id unique to a product

feat_1, feat_2, ..., feat_93 - the various features of a product

target - the class of a product

https://www.kaggle.com/c/otto-group-product-classification-challenge

https://www.kaggle.com/c/otto-group-product-classification-challenge

Page 43 of 52

15. Walmart Recruiting - Store Sales

Forecasting

Problem: Use historical markdown data to predict store sales (Area: Retail Sales)

One challenge of modeling retail data is the need to make decisions based on limited history. If

Christmas comes but once a year, so does the chance to see how strategic decisions impacted the

bottom line.

In this recruiting competition, job-seekers are provided with historical sales data for 45 Walmart

stores located in different regions. Each store contains many departments, and participants must

project the sales for each department in each store. To add to the challenge, selected holiday

markdown events are included in the dataset. These markdowns are known to affect sales, but it

is challenging to predict which departments are affected and the extent of the impact.

You are provided with historical sales data for 45 Walmart stores located in different regions.

Each store contains a number of departments, and you are tasked with predicting the department-

wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These

markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor

Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times

higher in the evaluation than non-holiday weeks. Part of the challenge presented by this

competition is modeling the effects of markdowns on these holiday weeks in the absence of

complete/ideal historical data.

Data

stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of

store.

https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

Page 44 of 52

train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file

you will find the following fields:

Store - the store number

Dept - the department number

Date - the week

Weekly_Sales - sales for the given department in the given store

IsHoliday - whether the week is a special holiday week

test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the

sales for each triplet of store, department, and date in this file.

features.csv

This file contains additional data related to the store, department, and regional activity for the

given dates. It contains the following fields:

Store - the store number

Date - the week

Temperature - average temperature in the region

Fuel_Price - cost of fuel in the region

MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is

running. MarkDown data is only available after Nov 2011, and is not available for all

stores all the time. Any missing value is marked with an NA.

CPI - the consumer price index

Unemployment - the unemployment rate

IsHoliday - whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays

are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Page 45 of 52

16. Springleaf: Determine whether to

send a direct mail piece to a customer

Problem: Predict which customers can be directly targeted (Area: Marketing/Sales)

Springleaf puts the humanity back into lending by offering their customers personal and auto

loans that help them take control of their lives and their finances. Direct mail is one important

way Springleaf's team can connect with customers whom may be in need of a loan.

Direct offers provide huge value to customers who need them, and are a fundamental part of

Springleaf's marketing strategy. In order to improve their targeted efforts, Springleaf must be

sure they are focusing on the customers who are likely to respond and be good candidates for

their services.

Using a large set of anonymized features, Springleaf is asking you to predict which customers

will respond to a direct mail offer. You are challenged to construct new meta-variables and

employ feature-selection methods to approach this dauntingly wide dataset.

You are provided a high-dimensional dataset of anonymized customer information. Each row

corresponds to one customer. The response variable is binary and labeled "target". You must

predict the target variable for every row in the test set.

The features have been anonymized to protect privacy and are comprised of a mix of continuous

and categorical features. You will encounter many "placeholder" values in the data,

which represent cases such as missing values. We have intentionally preserved their encoding to

match with internal systems at Springleaf. The meaning of the features, their values, and

their types are provided "as-is" for this competition; handling a huge number of messy features is

part of the challenge here.

Data

One is provided a high-dimensional dataset of anonymized customer information. Each row

corresponds to one customer. The response variable is binary and labeled "target". One must

predict the target variable for every row in the test set.

The features have been anonymized to protect privacy and are comprised of a mix of continuous

and categorical features. One will encounter many "placeholder" values in the data,

https://www.springleaf.com/

Page 46 of 52

which represent cases such as missing values. We have intentionally preserved their encoding to

match with internal systems at Springleaf. The meaning of the features, their values, and

their types are provided "as-is" for this competition; handling a huge number of messy features is

part of the challenge here. File train.csv is around 1GB size: 145231 X 1934 and test file is also

around 1GB having 145232 X 1933.

Results achieved

Results achieved are as below:

Page 47 of 52

17. Santander Customer Satisfaction

Problem: Which customers are happy customers?

From frontline support teams to C-suites, customer satisfaction is a key measure of success.

Unhappy customers don't stick around. What's more, unhappy customers rarely voice their

dissatisfaction before leaving

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their

relationship. Doing so would allow Santander to take proactive steps to improve a customer's

happiness before it's too late.

In this competition, one'll work with hundreds of anonymized features to predict if a customer is

satisfied or dissatisfied with their banking experience.

Data Files

File name Available Formats

Sample_submission.csv .zip (175.67KB)

test.csv .zip (3.31 mb)

train.csv .zip (3.34 mb)

You are provided with an anonymized dataset containing a large number of numeric variables.

The "TARGET" column is the variable to predict. It equals one for unsatisfied customers and 0

for satisfied customers.

The task is to predict the probability that each customer in the test set is an unsatisfied customer.

File descriptions

1. train.csv - the training set including the target

https://www.kaggle.com/c/santander-customer-satisfaction

https://www.santanderbank.com/us/personal

https://www.kaggle.com/c/santander-customer-satisfaction

Page 48 of 52

2. test.csv - the test set without the target

3. sample_submission.csv- a sample submission file in the correct format

Results achieved

The problem was solved using glm (binomial family) model of sparkR (ver 1.6.1). Results

achieved are as below:

*************

Page 49 of 52

18. Influencers in Social Networks

Problem: Predict which people are influential in a social network

Data Science London and the UK Windows Azure Users Group in partnership with Microsoft

and Peerindex, announce the Influencers in Social Networks competition as part of The Big Data

Hackathon. This competition asks you to predict human judgments about who is more influential on social

media.

The dataset, provided by Peerindex, comprises a standard, pair-wise preference learning task.

Each data point describes two individuals, A and B. For each person, 11 pre-computed, non-

negative numeric features based on twitter activity (such as volume of interactions, number of

followers, etc) are provided.

The binary label represents a human judgment about which one of the two individuals is more

influential. A label '1' means A is more influential than B. 0 means B is more influential than A.

The goal of the challenge is to train a machine learning model which, for pairs of individuals,

predicts the human judgment on who is more influential with high accuracy. Labels for the

dataset have been collected by PeerIndex

Kaggle Reference: https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network

Data Files

File name Available Formats

Sample_submission.csv .csv (102.85 kb)

Test .csv (1.29 mb)

train.csv .csv (1.20 mb)

The dataset, provided by Peerindex, comprises a standard, pair-wise preference learning task.

Each data point describes two individuals. Pre-computed, standardized features based on twitter

activity (such as volume of interactions, number of followers, etc.) is provided for each

individual.

The discrete label represents a human judgement about which one of the two individuals is more

influential. The goal of the challenge is to train a machine learning model which, for a pair of

http://datasciencelondon.org/

http://ukwaug.net/

https://www.brandwatch.com/peerindex-and-brandwatch/

http://www.bigdatahackathon.com/

http://www.bigdatahackathon.com/

https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network

https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network/download/sample_predictions.csv

https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network/download/test.csv

https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network/download/train.csv

Page 50 of 52

individuals, predicts the human judgement on who is more influential with high accuracy. Labels

for the dataset have been collected by PeerIndex.

Keywords: Social media analytics; twitter analytics; social networks

Page 51 of 52

19. Predicting Red Hat Business Value

Problem: Classifying customer potential

Like most companies, Red Hat is able to gather a great deal of information over time about the

behavior of individuals who interact with them. They’re in search of better methods of using this

behavioral data to predict which individuals they should approach—and even when and how to

approach them.

In this competition, Kagglers are challenged to create a classification algorithm that accurately

identifies which customers have the most potential business value for Red Hat based on their

characteristics and activities.

With an improved prediction model in place, Red Hat will be able to more efficiently prioritize

resources to generate more business and better serve their customers.

Data Files


people.csv .zip (3.22 mb)

sample_submission.csv .zip (1.18 mb)

act_test.csv .zip (4.03 mb)

act_train.csv .zip (17.07 mb)

This competition uses two separate data files that may be joined together to create a single,

unified data table: a people file and an activity file.

https://www.kaggle.com/c/predicting-red-hat-business-value

https://www.kaggle.com/c/predicting-red-hat-business-value/download/people.csv.zip

https://www.kaggle.com/c/predicting-red-hat-business-value/download/sample_submission.csv.zip

https://www.kaggle.com/c/predicting-red-hat-business-value/download/act_test.csv.zip

https://www.kaggle.com/c/predicting-red-hat-business-value/download/act_train.csv.zip

Page 52 of 52

The people file contains all of the unique people (and the corresponding characteristics) that have

performed activities over time. Each row in the people file represents a unique person. Each

person has a unique people_id.

The activity file contains all of the unique activities (and the corresponding activity

characteristics) that each person has performed over time. Each row in the activity file represents

a unique activity performed by a person on a certain date. Each activity has a unique activity_id.

The challenge of this competition is to predict the potential business value of a person who has

performed a specific activity. The business value outcome is defined by a yes/no field attached to

each unique activity in the activity file. The outcome field indicates whether or not each person

has completed the outcome within a fixed window of time after each unique activity was

performed.

The activity file contains several different categories of activities. Type 1 activities are different

from type 2-7 activities because there are more known characteristics associated with type 1

activities (nine in total) than type 2-7 activities (which have only one associated characteristic).

To develop a predictive model with this data, one has to merge the files together into a single

data set. The two files can be joined together using person_id as the common key. All variables

are categorical, with the exception of 'char_38' in the people file, which is a continuous

numerical variable.

(Refer: https://www.kaggle.com/c/predicting-red-hat-business-value/data )

******************

https://www.kaggle.com/c/predicting-red-hat-business-value/data

Documents

Kaggle Machine Learning Projects Ashok Kumar Harnal203.122.28.235/pdf/Kaggle_Projects_Executed_in_the_course.pdf · Kaggle and About Projects Kaggle is a platform for predictive modelling