40
PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Embed Size (px)

Citation preview

Page 1: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

PORTAL Data Quality & Aggregation

Sue Ahn, Kristin Tufte

Page 2: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Road Map Initial Data Cleaning (more cleaning later)

Set null values as appropriate Set Data Quality Flags based on row-level data

integrity violations

Aggregate Calculations Calculate 5-minute aggregates Calculate 15-min, 1-hour aggregates

Portal Output Calculations Timeseries Grouped Data Performance

Page 3: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Road Map: Initial Data Cleaning

Set NULL values Data Quality Flags

Page 4: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Set Null Values as Appropriate Case 1: No Data Received from Detector

ODOT indicates vol = occ = speed = -1, status = 0 means no data received from detector ** (status 0 is inhibited)

We set volume, occupancy and speed to NULL in this case

Case 2: No Activity at Detector ODOT indicates vol = occ = speed = 0, status = 0 means no activity at

detector ** We set speed to NULL in this case

Case 3: Disabled (?) Also get readings: vol = occ = speed = -1, status = 1 (status 1 is disabled) Similar to Case 1, except status = 1, so assume this does mean disabled We set volume, occupancy and speed to NULL in this case Should verify the meaning of vol = occ = speed = -1, status = 1 with ODOT

** as of 2-14-05; fields were different before No -999’s in data received live; there are -999s in 2001, 2002 data

Page 5: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Data Quality Flags

Detect data inconsistencies and erroneous values

Row-level/Reading-level flags Flags are set for each (vol, occ, speed) reading we

get For each reading (row), a flag is set to indicate if that

row fails a particular criteria Examples

Max Speed, set if speed > 100 mph Max Occ, set if occ > 95%

Adapted from TTI (see wiki for link)

Page 6: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Data Quality Flags – Single Value Flags Maximum Volume (DQ_MAXVOL)

Condition: volume > 17 (20-sec data) Comments: Equivalent to 3060 vph; may be caused by

motorcycles or lane changes for example Maximum occupancy (DQ_MAXOCC)

Condition: occupancy > 95% Comments: detectors may be “stuck-on”

Maximum speed (DQ_MAXSPD) Condition: speed > 100 mph Comments: based on empirical evidence

Minimum speed (DQ_MINSPD) Condition: speed < 5 mph Comments: based on empirical evidence

Page 7: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Data Quality Flags – Multi-variate Consistency Multi-variate consistency #1 (DQ_MVC_S0VGT0)

Speed = 0 and Volume > 0 Indicates potential problem with: speed Is this necessary? We already eliminate speed < 5 mph

Multi-variate consistency #2 (DQ_MVC_SGT0V0) Speed > 0 and Volume = 0 Indicates potential problem with: speed

Multi-variate consistency #3 (DQ_MVC_OGT0V0) Occupancy > 0 and Volume = 0 Note: alternative: set occupancy = 0 Indicates potential problem with: occupancy

Page 8: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Road Map: Aggregate Calculations

5-minute aggregates 15-min, 1-hr aggregates

Page 9: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Missing Values in 5-min Aggregates of Vol, Speed, Occ

Raw AggregatesSum/avg data with no imputation, rescaling or

cleaning Processed Aggregates

Two issues:#1 Is there enough data to calculate an

aggregate?#2 If there is enough data, how to handle the

fact that we are missing values?

Page 10: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Missing Values in 5-min Aggregates of Vol, Speed, Occ, Cont’d

Issue #1: Determining if a time period has enough valid data to calculate aggregates for the time periodAggregate vol, speed, occ set to null if less

than X% of the readings are validDo sensitivity analysis to pick X (start with

75% - set 5-min aggregate to null if we have less than 75% of the data)

Page 11: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Bad Data Threshold Rob - In general, throughout the development of the aggregation process,

we consistently encountered the issue of how to determine when there is enough data to create an aggregate vs. when an aggregate value should be declared null because of lack of data.

For starters, we’ve chosen a threshold of 75%. If 75% or more of the 20-second readings in a 5-minute interval are valid, we calculate an aggregate for that 5-minute interval.

The same issue arises in the following situations: Calculating 5-min aggregates from 20-second data Calculating 15-min aggregates from 5-min data Calculating 1-hr aggregates from 15-min data Grouped Data: grouping over multiple days (group by hour) Grouped Data: aggregating 1-hr data to daily data for group by day of week and

week of year Related issues:

Aggregating over lanes (how to handle the situation when data is missing from one lane, note that data may be missing from a lane for only a portion of a day)

We wonder if studying this percentage (data sensitivity) would make a good paper topic.

Page 12: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

#2 Handling Missing Values

Two basic methods for dealing with missing values 1) Use preceding/following values (replace missing volume with

preceding volume reading for that detector) 2) Aggregate values in a 5 minute range and scale based on # of

readings Since we are aggregating, suggest method 2

We are aggregating, so interpolation may not be necessary Think 2 will perform similarly to 1 in cases when only a few

readings are missing, when lots of readings missing, should report that data is not available (null)

Method 2 is well-defined (simple to define and express/explain) Method 2 is easier to implement than 1

Page 13: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

5-min Aggregates (vol, spd, occ)

Volume For each detector, sum all valid volume readings in 5-

minute period (eliminate any readings with non-zero flags) and rescale by # of missing readings

Speed, Occupancy For each detector, calculate the volume-weighted

average of all valid readings in a 5-minute period Assume 20-second readings are space-mean speed Weight by volume (to be technically correct, we need to

weight by density, but that requires estimating density)

Page 14: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

5-Min Aggregate – Other Measures

Vehicle Miles Traveled (VMT) 5-min VMT = 5-min volume * length of midpoint

influence area of associated detector

Vehicle Hours Traveled (VHT) 5-min VHT = 5-min volume * (length of midpoint

influence area / 5-min space-mean speed)

Travel Time 5-min travel time = length of midpoint influence area /

(5-min space-mean speed)

Page 15: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

5-min Aggregates – Other – Cont’d

Total Delay (in veh-mins) 5-min delay = 5-min volume * (travel time – (influence

area / free-flow speed)) Where: free-flow speed = min {assumed free-flow

speed, actual 5-min speed} Note: Free-flow speed is computed in this way in

order to avoid negative delay Average Delay/Vehicle (in minutes)

5-min delay = Total Delay/5-min volume Should we report Average or Total Delay or

both?

Page 16: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Summary: Which values/measures affect which calculations?

20-sec volume

20-sec speed

20-sec occupancy

5-min volume

5-min speed

5-min occupancy

5-min VMT

5-min Travel Time

5-min Delay

5-min VHT

**

* May be needed, depends on method used

** Need 5-min speed to calculate free-flow speed

*

Page 17: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Road Map: 15-min, 1-hr Aggregates

Issues:Calculated from 5-min aggregates or from 20-

second data? We chose to use 5-min data

What should the threshold be for % of bad data?

Chose to toss 15-min reading if we are missing more than one 5-min reading

For now, toss 1 hr reading if we have less than 75% of 5-min readings (or less than 9 readings)

Note the “bad data” threshold issue arises here, also see next page.

Page 18: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Aggregation over Larger Intervals

15-min aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than one 5-minute reading

1-hr aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than three 5-minute readings

Page 19: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Aggregation over Time, Cont’d Volume

Sum 5-min volumes Occupancy

Volume-weighted average of 5-min occupancies Speed

Volume-weighted average of 5-min speeds Vehicle Miles Traveled

sum of 5-min VMTs Vehicle Hours Traveled

sum of 5-min VHTs Travel Time

Volume-weighted average of 5-min travel times Delay

Total Delay: Sum of 5-min delays Average Delay: Volume-weighted average of 5-min delays

Page 20: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Road Map: PORTAL Output

We have described how to do the 5-min, 15-min and 1-hr aggregations

For each page in PORTAL, analyze what calculations are needed and how those calculations should be done

Page 21: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Timeseries – Line Graphs Timeseries

Options: graph by station or lane, 7 basic measures, 5-min, 15-min, 1-hr granularity of input data

For Lane Graph These values are in aggregate table (calculation described in

previous slides) Requires 5-min (15-min, 1-hr) aggregates of each value for

each detector If data is missing, put blank space in the graph

For Station Graph: Requires aggregation over lanes Calculation described in subsequent slides If data is missing from one or more lanes, show blank space

on graph

Page 22: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Timeseries – Contour Plots Contour Plot (Surface Plot?) for a Highway

Requires same calculations as for station plot on previous slide

Page 23: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Aggregation over Lanes Volume

Sum Occupancy

Volume-weighted average of by-lane occupancies Speed

Volume-weighted average of by-lane speeds Vehicle Miles Traveled

Sum over lanes Vehicle Hours Traveled

Sum over lanes Travel Time

Volume-weighted average of by-lane travel times Average Delay

Volume-weighted average of by-lane delays Total Delay

Sum of by-lane delays

Page 24: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Grouped Data – Line Plots Line Plots by Station/Detector

Group by: 5 minute (minute of day) Input is 5-minute aggregations Simple arithmetic average, min, max, and std dev of data

from appropriate 5-min period Result shows variation between days

Group by: Hour of day Input is 1-hr aggregations As above, simple arithmetic average, min, max and std dev

of data from appropriate 1-hr periods Result shows variation between days

What to do if we are missing data? What if we are missing 50% of the values for 8:00-9:00, for example? We may be missing different %s for different hour intervals??

Page 25: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Grouped Data – Line Plots, Cont’d Line Plots by Station/Detector

Group by: Day of Week Input must be 1-hr aggregations Aggregate 1-hr data to daily data What if one or more hours in a day is/are missing?? (What if

we are missing a peak hour?) Then simple arithmetic avg, min, max, std dev of aggregates Result shows variation within day of week

Group by: Week of Year Similar to day of week

Note: “bad data threshold” issue again. Seems reasonable to rescale when going from 20-sec data to 5-min data, and again 5-min to 1-hr seems OK. But neither interpolating or rescaling seem really acceptable in this situation.

Page 26: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Aggregation to Day of Week Volume

Sum 1-hr volumes Occupancy

?? Volume-weighted average of 5-min occupancies Speed

( 1-hr VMTs)/( 1-hr VHTs) ensure for VMT and VHT - either are both null or neither are null

Vehicle Miles Traveled 1-hr VMTs Rescale???

Vehicle Hours Traveled 1-hr VHTs

Travel Time Volume-weighted average of 5-min travel times

Delay Total Delay: Sum of 5-min delays Average Delay: Volume-weighted average of 5-min delays

Page 27: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

WARNING: STOP HERE

Following slides are under construction (hard hats required)

Page 28: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Grouped Data – Highway Plots Highway Plots

What does this do anyway??? Group by: 5 minute (minute of day)

Requires input to be 5-minute aggregate data Need to group 5-minute periods from different days together Use appropriate aggregation-by-time method

Group by: Hour of day Input can be 5-min, 15-min or 1-hr For 1hr input need to group 1-hr periods from different days

together Use appropriate aggregation-by-time method

5-min, 15-min input – Should we restrict this to 1-hr input?? Just use appropriate aggregation-by-time method??

Page 29: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Grouped Data – Highway Plots, Cont’d Highway Plots

Group by: Day of Week Input is 5-min, 15-min, 1-hr Need to group 5-min data into 5-min groups, 15-

min data into 15-min groups, etc. Use appropriate aggregation by time method??

Group by: Week of Year Same as Day of Week, I think

Page 30: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Grouped Data – All Highways

All Highways Plots

Page 31: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Performance Reports MONTHLY FIGURES

Avg TT Avg TS Avg VS FFTT FFTS 95th% TT 95th% TS   166.11 Mins   52.05 MPH   53.79 MPH   145.81 Mins   59.29 MPH   232.16

Mins    37.24 MPH   VMT   VHT  % Cngst Hwy Len Days Used   211,298,304   4,118,289   15.09%   144.09 Miles  31 WEATHER NUMBER OF DAYS Rain ≥ 0.5" Rain ≥ 1.0" Frzn Precip Fog 8 2 0 1

TIME WINDOW FIGURES (primarily for daily report – BI, TTI are new) Avg TT Avg TS 95th% TT 95th% TS   166.11 Mins   52.05 MPH   232.16 Mins   37.24 MPH   VMT VHT % Cngst TTI BI   211,298,304   4,118,289   15.09%   1.59   1.97

Page 32: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Future Flags

Calibration Error (not in this presentation, see paper) Multi-variate consistency

Volume > nmax when Occupancy = 0 This should be tested after correcting for calibration error

Maximum reduction in speed Speedn+1 > 0.45*Speedn

Identical records If more than 8 consecutive records (volume, occupancy, and

speed) are identical

Page 33: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Extra Slides

Page 34: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Questions

No vehicles present Speed = 0 when Volume = 0Set Speed = null

Am not doing this because of the -1, -1, -1, and 0, 0, 0 tests

At what point do we call a 5-minute interval invalid?? 5 missing readings? 8? 9? 10?

Aggregation over space – we need something for vol and

Aggregation over space – need to deal with missing data Need aggregation over time (i.e. combining data from

different days together)

Page 35: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

ODOT SWARM Flags

0 - inhibited 1 - disabled 2 - ok 3 - suspect 4 - softfail 5 - hardfail

Page 36: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Data Notes Status 0 Received Data As of Feb 14, 2005, ODOT reports vol -1, occ -1,

speed -1, status 0, for detectors with no controller communications (no data received from detector), and vol 0, occ 0, speed 0, status 0 for detectors reporting no traffic data (no activity) (info from David Gaarsoe). See Data Fidelity and minus 1 and minus 999 for more info.

Data Cleaning The nightly scripts "clean" the -1 and -999 data as follows: Sets volume = 0, speed = NULL, occupancy = 0 where volume = -999 Sets volume = NULL, speed = NULL, occupancy = NULL, where volume

= -1 I don’t belive we receive -999 data anymore (KT)

Unknown We seem to also get vol = 0, speed = 0, status = 2 (or other status), with

occ > 0. What does this mean? What was that weird thing I noted in the file on badger?

Page 37: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Steve H’s notes from emails from ODOT -999

When we received blocks of data from 2001 and 2002, the 20-second data included records where volume=-999. This indicated that no cars passed. As of July 14, 2004 when PORTAL went live, we no longer received volume= -999. -999 are no longer an issue.

-1 From July 14, 2004 to November 17, 2004, we were receiving records with volume=-1. In all cases, when volume=-1, the SWARM algorithm variable status=2, which is the code for "OK." As I understand it, this is supposed to represent a communication error.

From November 18, 2004 to February 13, 2005, we did not receive any records where volume = -1. This is due to "Transport ATMS build 2.1, distributed on Nov 17, included many internal changes to traffic sensor and detector station data, (including station level data smoothing). Among other changes it started sending a 0 detector volume during Communications Failure with the Wapiti controllers, as noted by Robert Bertini." -David Gaarsoe

During this time period, it is impossible to distinguish between communication errors and periods where no cars pass.

Beginning on February 14, 2005, it appears that Transport ATMS build 2.3 was released. "Transport ATMS build 2.3 will ... once again report a -1 vol, -1 occupancy, and -1 speed, Inhibited status (0) for Traffic Detectors with no controller communications, and a 0 vol, 0 occupancy, 0 speed, Inhibited status for detectors of reporting controllers with no traffic data (no activity)." -David Gaarsoe

It appears that, beginning on February 14, 2005 we are now able to accurately distinguish between communication errors and periods where no vehicles pass a detector.

http://wiki.its.pdx.edu/wiki/portal/moin.cgi/ODOT_-1_Email

Page 38: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Data Quality Flags - Implementation

Queries or scripts analyze the data and set flags when criteria are met

We will be able to tell which data values failed which criteria

Page 39: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Aggregation over Lanes/Corridors

Volume Over lanes: Sum (average volume over detector stations doesn’t seem useful)

Occupancy Over lanes: Weighted average of by-lane occupancies by

volumes (average occupancy over detector stations doesn’t seem useful)

Speed Over lanes: Weighted average of by-lane speeds by volumes Over detector stations: Weighted average of speeds by volumes

(?)

Page 40: PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Aggregation over Space, Cont’d

Vehicle Miles Traveled Sum over lanes or loop detector stations

Vehicle Hours Traveled Sum over lanes or loop detector stations

Travel Time Over lanes: Weighted average of travel time by volumes Over loop detector stations: Sum of segment-wide travel times

Delay Over lanes: Weighted average of travel time by volumes Over loop detector stations: Sum of segment-wide travel times