Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Tools to Optimize Operational Test Analysis
Frank Thomason
2
» Data, data everywhere
Managing data affects the success of your test
Accuracy is critical
Don’t throw away useful data
Complexity
» Timeliness
Customers want results quickly
Be prepared for the 3pm phone call
» Happiness
Data Scientists spend 50-80% of their time doing “data janitorial” work (NY Times)
Let the analysts do analysis
Why do we need tools?
3
» Automate as much as possible
» Be flexible
» Don’t reinvent the wheel
Use existing tools when possible
For custom software, use pre-built modules
» Don’t try to find one all-encompassing tool.
Break it into smaller chunks.
Keep it simple and focused.
Strategy
4
Data Analysis Lifecycle
Retrieve Authenticate Analyze Report
5
» Move data from source to your environment
» Convert it to your preferred format
CSV, XML, etc…
» Load data into your repository
Database, shared drive, big data repository, etc…
» Run data quality checks to flag records for
investigation and cleaning
» Alert when errors occur
Steps to Retrieve Data
6
» Data Warehousing has spawned many tools for retrieving data. ETL – Extract, Transform, Load
» Open-Source Tools Talend Data Integrator
Pentaho Kettle
Hadoop (big data)
» Commercial Tools IBM Data Stage
SQL Server Integration Services
Informatica
Oracle Data Integrator
» Custom Tools Not recommended
Tools to Retrieve Data
7
» Hundreds of pre-built components
» Process-diagram interface makes it easy to
understand and debug
» Lots of configuration options
» Create custom components
» Can run on a schedule or manually
Features in ETL Tools
8
ETL Tool Screenshot
Typical Extract-Load Job (from Talend Data Integrator)
9
ETL Map Fields
10
ETL Standard Components
• AS400• Access• Amazon RDS• DB Generic• DB JDBC• DB2• Firebird• Greenplum• HSQL DB• Hive• Informix• Ingres• Interbase• JavaDB• LDAP
• MS SQL Server• MaxDB• MySQL• Netezza• Ole DB• Oracle• ParAccel• Postgres SQL• Redshift• SQLite• SAS• Sybase• Teradata• Vertica• eXist
Database Components• Apache Log• ARFF• Delimited• EBCDIC• Excel• JSON• MS Delimited• MS Positional• MS XML• Email• Positional• Properties• RegEx• XML
• HTTP Request• FTP• FileFetch• POP• Kerberos• Keystore• Proxy• Socket• Web Service• XMLRPC• SOAP• RSS• Named Pipe• REST• FTP• MOM• JMS• Socket
ProtocolsFile Formats
11
» Lots of variability with retrieving source data
Proprietary formats, software, media, etc…
» Don’t modify the source data
Keep it as an accurate representation of the
original data.
You can transform it later.
Keep the structure the same
» Flexibility is more important than speed
» Use an existing ETL tool
» Consider a Big Data Tool if you have a lot of log
data
Tips to Retrieve Data
12
» Investigate flagged records
» Clean data wherever possible
» Present data to Data Authority for authentication
Summary of data
Sampling of records if dataset is large
Separate flagged records for closer review
» Update records with results of authentication
Steps to Authenticate Data
13
» Not many off-the-shelf options
» Custom software may be the best choice
Use pre-existing modules for rapid development
ASP.Net, Java, and Python all have pre-built
interfaces for viewing, sorting, and editing data.
Avoid bells-and-whistles that will make code
harder to reuse for other tests
Tools to Authenticate Data
14
» Record revision history automatically
Show what records have been deleted, modified,
and added
Record the user name that made the change
» Don’t try to compete with Excel
Allow data to be exported to Excel
» Create summary reports
» Avoid adding a lot of bells-and-whistles
Makes it easier to reuse for other tests
Tips to Authenticate Data
15
» Data Authentication, Reporting, and Analysis tool
» Review and update data
» Import/Export Excel spreadsheets
» Keeps revision history
» Shows the changes made to each record
» Uses central repository for data
» Developed in ASP.Net and Java to fit most
environments
» Generates summary reports
» Code is designed for flexibility so that new tests can
be added quickly.
DARA Tool
16
DARA Tool
17
Revision History
18
Merged Records
19
Sample Sizes
20
Measures
21
Data Flow
22
» Prep the data
Change the data structure
Merge data
Flatten data
Dimensional models
Transform Change time zones
Perform calculations
Geocodes
Categorize
Optimize database
Aggregate data
» Integrate your favorite analysis tool
Steps to Analyze
23
» Open Source R
Saiku (Mondrian)
Weka, Rapid Miner, KNIME
» Commercial SAS, SPSS
MATLAB
STATISTICA
Tableau
Rattle
Excel
New automated analysis tools
» Custom SQL
Python, Perl
Tools to Analyze
24
» Let analysts use their favorite tool
» Optimized structures like Dimensional Models
are nice but take time
Extra time to design and develop
Requires a lot of testing to ensure data hasn’t
been distorted
May not save time
May not handle changes to data easily
» OLAP is fun but probably only useful for very
large datasets
Tips to Analyze
25
R
Output from R comparing two data sets
26
R
R Data Graphs from ClearPeaks.com
27
OLAP
Saiku OLAP Tool (using Mondrian)
28
» Determine audience
» Determine target media
» Determine level of detail
» Create reports
Canned reports
Ad-hoc reports
Custom reports
» Create visualizations
» Distribute
Steps to Report
29
» Open Source
Jasper Reports
Pentaho
» Commercial
Business Objects
Tableau
Yellow Fin
Excel
» Custom Development
Tools to Report
30
» Canned reports are useful but don’t go nuts.
Reports change frequently.
» Ad-hoc reporting is useful for non-technical users
but analysts may prefer something else.
Typically an add-on for a cost.
» If you are only going to use most of the reports
once, it may not be worth it to develop reports.
Tips to Report
31
Jasper Reports
Reports created in Jasper Reports
32
3000 WILSON BLVD SUITE 250
ARLINGTON, VA 22201
www.definitivelogic.com
TEL: 703.955.4186
FAX: 877.349.4031
Frank Thomason
703-472-8138