Upload
vuongdang
View
214
Download
1
Embed Size (px)
Citation preview
Agenda • Current state and background • Issues and problems • Approach • Next steps – issues to be addressed – issues deferred
1/26/11 2
MIT Data Warehouse -‐ Overview
• Most administraJve informaJon is available
• Widely used and relied on • Recognized by peer insJtuJons as one of the best in higher educaJon
• Improvements can make it more useful Examples:
– adding more data – cleaning up data – real Jme data – true 24x7 availability
3 1/26/11
Why the Warehouse was Created
• Primary reporJng plaRorm – provide a common access mechanism for all InsJtute wide administraJve data
– easier access provided by a simpler view of data
• Support departmental reporJng needs – ability to combine central and local data – reduce rekeying of data
• Improve data quality through usage • Self service – minimize support & operaJonal costs – DLC's don’t need to rely on a central reporJng group
4 1/26/11
• Academic Chairs • Alumni Biographic, OccupaJon & Degree
• AssociaJon of Student AcJviJes Group, Requests and Space
• Benefits
• Budgets (Nimbus)
• CIP (ClassificaJon of InstrucJonal Programs)
• Concur Travel
• Coop Textbook
• Course Catalog
• Cybersource MIT Storefront transacJons
• Department Master Hierarchy (and other hierarchies)
• EHS Contact, Room set, Sarah, Training Data
• FaciliJes/Plant Maintenance Orders
• Fidelity – Benefits
• Financial
• Financial Aid
• Fund data (Resource Development) • Graduate Admissions
• Graduate Awards
• HR (Employees, PosiJons, & Appointments) Cyborg & SAP
• IAP Credit & Non-‐Credit AcJviJes
• Instructor
• InsJtuJonal Research -‐ Honors & Awards
• Kerberos
• Keys
• Libraries (Financials, Orders, CirculaJons, CollecJon Management, Serials Control Processing)
• Metadata (Table, view, & field definiJons, data models; report templates)
• MIT Card – Pictures
• MIT ID
• Moira List data
What Data is Available
5 1/26/11
• OSP Proposals and Award Cost Sharing • Overhead Rates
• Payroll (DeducJons, DistribuJons & CerJficaJons)
• Pension
• Person
• Pre-‐registraJon & RegistraJon
• Procurement (Purchasing, RequisiJon & Credit Card)
• Request Tracker
• Roles /AuthorizaJon
• Space
• Student Biographic
• Student Degree
• Subject Enrollment
• Student Grades
• Training & Events
• TLO
• Telephone Detail (ice 9) • Student Term Enrollment
• Undergrad Admissions
• VacaJon/Absences/Leaves
• Web StaJsJcal Data
What Data is Available (cont)
6 1/26/11
DBMS Files
Files
Transform & Load
Extract
Pull Metadata
Source Systems Warehouse Environment
Extract
SQL
Web ReporJng Tool
Brio User
7 1/26/11
User Access
Context Diagram
Issues and Problem CollecJon
• Who – Data Warehouse Community Survey – Task Force Reports – ReporJng Tool Review Commi,ee – DW Staff – Interview data users • InsJtuJonal, School, Department • Central, Research, Academic
• What – Document • specific issues • brief analysis
– IdenJfy next steps 8 1/26/11
Categories of Data Issues
• Warehouse infrastructure & operaJons
• ReporJng environment • Missing data • Data access policy • Strategic reporJng (hard to answer quesJons)
9 1/26/11
Warehouse Infrastructure & OperaJons Issues idenJfied by the Warehouse Staff
• Job scheduling – start Jme manually set – extra Jme used to pad
– hard to maintain
• 7x24 availability • Congested nightly loads
– scheduling – full/incremental – a few intensive jobs
1/26/11 10
Warehouse Infrastructure & OperaJons • Performance of large detailed tables as history grows
• Managing the daily operaJon – reduce the number of non standard processes
• Incomplete DocumentaJon
• Lack of Real Jme data feeds • Lack a process to build and deliver higher level informaJon
1/26/11 11
ReporJng Environment Issues
• Data definiJons need improvement
• Need intuiJve table and field names • Report descripJons aren't sufficient – hard to find the one you need – report names should be in the footer, so once you have paper you can find the report used to produce it
– organize reports be,er – easy to idenJfy reports that need to be run by DLC monthly
• Easier way to get the data from the Warehouse and format in Excel
• Easy way to join data from mulJple domains
12 1/26/11
Data Missing from Warehouse • OSP – Pre-‐Proposal – Award Budgets
• Property • TLO – Basic • Alumni • ISO informaJon • Travel charges that haven't been submi,ed on a trip report • Travel Guest Card charges not available for DLC's • Historical Request Tracker InformaJon • Profit Center Group History • Building Data – Date Built and Date Renovated • Space Survey Data • Building DepreciaJon
13 1/26/11
Data Access Policy Issues
• Room usage type
• Appointment supplements • HR informaJon for people affiliated with labs & centers – paid but not appointed – faculty not charged or appointed
14 1/26/11
Strategic ReporJng (Hard to answer quesJons)
• What does it cost MIT to deliver its programs? – Undergraduate EducaJon, Graduate EducaJon and Research – How do costs vary by school and program? – How do educaJon costs relate to tuiJon charged (sJcker price) and tuiJon collected (net tuiJon revenue)?
– How do research costs relate to research revenues collected (direct and indirect)?
• What does it cost MIT to own and maintain its space? – capital (interest, depreciaJon) – operaJonal (uJliJes, repair and maintenance costs) – How do costs vary by building and relate to market rates?
15 1/26/11
Strategic ReporJng (Hard to answer quesJons)
• How much money has MIT received from all sources (gios, research revenues, other) in support of major areas such as Cancer Research or Energy compared to total resources used and needed?
• What are MIT’s gross expenses by funcJon before internal charges?
16 1/26/11
State of Data PracJces • Current sooware project development does not always include – reporJng requirements
– conceptual data model – data definiJons
• Projects ooen focus on a transacJonal orientaJon – not what ways the informaJon might be used in the future
1/26/11 17
ReporJng Commi,ee Findings
• IBM/COGNOS Tool Suite meets user requirements for funcJonality and look & feel. – Handling of local data – Web-‐based soluJon
• Data issues will not be resolved by a tool • Rollout of the tool will be a mulJyear effort • Data issues should be addressed as the tool is rolled out
• Requires the involvement of the Business Owners and Community
1/26/11 18
Approach
• Demonstrate to the community that we are working on all aspects of the problem
• Create a repeatable process that can applied per subject
• Gain experience with the new tool • Begin engaging Business Owners • Partner with Development Projects • Make demonstrable incremental progress
1/26/11 19
Areas of Focus
• Improving ReporJng Infrastructure
• Improving Development PracJces • Improving Data PracJces
1/26/11 20
Categories of Work
• Day to day operaJons – upgrading the environment – monitoring the loads – supporJng minor changes
• Housekeeping – Metadata maintenance – Warehouse documentaJon
• New reporJng tool administraJon • Building reports • Strategic quesJons • Modeling • New informaJon in Warehouse
1/26/11 21
Resources
• Team – 2 Data Administrators – 2 Developers – 2 ReporJng Consultants
• Project Teams
• Business Owners
1/26/11 22
Next Steps 2011
• Improving ReporJng Infrastructure – migraJon to VM – develop plan for 24x7 – reporJng tool proof of concept
• Improving Development PracJces – iniJal data modeling support
• Improving Data PracJces – document problems & issues
– Improve metadata documentaJon
1/26/11 23
Issues to be Addressed
• Tested in reporJng tool proof of concept – be,er Excel integraJon – easier report delivery • automaJc noJficaJon and delivery
– be,er report descripJons – be,er report organizaJon – simpler views of informaJon
– simpler joining of informaJon
• Short term soluJon for Singapore access
1/26/11 24
Work Deferred (Post FY2011) • Improving ReporJng Infrastructure
– real Jme data – 24x7 implementaJon – Job scheduling
• Improving Development PracJces – to be determined
• Improving Data PracJces – strategic (Hard to answer quesJons) – define and publish guidelines – define and publish roles & responsibiliJes – document data access policies
• New data in Warehouse – should be done with new tool
1/26/11 25
Improving Understanding of Data
• Simpler more focused views of informaJon
• Be,er definiJons and documentaJon
• ReporJng tool proof of concept – including process for developing specialized views
26 1/26/11
Improving IS&T PracJces
• Focus with the business on up front design – conceptual data model – business process models
– reporJng requirements
• Metadata documentaJon
• Document reconciliaJon processes
27 1/26/11
Improving Data PracJces
• Publish principles & guidelines for data management
• Establish processes for data issue resoluJon and prioriJzaJon
• Gain consensus and document roles & responsibiliJes concerning data
• Develop a method to monitor progress
28 1/26/11
Improving Warehouse Environment
1/26/11
29
Improving the Understanding of Data
Improving IS&T PracJces
Establishing Modeling ConvenJons
Improving Meta Data
Improving Data PracJces
Defining Roles & ResponsibiliJes
Data DocumentaJon per Area
ReporJng Tool Pilot
Space Example Define Focused Views per Area
Rollout New Tool per Area
July 2011
TIMELINE
Migrate to Linux/VM
Job Scheduling
ParJJoning
7x24
Appendix
• Warehouse background – design principles – Warehouse processing
– staJsJcs • Example of Addressing Issues in a single domain – Issues – short term plan
– long term plan
• Issues found so far grouped by subject
1/26/11 30
Data Warehouse Guiding Principles
• InformaJon in a single place – access any informaJon in the same manner – integrated informaJon • join informaJon from several sources
• Structures should support the needs of all types of users – Central, DLC, InsJtuJonal
• Open access via SQL – not proprietary only access – uncommon in educaJon and industry – can be accessed programmaJcally
• Warehouse is the data distribuJon hub – easy to verify when feeds and reports are from a single source
31 1/26/11
Data Warehouse Guiding Principles (CONT)
• Data should be stable – consistency between reports generated on the same day – ability to generate the same report at any point in Jme
• Data has to be accurate – stale data is be,er than inaccurate data
• Structures should make it easy to report • Data is not corrected or modified in the Warehouse • Access rules are shared by the Warehouse and transacJonal systems – access managed by Business Owners
32 1/26/11
Data Warehouse Guiding Principles (CONT)
• Data always comes from a system of record
• Users should be able generate and share reports • Metadata is available for all informaJon • Access control done at the database level – enables direct access via any means
• Metadata driven transformaJons and loads – to know how data was arrived at – same code called thousands of Jmes every night
33 1/26/11
Data Warehouse Process Stages Metadata Driven
• Extract – minimal logic – full or Incremental files generated
• Transfer – usually pushed automaJcally – encrypted/decrypted – simple integrity checking done
• Convert – data into reporJng format and structures • star schemas
• Load
34 1/26/11
Integrity Checking in all Processing
• Correct files on hand before job runs • Record & byte counts • Comparisons of control file to data file • Conversion type checking, number and types of fields • DBMS constraints • Error checking and validaJon rouJnes • Mail sent to Warehouse & Business Owners
36 1/26/11
System Availability
• Approaches 24 x 7 – except Saturdays from 3:00 pm – 3:10 pm • snapshot
– some tables are temporarily unavailable in the early morning as new day’s data is loaded
37 1/26/11
StaJsJcs
Jobs per night ~300
Input files per night 875 Field transformaJons 12,847 DisJnct funcJons 78
Tables 1,522 Views 1,150
Records 1,904,948,496 Total database size 575g
38 1/26/11
StaJsJcs
Fiscal Year 2011 Quarter 2 (October1 2010 – December 31,2010)
• 664 disJnct users logged into the Data Warehouse
• RepresenJng 117 org units
• GeneraJng 782,177 sessions
39 1/26/11
StaJsJcs
1/26/11 40
20428
10490 5413
3248
3167
2201
2092
1752
1496
1299
1151
965 902
840 808 753
730 615
547 492
483
451
441
DW User Community: Sessions by Department (400+ Sessions) FY 2011 Q2
Environment, Health & Safety Office
HQ, InformaJon Services & Technology
Others (research affiliates and scienJsts, postdoctoral fellows, visiJng faculty/students/scholars, non-‐current employees, etc.) Center for InternaJonal Studies
Human Resources
Research Laboratory of Electronics
Office of Sponsored Programs
Sloan School of Management
InformaJon Services & Technology
Microsystems Technology Laboratories
GL Ops & ReporJng & Journal Vouchers
Budget, Finance and Treasury
Sponsored & Genl A/R & Cashier
Data Issues Example (Space)
• Buildings need to be mapped to their cost collectors manually • Building costs are collected via secondary posJngs
– many of these are done manually requiring significant effort • UJlity usage metadata not available in the Warehouse • Flaws in the accuracy due to the fact that some uJlity meters
cover more than one building • The older organizaJon unit idenJfiers (6 digit) make it hard to
relate space informaJon to other domains • All systems do not use a standard format when referencing
locaJon informaJon making integraJon difficult • Building built and last renovated dates are not shown in the
Warehouse • Access to space informaJon in not consistent or documented • Building depreciaJon informaJon not in the Warehouse
41 1/26/11
Space Data Short Term Plan
• Use new HR org unit numbers in the feeds and views of the Warehouse
• Add the cost collector that relates to the building in the building feed – include the cost collector key in the Warehouse view
• Add new fields to the building feeds and views for date built and date renovated
• Add geocode fields to the room feed and views • Update the Warehouse metadata – add informaJon about uJlity usage
42 1/26/11
Space Long Term Plan
• Work with FaciliJes and others to agree on guidelines for integraJng locaJon informaJon – building and room idenJfiers – geocoding • guidelines for tracking geocode for things like equipment
• Work to gain consensus and document access policies concerning space informaJon
• Ensure that space survey data it in the Warehouse • Encourage reporJng of space informaJon from the Warehouse where appropriate – possible pilot web reporJng
• Bring building depreciaJon informaJon into the Warehouse – create an integrate view of all space costs in the Warehouse
43 1/26/11
HR Data Issues • Faculty promoJon and Tenure dates have to be cleaned up by hand
– SAP & business policy does not support correcJon of some informaJon
• CIP (Academic Discipline) a,ached to All Academic Staff
• Race & Ethnicity not consistent over Jme – self idenJfied ethnicity needs to be managed and synchronized with InsJtuJonally idenJfied
• Historical data is inaccurate because old records are changes without the modified date and effecJve dates properly maintained
• Many places where more than one affiliaJon occurs but always need one marked "Primary" to support non-‐duplicated counts
• HR data not accurate. – How many researchers have PHD?
• Many things could be calculated in a standard way to make it easier and reduce discrepancies. – For example: years at MIT, could be provided in addiJon to the appointment dates.
• There seem to be different tallies for the Annual Head Count Report (October 31st) which cause confusion.
• Inaccuracies in Faculty appointment dates
• Inaccuracies in Faculty Leave informaJon
• Ethnicity & Race not maintained accurately
• How many FTE by FY for a DLC for the last 10 years
1/26/11 44
Finance Data Issues • Hard to maintain CEMIT Groups
– can't tell if they're used • so hesitant to delete them
– Need to verify that they are all correct each month
• maybe an exclusion list would help (to see if the right gl_accounts are excluded and only those) • mistakes lead to reporJng errors
• Difficult to separate out internal charging and transfers – some specific GL_Accounts for internal that can be excluded for this view
– could use SI transacJons instead of SA, but no way to ensure people do that correctly – no way to enforce integrity rule in SAP
• Many Document Types not used, but not filtered out of list • Sponsor Billing done at Parent WBS level, so children look in deficit
– no common definiJon of what "deficit" accounts are
• GL_Account_Report is confusing. • Travel charges that haven't been submi,ed on a trip report can't be reported on • Guest Card charges not available for DLCs • Sponsor Billing done at Parent WBS level, so children look in deficit No common definiJon of what "deficit"
accounts are – billing lag, means DLCs and Central might have different views and definiJons
• Hard to see if changes in a Person's status from ON/OFF Campus have been made before the posJngs • Need Profit Center Group History so that we can recreate old reports and understand the differences in
numbers
1/26/11 45
Student System Issues
• What is the Cost of EducaJon? – By Department and Program
• per Student – How much TuiJon is Charged?
• What is waived?
– How much of TuiJon is paid?
• by the Student • by a Sponsor
• Don't have accurate info on Grads. – Where did they get their undergrad degree?
– When did a student enter in which degree program?
– They have Grad entry term, but not cohort. – Are they fully supported (external sources hard)?
1/26/11 46