37
© Hitachi, Ltd. 2017. All rights reserved. Hitachi, Ltd. OSS Solution Center 5/16/2017 Yusuke Furuyama Shogo Kinoshita Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

  • Upload
    others

  • View
    19

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

© Hitachi, Ltd. 2017. All rights reserved.

Hitachi, Ltd. OSS Solution Center

5/16/2017

Yusuke Furuyama Shogo Kinoshita

Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

Page 2: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

1 © Hitachi, Ltd. 2017. All rights reserved.

Who are we?

• Solutions engineer at Hitachi, Ltd. • Offering and co-creating progressive

Hadoop solutions to customers who are going to build enterprise system.

Yusuke Furuyama Shogo Kinoshita • Solutions Engineer at Hitachi, Ltd. • Focusing on Hadoop eco-system

(including Spark, Hive, Impala) and write web-articles, make presentations about evaluation of Hadoop-related OSS.

Page 3: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

© Hitachi, Ltd. 2017. All rights reserved.

1. Leveraging smart meter data[Sample use case for electric utilities]

2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)

3. Additional evaluation with Spark 2.1

Contents

2

4. Summary

Page 4: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

3 © Hitachi, Ltd. 2017. All rights reserved.

1. Leveraging smart meter data [Sample use case for electric utilities]

Page 5: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

4 © Hitachi, Ltd. 2017. All rights reserved.

1-1 Hitachi Corporate Profile

9,162.2 billion yen

587.3 billion yen

303,887

February 1, 1920

458.7 billion yen

Revenues

Operating Income

Number of Employees

Established

Capital

(as of end of Mar. 2017)

(as of end of Mar. 2017)

(FY2016 Consolidated)

(FY2016 Consolidated)

Hitachi, Ltd.

President & CEO

Toshiaki Higashihara

Page 6: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

5 © Hitachi, Ltd. 2017. All rights reserved.

1-2 Revenue by Segments FY2016

Automotive Systems

Other (Logistics & Other services)

Smart Life & Ecofriendly Systems

Information & Telecommunication Systems

Social Infrastructure & Industrial Systems

Construction Machinery

Financial Services

High Functional Materials & Components

*Figures are on a consolidated basis

Total 9,162.2 billion yen

FY 2016 23%

12% 7% 14%

10%

6% 6% 2% 20%

Electronics Systems & Equipment

Page 7: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

6 © Hitachi, Ltd. 2017. All rights reserved.

1-3 Hitachi’s Social innovation business approach

Page 8: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

7 © Hitachi, Ltd. 2017. All rights reserved.

1-4 Situation of electric utilities in Japan and their needs

• Electric utilities in Japan have to adapt to competitive free market • Request for price cut of power transmission fee from government of Japan

Liberalization of the retail Electric Power Market

Needs to cut the cost for transmission and distribution equipment • Transmission and distribution equipment have been replaced periodically • Decide the timing of replacement by the condition of equipment

Maintenance team needs: Obtain the load status of each equipment

Electricity rate

Power plant Transmission

Lines Substation Distribution

Lines

Home

Transmission and Distribution Companies

Power transmission fee Electric Power

Generation Companies

Electricity retailers

In Japan, the retail sale of electric power was

fully liberalized in April 2016

Page 9: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

8 © Hitachi, Ltd. 2017. All rights reserved.

1-5 Situation of electric utilities in Japan and their needs (future)

• Decreasing nuclear plant as a stable power supplier • Increasing renewable energy supply

Unstable power supply

Needs for high level Demand Response • Rates by time zone (current demand response) • Many and small renewable energy suppliers • Near real-time demand response for each distribution system

Planning team needs: Obtain near real-time load status of each equipment

Electricity rate

Power plant Transmission

Lines Substation Distribution

Lines

Home

Transmission and Distribution Companies

Power transmission fee Electric Power

Generation Companies

Electricity retailers

Page 10: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

9 © Hitachi, Ltd. 2017. All rights reserved.

1-6 Leveraging big data for electric utilities

Power Grid Transformer (500/Distribution Line)

Switch (20/Distribution Line)

Meter Data Management System Data from smart meters(every 30min.)

Analyze the data from smart meters to grasp the load status of equipment

Meet the needs of electric utilities

Substation (1,000)

Distribution Lines (5/Substation)

Smart Meter (4/Trans)

・・・ 0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

Data Analysis System

Total 10,000,000

Concern about performance needed for near real-time processing

Page 11: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

10 © Hitachi, Ltd. 2017. All rights reserved.

Data processing platform (Hadoop, Spark)

Data Analysis System

Target of this session

Planner (Equipment/Demand

response)

Meter Data Management System

1-7 System Component

Raw Data Processed Data

Data visualization tool (Pentaho)

Data from smart meters (every 30min.)

Page 12: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

11 © Hitachi, Ltd. 2017. All rights reserved.

2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)

Page 13: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

12 © Hitachi, Ltd. 2017. All rights reserved.

2-1 Use Case for electric utilities

Points of analysis Needs Point of analysis

Find the equipment that needs to be replaced Find the equipment that has heavy workload

Estimate the timing for replacement Check the trend of load status

Select the proper capacity of new equipment Extract the peak of the load

Needs of electric utilities (recap)

Needs ・Analyze the data from smart meters to grasp the load status of equipment ・Near real-time

Page 14: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

13 © Hitachi, Ltd. 2017. All rights reserved.

2-2 Contents of performance evaluation

Point of evaluation for near real-time processing

Check if MapReduce and Spark can process the data in 30min.

Point of analysis Aggregate per

Find the equipment has heavy workload

Equipment

Check the trend of load status Term

Extract the peak of the load Time zone

Items for evaluation

・Distribution system ・Substation ・Switch ・Transformer ・Meter

・1day ・1month(30days) ・1year(365days)

・Specific 30min of each day ・24h

• Concern about performance for processing 10,000,000 meters

• Data comes from each smart meter every 30min (48/Day, spec of smart meter)

Page 15: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

14 © Hitachi, Ltd. 2017. All rights reserved.

2-3 Target of performance evaluation

Target of evaluation

Data Processing Platform (MapReduce, Spark 1.6)

Data Analysis System

End

Data visualization tool(Pentaho)

Meter Data Management System

Target

Aggregation batch (Hive / Spark SQL)

Start

Time from start of aggregation batch for meter data to end of the batch

Page 16: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

15 © Hitachi, Ltd. 2017. All rights reserved.

2-4 Evaluation environment

System Configuration Spec

Master Node

CPU Core 2

Memory 8 GB

Capacity of disk 80 GB

# of disk 1

Per slave node total

CPU Core 16 64

Memory 128 GB 512 GB

Capacity of disk 900 GB -

# of disk 6 24

Total capacity of disks

5.4 TB (5,400 GB)

21.6 TB (21,600 GB)

10Gbps LAN

10Gbps SW

1Gbps LAN

disk disk ・・・ disk

4 Slave Nodes (Physical Machines)

1 Master Node (Virtual Machine)

Page 17: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

16 © Hitachi, Ltd. 2017. All rights reserved.

2-5 Dataset

Smart meter data

Term # of records Size (CSV) Size (ORCFile)

365days (1year) 3,650 million 2.475 TB 1.325 TB

30days (1month) 300 million 0.205 TB 0.158 TB

1day 10 million 0.007 TB 0.005 TB

0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. info meter1

0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. info meter2

・・・

0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. info meter

10,000,000

Smart meter data /day

Data size

48 columns (every 30min)

10,000,000 records/day

Page 18: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

17 © Hitachi, Ltd. 2017. All rights reserved.

2-6 Contents of performance evaluation (recap)

Items for evaluation

Target of evaluation

Point of analysis Aggregate per

Find the equipment has heavy workload

Equipment

Check the trend of load status Term

Extract the peak of the load Time zone

・Distribution system ・Substation ・Switch ・Transformer ・Meter

・1day ・1month(30days) ・1year(365days)

・Specific 30min of each day ・24h

Point of evaluation

Time from start of aggregation batch for meter data to end of the batch

Check if MapReduce and Spark can process the data in 30min.

Page 19: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

18 © Hitachi, Ltd. 2017. All rights reserved.

Time from start of aggregation batch for meter data to end of the batch

2-6 Contents of performance evaluation (recap)

Point of evaluation

Check if MapReduce and Spark can process the data in 30min.

Items for evaluation Point of analysis Aggregate per

Find the equipment has heavy workload

Equipment

Check the trend of load status Term

Extract the peak of the load Time zone

・Distribution system ・Substation ・Switch ・Transformer ・Meter

・1day ・1month(30days) ・1year(365days)

・Specific 30min of each day ・24h

+ File type ・Text (CSV) ・ORCFile (Column-based)

Target of evaluation

Page 20: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

19 © Hitachi, Ltd. 2017. All rights reserved.

2-7 Comparison of txt with ORCFile (MapReduce)

• Couldn’t finish processing in 30min • Performance improvement by ORCFile

62 sec

224 sec

3286 sec

99 sec

338 sec

6962 sec

0 sec 2000 sec 4000 sec 6000 sec

1day

30days

365days

Hive + TXT

Hive + ORCFile

31 sec

69 sec

492 sec

32 sec

138 sec

3015 sec

0 sec 2000 sec 4000 sec

1day

30days

365days

Hive + TXT

Hive + ORCFile

30min 30min

Result (24h) Result (0.5h)

Aggregate meter data of entire Distribution System

Page 21: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

20 © Hitachi, Ltd. 2017. All rights reserved.

2-8 Comparison of txt with ORCFile (Spark 1.6)

28 sec

50 sec

993 sec

47 sec

154 sec

1408 sec

0 sec 400 sec 800 sec 1200 sec 1600 sec

1day

30days

365days

Spark SQL + TXT

Spark SQL + ORCFile

• Could finish processing in 30min (1,800s) • Performance improvement by ORCFile

26 sec

32 sec

169 sec

30 sec

111 sec

1263 sec

0 sec 400 sec 800 sec 1200 sec 1600 sec

1day

30days

365days

Spark SQL + TXT

Spark SQL + ORCFile

Result (24h) Result (0.5h)

Aggregate meter data of entire Distribution System

Page 22: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

21 © Hitachi, Ltd. 2017. All rights reserved.

2-9 Review of the results

Why the processing was fast with ORCFile

Processing 0.5h data

・・・

・・・

Smart meter data

・・・

0:00 23:30

0:00

0:30

0:00

0:00

0:00

0:00

0:00

0:00

0:00

Use 1 column only

Processing 24h data

・・・

・・・

・・・ ・・・

・・・

0:30 23:30 SUM/day

Use all 48 columns

• Processing big data with ORCFile was more effective

than processing small data

Results

• Processing specific 0.5h data with ORCFile was more

effective than processing 24h data

Compression

Reading specific column

Features of ORCFile

+ +

+ +

+ + +

+ +

SUM/day

SUM/day

SUM/day

SUM/day

Smart meter data

0:00

0:30

0:30

0:30

0:30

23:30

23:30

23:30

23:30

0:00

0:00

0:00

0:00

0:00

0:30

0:30

0:30

0:30

23:30

23:30

23:30

23:30

Page 23: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

22 © Hitachi, Ltd. 2017. All rights reserved.

32 sec

104 sec

69 sec

556 sec

0 sec 500 sec 1000 sec

Distributionsystem

Perequipment

Hive

Spark SQL

2-10 Comparison of MapReduce with Spark 1.6 (ORCFile)

50 sec

145 sec

224 sec

718 sec

0 sec 500 sec 1000 sec

Distributionsystem

Perequipment

Hive

Spark SQL

993 sec

1359 sec

3286 sec

4131 sec

0 sec 1000 sec 2000 sec 3000 sec 4000 sec

Distributionsystem

Perequipment

Hive

Spark SQL

・Could finish processing the data from 10,000,000 meter in 30min using Spark 1.6! ・Spark’s good performance with “per equipment” processing

Result (30days / 24h) Result (30days / 0.5h)

Result (365days / 24h)

30min

80% down

78% down

81% down

54% down

(JOIN) (JOIN)

(JOIN)

(NO JOIN) (NO JOIN)

(NO JOIN)

Page 24: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

23 © Hitachi, Ltd. 2017. All rights reserved.

Trans

2,500,000

Switch

100,000

2-11 Review of the results

Why the processing per equipment was more effective than the processing for entire distribution system when using spark?

Per equipment ・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

0:00

0:00

0:00

0:00

0:00

0:30

0:30

0:30

0:30

0:30

23:30

23:30

23:30

23:30

23:30

Trans1

Trans2

Trans3

Switch1

Switch2 Line1 Substa.1

1,000 5,000

・・・

・・・

・・・

・・・

・・・

0:00

0:00

0:00 0:30

23:30

23:30

23:30

・・・ 0:00

0:00

0:30

0:30

23:30

23:30

For entire distribution system

JOIN

・Less disk I/O than MapReduce ・Smaller data (including re-distributing data) than total memory of cluster

Smart meter data Transformer Switch Distribution Line

Substation

Distribution System

Distribution System

JOIN JOIN JOIN

JOIN

Smart meter data

0:30

0:30

Page 25: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

24 © Hitachi, Ltd. 2017. All rights reserved.

3. Additional evaluation with Spark 2.1

Page 26: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

25 © Hitachi, Ltd. 2017. All rights reserved.

3-1 Evaluation environment

System Configuration for additional evaluation

Spec

Master Node

CPU Core 2

Memory 8 GB

Capacity of disk 80 GB

# of disk 1

Per slave node total

CPU Core 16 64

Memory 128 GB 512 GB

Capacity of disk 900 GB -

# of disk 6 24

Total capacity of disks

5.4 TB (5,400 GB)

21.6 TB (21,600 GB)

10Gbps LAN

10Gbps SW

disk disk ・・・ disk

4 Slave Nodes (Physical Machines)

1 Master Node (Virtual Machine)

1Gbps LAN

Page 27: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

26 © Hitachi, Ltd. 2017. All rights reserved.

76 sec

97 sec

0 sec 20 sec 40 sec 60 sec 80 sec 100 sec 120 sec

PerEquipment

(JOIN)

Spark1.6 + ORCSpark2.1 + ORC

114 sec

147 sec

0 sec 50 sec 100 sec 150 sec 200 sec

PerEquipment

(JOIN)

Spark1.6 + ORCSpark2.1 + ORC

571 sec

779 sec

0 sec 200 sec 400 sec 600 sec 800 sec 1000 sec

PerEquipment

(JOIN)

Spark1.6 + ORCSpark2.1 + ORC

3-2 Comparison of Spark 2.1 with Spark 1.6(ORCFile)

・Performance improvement 22-27% (including disk I/O) ・More effective with large data

Result (365days / 0.5h) Result (30days / 0.5h)

Result (1day / 0.5h)

Time from start of aggregation batch for meter data to end of the batch

27% down

23% down

22% down

Page 28: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

27 © Hitachi, Ltd. 2017. All rights reserved.

82 sec

126 sec

0 sec 50 sec 100 sec 150 sec

PerEquipment

(JOIN)

Spark1.6 + ParquetSpark2.1 + Parquet

76 sec

105 sec

0 sec 20 sec 40 sec 60 sec 80 sec 100 sec 120 sec

PerEquipment

(JOIN)

Spark1.6 + ParquetSpark2.1 + Parquet

227 sec

401 sec

0 sec 100 sec 200 sec 300 sec 400 sec 500 sec

PerEquipment

(JOIN)

Spark1.6 + ParquetSpark2.1 + Parquet

3-3 Comparison of Spark 2.1 with Spark 1.6(Parquet)

・Performance improvement 28-43% (including disk I/O) ・More effective with large data ・Better improvement than ORCFile

Result (365days / 0.5h) Result (30days / 0.5h)

Result (1day / 0.5h)

Time from start of aggregation batch for meter data to end of the batch

43% down

35% down

28% down

Page 29: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

28 © Hitachi, Ltd. 2017. All rights reserved.

76 sec

79 sec

82 sec

161 sec

227 sec

1132 sec

76 sec

79 sec

114 sec

176 sec

571 sec

1232 sec

0 sec 500 sec 1000 sec 1500 sec

1day×0.5h

1day / 24h

30days / 0.5h

30days / 24h

365days / 0.5h

365days / 24h

Spark2.1 + ORC

Spark2.1 + Parquet105 sec

110 sec

126 sec

202 sec

401 sec

1342 sec

97 sec

97 sec

147 sec

201 sec

779 sec

1459 sec

0 sec 500 sec 1000 sec 1500 sec

1day×0.5h

1day / 24h

30days / 0.5h

30days / 24h

365days / 0.5h

365days / 24h

Spark1.6 + ORC

Spark1.6 + Parquet

3-4 Comparison of ORCFile with Parquet (Spark 1.6/2.1)

Result (Spark 1.6 / Per equipment (JOIN)) Result (Spark 2.1 / Per equipment (JOIN))

Time from start of aggregation batch for meter data to end of the batch

・Basically, performance of Parquet is better than ORCFile ・Performance of Parquet with small data is worse than ORCFile in some cases.

Term Size (Parquet) Size (ORCFile)

365days (1year) 1.363 TB 1.328 TB

30days (1month) 0.112 TB 0.109 TB

1day 3.7 GB 3.6 GB

Page 30: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

29 © Hitachi, Ltd. 2017. All rights reserved.

Demo

Page 31: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

30 © Hitachi, Ltd. 2017. All rights reserved.

Data visualization tool (Pentaho)

Aggregated Data (29 days)

Demo: Data aggregation and visualization

② Execute aggregation batch (Spark SQL)

① Show 29 days data

Aggregated Data (30 days)

③ Show 30 days data

Raw Data (1day)

④ Outlier detected!!

Page 32: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

31 © Hitachi, Ltd. 2017. All rights reserved.

4. Summary

Page 33: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

32 © Hitachi, Ltd. 2017. All rights reserved.

4 Summary

Leveraging data from 10,000,000 smart meters for electric utilities in Japan - Built data analysis system - Concern about performance

Evaluate the performance of batch processing - Spark could process the data from 10,000,000 meters in 30min (4 slave nodes)

Evaluate the performance of Spark 2.1 - Performance improvement 22-27% (compared to 1.6, ORCFile) - Performance improvement 28-43% (compared to 1.6, Parquet)

Data Processing Platform (MapReduce, Spark)

Data Analysis System

Data visualization tool(Pentaho)

Aggregation batch (Hive / Spark SQL)

Meter Data Management System

Page 34: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

© Hitachi, Ltd. 2017. All rights reserved.

Hitachi, Ltd. OSS Solution Center

Comparison of Spark SQL with Hive

Leveraging smart meter data for electric utilities:

5/16/2017

Yusuke Furuyama Shogo Kinoshita

END

33

Page 35: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

34 © Hitachi, Ltd. 2017. All rights reserved.

Trademarks

Hadoop, Spark and Hive are trademarks or registered trademarks of Apache Software Foundation in the United States and other countries.

Other brand names and product names used in this material are trademarks, registered trademarks or trade names of their respective holders.

Page 36: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric
Page 37: Leveraging smart meter data for electric utilities · Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive ... 1-6 Leveraging big data for electric

36 © Hitachi, Ltd. 2017. All rights reserved.

Appendix. Difficulty with large data to be shuffled

Attempted to aggregate raw (48 columns) meter data per equipment - Extremely slow (Spark 2.0) or Job failed (Spark 1.6) - Processing: Iteration of JOIN and GROUP BY+SUM - Huge data to be shuffled (spilled out from page cache)

76 sec 223 sec

13281 sec

85 sec 236 sec

3650 sec

0 sec

2000 sec

4000 sec

6000 sec

8000 sec

10000 sec

12000 sec

14000 sec

1day 30days 365days

Pro

cess

ing t

ime

Target data

Before

adding

After

adding

Add HDFS disks as disks for shuffle - Performance Improved (365days) - Performance degraded (1day/30days)

・Data for Spark (including temporary data) should be smaller than memory.

・Had better to process as a trial to estimate

Adding disks for shuffle (Spark 2.0.0)

Heavy load on a local disk (OS disk) by shuffle