運用Spark於用電大數據湖泊之資料儲存與分析平台實作

Tunghai University Institutional Repository > 工學院 > 資訊工程學系所 > 碩士論文 > Item 310901/31103

Please use this identifier to cite or link to this item: http://140.128.103.80:8080/handle/310901/31103

Title:	運用Spark於用電大數據湖泊之資料儲存與分析平台實作
Other Titles:	The Implementation of Data Storage and Analytics Platform for Big Data Lake of Electric Loads Using Spark
Authors:	陳子揚 CHEN, TZU-YANG
Contributors:	楊朝棟 YANG, CHAO-TUNG 資訊工程學系
Keywords:	巨量資料;資料湖泊;資料儲存;資料視覺化;電能資料 Big Data;Data Lake;Data Storage;Data Visualization;Power Data
Date:	2018
Issue Date:	2019-01-10T09:05:05Z (UTC)
Abstract:	隨著物聯網大數據技術的快速發展，資料量產生與累積的速度是相當驚人的，傳統架構的資料儲存與分析技術對現在大資料量的處理已漸漸不堪負荷。以本校為例，過去我們將機房用電與校園用電個別儲存在兩個不同的資料庫系統，長久累積下來的資料量是十分龐大的，如果想要將資料取出並用大數據平台分析的話只能透過JDBC的連接或是將資料個別輸出，資料的取出就變得相對繁雜，因此如何將現有系統導入資料湖泊與大數據技術是一個趨勢，也是個挑戰。本篇論文提出一個架構能將現有的儲存系統導入至資料湖泊與大數據平台並儲存與分析電能資料，透過Sqoop將舊系統的歷史資料轉存到Hive上做資料倉儲，即時的串流資料藉由Kafka保持資料的完整性且利用Spark Streaming的方式將即時產生的電能資料寫入HBase做為即時資料的保存，以Hive和HBase為基底建置資料湖泊以保持資料的完整性，並整合Impala與 Phoenix個別對Hive和HBase做為搜尋引擎且。本論文也利用Spark提出用電預測與斷電判別等分析模組來分析校園用電情形，分析的結果將會儲存在HBase上，本論文所有視覺化的呈現都藉由Apache Superset完成。 With the rapid development of the Internet of Things and Big Data technology, the speed of data generation and accumulation is quite alarming. The data storage and analysis technology of the traditional architecture has become not suitable enough by the processing of large amounts of data. Take our campus as example. In the past,we used the power data from data center and campus to be stored separately in two different database systems. The amount of data accumulated is very large over a long period of time. There is no doubt that Big Data technology brings significant benefits such as efficiency and productivity. However, a successful approach to Big data migration requires efficient architecture. How to import existing systems into Data Lake and Big Data technologies is a trend and a challenge. In this paper, we proposed an architecture to import existing power data storage system of our campus into Big data platform with Data Lake. We use Apache sqoop to transfer historical data from existing system to Apache Hive for data storage. Apache Kafka is used for making sure the integrity of streaming data and as the input source for Spark streaming that writing data to Apache HBase. To integrate the data, we use the concept of Data Lake which is based on Hive and HBase. Apache Impala and Apache Phoenix are individually used as search engines for Hive and HBase. This thesis uses Apache Spark to analyze power consumption forecasting, and power failure. The results of the analysis will be stored on HBase. All visualizations of this thesis are presented by Apache Superset.
Appears in Collections:	[資訊工程學系所] 碩士論文

Files in This Item:

File	Description	Size	Format
106THU00394013-001.pdf		5201Kb	Adobe PDF	347	View/Open

Loading...