书签分享收藏举报版权申诉 / 50

立即下载加入VIP免费专享

当前位置：首页 > 建筑/环境 > 装饰装潢 > Spark驱动智能大数据分析应用.pdf

Spark驱动智能大数据分析应用.pdf

上传人：哈尼dd

文档编号：3331043

上传时间：2019-08-13

格式：PDF

页数：50

大小：1.90MB

《Spark驱动智能大数据分析应用.pdf》由会员分享，可在线阅读，更多相关《Spark驱动智能大数据分析应用.pdf（50页珍藏版）》请在三一文库上搜索。

1、Spark Drives Big Data Analytical Application Phil Tian Jianzhong Chen 2 Cloudera, Inc. All rights reserved. Spark Brief What Cloudera is doing on Spark Spark Use Cases Clouderas Position on Spark Agenda 3 Cloudera, Inc. All rights reserved. Key Advances by MapReduce: Data Locality: Automatic split c

2、omputation and launch of mappers appropriately Fault-Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware Linear Scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problem

3、s A Brief Review of MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce 4 Cloudera, Inc. All rights reserved. Spark is a general purpose computational framework with more flexibility than MapReduce Key properties: Leverages distributed memory Full Directed Graph exp

4、ressions for data parallel computations Improved developer experience Yet retains: Linear scalability, Fault-tolerance, and Data Locality based computations What is Spark? 5 Cloudera, Inc. All rights reserved. Easy to Develop High productive language support Clean and expressive APIs Interactive she

5、ll Out of box functionality Spark: Easy and Fast Big Data Fast to Run General execution graphs In-memory storage 2-5 less code Up to 10 faster on disk, 100 in memory 6 Cloudera, Inc. All rights reserved. Easy: Example Word Count Spark public static class WordCountMapClass extends MapReduceBase imple

6、ments Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (

7、itr.hasMoreTokens() word.set(itr.nextToken(); output.collect(word, one); public static class WorkdCountReduce extends MapReduceBase implements Reducer public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException int sum = 0; while (values.hasNext() sum

8、+= values.next().get(); output.collect(key, new IntWritable(sum); Hadoop MapReduce val spark = new SparkContext(master, appName, sparkHome, jars) val file = spark.textFile(“hdfs:/.“) val counts = file.flatMap(line = line.split(“ “) .map(word = (word, 1) .reduceByKey(_ + _) counts.saveAsTextFile(“hdf

9、s:/.“) 7 Cloudera, Inc. All rights reserved. Hadoop Integration Works with Hadoop Data Runs With YARN Libraries MLlib Spark Streaming GraphX (alpha) Out-of-the-Box Functionality Language support: Improved Python support SparkR Java 8 Schema support in Sparks APIs 8 Cloudera, Inc. All rights reserved

10、. data = spark.textFile(.).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Example: Logistic Regression 9 Cloudera, Inc. All right

11、s reserved. Hadoop cluster with 100 nodes contains 10+TB of RAM today and will double next year 1 GB RAM $10-$20 Trends: price every 18 months 2x bandwidth every 3 years Memory Management Leads to Greater Performance 64-128GB RAM 16 cores 50 GB per sec Memory can be enabler for high performance big

12、data applications 10 Cloudera, Inc. All rights reserved. In-memory Caching Data Partitions read from RAM instead of disk Operator Graphs Scheduling Optimizations Fault Tolerance Fast: Using RAM, Operator Graphs join filter groupBy B: B: C: D: E: F: map A: map take = cached partition = RDD 11 Clouder

13、a, Inc. All rights reserved. Expressiveness of Programming Model Map Reduce Map Map Reduce Map Reduce Efficient group-by aggregations and other analytics Pipelined MapReduce Jobs Map Reduce Map Reduce X X X Map Reduce Iterative jobs (Machine Learning) 12 Cloudera, Inc. All rights reserved. Logistic

14、Regression Performance (Data Fits in Memory) 0 500 1000 1500 2000 2500 3000 3500 4000 15102030 Running Time (s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s 13 Cloudera, Inc. All rights reserved. Spark Brief What Cloudera is doing on Spark Spark Use

15、 Cases Clouderas Position on Spark Agenda 14 Cloudera, Inc. All rights reserved. Spark Engineering in Cloudera Cloudera embraced Spark in early 2014 Engineering with Intel to broaden Spark ecosystem Hive-on-Spark Pig-on-Spark Spark-over-YARN Spark Streaming Reliability General Spark Optimization 15

16、Cloudera, Inc. All rights reserved. Hive on Spark Technology Hive: “standard” SQL tool in Hadoop Spark: next-gen distributed processing framework Hive + Spark Performance Minimum feature gap Industry A lot of customers heavily invest in Hive Want to leverage the Spark engine 16 Cloudera, Inc. All ri

17、ghts reserved. Design Principles No or limited impact on Hives existing code path Maximize code reuse Minimum feature customization Low future maintenance cost 17 Cloudera, Inc. All rights reserved. Class Hierarchy TaskCompiler MapRedCompiler TezCompiler Task Work MapRedTask TezTask TezWork MapRedWo

18、rk SparkCompiler SparkTask SparkWork generates described by 18 Cloudera, Inc. All rights reserved. Work Metadata for Task MapReduceWork contains one MapWork and a possible ReduceWork SparkWork contains a graph of MapWorks and ReduceWorks MapWork1 ReduceWork1 MapWork2 ReduceWork2 MapWork1 ReduceWork1

19、 ReduceWork2 Query: select name, sum(value) as v from dec group by name order by v; Spark Job MR Job 2 MR Job 1 19 Cloudera, Inc. All rights reserved. Data Processing via Spark Treat Table as HadoopRDD (input RDD) Apply the function that wraps MRs map-side processing Shuffle map output using Sparks

20、transformations (groupByKey, sortByKey, etc) Apply the function that wraps MRs reduce-side processing 20 Cloudera, Inc. All rights reserved. Spark Plan MapInput encapsulate a table MapTran map-side processing ShuffleTran shuffling ReduceTran reduce-side processing Query: Select name, sum(value) as v

21、 from dec group by name order by v; 21 Cloudera, Inc. All rights reserved. Current Status All functionality in Hive is implemented First round of optimization is completed Map join, SMB Split generation and grouping CBO, vectorization More optimization and benchmarking coming Beta in CDH http:/archi

22、ve- http:/ spark/latest/PDF/hive-spark-get-started.pdf 22 Cloudera, Inc. All rights reserved. Spark Brief What Cloudera is doing on Spark Spark Use Cases Clouderas Position on Spark Agenda 23 Cloudera, Inc. All rights reserved. User Use Case Sparks Value Conviva 通过实时分析流量规律以及更精细的流量控制，优化终端用户的在线视频体验快

23、速原型开发共享的离线和在线计算业务逻辑开源的机器学习算法 Yahoo! 加速广告投放的模型训练管道，特征提取提高3X，用协同过滤进行内容推荐降低数据管道的延迟迭代式机器学习高效的 P2P 广播 Anonymous (Large Tech Company) 准实时日志聚合与分析，实现监控和告警低延迟、高频度的运行“mini” 批作业来处理最新数据 Technicolor 为（电信）客户提供实时分析；提供流处理和实时查询能力部署简单，只需要Spark 和 Spark Streaming 在线数据的随机查询 Sample Use Cases 24 Cloudera, Inc.

24、All rights reserved. Large Tech Company Spark is used for new machine learning investigations for search personalization Financial Services Process millions of stock positions and future scenarios in 4hrs with Spark (compared with 1 week using MapReduce) University Genomics research using Spark pipe

25、lines Video Spark and Spark Streaming for video streaming and analysis Hospital Spark for predictive modeling of disease conditions Cloudera Use Cases in Verticals 25 Cloudera, Inc. All rights reserved. Run ETL on Spark using PIG To achieve very tight SLAs. Accenture Smart Water Application. Spark A

26、nalytics over Hbase Patients physiological data, experiment and user data Serving Researchers. Traffic analysis using MLlib/Clustering at Dylan Annotated Variants analysis on Spark Using the Spark/Java framework in Duke Sepsis detection with Spark Streaming Cloudera Use cases with different Componen

27、ts 26 Cloudera, Inc. All rights reserved. E is a car shopping website where people from all across the nation come to read reviews, compare prices, and in general get help in all matters car related. The goal was to build a near real-time dashboard that would provide both unique visitor and page vie

28、w counts per make and make/model that could be engineered in a couple of weeks. In the past, these updates have been restricted to hourly granularities with an additional hour delay. Furthermore, as this data was not available in an easy-to-use dashboard, manual processing was needed to visualize th

29、e data. Near real-time dashboard by E 27 Cloudera, Inc. All rights reserved. Prototype Architecture 28 Cloudera, Inc. All rights reserved. 29 Cloudera, Inc. All rights reserved. 30 Cloudera, Inc. All rights reserved. 31 Cloudera, Inc. All rights reserved. Advanced Analytics with Spark Written by Clo

30、udera data science team First ever book bridging ML with Hadoop ecosystem Focusing on use cases and examples rather than a manual Target for data scientist solving real word analysis problems Generally available in May 2015 32 Cloudera, Inc. All rights reserved. Analyzing Big Data Building a model t

31、o detect credit card fraud using thousands of features and billions of transactions Intelligently recommend millions of products to millions of users Estimate financial risk through simulations of portfolios including millions of instruments Easily manipulate data from thousands of human genomes to

32、detect genetic associations with disease 33 Cloudera, Inc. All rights reserved. Spark Brief What Cloudera is doing on Spark Spark Use Cases Clouderas Position on Spark Agenda 34 Cloudera, Inc. All rights reserved. Spark is a fully integrated and supported part of Clouderas enterprise data hub First

33、vendor to ship and support Spark Invested early to make it a cohesive part of the platform Complemented by Intels early investment Developed and supported in collaboration with Databricks to ensure success Only vendor with Spark committers on staff Several Spark use cases in production Well-trained

34、support staff and external Training Courses Clouderas Investment in Spark 35 Cloudera, Inc. All rights reserved. Hadoop in the Spark World YARN Spark Spark Streaming GraphX MLlib HDFS, HBase Hive Pig Impala MapReduce2 SparkSQL Search Core Hadoop Support Spark components Unsupported add-ons 36 Cloude

35、ra, Inc. All rights reserved. Cloudera is Built for Production Success Hadoop delivers: One place for unlimited data Unified, multi-framework data access Cloudera delivers: Leading Performance Open Source, Open Standards Enterprise Security Data Governance Complete Management Security and Administra

36、tion Unlimited Storage Process Discover Model Serve Deployment Flexibility On-Premises Appliances Engineered Systems Public Cloud Private Cloud Hybrid Cloud A modern data platform plus what the enterprise requires. 37 Cloudera, Inc. All rights reserved. Focusing on Open Standards, not just Open Sour

37、ce Open Standards are just as important as Open Source. Why does it matter? Diverse engineering is more sustainable. Broad support ensures vendor portability. Project utility depends on ecosystem compatibility, which depends on standards. Cloudera leads in defining the de facto open standards adopte

38、d by the market. Vendor Support Component (Founder) Cloudera Pivotal MapR Amazon IBM Hortonworks Spark (UC Berkeley) Impala (Cloudera) Hue (Cloudera) Sentry (Cloudera) Flume (Cloudera) Parquet (Cloudera/Twitter) Sqoop (Cloudera) Falcon Knox Tez Ranger ORCfile 38 Cloudera, Inc. All rights reserved. C

39、loudera is a member of, and aligned with, the broader Spark community Spark: Will replace MapReduce as the general purpose Hadoop framework Broad community and vendor adoption Hadoop ecosystem integration (native & 3rd party) Goes beyond data science/machine learning Cloudera working on Spark Core,

40、Streaming, Security, YARN, and MLlib Does not replace special purpose frameworks One size does not fit all for SQL, Search, Graph, Stream Clouderas Position on Spark 39 Cloudera, Inc. All rights reserved. Try It With Cloudera Live Featuring tutorials on: CDH Thank You 41 Cloudera, Inc. All rights r

41、eserved. Appendix Concepts 42 Cloudera, Inc. All rights reserved. Driver & Workers RDD Resilient Distributed Dataset Transformations Actions Caching Spark Concepts - Overview 43 Cloudera, Inc. All rights reserved. Drivers and Workers Driver Worker Worker Worker Data Data RAM Data RAM RAM 44 Cloudera

42、, Inc. All rights reserved. Read-only partitioned collection of records Created through: Transformation of data in storage Transformation of RDDs Contains lineage to compute from storage Lazy materialization Users control persistence and partitioning RDD Resilient Distributed Dataset 45 Cloudera, In

43、c. All rights reserved. Map Filter Sample Join Operations Reduce Count First, Take SaveAs Transformations Actions 46 Cloudera, Inc. All rights reserved. Transformations create new RDD from an existing one Actions run computation on RDD and return a value Transformations are lazy Actions materialize

44、RDDs by computing transformations RDDs can be cached to avoid re-computing Operations 47 Cloudera, Inc. All rights reserved. RDDs contain lineage Lineage source location and list of transformations Lost partitions can be re-computed from source data Fault-Tolerance msgs = textFile.filter(lambda s: s

45、.startsWith(“ERROR”) .map(lambda s: s.split(“t”)2) HDFS File Filtered RDD Mapped RDD filter (func = startsWith() map (func = split(.) 48 Cloudera, Inc. All rights reserved. Persist() and cache() mark data RDD is cached after first action Fault-tolerant lost partitions will re-compute If not enough m

46、emory, some partitions will not be cached Future actions are performed on cached partitioned, so they are much faster Use caching for iterative algorithms Caching 49 Cloudera, Inc. All rights reserved. MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 Caching Storage Levels 50 Cloudera, Inc. All rights reserved. map filter groupBy sort union join leftOuterJoin rightOuterJoin Easy: Expressive API reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

6 元

下载	加入VIP免费专享

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Spark 驱动智能数据分析应用

三一文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：Spark驱动智能大数据分析应用.pdf
链接地址：https://www.31doc.com/p-3331043.html