Spark驱动智能大数据分析应用.pdf
《Spark驱动智能大数据分析应用.pdf》由会员分享,可在线阅读,更多相关《Spark驱动智能大数据分析应用.pdf(50页珍藏版)》请在三一文库上搜索。
1、Spark Drives Big Data Analytical Application Phil Tian Jianzhong Chen 2 Cloudera, Inc. All rights reserved. Spark Brief What Cloudera is doing on Spark Spark Use Cases Clouderas Position on Spark Agenda 3 Cloudera, Inc. All rights reserved. Key Advances by MapReduce: Data Locality: Automatic split c
2、omputation and launch of mappers appropriately Fault-Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware Linear Scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problem
3、s A Brief Review of MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce 4 Cloudera, Inc. All rights reserved. Spark is a general purpose computational framework with more flexibility than MapReduce Key properties: Leverages distributed memory Full Directed Graph exp
4、ressions for data parallel computations Improved developer experience Yet retains: Linear scalability, Fault-tolerance, and Data Locality based computations What is Spark? 5 Cloudera, Inc. All rights reserved. Easy to Develop High productive language support Clean and expressive APIs Interactive she
5、ll Out of box functionality Spark: Easy and Fast Big Data Fast to Run General execution graphs In-memory storage 2-5 less code Up to 10 faster on disk, 100 in memory 6 Cloudera, Inc. All rights reserved. Easy: Example Word Count Spark public static class WordCountMapClass extends MapReduceBase imple
6、ments Mapper private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (
7、itr.hasMoreTokens() word.set(itr.nextToken(); output.collect(word, one); public static class WorkdCountReduce extends MapReduceBase implements Reducer public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException int sum = 0; while (values.hasNext() sum
8、+= values.next().get(); output.collect(key, new IntWritable(sum); Hadoop MapReduce val spark = new SparkContext(master, appName, sparkHome, jars) val file = spark.textFile(“hdfs:/.“) val counts = file.flatMap(line = line.split(“ “) .map(word = (word, 1) .reduceByKey(_ + _) counts.saveAsTextFile(“hdf
9、s:/.“) 7 Cloudera, Inc. All rights reserved. Hadoop Integration Works with Hadoop Data Runs With YARN Libraries MLlib Spark Streaming GraphX (alpha) Out-of-the-Box Functionality Language support: Improved Python support SparkR Java 8 Schema support in Sparks APIs 8 Cloudera, Inc. All rights reserved
10、. data = spark.textFile(.).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Example: Logistic Regression 9 Cloudera, Inc. All right
11、s reserved. Hadoop cluster with 100 nodes contains 10+TB of RAM today and will double next year 1 GB RAM $10-$20 Trends: price every 18 months 2x bandwidth every 3 years Memory Management Leads to Greater Performance 64-128GB RAM 16 cores 50 GB per sec Memory can be enabler for high performance big
12、data applications 10 Cloudera, Inc. All rights reserved. In-memory Caching Data Partitions read from RAM instead of disk Operator Graphs Scheduling Optimizations Fault Tolerance Fast: Using RAM, Operator Graphs join filter groupBy B: B: C: D: E: F: map A: map take = cached partition = RDD 11 Clouder
13、a, Inc. All rights reserved. Expressiveness of Programming Model Map Reduce Map Map Reduce Map Reduce Efficient group-by aggregations and other analytics Pipelined MapReduce Jobs Map Reduce Map Reduce X X X Map Reduce Iterative jobs (Machine Learning) 12 Cloudera, Inc. All rights reserved. Logistic
14、Regression Performance (Data Fits in Memory) 0 500 1000 1500 2000 2500 3000 3500 4000 15102030 Running Time (s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s 13 Cloudera, Inc. All rights reserved. Spark Brief What Cloudera is doing on Spark Spark Use
15、 Cases Clouderas Position on Spark Agenda 14 Cloudera, Inc. All rights reserved. Spark Engineering in Cloudera Cloudera embraced Spark in early 2014 Engineering with Intel to broaden Spark ecosystem Hive-on-Spark Pig-on-Spark Spark-over-YARN Spark Streaming Reliability General Spark Optimization 15
16、Cloudera, Inc. All rights reserved. Hive on Spark Technology Hive: “standard” SQL tool in Hadoop Spark: next-gen distributed processing framework Hive + Spark Performance Minimum feature gap Industry A lot of customers heavily invest in Hive Want to leverage the Spark engine 16 Cloudera, Inc. All ri
17、ghts reserved. Design Principles No or limited impact on Hives existing code path Maximize code reuse Minimum feature customization Low future maintenance cost 17 Cloudera, Inc. All rights reserved. Class Hierarchy TaskCompiler MapRedCompiler TezCompiler Task Work MapRedTask TezTask TezWork MapRedWo
18、rk SparkCompiler SparkTask SparkWork generates described by 18 Cloudera, Inc. All rights reserved. Work Metadata for Task MapReduceWork contains one MapWork and a possible ReduceWork SparkWork contains a graph of MapWorks and ReduceWorks MapWork1 ReduceWork1 MapWork2 ReduceWork2 MapWork1 ReduceWork1
19、 ReduceWork2 Query: select name, sum(value) as v from dec group by name order by v; Spark Job MR Job 2 MR Job 1 19 Cloudera, Inc. All rights reserved. Data Processing via Spark Treat Table as HadoopRDD (input RDD) Apply the function that wraps MRs map-side processing Shuffle map output using Sparks
20、transformations (groupByKey, sortByKey, etc) Apply the function that wraps MRs reduce-side processing 20 Cloudera, Inc. All rights reserved. Spark Plan MapInput encapsulate a table MapTran map-side processing ShuffleTran shuffling ReduceTran reduce-side processing Query: Select name, sum(value) as v
21、 from dec group by name order by v; 21 Cloudera, Inc. All rights reserved. Current Status All functionality in Hive is implemented First round of optimization is completed Map join, SMB Split generation and grouping CBO, vectorization More optimization and benchmarking coming Beta in CDH http:/archi
22、ve- http:/ spark/latest/PDF/hive-spark-get-started.pdf 22 Cloudera, Inc. All rights reserved. Spark Brief What Cloudera is doing on Spark Spark Use Cases Clouderas Position on Spark Agenda 23 Cloudera, Inc. All rights reserved. User Use Case Sparks Value Conviva 通过实时分析流量规律以及更精细的流量 控制,优化终端用户的在线视频体验 快
23、速原型开发 共享的离线和在线计算业务逻辑 开源的机器学习算法 Yahoo! 加速广告投放的模型训练管道,特征提取 提高3X,用协同过滤进行内容推荐 降低数据管道的延迟 迭代式机器学习 高效的 P2P 广播 Anonymous (Large Tech Company) 准实时日志聚合与分析,实现监控和告警 低延迟、高频度的 运行“mini” 批作 业来处理最新数据 Technicolor 为(电信)客户提供实时分析;提供流处 理和实时查询能力 部署简单,只需要Spark 和 Spark Streaming 在线数据的随机查询 Sample Use Cases 24 Cloudera, Inc.
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Spark 驱动 智能 数据 分析 应用
链接地址:https://www.31doc.com/p-3331043.html