呼延：Simple Yet Powerful Hadoop.pdf

上传人：小小飞

文档编号：3332733

上传时间：2019-08-13

格式：PDF

页数：43

大小：1.41MB

《呼延：Simple Yet Powerful Hadoop.pdf》由会员分享，可在线阅读，更多相关《呼延：Simple Yet Powerful Hadoop.pdf（43页珍藏版）》请在三一文库上搜索。

1、Leslie Huyan VP, Internet Technology, Schmap Inc. Hadoop Community Contributor Simple Yet Powerful Hadoop Skills biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search Facebook: To store copies of internal log and dimension data source

2、s and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; Hadoop Deep Insights Hadoop core consists of HDFS Map/Reduce What Is HDFS A file system Fundamental underlying HDFS - Structure Files split into 128MB blocks Bloc

3、ks replicated across several datanodes (usually 3) Single namenode stores metadata (file names, block locations, etc) Optimized for large files, sequential reads Files are append-only Namenode Datanodes 1 2 3 4 1 2 4 2 1 3 1 4 3 3 2 4 File1 18 Namenode B replication Rack1 Rack2 Client Blocks Datanod

4、es Datanodes Client Write Read Metadata ops Metadata(Name, replicas) (/home/foo/data,6. Block ops Double Masters - Optimization HDFS - Data Replication HDFS is designed to store very large files across machines in a large cluster. Each file is a sequence of blocks. All blocks in the file except the

5、last are of the same size. Blocks are replicated for fault tolerance. Block size and replicas are configurable per file. The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster. BlockReport contains all the blocks on a Datanode. HDFS - Replica Placement The placement of

6、 the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement: Goal: improve reliability, availability and network bandwidth utilization Research topic Many racks, communication between

7、 racks are through switches. Network bandwidth between machines on the same rack is greater than those in different racks. Namenode determines the rack id for each DataNode. Replicas are typically placed on unique racks Simple but non-optimal Writes are expensive Replication factor is 3 Another rese

8、arch topic? Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks. HDFS API HDFS provides Java API for application to use. Pyth

9、on access is also used in many applications. A C language wrapper for Java API is also available. A HTTP browser can be used to browse the files of a HDFS instance. Map/Reduce Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner Comprises of

10、three classes Mapper class Reducer class Driver class Tasktracker/ Jobtracker Reducer phase will start only after mapper is done Takes (k,v) pairs and emits (k,v) pair MapReduce: High Level JobTracker MapReduce job submitted by client computer Master node TaskTracker Slave node Task instance TaskTra

11、cker Slave node Task instance TaskTracker Slave node Task instance In our case: circe.rc.usf.edu Map/Reduce job flow MapReduce Terminology Job A “full program” - an execution of a Mapper and Reducer across a data set Task An execution of a Mapper or a Reducer on a slice of data a.k.a. Task-In-Progre

12、ss (TIP) Task Attempt A particular instance of an attempt to execute a task on a machine MapReduce Programming Model Data type: key-value records Map function: (Kin, Vin) list(Kinter, Vinter) Reduce function: (Kinter, list(Vinter) list(Kout, Vout) Example: Word Count def mapper(line): foreach word i

13、n line.split(): output(word, 1) def reducer(key, values): output(key, sum(values) Word Count Execution the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox

14、, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle public void map(LongWritable key, Text value, OutputCollector out, Reporter reporter) throws IOException String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens() out.collect(

15、new text(itr.nextToken(), ONE); Word Count in Java public class ReduceClass extends MapReduceBase implements Reducer public void reduce(Text key, Iterator values, OutputCollector out, Reporter reporter) throws IOException int sum = 0; while (values.hasNext() sum += values.next().get(); out.collect(k

16、ey, new IntWritable(sum); Word Count in Java public static void main(String args) throws Exception JobConf conf = new JobConf(WordCount.class); conf.setJobName(“wordcount“); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileIn

17、putFormat.setInputPaths(conf, args0); FileOutputFormat.setOutputPath(conf, new Path(args1); conf.setOutputKeyClass(Text.class); / out keys are words (strings) conf.setOutputValueClass(IntWritable.class); / values are counts JobClient.runJob(conf); Word Count in Python with Hadoop Streaming import sy

18、s for line in sys.stdin: for word in line.split(): print(word.lower() + “t“ + 1) import sys counts = for line in sys.stdin: word, count = line.split(“t”) dictword = dict.get(word, 0) + int(count) for word, count in counts: print(word.lower() + “t“ + 1) Mapper.py: Reducer.py: Fault Tolerance in MapRe

19、duce 1. If a task crashes: Retry on another node OK for a map because it has no dependencies OK for reduce because map outputs are on disk If the same task fails repeatedly, fail the job or ignore that input block (user-controlled) Fault Tolerance in MapReduce 2. If a node crashes: Re-launch its cur

20、rent tasks on other nodes Re-run any maps the node previously ran Necessary because their output files were lost along with the crashed node Fault Tolerance in MapReduce 3. If a task is going slowly (straggler): Launch second copy of task on another node (“speculative execution”) Take the output of

21、whichever copy finishes first, and kill the other Surprisingly important in large clusters Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc Single straggler may noticeably slow down a job Takeaways By providing a data-parallel programming model, MapReduce can

22、 control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing Further Resources Hadoop Tutorial: http:/ Hadoop Streaming: http:/hadoop.apache.org/common/docs/r0.15.2/streaming.html Hadoop API: http:/hadoop.apache.org/common/docs/current/api HDFS Commands Reference: http:/hadoop.apache.org/hdfs/docs/current/file_system_shell.html Thank you

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

6 元

下载	加入VIP免费专享

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 呼延：Simple Yet Powerful Hadoop 呼延 Simple

三一文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：呼延：Simple Yet Powerful Hadoop.pdf
链接地址：https://www.31doc.com/p-3332733.html