Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
379 views
in Technique[技术] by (71.8m points)

python - Chaining multiple mapreduce tasks in Hadoop streaming

I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

Disclaimer: I'm the author of Cascading


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...