A paper "Making Sense of Performance in Data Analytics Frameworks" published in NSDI 2015 gives the conclusion that CPU(not IO or network) is the performance bottleneck of Spark. Kay has done some experiments on Spark including BDbench ,TPC-DS and a procdution workload(only Spark SQL is used?) in this paper. I wonder whether this conclusion is right for some frameworks built on Spark(like Streaming,with a continuous data stream received through network,both network IO and disk will suffer high pressure ).
Francois G :
Network and disk may suffer less pressure in Spark Streaming because the streams are usually checkpointed, meaning all data is not usually kept around forever.\n\nBut ultimately, this is a research question : the only way to settle this one is to benchmark. Kay's code is open-source.",
2015-05-15T09:19:01