Spark / Perfomance / |
Spark Performance Rules
- Do not use count()
when generating logger.info(), because count() is very expensive Spark operation.
- Do not use coalesce()
when in middle of the Spark process, if needed it, use only at the end to write the output.
coalesce() vs. repartition() will completely reshuffle data on the entire spark cluster, make everything slow waiting for data to move to particular node.
- Do not use persist()
because it utilize memory and hard disk, persist is very expensive, it is mostly use in cases
where data is sensitive and cannot be lost, like in banking, disadvantage of using persist()
is that Spark process become 10x + slower