Building solid data pipelines with PySpark

Git-Icon-1788C.png

Category

build data pipelines

Duration (fully-guided training)

24h

Flipped-classroom training duration:

5h16min

of videos and

16h

of interactive workshop.

About the Course

Apache Spark is an essential tool in a data engineer's toolbelt. With it, you can build impressive data transformation pipelines, especially for larger, cloud-native datasets. It can also be used in streaming applications, and for machine learning on large datasets, which other well-known tools, like Pandas, don't lend themselves well to. In this workshop, you won't just learn about the most common operations, but you'll get to apply them on the two most common business scenarios. You'll also learn about structuring your Spark pipelines, improving their performance and reducing the chances of mistakes in them. By the end, you should have a firm knowledge of Apache Spark, and have learned to use it effectively.