Building solid data pipelines in PySpark

Create data transformation pipelines that scale & are easy to debug.

  • Starts 12 Oct
  • €1,500 euros
  • Data Minded office

Service Description

Apache Spark is an essential tool in a data engineer's toolbelt. With it, you can build impressive data transformation pipelines, especially for larger, cloud-native datasets. Its Python binding, known as PySpark, provides a low-entry barrier to this amazing analytics engine. It can also be used in streaming applications, and for machine learning on large datasets, which other well-known tools, like Pandas, don't lend themselves well to. In this workshop, taught by Oliver Willekens, senior data engineer at Data Minded, you won't just learn about distributed computing concepts and the most common operations in Spark, but you'll get to apply them on the two most common business scenarios. You'll also learn how to structure your PySpark pipelines, improving their performance and reducing the chances of mistakes in them. By the end, you will have a firm knowledge of Apache Spark, allowing you to process data at scale, being mindfull of Spark's lazy behavior, and writing locally testable PySpark transformations with ease, thus saving you time and reducing nasty surprises. Join us!


Upcoming Sessions


Cancellation Policy

To cancel or reschedule, contact us at least 24h prior to the event. Cancellations received afterwards will be invoiced at 50% of the price.