Best Apache Spark Courses

Find the best online Apache Spark Courses for you. The courses are sorted based on popularity and user ratings. We do not allow paid placements in any of our rankings. We also have a separate page listing only the Free Apache Spark Courses.

Apache Spark with Scala – Hands On with Big Data!

Apache Spark tutorial with 20+ hands-on examples of analyzing large data sets, on your desktop or on Hadoop with Scala!

Created by Sundog Education by Frank Kane - Founder, Sundog Education. Machine Learning Pro

"]

Students: 72367, Price: $94.99

Students: 72367, Price:  Paid

New! Completely updated and re-recorded for Spark 3, IntelliJ, Structured Streaming, and a stronger focus on the DataSet API.

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including AmazonEBayNASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think, and you'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

Spark works best when using the Scala programming language, and this course includes a crash-course in Scala to get you up to speed quickly. For those more familiar with Python however, a Python version of this class is also available: "Taming Big Data with Apache Spark and Python - Hands On".

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course.

  • Learn the concepts of Spark's Resilient Distributed Datasets, DataFrames, and Datasets.

  • Get a crash course in the Scala programming language

  • Develop and run Spark jobs quickly using Scala, IntelliJ, and SBT

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Practice using other Spark technologies, like Spark SQL, DataFrames, DataSets, Spark Streaming, Machine Learning, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process! We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to SpiderMan? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. over 8 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Enroll now, and enjoy the course!

"I studied Spark for the first time using Frank's course "Apache Spark 2 with Scala - Hands On with Big Data!". It was a great starting point for me,  gaining knowledge in Scala and most importantly practical examples of Spark applications. It gave me an understanding of all the relevant Spark core concepts,  RDDs, Dataframes & Datasets, Spark Streaming, AWS EMR. Within a few months of completion, I used the knowledge gained from the course to propose in my current company to  work primarily on Spark applications. Since then I have continued to work with Spark. I would highly recommend any of Franks courses as he simplifies concepts well and his teaching manner is easy to follow and continue with!  " - Joey Faherty

Taming Big Data with Apache Spark and Python – Hands On!

Apache Spark tutorial with 20+ hands-on examples of analyzing large data sets on your desktop or on Hadoop with Python!

Created by Sundog Education by Frank Kane - Founder, Sundog Education. Machine Learning Pro

"]

Students: 62008, Price: $89.99

Students: 62008, Price:  Paid

New! Updated for Spark 3, more hands-on exercises, and a stronger focus on DataFrames and Structured Streaming.

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think.

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course. You'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

  • Learn the concepts of Spark's DataFrames and Resilient Distributed Datastores

  • Develop and run Spark jobs quickly using Python

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

This course uses the familiar Python programming language; if you'd rather use Scala to get the best performance out of Spark, see my "Apache Spark with Scala - Hands On with Big Data" course instead.

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process! We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to The Incredible Hulk? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. 7 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Wrangling big data with Apache Spark is an important skill in today's technical world. Enroll now!

  • " I studied "Taming Big Data with Apache Spark and Python" with Frank Kane, and helped me build a great platform for Big Data as a Service for my company. I recommend the course!  " - Cleuton Sampaio De Melo Jr.

Apache Spark In-Depth (Spark with Scala)

Apache Spark In-Depth (Spark with Scala)

Created by Harish Masand - Technical Lead

"]

Students: 21361, Price: $99.99

Students: 21361, Price:  Paid

Learn Apache Spark From Scratch To In-Depth

From the instructor of successful Data Engineering courses on "Big Data Hadoop and Spark with Scala" and "Scala Programming In-Depth"

  • From Simple program on word count to Batch Processing to Spark Structure Streaming.

  • From Developing and Deploying Spark application to debugging.

  • From Performance tuning, Optimization to Troubleshooting

Contents all you need for in-depth study of Apache Spark and to clear Spark interviews.

Taught in very simple English language so any one can follow the course very easily.

No Prerequisites, Good to know basics about Hadoop and Scala

Perfect place to start learning Apache Spark

Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Speed

Run workloads 100x faster.

Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Ease of Use

Write applications quickly in Java, Scala, Python, R, and SQL.

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.

Generality

Combine SQL, streaming, and complex analytics.

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Runs Everywhere

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

Apache Spark 2.0 with Java -Learn Spark from a Big Data Guru

Learn analyzing large data sets with Apache Spark by 10+ hands-on examples. Take your big data skills to the next level.

Created by Tao W. - Software engineer

"]

Students: 19288, Price: $99.99

Students: 19288, Price:  Paid

What is this course about:

This course covers all the fundamentals about Apache Spark with Java and teaches you everything you need to know about developing Spark applications with Java. At the end of this course, you will gain in-depth knowledge about Apache Spark and general big data analysis and manipulations skills to help your company to adapt Apache Spark for building big data processing pipeline and data analytics applications.

This course covers 10+ hands-on big data examples. You will learn valuable knowledge about how to frame data analysis problems as Spark problems. Together we will learn examples such as aggregating NASA Apache web logs from different sources; we will explore the price trend by looking at the real estate data in California; we will write Spark applications to find out the median salary of developers in different countries through the Stack Overflow survey data; we will develop a system to analyze how maker spaces are distributed across different regions in the United Kingdom.  And much much more.

What will you learn from this lecture:

In particularly, you will learn:

  • An overview of the architecture of Apache Spark.

  • Develop Apache Spark 2.0 applications with Java using RDD transformations and actions and Spark SQL.

  • Work with Apache Spark's primary abstraction, resilient distributed datasets(RDDs) to process and analyze large data sets.

  • Deep dive into advanced techniques to optimize and tune Apache Spark jobs by partitioning, caching and persisting RDDs.

  • Scale up Spark applications on a Hadoop YARN cluster through Amazon's Elastic MapReduce service.

  • Analyze structured and semi-structured data using Datasets and DataFrames, and develop a thorough understanding of Spark SQL.

  • Share information across different nodes on an Apache Spark cluster by broadcast variables and accumulators.
  • Best practices of working with Apache Spark in the field.

  • Big data ecosystem overview.

Why shall we learn Apache Spark:

Apache Spark gives us unlimited ability to build cutting-edge applications. It is also one of the most compelling technologies of the last decade in terms of its disruption to the big data world.

Spark provides in-memory cluster computing which greatly boosts the speed of iterative algorithms and interactive data mining tasks.

Apache Spark is the next-generation processing engine for big data.

Tons of companies are adapting Apache Spark to extract meaning from massive data sets, today you have access to that same big data technology right on your desktop.

Apache Spark is becoming a must tool for big data engineers and data scientists.

About the author:

Since 2015, James has been helping his company to adapt Apache Spark for building their big data processing pipeline and data analytics applications.

James' company has gained massive benefits by adapting Apache Spark in production. In this course, he is going to share with you his years of knowledge and best practices of working with Spark in the real field.

Why choosing this course?

This course is very hands-on, James has put lots effort to provide you with not only the theory but also real-life examples of developing Spark applications that you can try out on your own laptop.

James has uploaded all the source code to Github and you will be able to follow along with either Windows, MAC OS or Linux.

In the end of this course, James is confident that you will gain in-depth knowledge about Spark and general big data analysis and data manipulation skills. You'll be able to develop Spark application that analyzes Gigabytes scale of data both on your laptop, and in the cloud using Amazon's Elastic MapReduce service!

30-day Money-back Guarantee!

You will get 30-day money-back guarantee from Udemy for this course.

 If not satisfied simply ask for a refund within 30 days. You will get a full refund. No questions whatsoever asked.

Are you ready to take your big data analysis skills and career to the next level, take this course now!

You will go from zero to Spark hero in 4 hours.

Apache Spark Hands on Specialization for Big Data Analytics

In-depth course to master Apache Spark Development using Scala for Big Data (with 30+ real-world & hands-on examples)

Created by Irfan Elahi - Data Scientist in the world's largest consultancy firm

"]

Students: 12288, Price: $99.99

Students: 12288, Price:  Paid

What if you could catapult your career in one of the most lucrative domains i.e. Big Data by learning the state of the art Hadoop technology (Apache Spark) which is considered mandatory in all of the current jobs in this industry?

What if you could develop your skill-set in one of the most hottest Big Data technology i.e. Apache Spark by learning in one of the most comprehensive course  out there (with 10+ hours of content) packed with dozens of hands-on real world examples, use-cases, challenges and best-practices?

What if you could learn from an instructor who is working in the world's largest consultancy firm, has worked, end-to-end, in Australia's biggest Big Data projects to date and who has a proven track record on Udemy with highly positive reviews and thousands of students already enrolled in his previous course(s)?

If you have such aspirations and goals, then this course and you is a perfect match made in heaven!

Why Apache Spark?

Apache Spark has revolutionised and disrupted the way big data processing and machine learning were done by virtue of its unprecedented in-memory and optimised computational model. It has been unanimously hailed as the future of Big Data. It's the tool of choice all around the world which allows data scientists, engineers and developers to acquire and process data for a number of use-cases like scalable machine learning, stream processing and graph analytics to name a few. All of the leading organisations like Amazon, Ebay, Yahoo among many others have embraced this technology to address their Big Data processing requirements. 

Additionally, Gartner has repeatedly highlighted Apache Spark as a leader in Data Science platforms. Certification programs of Hadoop vendors like Cloudera and Hortonworks, which have high esteem in current industry, have oriented their curriculum to focus heavily on Apache Spark. Almost all of the jobs in Big Data and Machine Learning space demand proficiency in Apache Spark. 

This is what John Tripier, Alliances and Ecosystem Lead at Databricks has to say, “The adoption of Apache Spark by businesses large and small is growing at an incredible rate across a wide range of industries, and the demand for developers with certified expertise is quickly following suit”.

All of these facts correlate to the notion that learning this amazing technology will give you a strong competitive edge in your career.

Why this course?

Firstly, this is the most comprehensive and in-depth course ever produced on Apache Spark. I've carefully and critically surveyed all of the resources out there and almost all of them fail to cover this technology in the depth that it truly deserves. Some of them lack coverage of Apache Spark's theoretical concepts like its architecture and how it works in conjunction with Hadoop, some fall short in thoroughly describing how to use Apache Spark APIs optimally for complex big data problems, some ignore the hands-on aspects to demonstrate how to do Apache Spark programming to work on real-world use-cases and almost all of them don't cover the best practices in industry and the mistakes that many professionals make in field.

This course addresses all of the limitations that's prevalent in the currently available courses. Apart from that, as I have attended trainings from leading Big Data vendors like Cloudera (for which they charge thousands of dollars), I've ensured that the course is aligned with the educational patterns and best practices followed in those training to ensure that you get the best and most effective learning experience. 

Each section of the course covers concepts in extensive detail and from scratch so that you won't find any challenges in learning even if you are new to this domain. Also, each section will have an accompanying assignment section where we will work together on a number of real-world challenges and use-cases employing real-world data-sets. The data-sets themselves will also belong to different niches ranging from retail, web server logs, telecommunication and some of them will also be from Kaggle (world's leading Data Science competition platform).

The course leverages Scala instead of Python. Even though wherever possible, reference to Python development is also given but the course is majorly based on Scala. The decision was made based on a number of rational factors. Scala is the de-facto language for development in Apache Spark. Apache Spark itself is developed in Scala and as a result all of the new features are initially made available in Scala and then in other languages like Python. Additionally, there is significant performance difference when it comes to using Apache Spark with Scala compared to Python. Scala itself is one of the most highest paid programming languages and you will be developing strong skill in that language along the way as well.

The course also has a number of quizzes to further test your skills. For further support, you can always ask questions to which you will get prompt response. I will also be sharing best practices and tips on regular basis with my students.

What you are going to learn in this course?

The course consists
of majorly two sections:

  • Section - 1:

We'll start off with
the introduction of Apache Spark and will understand its potential and business
use-cases in the context of overall Hadoop ecosystem. We'll then focus on how
Apache Spark actually works and will take a deep dive of the architectural components
of Spark as its crucial for thorough understanding.

  • Section  - 2:

After developing
understanding of Spark architecture, we will move to the next section of this
course where we will employ Scala language to use Apache Spark APIs to develop
distributed computation programs. Please note that you don't need to have prior
knowledge of Scala for this course as I will start with the very basics of
Scala and as a result you will also be developing your skills in this one of
the highest paying programming languages.

In this section, We
will comprehensively understand how spark performs distributed computation
using abstractions like RDDs, what are the caveats
in loading data in Apache Spark, what are the
different ways to create RDDs and how to leverage parallelism and much more.

Furthermore, as
transformations and action constitute the gist of Apache Spark APIs thus its
imperative to have sound understanding of these. Thus, we will then
focus on a number of Spark transformations and Actions that are heavily being
used in Industry and will go into detail of each. Each API usage will be
complimented with a series of real-world examples and datasets e.g. retail, web
server logs, customer churn and also from kaggle. Each section of the course
will have a number of assignments where you will be able to practically apply
the learned concepts to further consolidate your skills.

A significant
section of the course will also be dedicated to key value RDDs which form the
basis of working optimally on a number of big data problems.

In addition to
covering the crux of Spark APIs, I will also highlight a number of valuable
best practices based on my experience and exposure and will also intuit on
mistakes that many people do in field. You will rarely such information
anywhere else.

Each topic will be
covered in a lot of detail with strong emphasis on being hands-on thus ensuring
that you learn Apache Spark in the best possible way.

The course is
applicable and valid for all versions of Spark i.e. 1.6 and 2.0.

After completing
this course, you will develop a strong foundation and extended skill-set to use
Spark on complex big data processing tasks. Big data is one of the most
lucractive career domains where data engineers claim salaries in high numbers.
This course will also substantially help in your job interviews. Also, if you
are looking to excel further in your big data career, by passing Hadoop
certifications
like of Cloudera and Hortonworks, this course will prove to be
extremely helpful in that context as well.

Lastly, once enrolled, you will have life-time access to the lectures and resources. Its a self-paced course and you can watch lecture videos on any device like smartphone or laptop. Also, you are backed by Udemy's rock-solid 30 days money back guarantee. So if you are serious about learning about learning Apache Spark, enrol in this course now and lets start this amazing journey together!

Databricks Fundamentals & Apache Spark Core

Learn how to process big-data using Databricks & Apache Spark 2.4 and 3.0.0 - DataFrame API and Spark SQL

Created by Wadson Guimatsa - Data Engineer

"]

Students: 11847, Price: $99.99

Students: 11847, Price:  Paid

Welcome to this course on Databricks and Apache Spark 2.4 and 3.0.0

Apache Spark is a Big Data Processing Framework that runs at scale.
In this course, we will learn how to write Spark Applications using Scala and SQL.

Databricks is a company founded by the creator of Apache Spark.
Databricks offers a managed and optimized version of Apache Spark that runs in the cloud.

The main focus of this course is to teach you how to use the DataFrame API & SQL to accomplish tasks such as:

  • Write and run Apache Spark code using Databricks

  • Read and Write Data from the Databricks File System - DBFS

  • Explain how Apache Spark runs on a cluster with multiple Nodes

Use the DataFrame API and SQL to perform data manipulation tasks such as

  • Selecting, renaming and manipulating columns

  • Filtering, dropping and aggregating rows

  • Joining DataFrames

  • Create UDFs and use them with DataFrame API or Spark SQL

  • Writing DataFrames to external storage systems

List and explain the element of Apache Spark execution hierarchy such as

  • Jobs

  • Stages

  • Tasks

Apache Spark 3 – Spark Programming in Python for Beginners

Data Engineering using Spark Structured API

Created by Prashant Kumar Pandey - Architect, Author, Consultant, Trainer @ Learning Journal

"]

Students: 11298, Price: $19.99

Students: 11298, Price:  Paid

This course does not require any prior knowledge of Apache Spark or Hadoop. We have taken enough care to explain Spark Architecture and fundamental concepts to help you come up to speed and grasp the content of this course.

About the Course

I am creating Apache Spark 3 - Spark Programming in Python for Beginners course to help you understand the Spark programming and apply that knowledge to build data engineering solutions. This course is example-driven and follows a working session like approach. We will be taking a live coding approach and explain all the needed concepts along the way.

Who should take this Course?

I designed this course for software engineers willing to develop a Data Engineering pipeline and application using the Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with the people who implement Apache Spark at the ground level.

Spark Version used in the Course

This Course is using the Apache Spark 3.x. I have tested all the source code and examples used in this Course on Apache Spark 3.0.0 open-source distribution.

Apache Spark for Java Developers

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!

Created by Richard Chesterwood - Software developer at VirtualPairProgrammers

"]

Students: 10258, Price: $34.99

Students: 10258, Price:  Paid

Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers.

If you're new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You'll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!

And finally, there's a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.

Optionally, if you have an AWS account, you'll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you're not familiar with AWS you can skip this video, but it's still worthwhile to watch rather than following along with the coding.

You'll be going deep into the internals of Spark and you'll find out how it optimizes your execution plans. We'll be comparing the performance of RDDs vs SparkSQL, and you'll learn about the major performance pitfalls which could save a lot of money for live projects.

Throughout the course, you'll be getting some great practice with Java Lambdas - a great way to learn functional-style Java if you're new to it.

Master Big Data – Apache Spark/Hadoop/Sqoop/Hive/Flume

In-depth course on Big Data - Apache Spark , Hadoop , Sqoop , Flume & Apache Hive, Big Data Cluster setup

Created by Navdeep Kaur - TechnoAvengers.com (Founder)

"]

Students: 5404, Price: $29.99

Students: 5404, Price:  Paid

In this course, you will start by learning what is hadoop distributed file system and most common hadoop commands required to work with Hadoop File system.

Then you will be introduced to Sqoop Import

  • Understand lifecycle of sqoop command.

  • Use sqoop import command to migrate data from Mysql to HDFS.

  • Use sqoop import command to migrate data from Mysql to Hive.

  • Use various file formats, compressions, file delimeter,where clause and queries while importing the data.

  • Understand split-by and boundary queries.

  • Use incremental mode to migrate the data from Mysql to HDFS.

Further, you will learn Sqoop Export to migrate data.

  • What is sqoop export

  • Using sqoop export, migrate data from HDFS to Mysql.

  • Using sqoop export, migrate data from Hive to Mysql.

Further, you will learn about Apache Flume

  • Understand Flume Architecture.

  • Using flume, Ingest data from Twitter and save to HDFS.

  • Using flume, Ingest data from netcat and save to HDFS.

  • Using flume, Ingest data from exec and show on console.

  • Describe flume interceptors and see examples of using interceptors.

  • Flume multiple agents

  • Flume Consolidation.

In the next section, we will learn about Apache Hive

  • Hive Intro

  • External & Managed Tables

  • Working with Different Files - Parquet,Avro

  • Compressions

  • Hive Analysis

  • Hive String Functions

  • Hive Date Functions

  • Partitioning

  • Bucketing

Finally You will learn about Apache Spark

  • Spark Intro

  • Cluster Overview

  • RDD

  • DAG/Stages/Tasks

  • Actions & Transformations

  • Transformation & Action Examples

  • Spark Data frames

  • Spark Data frames - working with diff File Formats & Compression

  • Dataframes API's

  • Spark SQL

  • Dataframe Examples

  • Spark with Cassandra Integration

Apache Spark 3 for Data Engineering & Analytics with Python

Learn how to use Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks) - Beginner to Ninja

Created by David Charles Academy - Senior Big Data Engineer / Consultant at ABN AMRO

"]

Students: 5333, Price: $19.99

Students: 5333, Price:  Paid

The key objectives of this course are as follows;

  • Learn the Spark Architecture

  • Learn Spark Execution Concepts

  • Learn Spark Transformations and Actions using the Structured API

  • Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API

  • Learn how to set up your own local PySpark Environment

  • Learn how to interpret the Spark Web UI

  • Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution

  • Learn the RDD (Resilient Distributed Datasets) API (Crash Course)

    • RDD Transformations

    • RDD Actions

  • Learn the Spark DataFrame API  (Structured APIs)

    • Create Schemas and Assign DataTypes

    • Read and Write Data using the DataFrame Reader and Writer

    • Read Semi-Structured Data such as JSON

    • Create and New Data Columns to the DataFrame using Expressions

    • Filter the DataFrame using the "Filter" and "Where" Transformations

    • Ensure that the DataFrame has unique rows

    • Detect and Drop Duplicates

    • Augment the DataFrame by Adding New Rows

    • Combine 2 or More DataFrames

    • Order the DataFrame by Specific Columns

    • Renaming and Drop Columns from the DataFrame

    • Clean the DataFrame by detecting and Removing Missing or Bad Data

    • Create  User-Defined Spark Functions

    • Read and Write to/from Parquet File

    • Partition the DataFrame and Write to Parquet File

    • Aggregate the DataFrame using Spark SQL functions (count, countDistinct, Max, Min, Sum, SumDistinct, AVG)

    • Perform Aggregations with Grouping

  • Learn Spark SQL and Databricks

    • Create a Databricks Account

    • Create a Databricks Cluster

    • Create Databricks SQL and Python Notebooks

    • Learn Databricks shortcuts

    • Create Databases and Tables using Spark SQL

    • Use DML, DQL, and DDL with Spark SQL

    • Use Spark SQL Functions

    • Learn the differences between Managed and Unmanaged Tables

    • Read CSV Files from the Databricks File System

    • Learn to write Complex SQL

    • Use Spark SQL Functions

    • Create Visualisations with Databricks

    • Create a Databricks Dashboard

The Python Spark project that we are going to do together;

Sales Data

  • Create a Spark Session

  • Read a CSV file into a Spark Dataframe

  • Learn to Infer a Schema

  • Select data from the Spark Dataframe

  • Produce analytics that shows the topmost sales orders per Region and Country

Convert Fahrenheit to Degrees Centigrade

  • Create a Spark Session

  • Read and Parallelize data using the Spark Context into an RDD

  • Create a Function to Convert Fahrenheit to Degrees Centigrade

  • Use the Map Function to convert data contained within an RDD

  • Filter temperatures greater than or equal to 13 degrees celsius

XYZ Research

  • Create a set of RDDs that hold Research Data

  • Use the union transformation to combine RDDs

  • Learn to use the subtract transformation to minus values from an RDD

  • Use the RDD API to answer the following questions

    • How many research projects were initiated in the first three years?

    • How many projects were completed in the first year?

    • How many projects were completed in the first two years?

Sales Analytics

  • Create the Sales Analytics DataFrame to a set of CSV Files

  • Prepare the DataFrame by applying a Structure

  • Remove bad records from the DataFrame (Cleaning)

  • Generate New Columns from the DataFrame

  • Write a Partitioned DataFrame to a Parquet Directory

  • Answer the following questions and create visualizations using Seaborn and Matplotlib

    • What was the best month in sales?

    • What city sold the most products?

    • What time should the business display advertisements to maximize the likelihood of customers buying products?

    • What products are often sold together in the state "NY"?

Technology Spec

  1. Python

  2. Jupyter Notebook

  3. Jupyter Lab

  4. PySpark (Spark with Python)

  5. Pandas

  6. Matplotlib

  7. Seaborne

  8. Databricks

  9. SQL

Apache Spark 3 – Spark Programming in Scala for Beginners

Data Engineering using Spark Structured API

Created by Prashant Kumar Pandey - Architect, Author, Consultant, Trainer @ Learning Journal

"]

Students: 5214, Price: $19.99

Students: 5214, Price:  Paid

This course does not require any prior knowledge of Apache Spark or Hadoop. We have taken enough care to explain Spark Architecture and fundamental concepts to help you come up to speed and grasp the content of this course.

About the Course

I am creating Apache Spark 3 - Spark Programming in Scala for Beginners course to help you understand the Spark programming and apply that knowledge to build data engineering solutions. This course is example-driven and follows a working session like approach. We will be taking a live coding approach and explain all the needed concepts along the way.

Who should take this Course?

I designed this course for software engineers willing to develop a Data Engineering pipeline and application using the Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with the people who implement Apache Spark at the ground level.

Spark Version used in the Course

This Course is using the Apache Spark 3.x. I have tested all the source code and examples used in this Course on Apache Spark 3.0.0 open-source distribution.

Apache Spark 3 – Real-time Stream Processing using Scala

Learn to create Real-time Stream Processing applications using Apache Spark

Created by Prashant Kumar Pandey - Architect, Author, Consultant, Trainer @ Learning Journal

"]

Students: 5054, Price: $19.99

Students: 5054, Price:  Paid

About the Course

I am creating Apache Spark 3 - Real-time Stream Processing using the Scala course to help you understand the Real-time Stream processing using Apache Spark and apply that knowledge to build real-time stream processing solutions. This course is example-driven and follows a working session like approach. We will be taking a live coding approach and explain all the needed concepts along the way.

Who should take this Course?

I designed this course for software engineers willing to develop a Real-time Stream Processing Pipeline and application using the Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with the people who implement Apache Spark at the ground level.

Spark Version used in the Course

This Course is using the Apache Spark 3.x. I have tested all the source code and examples used in this Course on Apache Spark 3.0.0 open-source distribution.

Master Apache Spark – Hands On!

Learn how to slice and dice data using the next generation big data platform - Apache Spark!

Created by Imtiaz Ahmad - Senior Software Engineer & Trainer @ Job Ready Programmer

"]

Students: 4563, Price: $89.99

Students: 4563, Price:  Paid

LAST UPDATED: November 2020

Apache Spark is the next generation batch and stream processing engine. It's been proven to be almost 100 times faster than Hadoop and much much easier to develop distributed big data applications with. It's demand has sky rocketed in recent years and having this technology on your resume is truly a game changer. Over 3000 companies are using Spark in production right now and the list is growing very quickly!  Some of the big names include: Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Amazon as well as most of the big world banks and financial institutions! 

In this course you'll learn everything you need to know about using Apache Spark in your organization while using their latest and greatest Java Datasets API.  Below are some of the things you'll learn:

  • How to develop Spark Java Applications using Spark SQL Dataframes

  • Understand how the Spark Standalone cluster works behind the scenes

  • How to use various transformations to slice and dice your data in Spark Java

  • How to marshall/unmarshall Java domain objects (pojos) while working with Spark Datasets

  • Master joins, filters, aggregations and ingest data of various sizes and file formats (txt, csv, Json etc.)

  • Analyze over 18 million real-world comments on Reddit to find the most trending words used

  • Develop programs using Spark Streaming for streaming stock market index files

  • Stream network sockets and messages queued on a Kafka cluster

  • Learn how to develop the most popular machine learning algorithms using Spark MLlib

  • Covers the most popular algorithms: Linear Regression, Logistic Regression and K-Means Clustering

You'll be developing over 15 practical Spark Java applications crunching through real world data and slicing and dicing it in various ways using several data transformation techniques. This course is especially important for people who would like to be hired as a java developer or data engineer because Spark is a hugely sought after skill. We'll even go over how to setup a live cluster and configure Spark Jobs to run on the cloud. You'll also learn about the practical implications of performance tuning and scaling out a cluster to work with big data so you'll definitely be learning a ton in this course. This course has a 30 day money back guarantee. You will have access to all of the code used in this course.

Apache Spark 2.0 + Java : DO Big Data Analytics & ML

Project Based, Hands-on Practices, Spark SQL, Spark Streaming, Java Setup and building real world applications

Created by V2 Maestros, LLC - Big Data / Data Science Experts | 50K+ students

"]

Students: 3758, Price: $19.99

Students: 3758, Price:  Paid

Welcome to our course. Looking to learn Apache Spark 2.0, practice end-to-end projects and take it to a job interview? You have come to the RIGHT course! This course teaches you Apache Spark 2.0 with Java, trains you in building Spark Analytics and machine learning programs and helps you practice hands-on (2K LOC code samples !) with an end-to-end real life application project. Our goal is to help you and everyone learn, so we keep our prices low and affordable.

Java is the main technology used today to build industry-grade applications and coming that with Spark gives you unlimited ability to build cutting edge applications.

Apache Spark is the hottest Big Data skill today. More and more organizations are adapting Apache Spark for building their big data processing and analytics applications and the demand for Apache Spark professionals is sky rocketing. Learning Apache Spark is a great vehicle to good jobs, better quality of work and the best remuneration packages.

The goal of this project is provide hands-on training that applies directly to real world Big Data projects. It uses the learn-train-practice-apply methodology where you

  • Learn solid fundamentals of the domain
  • See demos, train and execute solid examples
  • Practice hands-on and validate it with solutions provided
  • Apply knowledge you acquired in an end-to-end real life project

Taught by an expert in the field, you will also get prompt response to your queries and excellent support from Udemy.

Apache Spark 3 – Real-time Stream Processing using Python

Learn to create Real-time Stream Processing applications using Apache Spark

Created by Prashant Kumar Pandey - Architect, Author, Consultant, Trainer @ Learning Journal

"]

Students: 3694, Price: $19.99

Students: 3694, Price:  Paid

About the Course

I am creating Apache Spark 3 - Real-time Stream Processing using the Python course to help you understand the Real-time Stream processing using Apache Spark and apply that knowledge to build real-time stream processing solutions. This course is example-driven and follows a working session like approach. We will be taking a live coding approach and explain all the needed concepts along the way.

Who should take this Course?

I designed this course for software engineers willing to develop a Real-time Stream Processing Pipeline and application using the Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with the people who implement Apache Spark at the ground level.

Spark Version used in the Course

This Course is using the Apache Spark 3.x. I have tested all the source code and examples used in this Course on Apache Spark 3.0.0 open-source distribution.

Apache Spark 3 – Databricks Certified Associate Developer

Learn Apache Spark 3 With Scala & Earn the Databricks Associate Certification to prove your skills as data professional

Created by Wadson Guimatsa - Data Engineer

"]

Students: 2958, Price: $49.99

Students: 2958, Price:  Paid

Do you want to learn how to handle massive amounts of data at scale?

Learn Apache Spark 3 and pass the Databricks Certified Associate Developer for Apache Spark 3.0

Hi, My name is Wadson, and I’m a Databricks Certified Associate Developer for Apache Spark 3.0

In today’s data-driven world, Apache Spark has become the standard big-data cluster processing framework.

Apache Spark is used for Data Engineering, Data Science, and Machine Learning.

I will teach you everything you need to know about getting started with Apache Spark.

You will learn the Architecture of Apache Spark and use it’s Core APIs to manipulate complex data.
You will write queries to perform transformations such as Join, Union, GroupBy, and more.

This course is for beginners.
You do not need previous knowledge of Apache Spark.

There are Notebooks available to download so that you can follow along with me in the videos.
The Notebooks contains all the source code I use in the course.
There are also Quizzes to help you assess your understanding of the topics.

Apache Spark 2.0 + Python : DO Big Data Analytics & ML

Project Based, Hands-on Practices, Spark SQL, Spark Streaming, Real life Full cycle Project

Created by V2 Maestros, LLC - Big Data / Data Science Experts | 50K+ students

"]

Students: 2340, Price: $19.99

Students: 2340, Price:  Paid

Welcome to our course. Looking to learn Apache Spark 2.0, practice end-to-end projects and take it to a job interview? You have come to the RIGHT course! This course teaches you Apache Spark 2.0 with Python, trains you in building Spark Analytics and machine learning programs and helps you practice hands-on with an end-to-end real life application project. Our goal is to help you and everyone learn, so we keep our prices low and affordable.

Apache Spark is the hottest Big Data skill today. More and more organizations are adapting Apache Spark for building their big data processing and analytics applications and the demand for Apache Spark professionals is sky rocketing. Learning Apache Spark is a great vehicle to good jobs, better quality of work and the best remuneration packages.

The goal of this project is provide hands-on training that applies directly to real world Big Data projects. It uses the learn-train-practice-apply methodology where you

  • Learn solid fundamentals of the domain
  • See demos, train and execute solid examples
  • Practice hands-on and validate it with solutions provided
  • Apply knowledge you acquired in an end-to-end real life project

Taught by an expert in the field, you will also get prompt response to your queries and excellent support from Udemy.

Apache Spark SQL – Bigdata In-Memory Analytics Master Course

Master in-memory distributed computing with Apache Spark SQL. Leverage the power of Dataframe and Dataset Real life demo

Created by MUTHUKUMAR Subramanian - Best Selling Instructor, Big Data, Spark, Cloud, Java, AWS

"]

Students: 607, Price: $94.99

Students: 607, Price:  Paid

This course is designed for professionals from zero experience to already skilled professionals to enhance their Spark SQL Skills. Hands on session covers on end to end setup of Spark Cluster in AWS and in local systems. 

COURSE UPDATED PERIODICALLY SINCE LAUNCH: Last Updated : December

What students are saying:

  • 5 stars, "This is classic. Spark related concepts are clearly explained with real life examples.  " - Temitayo Joseph 

In data pipeline whether the data is in structured or in unstructured form, the final extracted data would be in structured form only. At the final stage we need to work with the structured data. SQL is popular query language to do analysis on structured data.

Apache spark facilitates distributed in-memory computing. Spark has inbuilt module called Spark-SQL for structured data processing. Users can mix SQL queries with Spark programs and seamlessly integrates with other constructs of Spark.

Spark SQL facilitates loading and writing data from various sources like RDBMS, NoSQL databases, Cloud storage like S3 and easily it can handle different format of data like Parquet, Avro, JSON and many more.

Spark Provides two types of APIs

Low Level API - RDD

High Level API - Dataframes and Datasets

Spark SQL amalgamates very well with various components of Spark like Spark Streaming, Spark Core and GraphX as it has good API integration between High level and low level APIs.

Initial part of the course is on Introduction on Lambda Architecture and Big data ecosystem. Remaining section would concentrate on reading and writing data between Spark and various data sources.

Dataframe and Datasets are the basic building blocks for Spark SQL. We will learn on how to work on Transformations and Actions with RDDs, Dataframes and Datasets.

Optimization on table with Partitioning and Bucketing.

To facilitate the understanding on data processing following usecase have been included to understand the complete data flow.

1) NHL Dataset Analysis

2) Bay Area Bike Share Dataset Analysis

Updates:

++ Apache Zeppelin notebook (Installation, configuration, Dynamic Input)

++Spark Demo with Apache Zeppelin

Mastering Databricks & Apache spark -Build ETL data pipeline

Learn fundamental concept about databricks and process big data by building your first data pipeline on Azure

Created by Priyank Singh - An Engineer who loves to build

"]

Students: 523, Price: $99.99

Students: 523, Price:  Paid

Welcome to the course on Mastering Databricks & Apache spark -Build ETL data pipeline

Databricks combines the best of data warehouses and data lakes into a lakehouse architecture. In this course we will be learning how to perform various operations in Scala, Python and Spark SQL. This will help every student in building solutions which will create value and mindset to build batch process in any of the language. This course will help in writing same commands in different language and based on your client needs we can adopt and deliver world class solution. We will be building end to end solution in azure databricks.

Key Learning Points

  • We will be building our own cluster which will process our data and with one click operation we will load different sources data to Azure SQL and Delta tables

  • After that we will be leveraging databricks notebook to prepare dashboard to answer business questions

  • Based on the needs we will be deploying infrastructure on Azure cloud

  • These scenarios will give student 360 degree exposure on cloud platform and how to step up various resources

  • All activities are performed in Azure Databricks

Fundamentals

  • Databricks

  • Delta tables

  • Concept of versions and vacuum on delta tables

  • Apache Spark SQL

  • Filtering Dataframe

  • Renaming, drop, Select, Cast

  • Aggregation operations SUM, AVERAGE, MAX, MIN

  • Rank, Row Number, Dense Rank

  • Building dashboards

  • Analytics

This course is suitable for Data engineers, BI architect, Data Analyst, ETL developer, BI Manager

Apache Spark Core 3.0 In-Depth

In-Depth, Hands-On driven exposure to the features and concpets of Spark Core with tips on tuning its performance

Created by Amit Ranjan - Big Data Engineer

"]

Students: 226, Price: $89.99

Students: 226, Price:  Paid

Apache Spark has turned out to be the most sought-after skill for any big data engineer. An evolution of MapReduce programming paradigm, Spark provides unified data processing from writing SQL to performing graph processing to implementing Machine Learning algorithms. It effectively uses cluster nodes and better memory management to spread the load across cluster of nodes to get faster results. Apache Spark drives the mission of data-driven-decision-making in thousands of organizations.

In order to fairly appreciate the benefits of the libraries of Apache Spark, it is essential to know the foundations right. This course aims exactly at that part. It starts from the beginner level and gradually explains all the complex concepts in an easy to reflect manner. It gives a profound description of the features and working of the framework through 5 different use cases with detailed hands on implementations. In fact, some hands-on sessions and solutions to the use-cases are explained in a full classroom mode with videos extending over 40 mins. After taking this course, you will gain the expertise on Spark Core and usage of further libraries like Spark SQL, Structured Streaming, Spark ML and GraphX will be much easier to visualize, implement and optimize.