Best Pyspark Courses

Find the best online Pyspark Courses for you. The courses are sorted based on popularity and user ratings. We do not allow paid placements in any of our rankings. We also have a separate page listing only the Free Pyspark Courses.

Spark and Python for Big Data with PySpark

Learn how to use Spark with Python, including Spark Streaming, Machine Learning, Spark 2.0 DataFrames and more!

Created by Jose Portilla - Head of Data Science, Pierian Data Inc.

"]

Students: 81781, Price: $129.99

Students: 81781, Price:  Paid

Learn the latest Big Data Technology - Spark! And learn to use it with one of the most popular programming languages, Python!

One of the most valuable technology skills is the ability to analyze huge data sets, and this course is specifically designed to bring you up to speed on one of the best technologies for this task, Apache Spark! The top technology companies like Google, Facebook, Netflix, Airbnb, Amazon, NASA, and more are all using Spark to solve their big data problems!

Spark can perform up to 100x faster than Hadoop MapReduce, which has caused an explosion in demand for this skill! Because the Spark 2.0 DataFrame framework is so new, you now have the ability to quickly become one of the most knowledgeable people in the job market!

This course will teach the basics with a crash course in Python, continuing on to learning how to use Spark DataFrames with the latest Spark 2.0 syntax! Once we've done that we'll go through how to use the MLlib Machine Library with the DataFrame syntax and Spark. All along the way you'll have exercises and Mock Consulting Projects that put you right into a real world situation where you need to use your new skills to solve a real problem!

We also cover the latest Spark Technologies, like Spark SQL, Spark Streaming, and advanced models like Gradient Boosted Trees! After you complete this course you will feel comfortable putting Spark and PySpark on your resume! This course also has a full 30 day money back guarantee and comes with a LinkedIn Certificate of Completion!

If you're ready to jump into the world of Python, Spark, and Big Data, this is the course for you!

Apache Spark Streaming with Python and PySpark

Add Spark Streaming to your Data Science and Machine Learning Python Projects

Created by Level Up Big Data Program - Big Data Experts

"]

Students: 23633, Price: $99.99

Students: 23633, Price:  Paid

What is this course about? 

This course covers all the fundamentals about Apache Spark streaming with Python and teaches you everything you need to know about developing Spark streaming applications using PySpark, the Python API for Spark. At the end of this course, you will gain in-depth knowledge about Spark streaming and general big data manipulation skills to help your company to adapt Spark Streaming for building big data processing pipelines and data analytics applications. This course will be absolutely critical to anyone trying to make it in data science today. 

What will you learn from this Apache Spark streaming cour? 

In this Apache Spark streaming course, you'll learn the following:

  • An overview of the architecture of Apache Spark.
  • How to develop Apache Spark streaming applications with PySpark using RDD transformations and actions and Spark SQL.
  • How to work with Spark's primary abstraction, resilient distributed datasets(RDDs), to process and analyze large data sets.
  • Advanced techniques to optimize and tune Apache Spark jobs by partitioning, caching and persisting RDDs.
  • Analyzing structured and semi-structured data using Datasets and DataFrames, and develop a thorough understanding of Spark SQL.
  • How to scale up Spark Streaming applications for both bandwidth and processing speed
  • How to integrate Spark Streaming with cluster computing tools like Apache Kafka
  • How to connect your Spark Stream to a data source like Amazon Web Services (AWS) Kinesis
  • Best practices of working with Apache Spark streaming in the field.
  • Big data ecosystem overview.

Why should you learn Apache Spark streaming? 

Spark streaming is becoming incredibly popular, and with good reason. According to IBM, Ninety percent of the data in the world today has been created in the last two years alone. Our current output of data is roughly 2.5 quintillion bytes per day. The world is being immersed in data, moreso each and every day. As such, analyzing static dataframes of non-dynamic data becomes the less practical approach to more and more problems. This is where data streaming comes in, the ability to process data almost as soon as it’s produced, recognizing the time-dependency of the data.

Apache Spark streaming gives us unlimited ability to build cutting-edge applications. It is also one of the most compelling technologies of the last decade in terms of its disruption to the big data world. Spark provides in-memory cluster computing which greatly boosts the speed of iterative algorithms and interactive data mining tasks.

Spark also is a powerful engine for streaming data as well as processing it. The synergy between them makes Spark an ideal tool for processing gargantuan data firehoses.

Tons of companies, including Fortune 500 companies, are adapting Apache Spark streaming to extract meaning from massive data streams, today you have access to that same big data technology right on your desktop.

What programming language is this Apache Spark streaming course taught in? 

This Apache Spark streaming course is taught in Python. Python is currently one of the most popular programming languages in the world! It's rich data community, offering vast amounts of toolkits and features, makes it a powerful tool for data processing. Using PySpark (the Python API for Spark) you will be able to interact with Apache Spark Streaming's main abstraction, RDDs, as well as other Spark components, such as Spark SQL and much more!

Let's learn how to write Apache Spark streaming programs with PySpark Streaming to process big data sources today!

30-day Money-back Guarantee!

You will get 30-day money-back guarantee from Udemy for this Apache Spark streaming course.
If not satisfied simply ask for a refund within 30 days. You will get a full refund. No questions whatsoever asked.
Are you ready to take your big data analysis skills and career to the next level, take this course now!
You will go from zero to Spark streaming hero in 4 hours.

PySpark for Data Science – Intermediate

You get to learn about how to use spark python or PySpark to perform data analysis.

Created by Exam Turf - #1 Brand for Competitive Exam Preparation and Test Series

"]

Students: 20417, Price: $89.99

Students: 20417, Price:  Paid

This module on PySpark Tutorials aims to explain the intermediate concepts such as those like the use of Spark session in case of later versions and the use of Spark Config and Spark Context in case of earlier versions. This will also help you in understanding how the Spark-related environment is set up, concepts of Broadcasting and accumulator, other optimization techniques include those like parallelism, tungsten, and catalyst optimizer. You will also be taught about the various compression techniques such as Snappy and Zlib. We will also understand and talk about the various Big data ecosystem related concepts such as HDFS and block storage, various components of Spark such as Spark Core, Mila, GraphX, R, Streaming, SQL, etc. and will also study the basics of Python language which is related and relevant to be used along with Apache Spark thereby making it Pyspark. We will learn the following in this course:

  • Regression

  • Linear Regression

  • Output Column

  • Test Data

  • Prediction

  • Generalized Linear Regression

  • Forest Regression

  • Classification

  • Binomial Logistic Regression

  • Multinomial Logistic Regression

  • Decision Tree

  • Random Forest

  • Clustering

  • K-Means Model

Pyspark is a big data solution that is applicable for real-time streaming using Python programming language and provides a better and efficient way to do all kinds of calculations and computations. It is also probably the best solution in the market as it is interoperable i.e. Pyspark can easily be managed along with other technologies and other components of the entire pipeline. The earlier big data and Hadoop techniques included batch time processing techniques.

Pyspark is an open-source program where all the codebase is written in Python which is used to perform mainly all the data-intensive and machine learning operations. It has been widely used and has started to become popular in the industry and therefore Pyspark can be seen replacing other spark based components such as the ones working with Java or Scala. One unique feature which comes along with Pyspark is the use of datasets and not data frames as the latter is not provided by Pyspark. Practitioners need more tools that are often more reliable and faster when it comes to streaming real-time data. The earlier tools such as Map-reduce made use of the map and the reduce concepts which included using the mappers, then shuffling or sorting and then reducing them into a single entity. This MapReduce provided a way of parallel computation and calculation. The Pyspark makes use of in-memory techniques that don’t make use of the space storage being put into the hard disk. It provides a general-purpose and a faster computation unit.

PySpark for Data Science – Advanced

Learn about how to use PySpark to perform data analysis, RFM analysis and Text mining

Created by Exam Turf - #1 Brand for Competitive Exam Preparation and Test Series

"]

Students: 19361, Price: $89.99

Students: 19361, Price:  Paid

This module in the PySpark tutorials section will help you learn about certain advanced concepts of PySpark. In the first section of these advanced tutorials, we will be performing a Recency Frequency Monetary segmentation (RFM). RFM analysis is typically used to identify outstanding customer groups further we shall also look at K-means clustering. Next up in these PySpark tutorials is learning Text Mining and using Monte Carlo Simulation from scratch.

Pyspark is a big data solution that is applicable for real-time streaming using Python programming language and provides a better and efficient way to do all kinds of calculations and computations. It is also probably the best solution in the market as it is interoperable i.e. Pyspark can easily be managed along with other technologies and other components of the entire pipeline. The earlier big data and Hadoop techniques included batch time processing techniques.

Pyspark is an open-source program where all the codebase is written in Python which is used to perform mainly all the data-intensive and machine learning operations. It has been widely used and has started to become popular in the industry and therefore Pyspark can be seen replacing other spark-based components such as the ones working with Java or Scala. One unique feature which comes along with Pyspark is the use of datasets and not data frames as the latter is not provided by Pyspark. Practitioners need more tools that are often more reliable and faster when it comes to streaming real-time data. The earlier tools such as Map-reduce made use of the map and the reduced concepts which included using the mappers, then shuffling or sorting, and then reducing them into a single entity. This MapReduce provided a way of parallel computation and calculation. The Pyspark makes use of in-memory techniques that don’t make use of the space storage being put into the hard disk. It provides a general purpose and a faster computation unit.

The career benefits of these PySpark Tutorials are many. Apache spark is among the newest technologies and possibly the best solution in the market available today when it comes to real-time programming and processing. There are still very few numbers of people who have a very sound knowledge of Apache spark and its essentials, thereby an increase in the demand for the resources is huge whereas the supply is very limited. If you are planning to make a career in this technology there can be no wiser decision than this. The only thing you need to keep in mind while making a transition in this technology is that it is more of a development role and therefore if you have a good coding practice and a mindset then these PySpark Tutorials are for you. We also have many certifications for apache spark which will enhance your resume.

PySpark for Data Science – Beginners

Learn basics of Apache Spark and learn to analyze Big Data for Machine Learning using Python in PySpark

Created by Exam Turf - #1 Brand for Competitive Exam Preparation and Test Series

"]

Students: 12909, Price: $89.99

Students: 12909, Price:  Paid

These PySpark Tutorials aim to explain the basics of Apache Spark and the essentials related to it. This also targets why the Apache spark is a better choice than Hadoop and is the best solution when it comes to real-time processing. You will also understand what are the benefits and disadvantages of using Spark with all the above-listed languages You will also read about the concept of RDDs and other very basic features and terminologies being used in the case of Spark. This course is for students, professionals, and aspiring data scientists who want to get hands-on training in PySpark (Python for Apache Spark) using real-world datasets and applicable coding knowledge that you’ll use every day as a data scientist.

Pyspark is a big data solution that is applicable for real-time streaming using Python programming language and provides a better and efficient way to do all kinds of calculations and computations. It is also probably the best solution in the market as it is interoperable i.e. Pyspark can easily be managed along with other technologies and other components of the entire pipeline. The earlier big data and Hadoop techniques included batch time processing techniques.

Pyspark is an open-source program where all the codebase is written in Python which is used to perform mainly all the data-intensive and machine learning operations. It has been widely used and has started to become popular in the industry and therefore Pyspark can be seen replacing other spark-based components such as the ones working with Java or Scala. One unique feature which comes along with Pyspark is the use of datasets and not data frames as the latter is not provided by Pyspark. Practitioners need more tools that are often more reliable and faster when it comes to streaming real-time data. The earlier tools such as Map-reduce made use of the map and the reduced concepts which included using the mappers, then shuffling or sorting, and then reducing them into a single entity. This MapReduce provided a way of parallel computation and calculation. The Pyspark makes use of in-memory techniques that don’t make use of the space storage being put into the hard disk. It provides a general purpose and a faster computation unit.

Data Science:Hands-on Diabetes Prediction with Pyspark MLlib

Diabetes Prediction using Machine Learning in Apache Spark

Created by School of Disruptive Innovation - Creative Learning Solutions for the Digital Age

"]

Students: 11791, Price: $19.99

Students: 11791, Price:  Paid

Would you like to build, train, test and evaluate a machine learning model that is able to detect diabetes using logistic regression?

This is a Hands-on Machine Learning Course where you will practice alongside the classes. The dataset will be provided to you during the lectures. We highly recommend that for the best learning experience, you practice alongside the lectures.

You will learn more in this one hour of Practice than hundreds of hours of unnecessary theoretical lectures.

Learn the most important aspect of Spark Machine learning (Spark MLlib) :

  • Pyspark fundamentals and implementing spark machine learning

  • Importing and Working with Datasets

  • Process data using a Machine Learning model using spark MLlib

  • Build and train Logistic regression model

  • Test and analyze the model

The entire course has been divided into tasks. Each task has been very carefully created and designed to give you the best learning experience. In this hands-on project, we will complete the following tasks:

  • Task 1: Project overview

  • Task 2: Intro to Colab environment & install dependencies to run spark on Colab

  • Task 3: Clone & explore the diabetes dataset

  • Task 4: Data Cleaning

  • Task 5: Correlation & feature selection

  • Task 6: Build and train Logistic Regression Model using Spark MLlib

  • Task 7: Performance evaluation & Test the model

  • Task 8: Save & load model

About Pyspark:

Pyspark is the collaboration of Apache Spark and Python. PySpark is a tool used in Big Data Analytics.

Apache Spark is an open-source cluster-computing framework, built around speed, ease of use, and streaming analytics whereas Python is a general-purpose, high-level programming language. It provides a wide range of libraries and is majorly used for Machine Learning and Real-Time Streaming Analytics.

In other words, it is a Python API for Spark that lets you harness the simplicity of Python and the power of Apache Spark in order to tame Big Data. We will be using Big data tools in this project.

Make a leap into Data science with this Spark MLlib project and showcase your skills on your resume.

Click on the “ENROLL NOW” button and start learning.

Happy Learning.

Apache Spark 3 – Spark Programming in Python for Beginners

Data Engineering using Spark Structured API

Created by Prashant Kumar Pandey - Architect, Author, Consultant, Trainer @ Learning Journal

"]

Students: 11298, Price: $19.99

Students: 11298, Price:  Paid

This course does not require any prior knowledge of Apache Spark or Hadoop. We have taken enough care to explain Spark Architecture and fundamental concepts to help you come up to speed and grasp the content of this course.

About the Course

I am creating Apache Spark 3 - Spark Programming in Python for Beginners course to help you understand the Spark programming and apply that knowledge to build data engineering solutions. This course is example-driven and follows a working session like approach. We will be taking a live coding approach and explain all the needed concepts along the way.

Who should take this Course?

I designed this course for software engineers willing to develop a Data Engineering pipeline and application using the Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with the people who implement Apache Spark at the ground level.

Spark Version used in the Course

This Course is using the Apache Spark 3.x. I have tested all the source code and examples used in this Course on Apache Spark 3.0.0 open-source distribution.

CCA 175 – Spark and Hadoop Developer – Python (pyspark)

Cloudera Certified Associate Spark and Hadoop Developer using Python as Programming Language

Created by Durga Viswanatha Raju Gadiraju - Technology Adviser and Evangelist

"]

Students: 9717, Price: $24.99

Students: 9717, Price:  Paid

CCA 175 Spark and Hadoop Developer is one of the well recognized Big Data certifications. This scenario-based certification exam demands basic programming using Python or Scala along with Spark and other Big Data technologies.

This comprehensive course covers all aspects of the certification using Python as a programming language.

  • Python Fundamentals

  • Spark SQL and Data Frames

  • File formats

Please note that the syllabus is recently changed and now the exam is primarily focused on Spark Data Frames and/or Spark SQL.

Exercises will be provided to prepare before attending the certification. The intention of the course is to boost the confidence to attend the certification.  

All the demos are given on our state of the art Big Data cluster. You can avail one-week complimentary lab access by filling this form which is provided as part of the welcome message.

Data Science with Python Course : Hands-on Data Science 2021

Numpy, Pandas, Matplotlib, Scikit-Learn, WebScraping, Data Science, Machine Learning, Pyspark, statistics, Data Science

Created by Ankit Mistry - Software Developer | I want to Improve your life & Income.

"]

Students: 7701, Price: $89.99

Students: 7701, Price:  Paid

Welcome to Complete Ultimate course guide on Data Science and Machine learning with Python.

Have you ever thought about

How amazon gives you product recommendation, 

How Netflix and YouTube decides which movie or video you should watch next,

Google translate translate one language to another,

How Google knows what is there in your photo,

How  Android speech Recognition or Apple siri understand your speech signal with such high accuracy.

If you would like algorithm or technology running behind that,  This is first course to get started in this direction.

==============================================

This course has more than 100 - 5 star rating.

What previous students have said: 

"This is a truly great course! It covers far more than it's written in its name: many data science libraries, frameworks, techniques, tips, starting from basics to advanced level topics. Thanks a lot!  "

"This course has taught me many things I wanted to know about pandas. It covers everything since the installation steps, so it is very good for anyone willing to learn about data analysis in python /jupyter environment."

"learning valuable concepts and feeling great.Thanks for this course."

"Good explanation, I have laready used two online tutorials on data -science and this one is more step by step, but it is good"

"i have studied python from other sources as well but here i found it more basic and easy to grab especially for the beginners. I can say its best course till now . it can be improved by including some more examples and real life data but overall i would suggest every beginner to have this course."

"The instructor is so good, he helps you in all doubts within an average replying time of one hour. The content of the course and the way he delivers is great."

==================================================

Why Data Science Now?

Data Scientist: The Sexiest Job of the 21st Century - By Harvard Business review

There is huge sortage of data scientist currently software industry is facing.

The average data scientist today earns $130,000 a year  by  glassdoor.

Want to join me for your journey towards becoming Data Scientist, Machine Learning Engineer.

This course has more than 100+ HD -  quality video lectures and is over 13+ hours in content.

This is first introductory course to get started data analysis, Machine learning and towards  AI algorithm implementation

This course will teach you - All Basic python library required for data analysis process.

  • Python  crash course

  • Numerical Python - Numpy

  • Pandas - data analysis

  • Matplotlib for data visualization

  • Plotly and Business intelligence tool Tableau

  • Importing Data in Python from different sources like .csv, .tsv, .json, .html, web rest Facebook API

  • Data Pre-Processing like normalization, train test split, Handling missing data 

  • Web Scraping with python BeautifulSoup - extract  value from structured HTML Data

  • Exploratory data analysis on pima Indian diabetes dataset

  • Visualization of Pima Indian diabetes dataset

  • Data transformation and Scaling Data -  Rescale Data, Standardize Data, Binarize Data, normalise data

  • Basic introduction to What is Machine Learning, and  Scikit learn overview Its type, and comparison with traditional system. Supervised learning vs Unsupervised Learning

  • Understanding of regression, classification and clustering

  • Feature selection and feature elimination technique.

  • And Many Machine learning algorithm yet to come. 

  • Data Science Prerequisite : Basics of Probability and statistics

  • Setup Data Science and Machine learning lab in Microsoft Azure Cloud

This course is for beginner and some experienced programmer who want to make career in Data Science and  Machine learning, AI.

Prerequisite:

  • basic knowledge in python programming (will be covered in python )

  • High School mathematics

Enroll in this course, take look at brief curriculum of this course and take first step in wonderful world of Data.

See you in field.

Sincerely,

Ankit Mistry

A Big Data Hadoop and Spark project for absolute beginners

Data Engineering, Spark, Hive, Python, PySpark, Scala, Coding framework, Testing, IntelliJ, Maven, Glue, Streaming,

Created by FutureX Skill - Big Data, Cloud and AI Solution Architects

"]

Students: 6798, Price: $29.99

Students: 6798, Price:  Paid

This course will prepare you for a real world Data Engineer role !

Get started with Big Data quickly leveraging free cloud cluster and solving a real world use case!  Learn Hadoop, Hive , Spark (both Python and Scala) from scratch!

Learn to code Spark Scala & PySpark like  a real world developer. Understand real world coding best practices, logging, error handling , configuration management using both Scala and Python.

Project

A bank is launching a new credit card and wants to identify prospects it can target in its marketing campaign.

It has received prospect data from various internal and 3rd party sources. The data has various issues such as missing or unknown values in certain fields. The data needs to be cleansed before any kind of analysis can be done.

Since the data is in huge volume with billions of records, the bank has asked you to use Big Data Hadoop and Spark technology to cleanse, transform and analyze this data.

What you will learn :

  • Big Data, Hadoop concepts

  • How to create a free Hadoop and Spark cluster using Google Dataproc

  • Hadoop hands-on - HDFS, Hive

  • Python basics

  • PySpark RDD - hands-on

  • PySpark SQL, DataFrame - hands-on

  • Project work using PySpark and Hive

  • Scala basics

  • Spark Scala DataFrame

  • Project work using Spark Scala

  • Spark Scala Real world coding framework and development using Winutil, Maven and IntelliJ.

  • Python Spark Hadoop Hive coding framework and development using PyCharm

  • Building a data pipeline using Hive , PostgreSQL, Spark

  • Logging , error handling and unit testing of PySpark and Spark Scala applications

  • Spark Scala Structured Streaming

  • Applying spark transformation on data stored in AWS S3 using Glue and viewing data using Athena

Prerequisites :

  • Some basic programming skills

  • Some knowledge of SQL queries

From 0 to 1 : Spark for Data Science with Python

Get your data to fly using Spark for analytics, machine learning and data science​

Created by Loony Corn - An ex-Google, Stanford and Flipkart team

"]

Students: 6571, Price: $89.99

Students: 6571, Price:  Paid

Taught by a 4 person team including 2 Stanford-educated, ex-Googlers  and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with Java and with billions of rows of data. 

Get your data to fly using Spark for analytics, machine learning and data science 

Let’s parse that.

What's Spark? If you are an analyst or a data scientist, you're used to having multiple systems for working with data. SQL, Python, R, Java, etc. With Spark, you have a single engine where you can explore and play with large amounts of data, run machine learning algorithms and then use the same system to productionize your code.

Analytics: Using Spark and Python you can analyze and explore your data in an interactive environment with fast feedback. The course will show how to leverage the power of RDDs and Dataframes to manipulate data with ease. 

Machine Learning and Data Science : Spark's core functionality and built-in libraries make it easy to implement complex algorithms like Recommendations with very few lines of code. We'll cover a variety of datasets and algorithms including PageRank, MapReduce and Graph datasets. 

What's Covered:

Lot's of cool stuff ..

  • Music Recommendations using Alternating Least Squares and the Audioscrobbler dataset
  • Dataframes and Spark SQL to work with Twitter data
  • Using the PageRank algorithm with Google web graph dataset
  • Using Spark Streaming for stream processing 
  • Working with graph data using the  Marvel Social network dataset 



.. and of course all the Spark basic and advanced features: 

  • Resilient Distributed Datasets, Transformations (map, filter, flatMap), Actions (reduce, aggregate) 
  • Pair RDDs , reduceByKey, combineByKey 
  • Broadcast and Accumulator variables 
  • Spark for MapReduce 
  • The Java API for Spark 
  • Spark SQL, Spark Streaming, MLlib and GraphFrames (GraphX for Python) 

A Crash Course In PySpark

Learn all the fundamentals of PySpark

Created by Kieran Keene - Data Engineer at Kodey

"]

Students: 5688, Price: $19.99

Students: 5688, Price:  Paid

Spark is one of the most in-demand Big Data processing frameworks right now.

This course will take you through the core concepts of PySpark. We will work to enable you to do most of the things you’d do in SQL or Python Pandas library, that is:

  • Getting hold of data

  • Handling missing data and cleaning data up

  • Aggregating your data

  • Filtering it

  • Pivoting it

  • And Writing it back

All of these things will enable you to leverage Spark on large datasets and start getting value from your data.

Let’s get started.

Apache Spark 3 for Data Engineering & Analytics with Python

Learn how to use Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks) - Beginner to Ninja

Created by David Charles Academy - Senior Big Data Engineer / Consultant at ABN AMRO

"]

Students: 5333, Price: $19.99

Students: 5333, Price:  Paid

The key objectives of this course are as follows;

  • Learn the Spark Architecture

  • Learn Spark Execution Concepts

  • Learn Spark Transformations and Actions using the Structured API

  • Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API

  • Learn how to set up your own local PySpark Environment

  • Learn how to interpret the Spark Web UI

  • Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution

  • Learn the RDD (Resilient Distributed Datasets) API (Crash Course)

    • RDD Transformations

    • RDD Actions

  • Learn the Spark DataFrame API  (Structured APIs)

    • Create Schemas and Assign DataTypes

    • Read and Write Data using the DataFrame Reader and Writer

    • Read Semi-Structured Data such as JSON

    • Create and New Data Columns to the DataFrame using Expressions

    • Filter the DataFrame using the "Filter" and "Where" Transformations

    • Ensure that the DataFrame has unique rows

    • Detect and Drop Duplicates

    • Augment the DataFrame by Adding New Rows

    • Combine 2 or More DataFrames

    • Order the DataFrame by Specific Columns

    • Renaming and Drop Columns from the DataFrame

    • Clean the DataFrame by detecting and Removing Missing or Bad Data

    • Create  User-Defined Spark Functions

    • Read and Write to/from Parquet File

    • Partition the DataFrame and Write to Parquet File

    • Aggregate the DataFrame using Spark SQL functions (count, countDistinct, Max, Min, Sum, SumDistinct, AVG)

    • Perform Aggregations with Grouping

  • Learn Spark SQL and Databricks

    • Create a Databricks Account

    • Create a Databricks Cluster

    • Create Databricks SQL and Python Notebooks

    • Learn Databricks shortcuts

    • Create Databases and Tables using Spark SQL

    • Use DML, DQL, and DDL with Spark SQL

    • Use Spark SQL Functions

    • Learn the differences between Managed and Unmanaged Tables

    • Read CSV Files from the Databricks File System

    • Learn to write Complex SQL

    • Use Spark SQL Functions

    • Create Visualisations with Databricks

    • Create a Databricks Dashboard

The Python Spark project that we are going to do together;

Sales Data

  • Create a Spark Session

  • Read a CSV file into a Spark Dataframe

  • Learn to Infer a Schema

  • Select data from the Spark Dataframe

  • Produce analytics that shows the topmost sales orders per Region and Country

Convert Fahrenheit to Degrees Centigrade

  • Create a Spark Session

  • Read and Parallelize data using the Spark Context into an RDD

  • Create a Function to Convert Fahrenheit to Degrees Centigrade

  • Use the Map Function to convert data contained within an RDD

  • Filter temperatures greater than or equal to 13 degrees celsius

XYZ Research

  • Create a set of RDDs that hold Research Data

  • Use the union transformation to combine RDDs

  • Learn to use the subtract transformation to minus values from an RDD

  • Use the RDD API to answer the following questions

    • How many research projects were initiated in the first three years?

    • How many projects were completed in the first year?

    • How many projects were completed in the first two years?

Sales Analytics

  • Create the Sales Analytics DataFrame to a set of CSV Files

  • Prepare the DataFrame by applying a Structure

  • Remove bad records from the DataFrame (Cleaning)

  • Generate New Columns from the DataFrame

  • Write a Partitioned DataFrame to a Parquet Directory

  • Answer the following questions and create visualizations using Seaborn and Matplotlib

    • What was the best month in sales?

    • What city sold the most products?

    • What time should the business display advertisements to maximize the likelihood of customers buying products?

    • What products are often sold together in the state "NY"?

Technology Spec

  1. Python

  2. Jupyter Notebook

  3. Jupyter Lab

  4. PySpark (Spark with Python)

  5. Pandas

  6. Matplotlib

  7. Seaborne

  8. Databricks

  9. SQL

Big Data with Apache Spark PySpark: Hands on PySpark, Python

Learn to analyse batch, streaming data with Data Frame of Apache Spark Python and PySpark

Created by Ankit Mistry - Software Developer | I want to Improve your life & Income.

"]

Students: 3683, Price: $19.99

Students: 3683, Price:  Paid

Welcome to the  Apache Spark : PySpark Course.

Have you ever thought about How big company like Google, Microsoft, Facebook, Apple or Amazon Process Petabytes of data on thousands of machine.

This course starting point to learn about in memory big data analysis tool Apache Spark.

==============================================

What previous students have said: 

"Very good introduction. Ideal for beginners to obtain a big picture as a starting point. The course should be further developed and supplemented with further practical examples. But overall I would highly recommend."     

"I like the pace at which the instructor is going. I like the fact that he quickly dives into the practical. For me, this helps to put subsequent learning into perspective. He tends to have quite a few typos, but I can overlook those and still give him a 5 star rating. I am still quite early in the. Hope to update my review as I go along."

"Great course, knowledgeable author."

"Curso excelente para quem deseja aprender sobre Big Data e Spache Spark com PySpark."

==================================================

Apache Spark can perform up to 100x faster than Hadoop MapReduce Data processing framework, Which makes apache spark one of most demanded skills. 

The top companies like Google, Facebook, Microsoft, Amazon, Airbnb  using Apache Spark to solve their big data problems!. Data analysis, on huge amount of data is one of the most valuable skills now a days and This course  will teach such kind of skills to complete in big data job market.

This course will teach  

  • Introduction to big data and Apache spark

  • Getting started with databricks

  • Detailed installation step on ubuntu - linux machine

  • Python Refresh for newbie

  • Apache spark Dataframe API

  • Apache spark structured streaming with end to end example

  • Basics of Machine Learning and feature engineering with Apache spark.

This course is not complete, will be adding new content related to Spark ML.

Note : This course will teach only Spark 2.0 Dataframe based API only not RDD based API. As Dataframe based API is the future of spark.

Regards

Ankit Mistry

PySpark Essentials for Data Scientists (Big Data + Python)

Learn how to wrangle Big Data for Machine Learning using Python in PySpark taught by an industry expert!

Created by Layla AI - Seasoned Data Scientist Consultant & Passionate Instructor

"]

Students: 2964, Price: $94.99

Students: 2964, Price:  Paid

This course is for data scientists (or aspiring data scientists) who want to get PRACTICAL training in PySpark (Python for Apache Spark) using REAL WORLD datasets and APPLICABLE coding knowledge that you’ll use everyday as a data scientist! By enrolling in this course, you’ll gain access to over 100 lectures, hundreds of example problems and quizzes and over 100,000 lines of code!

I’m going to provide the essentials for what you need to know to be an expert in Pyspark by the end of this course, that I’ve designed based on my EXTENSIVE experience consulting as a data scientist for clients like the IRS, the US Department of Labor and United States Veterans Affairs.

I’ve structured the lectures and coding exercises for real world application, so you can understand how PySpark is actually used on the job. We are also going to dive into my custom functions that I wrote MYSELF to get you up and running in the MLlib API fast and make getting started building machine learning models a breeze! We will also touch on MLflow which will help us manage and track our model training and evaluation process in a custom user interface that will make you even more competitive on the job market!

Each section will have a concept review lecture as well as code along activities structured problem sets for you to work through to help you put what you have learned into action, as well as the solutions to each problem in case you get stuck. Additionally, real world consulting projects have been provided in every section with AUTHENTIC datasets to help you think through how to apply each of the concepts we have covered.

Lastly, I’ve written up some condensed review notebooks and handouts of all the course content to make it super easy for you to reference later on. This will be super helpful once you land your first job programming in PySpark!

I can’t wait to see you in the lectures! And I really hope you enjoy the course! I’ll see you in the first lecture!

Apache PySpark Fundamentals

Learn PySpark, fundamentals of Apache Spark with Python

Created by Johnny F. - Programmer

"]

Students: 956, Price: $19.99

Students: 956, Price:  Paid

PySpark is the collaboration of Apache Spark and Python. This course covers all the fundamentals of Apache Spark with Python and teaches you everything you need to know about developing Spark applications using PySpark, the Python API for Spark. At the end of this course, you will gain in-depth knowledge about Apache Spark and general big data analysis.

This course helps you get comfortable with PySpark, explaining what it has to offer and how it can enhance your data science work. We'll first get into the Spark ecosystem, detailing its advantages over other data science platforms, APIs, and tool sets.

Next, we'll look at the DataFrame API and how it's the platform's answer to many big data challenges. We'll also go over Resilient Distributed Datasets (RDDs), the building blocks of Spark.

PySpark & AWS: Master Big Data With PySpark and AWS

Learn how to use Spark, Pyspark AWS, Spark applications, Spark EcoSystem, Hadoop and Mastering PySpark

Created by AI Sciences - AI Experts & Data Scientists |4+ Rated | 160+ Countries

"]

Students: 932, Price: $89.99

Students: 932, Price:  Paid

Comprehensive Course Description:

The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports the collaboration of Python and Apache Spark. In this course, you’ll start right from the basics and proceed to the advanced levels of data analysis. From cleaning data to building features and implementing machine learning (ML) models, you’ll learn how to execute end-to-end workflows using PySpark.

Right through the course, you’ll be using PySpark for performing data analysis. You’ll explore Spark RDDs, Dataframes, and a bit of Spark SQL queries. Also, you’ll explore the transformations and actions that can be performed on the data using Spark RDDs and dataframes. You’ll also explore the ecosystem of Spark and Hadoop and their underlying architecture. You’ll use the Databricks environment for running the Spark scripts and explore it as well.

Finally, you’ll have a taste of Spark with AWS cloud. You’ll see how we can leverage AWS storages, databases, computations, and how Spark can communicate with different AWS services and get its required data.

How Is This Course Different?

In this Learning by Doing course, every theoretical explanation is followed by practical implementation.

The course ‘PySpark & AWS: Master Big Data With PySpark and AWS’ is crafted to reflect the most in-demand workplace skills. This course will help you understand all the essential concepts and methodologies with regards to PySpark. The course is:

• Easy to understand.

• Expressive.

• Exhaustive.

• Practical with live coding.

• Rich with the state of the art and latest knowledge of this field.

As this course is a detailed compilation of all the basics, it will motivate you to make quick progress and experience much more than what you have learned. At the end of each concept, you will be assigned Homework/tasks/activities/quizzes along with solutions. This is to evaluate and promote your learning based on the previous concepts and methods you have learned. Most of these activities will be coding-based, as the aim is to get you up and running with implementations.

High-quality video content, in-depth course material, evaluating questions, detailed course notes, and informative handouts are some of the perks of this course. You can approach our friendly team in case of any course-related queries, and we assure you of a fast response.

The course tutorials are divided into 140+ brief videos. You’ll learn the concepts and methodologies of PySpark and AWS along with a lot of practical implementation. The total runtime of the HD videos is around 16 hours.

Why Should You Learn PySpark and AWS?

PySpark is the Python library that makes the magic happen.

PySpark is worth learning because of the huge demand for Spark professionals and the high salaries they command. The usage of PySpark in Big Data processing is increasing at a rapid pace compared to other Big Data tools.

AWS, launched in 2006, is the fastest-growing public cloud. The right time to cash in on cloud computing skills—AWS skills, to be precise—is now.

Course Content:

The all-inclusive course consists of the following topics:

1. Introduction:

a. Why Big Data?

b. Applications of PySpark

c. Introduction to the Instructor

d. Introduction to the Course

e. Projects Overview

2. Introduction to Hadoop, Spark EcoSystems, and Architectures:

a. Hadoop EcoSystem

b. Spark EcoSystem

c. Hadoop Architecture

d. Spark Architecture

e. PySpark Databricks setup

f. PySpark local setup

3. Spark RDDs:

a. Introduction to PySpark RDDs

b. Understanding underlying Partitions

c. RDD transformations

d. RDD actions

e. Creating Spark RDD

f. Running Spark Code Locally

g. RDD Map (Lambda)

h. RDD Map (Simple Function)

i. RDD FlatMap

j. RDD Filter

k. RDD Distinct

l. RDD GroupByKey

m. RDD ReduceByKey

n. RDD (Count and CountByValue)

o. RDD (saveAsTextFile)

p. RDD (Partition)

q. Finding Average

r. Finding Min and Max

s. Mini project on student data set analysis

t. Total Marks by Male and Female Student

u. Total Passed and Failed Students

v. Total Enrollments per Course

w. Total Marks per Course

x. Average marks per Course

y. Finding Minimum and Maximum marks

z. Average Age of Male and Female Students

4. Spark DFs:

a. Introduction to PySpark DFs

b. Understanding underlying RDDs

c. DFs transformations

d. DFs actions

e. Creating Spark DFs

f. Spark Infer Schema

g. Spark Provide Schema

h. Create DF from RDD

i. Select DF Columns

j. Spark DF with Column

k. Spark DF with Column Renamed and Alias

l. Spark DF Filter rows

m. Spark DF (Count, Distinct, Duplicate)

n. Spark DF (sort, order By)

o. Spark DF (Group By)

p. Spark DF (UDFs)

q. Spark DF (DF to RDD)

r. Spark DF (Spark SQL)

s. Spark DF (Write DF)

t. Mini project on Employees data set analysis

u. Project Overview

v. Project (Count and Select)

w. Project (Group By)

x. Project (Group By, Aggregations, and Order By)

y. Project (Filtering)

z. Project (UDF and With Column)

aa. Project (Write)

5. Collaborative filtering:

a. Understanding collaborative filtering

b. Developing recommendation system using ALS model

c. Utility Matrix

d. Explicit and Implicit Ratings

e. Expected Results

f. Dataset

g. Joining Dataframes

h. Train and Test Data

i. ALS model

j. Hyperparameter tuning and cross-validation

k. Best model and evaluate predictions

l. Recommendations

6. Spark Streaming:

a. Understanding the difference between batch and streaming analysis.

b. Hands-on with spark streaming through word count example

c. Spark Streaming with RDD

d. Spark Streaming Context

e. Spark Streaming Reading Data

f. Spark Streaming Cluster Restart

g. Spark Streaming RDD Transformations

h. Spark Streaming DF

i. Spark Streaming Display

j. Spark Streaming DF Aggregations

7. ETL Pipeline

a. Understanding the ETL

b. ETL pipeline Flow

c. Data set

d. Extracting Data

e. Transforming Data

f. Loading data (Creating RDS)

g. Load data (Creating RDS)

h. RDS Networking

i. Downloading Postgres

j. Installing Postgres

k. Connect to RDS through PgAdmin

l. Loading Data

8. Project – Change Data Capture / Replication On Going

a. Introduction to Project

b. Project Architecture

c. Creating RDS MySql Instance

d. Creating S3 Bucket

e. Creating DMS Source Endpoint

f. Creating DMS Destination Endpoint

g. Creating DMS Instance

h. MySql WorkBench

i. Connecting with RDS and Dumping Data

j. Querying RDS

k. DMS Full Load

l. DMS Replication Ongoing

m. Stoping Instances

n. Glue Job (Full Load)

o. Glue Job (Change Capture)

p. Glue Job (CDC)

q. Creating Lambda Function and Adding Trigger

r. Checking Trigger

s. Getting S3 file name in Lambda

t. Creating Glue Job

u. Adding Invoke for Glue Job

v. Testing Invoke

w. Writing Glue Shell Job

x. Full Load Pipeline

y. Change Data Capture Pipeline

After the successful completion of this course, you will be able to:

● Relate the concepts and practicals of Spark and AWS with real-world problems.

● Implement any project that requires PySpark knowledge from scratch.

● Know the theory and practical aspects of PySpark and AWS.

Who this course is for:

● People who are beginners and know absolutely nothing about PySpark and AWS.

● People who want to develop intelligent solutions.

● People who want to learn PySpark and AWS.

● People who love to learn the theoretical concepts first before implementing them using Python.

● People who want to learn PySpark along with its implementation in realistic projects.

● Big Data Scientists.

● Big Data Engineers.

Learning PySpark

Building and deploying data-intensive applications at scale using Python and Apache Spark

Created by Packt Publishing - Tech Knowledge in Motion

"]

Students: 487, Price: $89.99

Students: 487, Price:  Paid

Apache Spark is an open-source distributed engine for querying and processing data. In this tutorial, we provide a brief overview of Spark and its stack. This tutorial presents effective, time-saving techniques on how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark.

You'll learn about different techniques for collecting data, and distinguish between (and understand) techniques for processing data. Next, we provide an in-depth review of RDDs and contrast them with DataFrames. We provide examples of how to read data from files and from HDFS and how to specify schemas using reflection or programmatically (in the case of DataFrames). The concept of lazy execution is described and we outline various transformations and actions specific to RDDs and DataFrames.

Finally, we show you how to use SQL to interact with DataFrames. By the end of this tutorial, you will have learned how to process data using Spark DataFrames and mastered data collection techniques by distributed data processing.

About the Author

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting.

Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research. During his time in Sydney, he worked as a Data Analyst for Beyond Analysis Australia and as a Senior Data Analyst/Data Scientist for Vodafone Hutchison Australia among others. He has also published scientific papers, attended international conferences, and served as a reviewer for scientific journals.

In 2015 he relocated to Seattle to begin his work for Microsoft. While there, he has worked on numerous projects involving solving problems in high-dimensional feature space.

Databricks Certified Developer for Spark 3.0 Practice Exams

Databricks Associate Certification Practice Questions (PySpark/Python), Tests + Detailed Explanations + Exam Tips&Tricks

Created by Florian Roscheck | Databricks Certified Associate Developer - Sr. Data Scientist, Python Expert, Passionate Instructor

"]

Students: 469, Price: $29.99

Students: 469, Price:  Paid

If you have been looking for a comprehensive set of realistic, high-quality questions to practice for the Databricks Certified Developer for Apache Spark 3.0 exam in Python, look no further!

These up-to-date practice exams provide you with the knowledge and confidence you need to pass the exam with excellence. All 180 questions have been written from scratch, based on the actual distribution of topics and tone in the real exam. The questions cover all themes being tested for in the exam, including specifics to Python and Apache Spark 3.0.

Most questions come with detailed explanations, giving you a chance to learn from your mistakes and have links to the Spark documentation and expert web content, helping you to understand how Spark works even better.

These practice exams come with valuable exam tips & tricks and code snippets that you can execute for free on the Databricks Community Edition. These supplemental materials will help you understand the many tricky details of the exam and the Spark syntax, giving you the knowledge and confidence you need to be a top performer in the real exam!

SAMPLE QUESTION

Curious about what a high-quality question looks like? Here is an example from the DataFrame API section of the practice exams!

Question:

Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?

1. transactionsDf.take(1000).distinct()

2. transactionsDf.sample(False, 0.5)

3. transactionsDf.take(1000)

4. transactionsDf.sample(True, 0.5)

5. transactionsDf.sample(True, 0.5, force=True)

Correct Answer:

4. transactionsDf.sample(True, 0.5)

Explanation:

To solve this question, you need to know that "DataFrame.sample()" is not guaranteed to return the exact fraction of the number of rows specified as an argument. Furthermore, since duplicates may be returned, you should understand that the operator's "withReplacement" argument should be set to "True". A "force=" argument for the operator does not exist.

While the "take" argument returns an exact number of rows, it will just take the first specified number of rows ("1000" in this question) from the DataFrame. Since the DataFrame does not include duplicate rows, there is no potential of any of those returned rows being duplicates when using "take()", so the correct answer cannot involve "take()".

More info: [Link to the Spark documentation for DataFrame.sample(), available in the practice exams once purchased]

COURSE CONTENT

The practice exams cover the following topics:

Spark Architecture: Conceptual understanding (ca. 17 %): Spark driver, execution hierarchy, DAGs, execution modes, deployment modes, memory management, cluster configurations, fault tolerance, partitioning, narrow vs. wide transformations, executors, Python vs. Scala, Spark vs. Hadoop

Spark Architecture: Applied understanding (ca. 11%): Memory management, configurations, lazy evaluation, action vs. transformation, shuffles, broadcasting, fault tolerance, accumulators, adaptive query execution, Spark UI, partitioning

Spark DataFrame API Applications (ca. 72%): Selecting/dropping columns, renaming columns, aggregating rows, filtering DataFrames, different types of joins, partitioning/coalescing, reading and writing DataFrames in different formats, string functions, math functions, UDFs, Spark configurations, caching, collect/take

All questions are original, high-quality questions, not anything like Databricks Spark certification dumps.

LET'S GET YOU CERTIFIED!

Ready to pass your Databricks Certified Associate Developer for Apache Spark 3.0 exam? Click “Buy now” and immediately get started with these benefits:

  • Get 3 practice exams with 180 high-quality questions in total, mimicking the original exam

  • Take the exams as many times as you would like

  • Get support from the instructor if you have questions

  • Dive in deeper with the detailed explanations and links to additional resources for most questions

  • Access the exams anywhere, anytime on your desktop, tablet, or mobile device through the Udemy app

  • 30-days money back guarantee if you are not satisfied

I am excited to have you as a student and to see you pass the exam, taking your next career step as a Databricks Certified Associate Developer for Apache Spark 3.0!

Disclaimer: Neither this course nor the certification are endorsed by the Apache Software Foundation. The "Spark", "Apache Spark" and the Spark logo are trademarks of the Apache Software Foundation.

Apache Spark 3 – Databricks Certification Practice (PySpark)

PySpark - Databricks Certified Associate Developer for Apache Spark 3.0

Created by Learning Journal - Online Training Company

"]

Students: 369, Price: $19.99

Students: 369, Price:  Paid

This course brings you FOUR (240 questions) high-quality practice tests in PySpark

Each practice set will help you test yourself and improve your knowledge for Databricks Certified Associate Developer for the Apache Spark 3.0 exam.

About the Certification

The Databricks Certified Associate Developer for Apache Spark 3.0 certification exam assesses the understanding of the Spark DataFrame API and the ability to apply the Spark DataFrame API to complete basic data manipulation tasks within a Spark session.

Exam Details

The exam details are as follows:

The exam consists of 60 multiple-choice questions. Candidates will have 120 minutes to complete the exam.

The minimum passing score for the exam is 70 percent. This translates to correctly answer a minimum of 42 of the 60 questions.

You will be testing your knowledge on the following topics:

Spark Architecture: Conceptual and Applied understanding (~28%):

  • Spark Use Cases

  • Spark Architecture

  • Spark Configurations

  • Spark Query Planning

  • Adaptive Query Execution

  • Garbage Collection

  • Query Performance

  • Scheduling

Spark DataFrame API Applications (~72%):

  • Concepts of Transformations and Actions

  • Selecting and Manipulating Columns

  • Adding, Removing, and Renaming Columns

  • Working with Date and Time

  • Data Type Conversions and Casting

  • Filtering, Dropping, and Sorting Rows

  • Aggregation Joins and Broadcast

  • Partitioning and Coalseing

  • Reading, Writing Data Files

  • CSV and Parquet Options

  • Working with NULLs and Literals

  • Combining Data Frames

  • Data Sampling and Splits

  • Data Frame Schema and Catalog

  • Collecting data at Driver

  • Catching and Persistence

  • Syntax Problems Code

  • Tracing and Output Determinations

  • Working with UDF

  • Spark SQL, Database, Tables, and Views

  • Spark SQL Functions

  • Text File Options

Note: These are not exam dumps. The practice test aims to assess your Apache Spark 3.0 knowledge and exam preparedness.

It may also help you improve your Apache Spark 3.0 knowledge.

Building Big Data Pipelines with PySpark + MongoDB + Bokeh

Build intelligent data pipelines with big data processing and machine learning technologies

Created by EBISYS R&D - Big Data Engineering

"]

Students: 231, Price: $59.99

Students: 231, Price:  Paid

Welcome to the ​Building Big Data Pipelines with PySpark & MongoDB & Bokeh​ course. In

this course we will be building an intelligent data pipeline using big data technologies like

Apache Spark and MongoDB.

We will be building an ETLP pipeline, ETLP stands for Extract Transform Load and Predict.

These are the different stages of the data pipeline that our data has to go through in order for it

to become useful at the end. Once the data has gone through this pipeline we will be able to

use it for building reports and dashboards for data analysis.

The data pipeline that we will build will comprise of data processing using PySpark, Predictive

modelling using Spark’s MLlib machine learning library, and data analysis using MongoDB and

Bokeh.

  • You will learn how to create data processing pipelines using PySpark

  • You will learn machine learning with geospatial data using the Spark MLlib library

  • You will learn data analysis using PySpark, MongoDB and Bokeh, inside of jupyter notebook

  • You will learn how to manipulate, clean and transform data using PySpark dataframes

  • You will learn basic Geo mapping

  • You will learn how to create dashboards

  • You will also learn how to create a lightweight server to serve Bokeh dashboards

Hands-On PySpark for Big Data Analysis

Use PySpark to productionize analytics over Big Data and easily crush messy data at scale

Created by Packt Publishing - Tech Knowledge in Motion

"]

Students: 200, Price: $89.99

Students: 200, Price:  Paid

Data is an incredible asset, especially when there are lots of it. Exploratory data analysis, business intelligence, and machine learning all depend on processing and analyzing Big Data at scale. 

How do you go from working on prototypes on your local machine, to handling messy data in production and at scale? 

This is a practical, hands-on course that shows you how to use Spark and it's Python API to create performant analytics with large-scale data. Don't reinvent the wheel, and wow your clients by building robust and responsible applications on Big Data.

About the Author

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help their clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as Big Data, Data Science, Machine Learning, and Cloud Computing. Over the past few years, they have worked with some of the world's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the world's most popular soft drinks companies, helping each of them to better make sense of their data, and process it in more intelligent ways.

The company lives by their motto: Data -> Intelligence -> Action.

Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails to prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performance - key analytics that all feedback into how our AI generates content.

Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO’s Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye.

In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform to learn deeply about reinforcement learning and supervised learning topics in a commercial setting. 

Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean’s List, and received awards such as the Deutsche Bank Artificial Intelligence prize.

Data Analytics with Pyspark

Learn the basics of Pyspark

Created by Wajahatullah Khan - Data Architect at Afiniti

"]

Students: 141, Price: $19.99

Students: 141, Price:  Paid

PySpark helps you perform data analysis. It helps you to build more scalable analyses and data pipelines. This course starts by introducing you to PySpark's potential for performing analysis of large datasets. You'll learn how to interact with Spark from Python and connect to Spark on windows as local machine.

By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization.

This course will greatly appeal to data science enthusiasts, data scientists, or anyone who is familiar with Machine Learning concepts and wants to scale out his/her work to work with big data.

If you find it difficult to analyze large datasets that keep growing, then this course is the perfect guide for you!

Note: A working knowledge of Python assumed.

What You Will Learn

  • Gain a solid knowledge of PySpark with Data Analytics concepts via practical use cases

  • Run, process, and analyze large chunks of datasets using PySpark

  • Utilize Spark SQL to easily load big data into DataFrames

  • How to use PySpark SQL Functions.

  • How you can extract data from multiple sources

We will using Pycharm as an IDE to run pyspark and python.

Mastering Big Data Analytics with PySpark

Effectively apply Advanced Analytics to large datasets using the power of PySpark

Created by Packt Publishing - Tech Knowledge in Motion

"]

Students: 133, Price: $89.99

Students: 133, Price:  Paid

PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.

You'll learn to work with Apache Spark and perform ML tasks more smoothly than before. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks for deploying your code and performance tuning.

By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization.

About the Author

Danny Meijer works as the Lead Data Engineer in the Netherlands for the Data and Analytics department of a leading sporting goods retailer. He is a Business Process Expert, big data scientist and additionally a data engineer, which gives him a unique mix of skills—the foremost of which is his business-first approach to data science and data engineering.

He has over 13-years' IT experience across various domains and skills ranging from (big) data modeling, architecture, design, and development as well as project and process management; he also has extensive experience with process mining, data engineering on big data, and process improvement.

As a certified data scientist and big data professional, he knows his way around data and analytics, and is proficient in various types of programming language. He has extensive experience with various big data technologies and is fluent in everything: NoSQL, Hadoop, Python, and of course Spark.

Danny is a driven person, motivated by everything data and big-data. He loves math and machine learning and tackling difficult problems.

Complete PySpark & Google Colab Primer For Data Science

Develop Practical Machine Learning & Neural Network Models With PySpark and Google Colab

Created by Minerva Singh - Bestselling Instructor & Data Scientist(Cambridge Uni)

"]

Students: 127, Price: $89.99

Students: 127, Price:  Paid

YOUR COMPLETE GUIDE TO PYSPARK AND GOOGLE COLAB: POWERFUL FRAMEWORK FOR ARTIFICIAL INTELLIGENCE (AI)

This course covers the main aspects of the PySpasrk Big Data ecosystem within the Google CoLab framework. If you take this course, you can do away with taking other courses or buying books on PySpark based analytics as my course has the most updated information and syntax. Plus, you learn to channelise the power of PySpark within a powerful Python AI framework- Google Colab.

 In this age of big data, companies across the globe use Pyspark to sift through the avalanche of information at their disposal, courtesy Big Data. By becoming proficient in machine learning, neural networks and deep learning via a powerful framework, H2O in Python, you can give your company a competitive edge and boost your career to the next level!

LEARN FROM AN EXPERT DATA SCIENTIST:

My name is Minerva Singh and I am an Oxford University MPhil (Geography and Environment), graduate. I finished a PhD at Cambridge University, UK, where I specialized in data science models.

I have +5 years of experience in analyzing real-life data from different sources using data science-related techniques and producing publications for international peer-reviewed journals.

Over the course of my research, I realized almost all the data science courses and books out there do not account for the multidimensional nature of the topic.

This course will give you a robust grounding in the main aspects of working with PySpark- your gateway to Big Data

Unlike other instructors, I dig deep into the data science features of Pyspark and their implementation via Google Colab and give you a one-of-a-kind grounding

You will go all the way from carrying out data reading & cleaning to finally implementing powerful machine learning and neural networks algorithms and evaluating their performance using Pyspark.

Among other things:

  • You will be introduced to Google Colab, a powerful framework for implementing data science via your browser.

  • You will be introduced to important concepts of machine learning without jargon.

  • Learn to install PySpark within the Colab environment and use it for working with data

  • You will learn how to implement both supervised and unsupervised algorithms using the Pyspark framework

  • Implement both Artificial Neural Networks (ANN) and Deep Neural Networks (DNNs) with the Pyspark framework

  • Work with real data within the framework

NO PRIOR PYTHON OR STATISTICS/MACHINE LEARNING OR BIG DATA KNOWLEDGE IS REQUIRED:

You’ll start by absorbing the most valuable Pyspark Data Science basics and techniques. I use easy-to-understand, hands-on methods to simplify and address even the most difficult concepts in Python.

My course will help you implement the methods using real data obtained from different sources. Many courses use made-up data that does not empower students to implement Pyspark-based data science in real-life.

After taking this course, you’ll easily use the latest Pyspark techniques to implement novel data science techniques straight from your browser. You will get your hands dirty with real-life data and problems

You’ll even understand the underlying concepts to understand what algorithms and methods are best suited for your data.

We will also work with real data and you will have access to all the code and data used in the course. 

JOIN MY COURSE NOW!

I AM HERE TO SUPPORT YOU THROUGHOUT YOUR JOURNEY

INCASE YOU ARE NOT SATISFIED, THERE IS A 30-DAY NO QUIBBLE MONEY BACK GUARANTEE.

Big Data Analytics with PySpark + Tableau Desktop + MongoDB

Integrating Big Data Processing tools with Predictive Modeling and Visualization with Tableau Desktop

Created by EBISYS R&D - Big Data Engineering

"]

Students: 115, Price: $59.99

Students: 115, Price:  Paid

Welcome to the Big Data Analytics  with PySpark + Tableau Desktop + MongoDB course. In this course we will be creating a big data analytics solution using big data technologies like PySpark for ETL,  MLlib for Machine Learning as well as Tableau for Data Visualization and for building Dashboards.

We will be working with earthquake data, that we will transform into summary tables. We will then use these tables to train predictive models and predict future earthquakes. We will then analyze the data by building reports and dashboards in Tableau Desktop.

Tableau Desktop is a powerful data visualization tool used for big data analysis and visualization. It allows for data blending, real-time analysis and collaboration of data. No programming is needed for Tableau Desktop, which makes it a very easy and powerful tool to create dashboards apps and reports.

MongoDB is a document-oriented NoSQL database, used for high volume data storage. It stores data in JSON like format called documents, and does not use row/column tables. The document model maps to the objects in your application code, making the data easy to work with.

  • You will learn how to create data processing pipelines using PySpark

  • You will learn machine learning with geospatial data using the Spark MLlib library

  • You will learn data analysis using PySpark, MongoDB and Tableau

  • You will learn how to manipulate, clean and transform data using PySpark dataframes

  • You will learn how to create Geo Maps in Tableau Desktop

  • You will also learn how to create dashboards in Tableau Desktop

PySpark – Build DataFrames with Python, Apache Spark and SQL

Build an amazing DataFrames with Python, Apache Spark, and SQL

Created by Mammoth Interactive - Top-Rated Instructor, 800,000+ Students

"]

Students: 73, Price: $89.99

Students: 73, Price:  Paid

This course covers all the fundamentals about Apache Spark streaming with Python and teaches you everything you need to know about developing Spark streaming applications using PySpark, the Python API for Spark. At the end of this course, you will gain in-depth knowledge about Spark streaming and general big data manipulation skills to help your company to adapt Spark Streaming for building big data processing pipelines and data analytics applications. This course will be absolutely critical to anyone trying to make it in data science today.

Spark can perform up to 100x faster than Hadoop MapReduce, which has caused an explosion in demand for this skill! Because the Spark 2.0 DataFrame framework is so new, you now have the ability to quickly become one of the most knowledgeable people in the job market!

This course will teach the basics with a crash course in Python, continuing on to learning how to use Spark DataFrames with the latest Spark 2.0 syntax! Once we've done that we'll go through how to use the MLlib Machine Library with the DataFrame syntax and Spark. All along the way, you'll have exercises and Mock Consulting Projects that put you right into a real-world situation where you need to use your new skills to solve a real problem!

We also cover the latest Spark Technologies, like Spark SQL, Spark Streaming, and advanced models like Gradient Boosted Trees! After you complete this course you will feel comfortable putting Spark and PySpark on your resume! This course also has a full 30-day money-back guarantee and comes with a LinkedIn Certificate of Completion!

If you're ready to jump into the world of Python, Spark, and Big Data, this is the course for you!

Big Data Analytics with PySpark + Power BI + MongoDB

Big Data Analytics with Predictive Modeling and Visualization with Power BI Desktop

Created by EBISYS R&D - Big Data Engineering

"]

Students: 53, Price: $54.99

Students: 53, Price:  Paid

Welcome to the Big Data Analytics with PySpark + Power BI + MongoDB course. In this course we will be creating a big data analytics pipeline, using big data technologies like PySpark, MLlib, Power BI and MongoDB.

We will be working with earthquake data, that we will transform into summary tables. We will then use these tables to train predictive models and predict future earthquakes. We will then analyze the data by building reports and dashboards in Power BI Desktop.

Power BI Desktop is a powerful data visualization tool that lets you build advanced queries, models and reports. With Power BI Desktop, you can connect to multiple data sources and combine them into a data model. This data model lets you build visuals, and dashboards that you can share as reports with other people in your organization.

MongoDB is a document-oriented NoSQL database, used for high volume data storage. It stores data in JSON like format called documents, and does not use row/column tables. The document model maps to the objects in your application code, making the data easy to work with.

  • You will learn how to create data processing pipelines using PySpark

  • You will learn machine learning with geospatial data using the Spark MLlib library

  • You will learn data analysis using PySpark, MongoDB and Power BI

  • You will learn how to manipulate, clean and transform data using PySpark dataframes

  • You will learn how to create Geo Maps using ArcMaps for Power BI

  • You will also learn how to create dashboards in Power BI

Apache Spark 3 Programming | Databricks Certification Python

Become zero to hero in Apache PySpark 3.0 programming in a fun and easy way. Fastest way to prepare for Databricks exam.

Created by Vivek Singh Bhadouria - Data Scientist

"]

Students: 38, Price: $29.99

Students: 38, Price:  Paid

Hello Students,

I welcome you all to this course on Apache Spark 3.0 Programming and Databricks Associate Developer Certification using Python. In this course, you will learn about programming using Apache Spark 3.0 using Python, usually referred as PySpark, along with preparing you for the Databricks certification using Python in a fun and easy way from ground zero.

This course requires zero knowledge on PySpark, and it will take you to an advanced user level of PySpark by the end of this course. We will be only using Python language here, in this course. This course can also be taken by someone who is starting their journey with Apache Spark using Python.

This course focuses on the most important aspects of Apache Spark 3.0 without going into the esoteric side of the Spark framework. Therefore, you will be productive with PySpark with the help of this course in a couple of hours. Additionally, this course covers all the topics required for Databricks certification using Python language.

This course also comes with two bonus projects on Machine learning using PySpark. In those videos, I will talk about how to prepare your data so that it is ready for applying machine learning algorithms along with some hands-on on some machine learning algorithm from the PySpark machine learning framework. I have considered very gentle examples to illustrate the power of PySpark’s machine learning, so it will be very easy to follow along.

This course is ideal if you are an absolute beginner or someone with less than two years of experience with PySpark or if you wish to get certified as a Databricks Certified Associate Developer for Apache Spark 3.0. This course can also be used by experienced professionals to quickly brush up their basics in PySpark.

In terms of hardware requirements, you just need a computer with an internet connection. We will be using a free Databricks cluster to practice the problems here, so you also don't need to worry about any complicated installations. This is also helpful for many professionals because almost always, we do not have admin access to the computer and we cannot install any software on the computer. I will be teaching you how to use Databricks cloud platform for this course.

Machine Learning in Python – Extras

Explore ML Pipelines with Scikit-Learn,PySpark, Model Fairness and Model Interpretation, and More

Created by Jesse E. Agbe - Developer

"]

Students: 19, Price: $64.99

Students: 19, Price:  Paid

Machine Learning applications are everywhere nowadays from Google Translate and NLP API,to Recommendation Systems used by YouTube,Netflix and Amazon,Udemy and more. As we have come to know, data science and machine learning is quite important to the success of any business and sector- so what does it take to build machine learning systems that works?

In performing machine learning and data science projects, the normal workflow is that you have a problem you want to solve, hence you perform data collection,data preparation,feature engineering,model building and evaluation and then you deploy your model. However that is not all there is, there is a lot more to this life cycle.

In this course we  will be introducing to you some extra things that is not covered in most machine learning courses - such as working with pipelines specifically Scikit-learn pipelines, Spark Pipelines,etc and working with imbalanced dataset,etc

We will also explore other ML frameworks beyond Scikit-learn,Tensorflow or Pytorch such as TuriCreate, Creme for online machine learning and more.

We will learn about model interpretation and explanation. Certain ML models when used in production tend to be bias, hence in this course we will explore how to detect model fairness and bias.

By the end of the course you will have a comprehensive overview of extra concepts and tools in the entire machine learning project life cycle and things to consider when performing  a data science project.

This course is unscripted,fun and exciting but at the same time we dive deep into some extra aspects of the machine learning life cycle.

Specifically you will learn

  • Pipelines and their advantages.

  • How to build ML Pipelines with Scikit-Learn

  • How to build Spark NLP Pipelines

  • How to work with and fix Imbalanced Datasets

  • Model Fairness and Bias Detection

  • How to interpret and explain your Black Box Models using Lime,Eli5,etc

  • Incremental/Online Machine Learning Frameworks

  • Best practices in data science project

  • Model Deployment

  • Alternative ML Libraries eg TuriCreate,etc

  • how to track your ML experiments and more

  • etc

NB: This course will not cover CI/CD ML Pipelines

Join us as we explore the world of machine learning in python - the Extras