Introduction to Spark SQL in Python

Updated on

Course overview

Provider
Datacamp
Course type
Free trial availiable
Deadline
Flexible
Duration
4 hours
Certificate
Available on completion
Course author
Mark Plutowski

Description

Learn how to manipulate data and create machine learning feature sets in Spark using SQL in Python.
You're familiar with SQL, and have heard great things about Apache Spark. Then this course is for you! Apache Spark is a computing framework for processing big data. Spark SQL is a component of Apache Spark that works with tabular data. Window functions are an advanced feature of SQL that take Spark to a new level of usefulness. You will use Spark SQL to analyze time series. You will extract the most common sequences of words from a text document. You will create feature sets from natural language text and use them to predict the last word in a sentence using logistic regression. Spark combines the power of distributed computing with the ease of use of Python and SQL. The course uses a natural language text dataset that is easy to understand. Sentences are sequences of words. Window functions are very suitable for manipulating sequence data. The same techniques taught here can be applied to sequences of song identifiers, video ids, or podcast ids. Exercises include discovering frequent word sequences, and converting word sequences into machine learning feature set data for training a text classifier.

Similar courses

Foundations: Data, Data, Everywhere
  • Flexible deadline
  • 20 hours
  • Certificate
Ask Questions to Make Data-Driven Decisions
  • Flexible deadline
  • 18 hours
  • Certificate
Introduction to Statistics
  • Flexible deadline
  • 15 hours
  • Certificate
  • English language

  • Recommended provider

  • Certificate available