PySpark Cheatsheet
2022, May, 21
SparkPythonCode
Spark is a framework that does fast data processing due to in-memory computation with parallel execution on multiple nodes. Spark is useful for handling large datasets.
PySpark is a library what allows usage of Spark in Python3.
Like Pandas, pyspark also has a concept of DataFrame (structure to handle tabular data).
Reading csv file
df = spark.read.csv(<PATH_OF_CSV>, header=True)
Union of DataFrames
df = df1.union(df2).union(df3)
Sorting DataFrame in Descending order
from pyspark.sql.functions import descdf.sort(desc('<COLUMN_NAME>'))
Updating column value based on condition
from pyspark.sql.functions import when, litdf.withColumn('<COLUMN_NAME>', when( df['<COLUMN_NAME>'] == '<VALUE1>', lit('<LITERAL_VALUE>') ) .otherwise( lit('<VALUE2>') ))
Add new Column
from pyspark.sql.functions import litdf = df.withColumn("<NEW_COLUMN_NAME>", lit(<VALUE>))
Delete Column
df = df.drop('<COLUMN1>')# multiple columnsdf = df.drop('<COLUMN1>', '<COLUMN2>')
Filter
df.filter(df['<COLUMN_NAME'] == '<VALUE>')# OR conditiondf.filter( (df['<COLUMN1'] > '<VALUE1>') | df['<COLUMN_NAME2'] == '<VALUE2>'))# AND conditiondf.filter( (df['<COLUMN1'] > '<VALUE1>') & df['<COLUMN_NAME2'] == '<VALUE2>'))
Group By
df.select('<COLUMN1>').groupby('<COLUMN1>').count()
Convert to Pandas DataFrame
spark_dataframe.toPandas()
PySpark DataFram from Pandas DataFrame
spark.createDataFrame(pandas_dataframe)
Check null value count
df.filter(df['<COLUMN>'].isNull() == True).count()
Rename Column
df = df.withColumnRename('<COLUMN_OLD>', '<COLUMN_NEW>')
Change Column DataTypes
from pyspark.sql.types import IntegerTypedf = df.withColumn('<COLUMN_NAME>', df['<COLUMN_NAME>'].cast(IntegerType())])