PySpark Cheatsheet

2022, May, 21

SparkPythonCode

Spark is a framework that does fast data processing due to in-memory computation with parallel execution on multiple nodes. Spark is useful for handling large datasets.

PySpark is a library what allows usage of Spark in Python3.

Like Pandas, pyspark also has a concept of DataFrame (structure to handle tabular data).

Reading csv file

df = spark.read.csv(<PATH_OF_CSV>, header=True)

Union of DataFrames

df = df1.union(df2).union(df3)

Sorting DataFrame in Descending order

from pyspark.sql.functions import desc
df.sort(desc('<COLUMN_NAME>'))

Updating column value based on condition

from pyspark.sql.functions import when, lit
df.withColumn('<COLUMN_NAME>',
when( df['<COLUMN_NAME>'] == '<VALUE1>', lit('<LITERAL_VALUE>') )
.otherwise( lit('<VALUE2>') )
)

Add new Column

from pyspark.sql.functions import lit
df = df.withColumn("<NEW_COLUMN_NAME>", lit(<VALUE>))

Delete Column

df = df.drop('<COLUMN1>')
# multiple columns
df = df.drop('<COLUMN1>', '<COLUMN2>')

Filter

df.filter(df['<COLUMN_NAME'] == '<VALUE>')
# OR condition
df.filter( (df['<COLUMN1'] > '<VALUE1>') | df['<COLUMN_NAME2'] == '<VALUE2>'))
# AND condition
df.filter( (df['<COLUMN1'] > '<VALUE1>') & df['<COLUMN_NAME2'] == '<VALUE2>'))

Group By

df.select('<COLUMN1>').groupby('<COLUMN1>').count()

Convert to Pandas DataFrame

spark_dataframe.toPandas()

PySpark DataFram from Pandas DataFrame

spark.createDataFrame(pandas_dataframe)

Check null value count

df.filter(df['<COLUMN>'].isNull() == True).count()

Rename Column

df = df.withColumnRename('<COLUMN_OLD>', '<COLUMN_NEW>')

Change Column DataTypes

from pyspark.sql.types import IntegerType
df = df.withColumn('<COLUMN_NAME>', df['<COLUMN_NAME>'].cast(IntegerType())])