Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / User-Defined Functions
tutorial

User-Defined Functions

User‑defined functions (UDFs) allow you to write custom logic in Python that applies row‑by‑row. UDFs are powerful but can be slower than built‑in functions.

Creating a Simple UDF

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def categorize(age):
if age < 18:
return "Child"
elif age < 65:
return "Adult"
else:
return "Senior"

categorize_udf = udf(categorize, StringType())
df = df.withColumn("category", categorize_udf(df["age"]))

Using UDF with `select`

df.select(df.name, categorize_udf(df.age).alias("category")).show()

Register UDF for SQL

spark.udf.register("categorize_udf", categorize, StringType())
df.createOrReplaceTempView("people")
result = spark.sql("SELECT name, categorize_udf(age) as category FROM people")
result.show()

Performance Consideration

UDFs cause data to be serialised and moved between JVM and Python, which is slow. Prefer built‑in functions (`when`, `otherwise`, `regexp_extract`, etc.) when possible.


Two Minute Drill
  • UDFs apply Python functions row‑by‑row.
  • Register with `udf()` or `spark.udf.register()`.
  • Specify return type (StringType, IntegerType, etc.).
  • UDFs are slower than native Spark functions – use sparingly.

Need more clarification?

Drop us an email at career@quipoinfotech.com