User-Defined Functions
User‑defined functions (UDFs) allow you to write custom logic in Python that applies row‑by‑row. UDFs are powerful but can be slower than built‑in functions.
Creating a Simple UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def categorize(age):
if age < 18:
return "Child"
elif age < 65:
return "Adult"
else:
return "Senior"
categorize_udf = udf(categorize, StringType())
df = df.withColumn("category", categorize_udf(df["age"]))Using UDF with `select`
df.select(df.name, categorize_udf(df.age).alias("category")).show()Register UDF for SQL
spark.udf.register("categorize_udf", categorize, StringType())
df.createOrReplaceTempView("people")
result = spark.sql("SELECT name, categorize_udf(age) as category FROM people")
result.show()Performance Consideration
UDFs cause data to be serialised and moved between JVM and Python, which is slow. Prefer built‑in functions (`when`, `otherwise`, `regexp_extract`, etc.) when possible.
Two Minute Drill
- UDFs apply Python functions row‑by‑row.
- Register with `udf()` or `spark.udf.register()`.
- Specify return type (StringType, IntegerType, etc.).
- UDFs are slower than native Spark functions – use sparingly.
Need more clarification?
Drop us an email at career@quipoinfotech.com
