-
Improvement
-
Resolution: Won't Fix
-
Normal
-
None
-
None
-
None
Spark dataframes are immutable. We initially decided to be consistent and hence use SQL queries everywhere in our codebase when doing database (dataframe) operations. Like here
But since dataframes are immutable we cannot run SQL queries to update or alter our dataframes ( in short modify existing dataframes). In such situation, we must use dataframe functions like union, subtract etc which will create a new dataframe, copy the contents of the previous dataframe to this new dataframe and append our changes to the new dataframe. Any operation will always create a new dataframe like how tuples work in python (an oversimplified example).
We have recently decided to use PySpark SQL module in place of SQL Queries for consistency (since we will have to use PySpark SQL module for functions like subtract etc.) Most of the sql queries have been updated to use pyspark sql module except all.py.
In this task you are expected to update all.py using pyspark sql module.