Uploaded image for project: 'ListenBrainz'
  1. ListenBrainz
  2. LB-473

ListenBrainz-Labs: Use PySpark SQL Module in place of SQL queries.

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Won't Fix
    • Icon: Normal Normal
    • None
    • None
    • None

      Spark dataframes are immutable.  We initially decided to be consistent and hence use SQL queries everywhere in our codebase when doing database (dataframe) operations. Like here

      But since dataframes are immutable we cannot run SQL queries to update or alter our dataframes ( in short modify existing dataframes). In such situation, we must use dataframe functions like union, subtract etc which will create a new dataframe, copy the contents of the previous dataframe to this new dataframe and append our changes to the new dataframe. Any operation will always create a new dataframe like how tuples work in python (an oversimplified example). 

      We have recently decided to use PySpark SQL module in place of SQL Queries for consistency (since we will have to use PySpark SQL module for functions like subtract etc.) Most of the sql queries have been updated to use pyspark sql module except all.py. 

      In this task you are expected to update all.py using pyspark sql module.

            kartik1712 amCap1712
            vansika Vansika Pareek
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved:

                Version Package