Uploaded image for project: 'ListenBrainz'
  1. ListenBrainz
  2. LB-473

ListenBrainz-Labs: Use PySpark SQL Module in place of SQL queries.

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Normal
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Spark dataframes are immutable.  We initially decided to be consistent and hence use SQL queries everywhere in our codebase when doing database (dataframe) operations. Like here

      But since dataframes are immutable we cannot run SQL queries to update or alter our dataframes ( in short modify existing dataframes). In such situation, we must use dataframe functions like union, subtract etc which will create a new dataframe, copy the contents of the previous dataframe to this new dataframe and append our changes to the new dataframe. Any operation will always create a new dataframe like how tuples work in python (an oversimplified example). 

      We have recently decided to use PySpark SQL module in place of SQL Queries for consistency (since we will have to use PySpark SQL module for functions like subtract etc.) Most of the sql queries have been updated to use pyspark sql module except all.py. 

      In this task you are expected to update all.py using pyspark sql module.

        Attachments

          Activity

            People

            • Assignee:
              sarthak_jain Sarthak Jain
              Reporter:
              vansika Vansika Pareek
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Packages

                Version Package