English SDK for Apache Spark
Note
This article covers the English SDK for Apache Spark. This English SDK for Apache Spark is not supported directly by Databricks. To provide feedback, ask questions, and report issues, use the Issues tab in the English SDK for Apache Spark repository in GitHub.
The English SDK for Apache Spark takes English instructions and compiles them into Spark objects. Its goal is to make Spark more user-friendly and accessible, which enables you to focus your efforts on extracting insights from your data.
The following information includes an example that describes how you can use an Azure Databricks Python notebook to call the English SDK for Apache Spark. This example uses a plain English question to guide the English SDK for Apache Spark to run a SQL query on a table from your Azure Databricks workspace.
- Databricks has found that GPT-4 works optimally with the English SDK for Apache Spark. This article uses GPT-4 and assumes that you have an OpenAI API key that is associated with an OpenAI billing plan. To start an OpenAI billing plan, sign in at https://platform.openai.com/account/billing/overview, click Start payment plan, and follow the on-screen directions. After you start an OpenAI billing plan, to generate an OpenAI API key, sign in at https://platform.openai.com/account/api-keys and click Create new secret key.
- This example uses an Azure Databricks Python notebook that you can use in an Azure Databricks workspace that is connected to an Azure Databricks cluster.
In the notebook's first cell, run the following code, which installs on the attached compute resource the latest version of the Python package for the English SDK for Apache Spark:
%pip install pyspark-ai --upgrade
In the notebook's second cell, run the following code, which restarts the Python kernel to use the updated Python package for the English SDK for Apache Spark and its updated package dependencies:
dbutils.library.restartPython()
In the notebook's third cell, run the following code, which sets an environment variable named OPENAI_API_KEY
to the value of your OpenAI API key. The English SDK for Apache Spark uses this OpenAPI key to authenticate with OpenAI. Replace <your-openai-api-key>
with the value of your OpenAI API key:
import os
os.environ['OPENAI_API_KEY'] = '<your-openai-api-key>'
Important
In this example, for speed and ease of use, you hard-code your OpenAI API key into the notebook. In production scenarios, it is a security best practice not to hard-code your OpenAI API key into your notebooks. One alternative approach is to set this environment variable on the attached cluster. See Environment variables.
In the notebook's fourth cell, run the following code, which sets the LLM that you want the English SDK for Apache Spark to use and then activates the English SDK for Apache Spark with the selected model. For this example, you use GPT-4. By default, the English SDK for Apache Spark looks for an environment variable named OPENAI_API_KEY
and uses its value to authenticate with OpenAI to use GPT-4:
from langchain.chat_models import ChatOpenAI
from pyspark_ai import SparkAI
chatOpenAI = ChatOpenAI(model = 'gpt-4')
spark_ai = SparkAI(llm = chatOpenAI)
spark_ai.activate()
Tip
To use GPT-4 as the default LLM, you can simplify this code as follows:
from pyspark_ai import SparkAI
spark_ai = SparkAI()
spark_ai.activate()
In the notebook's fifth cell, run the following code, which selects all of the data in the samples.nyctaxi.trips
table from your Azure Databricks workspace and stores this data in a DataFrame that is optimized to work with the English SDK for Apache Spark. This DataFrame is represented here by the variable df
:
df = spark_ai._spark.sql("SELECT * FROM samples.nyctaxi.trips")
In the notebook's sixth cell, run the following code, which asks the English SDK for Apache Spark to print the average trip distance, to the nearest tenth, for each day during January of 2016.
df.ai.transform("What was the average trip distance for each day during the month of January 2016? Print the averages to the nearest tenth.").display()
The English SDK for Apache Spark prints its analysis and final answer as follows:
> Entering new AgentExecutor chain...
Thought: This can be achieved by using the date function to extract the date from the timestamp and then grouping by the date.
Action: query_validation
Action Input: SELECT DATE(tpep_pickup_datetime) as pickup_date, ROUND(AVG(trip_distance), 1) as avg_trip_distance FROM spark_ai_temp_view_2a0572 WHERE MONTH(tpep_pickup_datetime) = 1 AND YEAR(tpep_pickup_datetime) = 2016 GROUP BY pickup_date ORDER BY pickup_date
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT DATE(tpep_pickup_datetime) as pickup_date, ROUND(AVG(trip_distance), 1) as avg_trip_distance FROM spark_ai_temp_view_2a0572 WHERE MONTH(tpep_pickup_datetime) = 1 AND YEAR(tpep_pickup_datetime) = 2016 GROUP BY pickup_date ORDER BY pickup_date
> Finished chain.
The English SDK for Apache Spark runs its final answer and prints the results as follows:
+-----------+-----------------+
|pickup_date|avg_trip_distance|
+-----------+-----------------+
| 2016-01-01| 3.1|
| 2016-01-02| 3.0|
| 2016-01-03| 3.2|
| 2016-01-04| 3.0|
| 2016-01-05| 2.6|
| 2016-01-06| 2.6|
| 2016-01-07| 3.0|
| 2016-01-08| 2.9|
| 2016-01-09| 2.8|
| 2016-01-10| 3.0|
| 2016-01-11| 2.8|
| 2016-01-12| 2.9|
| 2016-01-13| 2.7|
| 2016-01-14| 3.3|
| 2016-01-15| 3.0|
| 2016-01-16| 3.0|
| 2016-01-17| 2.7|
| 2016-01-18| 2.9|
| 2016-01-19| 3.1|
| 2016-01-20| 2.8|
+-----------+-----------------+
only showing top 20 rows
- Try creating the DataFrame, represented in this example by the variable
df
, with different data. - Try using different plain English questions for the
df.ai.transform
function. - Try using different GPT-4 models. See GPT-4.
- Explore additional code examples. See the following additional resources.
- The English SDK for Apache Spark repository in GitHub
- The English SDK for Apache Spark documentation website
- The blog post Introducing English as the New Programming Language for Apache Spark