This workshop is for Python developers and analysts who want to use LLMs like OpenAI to extract structured data from text or run queries from text and summarise the results.
Pre-requisites: Comfortable with Python and REST APIs
We will use Open AI's Function Calling via LangChain to write code that can answer natural language questions from data.
For this workshop, you need:
pip install -q pandas # To run queries on data
pip install -q openai # To call OpenAI's API
pip install -q tiktoken # To count tokens in text
pip install -q langchain # To orchestrate function calling
This workshop uses Goodreads 10K dataset dataset. YOU SHOULD USE A DIFFERENT DATASET FOR YOUR SUBMISSION. Here are some options:
- Census income
- Wine quality
- Bank marketing
- Student dropout
- Bike sharing
- Air quality
- Obesity
- Apartment for rent
- Diabetes patients
- US Census data
- Online news popularity
- Garment employee productivity
- ... or any others from any source.
For example, download the Goodreads dataset:
from urllib.request import urlretrieve
urlretrieve('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv', 'books.csv')
Set up an OpenAI API key.
import os
os.environ['OPENAI_API_KEY'] = '...'
Next, let's set up LangChain. We:
- Use GPT 3.5 Turbo since it's a reasonably inexpensive and capable model as of 10 Oct 2023
- Set temperature to 0 to get deterministic results
- Set verbose to True to see the API calls
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0, verbose=True)
Let's create an OpenAI prompt chain that tells OpenAI to call a function. Try out different prompts to see what works best.
Note the Today is {today}
. This passes context, allowing questions like "last year".
from langchain.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages(
[
('system', 'Call right function. Today is {today}'),
('human', '{input}'),
('human', 'Always use right format'),
]
)
Define functions that can answer questions from the data. Use the Google Python Style docstring to document the functions. Specifically:
- Define the type and default of each argument
- Describe matching questions in the first paragraph, one per line
- Describe the arguments
import pandas as pd
from typing import Optional
books = pd.read_csv('books.csv')
def top_rated_books(count: Optional[int] = 10, start_year: Optional[int] = 0, end_year: Optional[int] = 9999, min_ratings_count: Optional[int] = 0) -> pd.DataFrame:
"""Which are the top rated books?
Which are the best books?
Args:
count: # books to return
start_year: year after which to consider books
end_year: year before which to consider books
min_ratings_count: min # of people who rated
"""
result = books[books['original_publication_year'] >= start_year] if start_year else books
result = result[result['original_publication_year'] <= end_year] if end_year else result
result = result[result['ratings_count'] >= min_ratings_count]
return result.sort_values('average_rating', ascending=False).head(count)
def most_popular_books(count: Optional[int]=10, start_year: Optional[int] = 0, end_year: Optional[int] = 9999, min_rating: Optional[int] = 0) -> pd.DataFrame:
"""Which are the most popular books?
What books did people read/rate the most?
Args:
count: # books to return
start_year: year after which to consider books
end_year: year before which to consider books
min_rating: min average rating
"""
result = books[books['original_publication_year'] >= start_year] if start_year else books
result = result[result['original_publication_year'] <= end_year] if end_year else result
result = result[result['average_rating'] >= min_rating]
return result.sort_values('ratings_count', ascending=False).head(count)
def most_prolific_authors(count: Optional[int]=10, start_year: Optional[int] = 0, end_year: Optional[int] = 9999, min_rating: Optional[int] = 0, min_ratings_count: Optional[int] = 0) -> pd.DataFrame:
"""Who wrote the most books?
Who are the most prolific authors?
Args:
count: # authors to return
start_year: year after which to consider books
end_year: year before which to consider books
min_rating: min average rating
min_ratings_count: min # of people who rated
"""
result = books[books['original_publication_year'] >= start_year] if start_year else books
result = result[result['original_publication_year'] <= end_year] if end_year else result
result = result[result['average_rating'] >= min_rating]
result = result[result['ratings_count'] >= min_ratings_count]
return result.groupby('authors').size().sort_values(ascending=False).head(count).to_frame('count')
Now, create a function chain that passes these functions to Open AI to choose from.
from langchain.chains.openai_functions import create_openai_fn_chain
functions = [
top_rated_books,
most_popular_books,
most_prolific_authors
]
chain = create_openai_fn_chain(functions, llm, prompt, verbose=True)
This allows us to ask questions:
query = 'Which are the most popular books after 2000?'
today = '2023-10-10'
result = chain.generate([{'input': query, 'today': today}])
result
result
looks like this:
LLMResult(
generations=[
[
ChatGeneration(
generation_info={'finish_reason': 'function_call'},
message=AIMessage(
content='',
additional_kwargs={
'function_call': {
'name': 'most_popular_books',
'arguments': '{\n "start_year": 2000\n}'
}
}
)
)
]
],
llm_output={
'token_usage': {
'prompt_tokens': 302,
'completion_tokens': 20,
'total_tokens': 322
},
'model_name': 'gpt-3.5-turbo'
},
run=[RunInfo(run_id=UUID('...'))]
)
This identified the most_popular_books
function as the one to call, along with the arguments (as JSON).
Let's write a function that also calls this function, and calculates its cost too.
import json
from datetime import datetime, timezone
# Cost of gpt-3.5-turbo as of 10 Oct 2023
cents = {
'prompt_tokens': 0.0015 / 1000 * 100,
'completion_tokens': 0.002 / 1000 * 100,
}
fn_map = {fn.__name__: fn for fn in functions}
def answer(query: str, today: str = datetime.now(timezone.utc).strftime('%Y-%m-%d')) -> dict:
response = chain.generate([{'input': query, 'today': today}])
# Calculate cost of query in cents
cost = sum(cents[token] * response.llm_output['token_usage'][token] for token in cents)
result = {'data': None, 'cost': cost, 'query': query, 'response': response}
# Get the function name and arguments
if len(response.generations) == 0 or len(response.generations[0]) == 0:
return result
function = response.generations[0][0].message.additional_kwargs.get('function_call', None)
if function is None or function.get('name', None) not in fn_map:
return result
# Call the function and return the result
kwargs = json.loads(function['arguments'])
result['data'] = fn_map[function['name']](**kwargs)
return result
answer1 = answer('Which are the most popular books after 2000')
answer1['data']
This should give a DataFrame that lists Suzanne Collins' The Hunger Games, followed by Stephenie Meyer's Twilight, and others.
Let's try a few more questions:
answer2 = answer('Which are the best rated books with at least 10,000 ratings in the last century?')
answer3 = answer('Who are the top 3 authors by number of books in the last decade of the 20th century?')
The top rated books last century were Calvin and Hobbes, Harry Potter and Words of Radiance.
answer2['data'].head().T
The top authors in the 1990s were Stephen King, Terry Pratchett and Nora Roberts.
answer3['data']
These queries cost about 0.05 cents each:
answer1['cost'], answer2['cost'], answer3['cost']
The cost is driven by the number of functions we pass to OpenAI. With more functions, the cost per query increases.
Instead, let's find the 2 most similar queries (based on vector embeddings) and pass them to OpenAI. This reduces the cost to about 0.03 cents.
import numpy as np
from langchain.storage.file_system import LocalFileStore
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.cache import CacheBackedEmbeddings
from typing import List
file_store = LocalFileStore('.embeddings/')
base = OpenAIEmbeddings()
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(base, file_store, namespace=base.model)
def classify(docs: List[str], topics: List[str]):
"""Return the similarity between each doc and topic"""
doc_embed = np.array(cached_embeddings.embed_documents(docs))
topic_embed = np.array(cached_embeddings.embed_documents(topics))
return np.dot(doc_embed, topic_embed.T)
Now modify answer()
to pick the top 2 queries and pass them to OpenAI.
from langchain.chains.openai_functions.base import convert_python_function_to_openai_function
def answer(query: str, today: str = datetime.now(timezone.utc).strftime('%Y-%m-%d'), top=2, min_similarity=0.85) -> dict:
# Convert the functions into a JSON schema
questions = []
for name, fn in fn_map.items():
schema = convert_python_function_to_openai_function(fn)
for desc in schema['description'].split('\n'):
questions.append({'q': desc.strip(), 'fn': fn})
# Find the similarity of each function to the query
similar = classify([query], [question['q'] for question in questions])
for index, similarity in enumerate(similar[0]):
questions[index]['similarity'] = similarity
# Create chain from the top similar questions with min_similarity
questions = sorted(questions, key=lambda q: q['similarity'], reverse=True)
top_functions = [q['fn'] for q in questions if q['similarity'] >= min_similarity][:top]
chain = create_openai_fn_chain(top_functions, llm, prompt, verbose=True)
# Then run the chain. This is the same as before
response = chain.generate([{'input': query, 'today': today}])
# Calculate cost of query in cents
cost = sum(cents[token] * response.llm_output['token_usage'][token] for token in cents)
result = {'data': None, 'cost': cost, 'query': query, 'response': response, 'functions': top_functions}
# Get the function name and arguments
if len(response.generations) == 0 or len(response.generations[0]) == 0:
return result
function = response.generations[0][0].message.additional_kwargs.get('function_call', None)
if function is None or function.get('name', None) not in fn_map:
return result
# Call the function and return the result
kwargs = json.loads(function['arguments'])
result['data'] = fn_map[function['name']](**kwargs)
return result
Now let's run all 3 questions...
answer1a = answer('Which are the most popular books after 2000')
answer2a = answer('Which are the best rated books with at least 10,000 ratings in the last century?')
answer3a = answer('Who are the top 3 authors by number of books in the last decade of the 20th century?')
... and check that their cost is ~0.03 cents each:
answer1a['cost'], answer2a['cost'], answer3a['cost']
Then verify that the answers are the same too.
[
answer1['data'].equals(answer1a['data']),
answer2['data'].equals(answer2a['data']),
answer3['data'].equals(answer3a['data']),
]
- Create an issue titled
Exercise submission
. Add a link to your Colab notebook
To mark a submission as correct:
- Check if the notebook uses a different dataset and functions than the one provided
- Check if the code has been fully executed with the new dataset and without errors
- Add the Function calling skill