Skip to content

first steps #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[flake8]
exclude = .git,.venv,__pycache__

# Config recommended by black:
# https://black.readthedocs.io/en/stable/guides/using_black_with_other_tools.html#bugbear
max-line-length = 80
extend-select = B950
extend-ignore = E203,E501,E701
31 changes: 31 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Test

on:
push:
branches:
- main
pull_request:

jobs:
test:
runs-on: ubuntu-22.04
strategy:
matrix:
python-version:
# - '3.9'
- '3.12'

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dev dependencies
run: pip install -r requirements.txt

- name: Test
run: ./ci.sh
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
credentials.yaml

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
17 changes: 17 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
# Using this mirror lets us use mypyc-compiled black, which is about 2x faster
- repo: https://github.com/psf/black-pre-commit-mirror
rev: 24.8.0
hooks:
- id: black
# It is recommended to specify the latest version of Python
# supported by your project here, or alternatively use
# pre-commit's default_language_version, see
# https://pre-commit.com/#top_level-default_language_version
language_version: python3.11
10 changes: 10 additions & 0 deletions .pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[pytest]

# Treat warnings as errors:
filterwarnings =
error

addopts = --doctest-glob '*.md' --doctest-modules

# If an xfail starts passing unexpectedly, that should count as a failure:
xfail_strict=true
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,19 @@
# llm-for-dp-qa
Experiments with LLMs for Q+A about DP

To get started:
```
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pre-commit install
cp llm_for_dp_qa/credentials{-template,}.yaml
```

Fill in the git-ignored `credentials.yaml` with the [key and secret for this app](https://portal.apis.huit.harvard.edu/my-apps/6dce5383-bcb6-4c9f-bd14-8f59d356b221).

You should then be able to run the queries against the API: The output will also be written to `outputs/`:
```
llm_for_dp_qa/run.py
ls llm_for_dp_qa/outputs
```
5 changes: 5 additions & 0 deletions ci.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash

set -euo pipefail

pytest -vv --failed-first
Empty file added llm_for_dp_qa/__init__.py
Empty file.
4 changes: 4 additions & 0 deletions llm_for_dp_qa/credentials-template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Fill this in with the values at
# https://portal.apis.huit.harvard.edu/my-apps/6dce5383-bcb6-4c9f-bd14-8f59d356b221
key: fill-in
secret: fill-in
29 changes: 29 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-15T15-01-22.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
- a:
- "Differential privacy is a mathematical framework designed to provide privacy\
\ guarantees when analyzing and sharing data. It aims to protect individual privacy\
\ while allowing for useful insights to be drawn from datasets. The key idea is\
\ to ensure that the output of a data analysis algorithm does not significantly\
\ depend on any single individual's data, making it difficult to infer whether\
\ any specific individual's information was included in the dataset.\n\nHere are\
\ the main concepts behind differential privacy:\n\n1. **Noise Addition**: To\
\ achieve differential privacy, random noise is added to the results of queries\
\ on the dataset. This noise is calibrated based on the sensitivity of the query,\
\ which measures how much the output could change if a single individual's data\
\ were included or excluded.\n\n2. **Privacy Parameter (\u03B5)**: Differential\
\ privacy is often quantified using a parameter \u03B5 (epsilon), which controls\
\ the trade-off between privacy and accuracy. A smaller \u03B5 provides stronger\
\ privacy guarantees but can result in less accurate results, while a larger \u03B5\
\ allows for more accurate results at the expense of privacy.\n\n3. **Indistinguishability**:\
\ The core principle of differential privacy is that an observer should not be\
\ able to determine whether a particular individual's data was included in the\
\ dataset based on the output of the analysis. This is achieved by ensuring that\
\ the outputs of the algorithm are statistically indistinguishable whether or\
\ not the individual's data is included.\n\n4. **Robustness**: Differential privacy\
\ provides a robust framework that can be applied to various types of data analysis,\
\ including statistical queries, machine learning, and data sharing.\n\nDifferential\
\ privacy has gained significant attention and adoption in various fields, including\
\ healthcare, social sciences, and technology companies, particularly for protecting\
\ user data in applications like Apple and Google.\n\nOverall, differential privacy\
\ strikes a balance between the utility of data analysis and the need to protect\
\ individual privacy, making it a powerful tool in the era of big data."
q: What is differential privacy?
62 changes: 62 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-15T15-18-47.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
- q: What is differential privacy?
a:
- 'Differential privacy is a mathematical framework designed to provide a formal
definition of privacy protection when analyzing and sharing data. The core idea
is to ensure that the inclusion or exclusion of a single individual''s data in
a dataset does not significantly affect the outcome of any analysis performed
on that dataset. This means that the results of queries or analyses conducted
on the data should be similar, whether or not any particular individual''s data
is included.


### Key Concepts:


1. **Privacy Budget (ε)**: Differential privacy uses a parameter known as epsilon
(ε), which quantifies the level of privacy protection. A smaller ε indicates stronger
privacy guarantees, while a larger ε allows for more accurate data analysis but
weaker privacy.


2. **Noise Addition**: To achieve differential privacy, noise is typically added
to the results of queries on the data. This noise is often drawn from a mathematical
distribution (like Laplace or Gaussian) and is calibrated based on the sensitivity
of the query—how much the output can change with the addition or removal of a
single individual''s data.


3. **Sensitivity**: The sensitivity of a function is a measure of how much the
function''s output can change when a single individual''s data is added or removed.
Functions with lower sensitivity are generally easier to protect with differential
privacy.


4. **Indistinguishability**: The fundamental goal of differential privacy is to
make it difficult for an observer to determine whether any particular individual''s
data was included in the analysis, thus protecting individual privacy.


### Applications:


Differential privacy has been adopted in various fields, including:


- **Statistics and Data Analysis**: To release aggregate statistics without compromising
individual privacy.

- **Machine Learning**: To train models on sensitive datasets while protecting
the privacy of the training data.

- **Public Data Releases**: Governments and organizations use differential privacy
to release census data or other sensitive information while maintaining the confidentiality
of individuals.


### Conclusion:


Differential privacy provides a robust framework for balancing the needs for data
utility and individual privacy. By carefully managing the trade-offs between accuracy
and privacy, it allows organizations to analyze and share data without exposing
sensitive information about individuals.'
13 changes: 13 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-15T15-43-06.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
metadata:
config:
model: gpt-4o-mini
temperature: 0.7
git_hash: 1219e73
datetime: '2025-01-15T15:43:06.734594'
q_and_a:
- q: In a few words, what is differential privacy?
a:
- Differential privacy is a technique used to ensure that the output of a database
query does not reveal too much information about any individual record in the
database, thereby protecting individual privacy while still allowing for useful
data analysis.
22 changes: 22 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-16T15-57-55.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
metadata:
config:
model: gpt-4o-mini
temperature: 0
git_hash: cdb186f
datetime: '2025-01-16T15:57:55.953321'
q_and_a:
- question: In a few words, what is differential privacy?
human answer: 'Differential privacy is a rigorous mathematical definition of privacy.
Consider an algorithm that analyzes a dataset and releases statistics: The algorithm
is differentially private if by looking at the output, you cannot tell whether
any individual''s data was included in the original dataset or not. Differential
privacy achieves this by carefully injecting random noise into the released statistics
to hide the effects of each individual.

'
llm answers:
- Differential privacy is a mathematical framework that ensures the privacy of individuals
in a dataset by adding noise to the data or query results, making it difficult
to identify any individual's information while still allowing for useful aggregate
insights.
runtime: '0:00:03.099645'
22 changes: 22 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-16T16-29-09.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
metadata:
config:
model: gpt-4o-mini
temperature: 0
datetime: '2025-01-16T16:29:09.845071'
git_hash: 0eb6250
q_and_a:
- question: In a few words, what is differential privacy?
human_answer: 'Differential privacy is a rigorous mathematical definition of privacy.
Consider an algorithm that analyzes a dataset and releases statistics: The algorithm
is differentially private if by looking at the output, you cannot tell whether
any individual''s data was included in the original dataset or not. Differential
privacy achieves this by carefully injecting random noise into the released statistics
to hide the effects of each individual.

'
llm_answers:
- Differential privacy is a mathematical framework that ensures the privacy of individuals
in a dataset by adding noise to the data or the results of queries, making it
difficult to identify any individual's information while still allowing for useful
aggregate insights.
runtime: '0:00:02.060478'
23 changes: 23 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-16T16-29-59.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
metadata:
config:
model: gpt-4o-mini
temperature: 0.99
datetime: '2025-01-16T16:29:59.318412'
git_hash: 0eb6250
q_and_a:
- question: In a few words, what is differential privacy?
human_answer: 'Differential privacy is a rigorous mathematical definition of privacy.
Consider an algorithm that analyzes a dataset and releases statistics: The algorithm
is differentially private if by looking at the output, you cannot tell whether
any individual''s data was included in the original dataset or not. Differential
privacy achieves this by carefully injecting random noise into the released statistics
to hide the effects of each individual.

'
llm_answers:
- Differential privacy is a framework designed to provide strong privacy guarantees
when analyzing and sharing statistical data. It ensures that the inclusion or
exclusion of an individual's data does not significantly affect the output of
a function, thereby protecting individual privacy while still allowing for useful
insights from aggregated data.
runtime: '0:00:01.193081'
37 changes: 37 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-16T17-42-52.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
metadata:
config:
model: gpt-4o-mini
temperature: 0
datetime: '2025-01-16T17:42:52.450949'
git_hash: 550587f
q_and_a:
- question: In a few words, what is differential privacy?
human_answers:
- evalution: Does the answer mention the rigorous mathematical definition of differential
privacy?
expected: true
actual:
- 'Yes'
- evalution: Does the answer describe adding calibrated noise to statistics?
expected: true
actual:
- Yes.
- evalution: Does the answer describe anonymizing data by stripping identifiers?
expected: false
actual:
- 'No'
llm_answers:
- evalution: Does the answer mention the rigorous mathematical definition of differential
privacy?
expected: true
actual:
- 'No'
- evalution: Does the answer describe adding calibrated noise to statistics?
expected: true
actual:
- Yes.
- evalution: Does the answer describe anonymizing data by stripping identifiers?
expected: false
actual:
- 'No'
runtime: '0:00:01.043781'
49 changes: 49 additions & 0 deletions llm_for_dp_qa/outputs/2025-01-16T18-14-39.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
metadata:
config:
model: gpt-4o-mini
temperature: 0
datetime: '2025-01-16T18:14:39.582095'
git_hash: a25fc61
q_and_a:
- question: In a few words, what is differential privacy?
human:
? 'Differential privacy is a rigorous mathematical definition of privacy. Consider
an algorithm that analyzes a dataset and releases statistics: The algorithm
is differentially private if by looking at the output, you cannot tell whether
any individual''s data was included in the original dataset or not. Differential
privacy achieves this by carefully injecting random noise into the released
statistics to hide the effects of each individual.

'
: - evalution: Does the answer mention the rigorous mathematical definition of
differential privacy?
expected: true
actual:
- 'Yes'
- evalution: Does the answer describe adding calibrated noise to statistics?
expected: true
actual:
- Yes.
- evalution: Does the answer describe anonymizing data by stripping identifiers?
expected: false
actual:
- 'No'
llm:
? Differential privacy is a mathematical framework that ensures the privacy of
individuals in a dataset by adding noise to the data or the query results, making
it difficult to identify any individual's information while still allowing for
useful aggregate insights.
: - evalution: Does the answer mention the rigorous mathematical definition of
differential privacy?
expected: true
actual:
- 'No'
- evalution: Does the answer describe adding calibrated noise to statistics?
expected: true
actual:
- Yes.
- evalution: Does the answer describe anonymizing data by stripping identifiers?
expected: false
actual:
- 'No'
runtime: '0:00:01.491051'
Loading