opendp · mccalluc · Jan 15, 2025 · Jan 15, 2025 · Jan 15, 2025 · Jan 15, 2025
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,8 @@
+[flake8]
+exclude = .git,.venv,__pycache__
+
+# Config recommended by black:
+# https://black.readthedocs.io/en/stable/guides/using_black_with_other_tools.html#bugbear
+max-line-length = 80
+extend-select = B950
+extend-ignore = E203,E501,E701
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,31 @@
+name: Test
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  test:
+    runs-on: ubuntu-22.04
+    strategy:
+      matrix:
+        python-version:
+          # - '3.9'
+          - '3.12'
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dev dependencies
+        run: pip install -r requirements.txt
+
+      - name: Test
+        run: ./ci.sh
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
+credentials.yaml
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,17 @@
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v2.3.0
+    hooks:
+    -   id: check-yaml
+    -   id: end-of-file-fixer
+    -   id: trailing-whitespace
+  # Using this mirror lets us use mypyc-compiled black, which is about 2x faster
+  - repo: https://github.com/psf/black-pre-commit-mirror
+    rev: 24.8.0
+    hooks:
+      - id: black
+        # It is recommended to specify the latest version of Python
+        # supported by your project here, or alternatively use
+        # pre-commit's default_language_version, see
+        # https://pre-commit.com/#top_level-default_language_version
+        language_version: python3.11
diff --git a/.pytest.ini b/.pytest.ini
@@ -0,0 +1,10 @@
+[pytest]
+
+# Treat warnings as errors:
+filterwarnings =
+	error
+
+addopts = --doctest-glob '*.md' --doctest-modules
+
+# If an xfail starts passing unexpectedly, that should count as a failure:
+xfail_strict=true
diff --git a/README.md b/README.md
@@ -1,2 +1,19 @@
 # llm-for-dp-qa
 Experiments with LLMs for Q+A about DP
+
+To get started:
+```
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+pre-commit install
+cp llm_for_dp_qa/credentials{-template,}.yaml
+```
+
+Fill in the git-ignored `credentials.yaml` with the [key and secret for this app](https://portal.apis.huit.harvard.edu/my-apps/6dce5383-bcb6-4c9f-bd14-8f59d356b221).
+
+You should then be able to run the queries against the API: The output will also be written to `outputs/`:
+```
+llm_for_dp_qa/run.py
+ls llm_for_dp_qa/outputs
+```
diff --git a/ci.sh b/ci.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+
+set -euo pipefail
+
+pytest -vv --failed-first
diff --git a/llm_for_dp_qa/__init__.py b/llm_for_dp_qa/__init__.py
diff --git a/llm_for_dp_qa/credentials-template.yaml b/llm_for_dp_qa/credentials-template.yaml
@@ -0,0 +1,4 @@
+# Fill this in with the values at
+# https://portal.apis.huit.harvard.edu/my-apps/6dce5383-bcb6-4c9f-bd14-8f59d356b221
+key: fill-in
+secret: fill-in
diff --git a/llm_for_dp_qa/outputs/2025-01-15T15-01-22.yaml b/llm_for_dp_qa/outputs/2025-01-15T15-01-22.yaml
@@ -0,0 +1,29 @@
+- a:
+  - "Differential privacy is a mathematical framework designed to provide privacy\
+    \ guarantees when analyzing and sharing data. It aims to protect individual privacy\
+    \ while allowing for useful insights to be drawn from datasets. The key idea is\
+    \ to ensure that the output of a data analysis algorithm does not significantly\
+    \ depend on any single individual's data, making it difficult to infer whether\
+    \ any specific individual's information was included in the dataset.\n\nHere are\
+    \ the main concepts behind differential privacy:\n\n1. **Noise Addition**: To\
+    \ achieve differential privacy, random noise is added to the results of queries\
+    \ on the dataset. This noise is calibrated based on the sensitivity of the query,\
+    \ which measures how much the output could change if a single individual's data\
+    \ were included or excluded.\n\n2. **Privacy Parameter (\u03B5)**: Differential\
+    \ privacy is often quantified using a parameter \u03B5 (epsilon), which controls\
+    \ the trade-off between privacy and accuracy. A smaller \u03B5 provides stronger\
+    \ privacy guarantees but can result in less accurate results, while a larger \u03B5\
+    \ allows for more accurate results at the expense of privacy.\n\n3. **Indistinguishability**:\
+    \ The core principle of differential privacy is that an observer should not be\
+    \ able to determine whether a particular individual's data was included in the\
+    \ dataset based on the output of the analysis. This is achieved by ensuring that\
+    \ the outputs of the algorithm are statistically indistinguishable whether or\
+    \ not the individual's data is included.\n\n4. **Robustness**: Differential privacy\
+    \ provides a robust framework that can be applied to various types of data analysis,\
+    \ including statistical queries, machine learning, and data sharing.\n\nDifferential\
+    \ privacy has gained significant attention and adoption in various fields, including\
+    \ healthcare, social sciences, and technology companies, particularly for protecting\
+    \ user data in applications like Apple and Google.\n\nOverall, differential privacy\
+    \ strikes a balance between the utility of data analysis and the need to protect\
+    \ individual privacy, making it a powerful tool in the era of big data."
+  q: What is differential privacy?
diff --git a/llm_for_dp_qa/outputs/2025-01-15T15-18-47.yaml b/llm_for_dp_qa/outputs/2025-01-15T15-18-47.yaml
@@ -0,0 +1,62 @@
+- q: What is differential privacy?
+  a:
+  - 'Differential privacy is a mathematical framework designed to provide a formal
+    definition of privacy protection when analyzing and sharing data. The core idea
+    is to ensure that the inclusion or exclusion of a single individual''s data in
+    a dataset does not significantly affect the outcome of any analysis performed
+    on that dataset. This means that the results of queries or analyses conducted
+    on the data should be similar, whether or not any particular individual''s data
+    is included.
+
+
+    ### Key Concepts:
+
+
+    1. **Privacy Budget (ε)**: Differential privacy uses a parameter known as epsilon
+    (ε), which quantifies the level of privacy protection. A smaller ε indicates stronger
+    privacy guarantees, while a larger ε allows for more accurate data analysis but
+    weaker privacy.
+
+
+    2. **Noise Addition**: To achieve differential privacy, noise is typically added
+    to the results of queries on the data. This noise is often drawn from a mathematical
+    distribution (like Laplace or Gaussian) and is calibrated based on the sensitivity
+    of the query—how much the output can change with the addition or removal of a
+    single individual''s data.
+
+
+    3. **Sensitivity**: The sensitivity of a function is a measure of how much the
+    function''s output can change when a single individual''s data is added or removed.
+    Functions with lower sensitivity are generally easier to protect with differential
+    privacy.
+
+
+    4. **Indistinguishability**: The fundamental goal of differential privacy is to
+    make it difficult for an observer to determine whether any particular individual''s
+    data was included in the analysis, thus protecting individual privacy.
+
+
+    ### Applications:
+
+
+    Differential privacy has been adopted in various fields, including:
+
+
+    - **Statistics and Data Analysis**: To release aggregate statistics without compromising
+    individual privacy.
+
+    - **Machine Learning**: To train models on sensitive datasets while protecting
+    the privacy of the training data.
+
+    - **Public Data Releases**: Governments and organizations use differential privacy
+    to release census data or other sensitive information while maintaining the confidentiality
+    of individuals.
+
+
+    ### Conclusion:
+
+
+    Differential privacy provides a robust framework for balancing the needs for data
+    utility and individual privacy. By carefully managing the trade-offs between accuracy
+    and privacy, it allows organizations to analyze and share data without exposing
+    sensitive information about individuals.'
diff --git a/llm_for_dp_qa/outputs/2025-01-15T15-43-06.yaml b/llm_for_dp_qa/outputs/2025-01-15T15-43-06.yaml
@@ -0,0 +1,13 @@
+metadata:
+  config:
+    model: gpt-4o-mini
+    temperature: 0.7
+  git_hash: 1219e73
+  datetime: '2025-01-15T15:43:06.734594'
+q_and_a:
+- q: In a few words, what is differential privacy?
+  a:
+  - Differential privacy is a technique used to ensure that the output of a database
+    query does not reveal too much information about any individual record in the
+    database, thereby protecting individual privacy while still allowing for useful
+    data analysis.
diff --git a/llm_for_dp_qa/outputs/2025-01-16T15-57-55.yaml b/llm_for_dp_qa/outputs/2025-01-16T15-57-55.yaml
@@ -0,0 +1,22 @@
+metadata:
+  config:
+    model: gpt-4o-mini
+    temperature: 0
+  git_hash: cdb186f
+  datetime: '2025-01-16T15:57:55.953321'
+q_and_a:
+- question: In a few words, what is differential privacy?
+  human answer: 'Differential privacy is a rigorous mathematical definition of privacy.
+    Consider an algorithm that analyzes a dataset and releases statistics: The algorithm
+    is differentially private if by looking at the output, you cannot tell whether
+    any individual''s data was included in the original dataset or not. Differential
+    privacy achieves this by carefully injecting random noise into the released statistics
+    to hide the effects of each individual.
+
+    '
+  llm answers:
+  - Differential privacy is a mathematical framework that ensures the privacy of individuals
+    in a dataset by adding noise to the data or query results, making it difficult
+    to identify any individual's information while still allowing for useful aggregate
+    insights.
+  runtime: '0:00:03.099645'
diff --git a/llm_for_dp_qa/outputs/2025-01-16T16-29-09.yaml b/llm_for_dp_qa/outputs/2025-01-16T16-29-09.yaml
@@ -0,0 +1,22 @@
+metadata:
+  config:
+    model: gpt-4o-mini
+    temperature: 0
+  datetime: '2025-01-16T16:29:09.845071'
+  git_hash: 0eb6250
+q_and_a:
+- question: In a few words, what is differential privacy?
+  human_answer: 'Differential privacy is a rigorous mathematical definition of privacy.
+    Consider an algorithm that analyzes a dataset and releases statistics: The algorithm
+    is differentially private if by looking at the output, you cannot tell whether
+    any individual''s data was included in the original dataset or not. Differential
+    privacy achieves this by carefully injecting random noise into the released statistics
+    to hide the effects of each individual.
+
+    '
+  llm_answers:
+  - Differential privacy is a mathematical framework that ensures the privacy of individuals
+    in a dataset by adding noise to the data or the results of queries, making it
+    difficult to identify any individual's information while still allowing for useful
+    aggregate insights.
+  runtime: '0:00:02.060478'
diff --git a/llm_for_dp_qa/outputs/2025-01-16T16-29-59.yaml b/llm_for_dp_qa/outputs/2025-01-16T16-29-59.yaml
@@ -0,0 +1,23 @@
+metadata:
+  config:
+    model: gpt-4o-mini
+    temperature: 0.99
+  datetime: '2025-01-16T16:29:59.318412'
+  git_hash: 0eb6250
+q_and_a:
+- question: In a few words, what is differential privacy?
+  human_answer: 'Differential privacy is a rigorous mathematical definition of privacy.
+    Consider an algorithm that analyzes a dataset and releases statistics: The algorithm
+    is differentially private if by looking at the output, you cannot tell whether
+    any individual''s data was included in the original dataset or not. Differential
+    privacy achieves this by carefully injecting random noise into the released statistics
+    to hide the effects of each individual.
+
+    '
+  llm_answers:
+  - Differential privacy is a framework designed to provide strong privacy guarantees
+    when analyzing and sharing statistical data. It ensures that the inclusion or
+    exclusion of an individual's data does not significantly affect the output of
+    a function, thereby protecting individual privacy while still allowing for useful
+    insights from aggregated data.
+  runtime: '0:00:01.193081'
diff --git a/llm_for_dp_qa/outputs/2025-01-16T17-42-52.yaml b/llm_for_dp_qa/outputs/2025-01-16T17-42-52.yaml
@@ -0,0 +1,37 @@
+metadata:
+  config:
+    model: gpt-4o-mini
+    temperature: 0
+  datetime: '2025-01-16T17:42:52.450949'
+  git_hash: 550587f
+q_and_a:
+- question: In a few words, what is differential privacy?
+  human_answers:
+  - evalution: Does the answer mention the rigorous mathematical definition of differential
+      privacy?
+    expected: true
+    actual:
+    - 'Yes'
+  - evalution: Does the answer describe adding calibrated noise to statistics?
+    expected: true
+    actual:
+    - Yes.
+  - evalution: Does the answer describe anonymizing data by stripping identifiers?
+    expected: false
+    actual:
+    - 'No'
+  llm_answers:
+  - evalution: Does the answer mention the rigorous mathematical definition of differential
+      privacy?
+    expected: true
+    actual:
+    - 'No'
+  - evalution: Does the answer describe adding calibrated noise to statistics?
+    expected: true
+    actual:
+    - Yes.
+  - evalution: Does the answer describe anonymizing data by stripping identifiers?
+    expected: false
+    actual:
+    - 'No'
+  runtime: '0:00:01.043781'
diff --git a/llm_for_dp_qa/outputs/2025-01-16T18-14-39.yaml b/llm_for_dp_qa/outputs/2025-01-16T18-14-39.yaml
@@ -0,0 +1,49 @@
+metadata:
+  config:
+    model: gpt-4o-mini
+    temperature: 0
+  datetime: '2025-01-16T18:14:39.582095'
+  git_hash: a25fc61
+q_and_a:
+- question: In a few words, what is differential privacy?
+  human:
+    ? 'Differential privacy is a rigorous mathematical definition of privacy. Consider
+      an algorithm that analyzes a dataset and releases statistics: The algorithm
+      is differentially private if by looking at the output, you cannot tell whether
+      any individual''s data was included in the original dataset or not. Differential
+      privacy achieves this by carefully injecting random noise into the released
+      statistics to hide the effects of each individual.
+
+      '
+    : - evalution: Does the answer mention the rigorous mathematical definition of
+          differential privacy?
+        expected: true
+        actual:
+        - 'Yes'
+      - evalution: Does the answer describe adding calibrated noise to statistics?
+        expected: true
+        actual:
+        - Yes.
+      - evalution: Does the answer describe anonymizing data by stripping identifiers?
+        expected: false
+        actual:
+        - 'No'
+  llm:
+    ? Differential privacy is a mathematical framework that ensures the privacy of
+      individuals in a dataset by adding noise to the data or the query results, making
+      it difficult to identify any individual's information while still allowing for
+      useful aggregate insights.
+    : - evalution: Does the answer mention the rigorous mathematical definition of
+          differential privacy?
+        expected: true
+        actual:
+        - 'No'
+      - evalution: Does the answer describe adding calibrated noise to statistics?
+        expected: true
+        actual:
+        - Yes.
+      - evalution: Does the answer describe anonymizing data by stripping identifiers?
+        expected: false
+        actual:
+        - 'No'
+  runtime: '0:00:01.491051'