Data powers how systems learn, products evolve, and how companies make choices. But getting answers quickly, correctly, and with the right context is often harder than it should be. To make this easier as OpenAI scales, we built our own bespoke in-house AI data agent that explores and reasons over our own platform.

A Custom Tool for Scale

Our agent is a custom internal-only tool built specifically around OpenAI's data, permissions, and workflows. OpenAI's data platform serves more than 3.5k internal users working across Engineering, Product, and Research, spanning over 600 petabytes of data across 70k datasets. At that scale, simply finding the right table can be one of the most time-consuming parts of doing analysis.

Our data agent lets employees go from question to insight in minutes, not days. This lowers the bar to pulling data and nuanced analysis across all functions, not just by our data team. Today, teams across Engineering, Data Science, Go-To-Market, Finance, and Research at OpenAI lean on the agent to answer high-impact data questions.

Six Layers of Intelligent Context

High-quality answers depend on rich, accurate context. The agent is built around multiple layers that ground it in OpenAI's data and institutional knowledge:

1. Metadata Grounding: Schema metadata and table lineage inform SQL writing and provide context on relationships.

2. Query Inference: Historical queries help the agent understand how to write its own queries and which tables are typically joined together.

3. Code-Level Understanding: By deriving a code-level definition of tables, the agent builds deeper understanding of what data actually contains.

4. Organizational Context: Access to Slack, Google Docs, and Notion captures critical company context such as launches, reliability incidents, and canonical metric definitions.

5. Self-Learning Memory: When the agent discovers nuances about data questions, it saves learnings for next time, allowing it to constantly improve.

6. Runtime Context: Live queries to the data warehouse validate schemas and understand data in real-time.

Systematic Evaluation for Trust

Building an always-on, evolving agent means quality can drift just as easily as it can improve. The agent is evaluated on curated sets of question-answer pairs paired with manually authored "golden" SQL queries. For each eval, the team sends the natural language question to the query-generation endpoint, executes the generated SQL, and compares the output against expected results.

Evaluation doesn't rely on naive string matching. Generated SQL can differ syntactically while still being correct. To account for this, the team compares both the SQL and resulting data, using OpenAI's Evals grader to produce a final score with explanation.

Key Lessons Learned

Building the agent from scratch surfaced practical lessons:

  • Exposing too many tools created overlapping functionality that confused the agent
  • Highly prescriptive prompting degraded results; higher-level guidance was more effective
  • Understanding code that produces data was as important as warehouse signals and metadata

The OpenAI tools used to build it are the same tools available to developers: Codex, GPT-5 flagship model, the Evals API, and the Embeddings API.

Source: OpenAI Blog