tldr

When working on another project, I wanted a way of using an LLM to get answers from Spreadsheets. When the project was started (Autumn 2024) I was sort of disappointed with the limitations in place.

This is mainly as all the data is loaded into context and that is (currently) limited. Tables inherently have a structure, so I was interesested in seeing if you could use this structure to get better results.

Below is an approach to combining the power of LLM’s while de-coupling it from the size of data.

What?

A proof of concept of using a large language model (LLM) to generate a bespoke “transformation machine” that uses an LLM to generate a series of “transform steps” to turn natural language into computation.

What follows is a description of a concrete example and implementation, I will return the the more general question later on. Specific problem: What I want is ability to ask natural language questions of tabular data and reliably get back accurate results over large datasets.

A working example for the IMDB dataset coming soon

Why?

We want to have the best of both worlds, combine the ability to ask questions in a natural way; with the computation accuracy and efficiency of computation.

We want to be able to ask questions of data without having to write SQL queries or load the data into a database.

We want to be able to ask questions directly of data, even if that data is incomplete or badly formatted.

Lets take the example of the IMDB dataset (1 million+ rows) for example (https://www.kaggle.com/datasets/octopusteam/full-imdb-dataset?resource=download). We might want to find a movie set in a particlar location from a particular time period with a certain rating.

ID Title Type Genres Rating Votes Year
tt0000009 Miss Jerry movie Romance 5.4 215 1894
tt0000147 The Corbett-Fitzsimmons Fight movie Documentary, News, Sport 5.2 539 1897
tt0000502 Bohemios movie   4.4 18 1905
tt0000574 The Story of the Kelly Gang movie Action, Adventure, Biography 6.0 941 1906

How would we do this currently?

  • We could manually read through the table finding things that look interesting - given the table has over 1 million rows, this could be slow.
  • We could load the data into a database and write a SQL query to get the data we want - perhaps if this is a common task and it proves worth the effort.
  • Load the Data into an LLM and ask the AI the question - this can work, however it has some disadvantages. LLM’s have limited context windows, they can make mistakes and they can be slow. (More below)

We want the best of all worlds, and it is possible to have it, with the right approach. Turning natural language questions into fast computable steps that can be run on the data directly.

System in action

How?

The process is roughly:

  1. Metadata Generation: We programmatically inspect the structured data source and create metadata — information about columns, their types, and structure.
  2. LLM Generation of Steps: We feed the LLM the natural language query, the metadata, and carefully crafted instructions. Instead of asking it for the final answer, we ask it to produce a set of transformation steps. For example, if you say, “Get me all customers in Europe,” the LLM might produce a filter step that includes a list of European countries.
  3. Execution on the Data: Another engine (like a Python script with Pandas) runs these transformation steps on the data. If a step fails—say, dates are in inconsistent formats—we can feed those problematic rows back to the LLM and fix them.
  4. Result: The output is a clean, final subset of data or an answer. The LLM only generated the instructions, so it’s not slowed down or confused by huge datasets.

How it works

A DSL for Intelligence

I claim that table extraction is just one example of an application of using a LLM to generate accurate, computable steps from natural language. These steps can be designed to better use the AI’s generation capabilites, producing not an answer, but in effect a custom program for a specific question and data source.

These steps can be built up one by one to define certain capabilites on the data, such as filtering, sorting, grouping, joining, etc. They do not have to be complete or complex, so you avoid the challenge of understanding a complete DSL such as SQL or AWK.

Predefined instructions are carefully crafted to leverage the AI model’s natural language understanding capabilities, resulting in transformation rules that are:

  • More intuitive, flexible, and robust to variations in data structure and query semantics compared to traditional DSL approaches such as directly generating Structured Query Language (SQL) commands.
  • We can disconnect the generation of rules from the size of the data. The system outlined here works in the same way for 10 rows or 10 million rows.
  • Better leverage the capabilities of the models to generate the correct rules. (Robustly reproduced, easy to debug each step)
  • Has some surprising results. Such as semantic expansion, such as asking a customer table for every customer in Europe returns a step with - ["United Kingdom", "Germany", "France", "Italy", "Spain", "Netherlands", "Greece", "Sweden", "Poland", "Belgium", "Finland", "Denmark", "Ireland", "Portugal", "Austria", "Hungary", "Czech Republic", "Romania", "Bulgaria", "Slovakia", "Croatia", "Estonia", "Slovenia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Cyprus"]
  • Because of the interaction of LLM and computation, it can also provide error correcting and data cleaning steps. Because we can robustly infer type information we can use computation to find data that cannot be converted to a type, and feed this back into the LLM to fix the errors. For example - we infer “signup_date” is a date, but some rows have dates formatted as ISO8601 and some as “4th Feb, 2024”. We can use computation to EXTRACT the rows that are not valid dates and feed this back into the LLM to fix the errors (updating the original data).

What is happening?

In the worked example, we are effectively creating a compression of the data, we enter a natural language question and data, and we get back a compressed version of the data. That could be a final answer or a part of a larger pipeline that feeds into other computations or a LLM to reach a final answer.

Transform Steps

One of the key ideas behind this PoC is the generation of transform steps. The ideal properties of these steps:

  • Can be robustly generated from a LLM - this means designed from the ground up to work with what LLM’s do well.
  • Expressed as well formed JSON (or equivalent format) that can be used as input to computation
  • Steps are expressive enough to achieve the intended outcome

The set of rules define what operations can be performed, they could be a pure extraction machine, or any set of operations the author wants to provide to the user. There is an open question between the complexity of operations and ultimate performance. A most optimal set of rules is open question.

Below are the transform steps generated from asking the question “Can you get me content from the 60s with rating higher than 7 with London in the title” to the IMDB table.

[
  {
    "column_index": 6,
    "column_type": "int",
    "from": 1960,
    "method": "filter_row_by_range",
    "to": 1969
  },
  {
    "column_index": 4,
    "column_type": "float",
    "from": 7,
    "method": "filter_row_by_range",
    "to": null
  },
  {
    "filters": [
      "London"
    ],
    "method": "filter_row",
    "row_index": 1
  }
]

This list of steps is passed to a transform engine along with the data to produce a result. This meaning the same steps are applied no matter the size of the data.

Semantic Expansion

One of the surprising moments for me when designing this PoC, when inspecting the generated steps, was something I call semantic expansion. This is using the knowledge of the LLM to better answer ambiguous questions. For example, having a customer table with country of origin, but we want to know something that is not explicitly in the data, such as the idea of Europe. The LLM can expand Europe into a list of countries.

Error Correction

As part of the transform steps we ask the LLM to imply the type of the data (int, date, etc). Using this information we can then attempt to convert the data into that type. We can extract all values that cannot be converted (due to bad formatting) these values can be fed back to the LLM to correct the formatting errors, correcting the data.

[
"2020-02-25 - on a thursday",
"25-02-2020",
"1st October, 2022",
"11/11/24",
"Thursday, 25th February 2021",
"2020-02-25"
] => [
"2020-02-25",
"2020-02-25",
"2022-10-01",
"2024-11-11",
"2021-02-25",
"2020-02-25"
]

Why not just use a LLM?

While Large Language Models are increasingly capable of handling larger datasets and performing complex analyses, there are several fundamental limitations that make a hybrid approach more practical, advantages over using a LLM directly:

1. Computational Efficiency

  • LLMs process entire datasets token by token, making them inefficient for large-scale data operations
  • A single query on a million-row dataset could cost hundreds of dollars in API fees
  • Traditional computational methods (like indexed searches) are orders of magnitude faster

2. Reliability and Reproducibility

  • LLMs are inherently probabilistic, potentially giving different answers to the same question
  • The transformation steps approach generates deterministic operations that can be verified and will always produce the same results
  • Error handling becomes explicit rather than hidden in the LLM’s black box

3. Resource Management

  • Context window limitations still exist even in modern LLMs
  • Processing large datasets requires chunking and managing complex prompt engineering
  • Our approach maintains constant LLM resource usage regardless of dataset size

4. Debugging and Maintenance

  • When an LLM makes a mistake, it’s often unclear why or how to fix it
  • With transformation steps, each operation is explicit and can be individually tested
  • The system can be gradually improved by refining the step generation process

5. Caching and Optimization

  • Similar queries can reuse cached transformation steps
  • Common patterns can be optimized without requiring new LLM calls
  • Intermediate results can be stored and reused

6. Data Privacy and Security

  • Direct LLM use requires sending entire datasets to external APIs
  • Our approach only shares metadata and transformation logic
  • Sensitive data remains within your computational environment

7. Cost Scaling

  • LLM API costs scale with dataset size
  • In our approach, LLM costs remain constant regardless of data volume
  • Computation costs scale linearly with traditional optimization techniques available

Side by Side comparison GPT-4o vs TableAI: ChatGPT-4o IMDB results TableAI IMDB results

FaQs

Example of natural language queries

"Can you get me films from the 60s with rating higher than 7 with London in the title"
"Can I get name and email from the last 5 customers from Europe"
"Can you get me all the info for customers who joined between 2021 and 2023"

Example table from IMDB data

ID Title Type Genres Rating Votes Year
tt0000009 Miss Jerry movie Romance 5.4 215 1894
tt0000147 The Corbett-Fitzsimmons Fight movie Documentary, News, Sport 5.2 539 1897
tt0000502 Bohemios movie   4.4 18 1905
tt0000574 The Story of the Kelly Gang movie Action, Adventure, Biography 6.0 941 1906

Why not use SQL?

SQL is powerful but requires technical expertise and a proper database setup. This approach offers several advantages:

  1. Natural language accessibility - users don’t need to learn SQL syntax
  2. Works directly with raw data files without database setup
  3. Handles messy or inconsistent data through LLM-assisted cleaning
  4. Provides semantic understanding (like automatically knowing European countries) that SQL alone cannot provide
  5. Can work with data sources that aren’t easily imported into SQL databases

How is this different from ToolFormer / Tool use?

While both approaches involve LLMs interacting with tools, this system is distinct because:

  1. It generates transformation steps rather than directly executing commands
  2. The steps are composable and reusable
  3. It separates the “thinking” phase (LLM generating steps) from the “execution” phase (computing on data)
  4. It can handle arbitrary-sized datasets since the LLM only sees metadata and generates instructions
  5. The transformation steps can be cached and optimized, unlike direct tool calls

What about data privacy and security?

Since the LLM only sees metadata and sample problematic rows (not the full dataset), this approach can be more privacy-preserving than sending entire datasets to an LLM. The actual computation happens locally or in your secure environment.

How does it handle ambiguous queries?

The system can:

  1. Generate clarifying steps when needed
  2. Use semantic understanding to expand ambiguous terms (like “Europe” → list of countries)
  3. Return multiple possible interpretations of the query for user confirmation
  4. Include validation steps in the transformation pipeline

What are the performance implications?

The hybrid approach offers several performance benefits:

  1. Only calls the LLM once to generate steps, not for processing data
  2. Can cache commonly used transformation patterns
  3. Executes computations using efficient data processing libraries
  4. Scales linearly with data size since LLM involvement is constant

How reliable are the generated transformations?

The system improves reliability by:

  1. Using predefined step types that are known to work
  2. Allowing validation and testing of steps before execution
  3. Providing error correction for data quality issues
  4. Enabling human review of generated steps when needed
  5. Supporting iterative refinement of problematic transformations

What about maintaining and debugging?

The transformation steps provide several advantages for maintenance:

  1. Steps are explicit and can be logged
  2. Each step can be tested independently
  3. Problems can be isolated to specific transformations
  4. Steps can be version controlled and reviewed
  5. The system can generate documentation for the transformations

How does it handle complex data relationships?

Based on the metadata, the system can:

  1. Understand relationships between different data sources
  2. Generate appropriate join operations
  3. Maintain data integrity through the transformation pipeline
  4. Handle nested and hierarchical data structures

Future work

  • Fine-tuning: We currently use off-the-shelf base models. Training a model specifically for this task might yield even better, more reliable transformation steps.
  • Optimized Instructions: Right now, we have a small set of rules. There’s an opportunity to refine these instructions to produce more robust and complex transformations.
  • Other Domains: This approach isn’t limited to tables. Anywhere you have structured or semi-structured data—like logs or certain configuration formats—you could potentially apply the same pattern.