OpenAI LLM RAG: Extracting structured data at scale

A client approached Pevatrons seeking a solution to distill structured data from diverse, unstructured documents like meeting minutes, penned by multiple authors. These documents varied widely: some were concise and spanned just a page, while others were detailed, stretching to six pages, resembling a movie script. The presentation styles ranged from sequential narratives to tabulated summaries, and the use of abbreviations fluctuated from extensive to minimal. Fortunately, the only consistent factor was that all these documents were typewritten and saved as PDFs. Our task was to design a data pipeline capable of processing this continuous influx of PDFs and converting them into JSON format.

This was a period when GPT 3.5 model from Open AI had been released and the fundamental problem of natural language processing was solved and ready to use at our disposal with this Large Language Model. We approached this with these three major steps:

Pre-processing of PDF files

The old adage in programming of “garbage in, garbage out” holds good here as well. If we are to read PDF files into text and just feed them to Open AI, then we cannot expect Open AI to give you good results. The input files had this broad structure:

We played around with multiple libraries to extract text from PDF and zeroed in on PDF Plumber and PyMuPDF. In comparison, PDF Plumber is good at extracting text from tables inside the files, whereas PyMuPDF is good at extracting tabular structured documents. The parameters of these libraries need to be tuned over a few iterations - for example, the density parameter can be tuned to set the minimum number of characters per point - depending on the general font size of the documents. Contrary to what we expected, we spent 40% of the project time on this exercise.

Prompt Engineering

What we thought was a fancy term, turned out to be serious engineering project. The structured data was hierarchical and hence the right choice was JSON. The challenges were:

Subsisting with 4k tokens - when input text is much more
Getting Open AI to produce well-formed JSON with functions
Being “efficiently” specific

Subsisting with 4k tokens

When input files had more than one page, the number of words also increased and with prompts, the token exceeded the limit of 4k for the model gpt-3.5-turbo. The challenge in such cases was complicated by the fact that OpenAI’s prompt calls do not have a session or any sort of persistence. We started with this technique of:

Step-1: First prompt e.g. “Generate JSON with the <given> structure for <this> text input”
Step-2: Subsequent prompts e.g. “Generate JSON wherever incomplete with the <json got from previous call> for <this> text input
Repeat step-2 for every page

While this achieved its purpose, it had a lot of demerits:

The JSON in this case had to be minified - to save tokens (not really a disadvantage)
The prompts were zero shot to conserve tokens.
Multiple calls to Open AI - the longer the input file, higher the latency. This was almost a show-stopper at times because each call took 30 seconds on average and a long document took 2-3 minutes
Open AI producing ill-formed JSON; this meant one cannot use it as-is, the JSON produced has to undergo “repair” before it can be used

Getting Open AI to produce well-formed JSON with functions

While we were struggling with the ill-formed JSON issues, Open AI released in July 2023, a functionality to generate a function call that had the ability to produce structured JSON outputs based on functions that you describe in the request. This was like a boon to our problem. So, essentially we faked a need to “call” functions which makes Open AI generate output with well-formed JSON.

functions= [  
{
  "name": "save_meeting_minutes",
  "description": "Save meeting minutes in JSON format",
  "type": "object",
  "properties": {
    "meetingMetaDetails": {
      "type": "object",
      "properties": {
        "date": {
          "type": "string",
          "format": "date"
        },
        "time": {
          "type": "string",
          "pattern": "^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$"
        },
        "duration": {
          "type": "string"
        },
        "location": {
          "type": "string"
        }
      },
      "required": ["date"]
    }

This coupled with gpt-3.5-turbo-16k model reversed some of the demerits:

The prompts could be one-shot - that made Open AI have some bounds - like when enums were defined for fields in JSON
Open AI produced well-formed JSON with functions
With 16K tokens, you no longer needed to make multiple calls - so that reduced the overall latency

Being “efficiently” specific

The quality of the output from Open AI depends on how well-crafted the information is we pass to it. However, like how “beauty is in the eyes of the beholder”, what is “well-crafted” can be highly subjective.

“Extract each action item and summarize them” might look like specific, but to yield better results, we might need to be more specific like “Extract each action item and summarize them in not more than 10 words”

In the JSON structure,

"dom":"Date of meeting in dd-mm-yyyy format"

should be more specific like:

“Date_of_meeting”: "Date of meeting in dd-mm-yyyy format"

Post-processing of output from Open AI

While the real crux of natural language processing is taken care by the Open AI’s LLM, you cannot blindly use it in production cases, more so when the primary purpose is information extraction. Hence, a post-process is essential to eliminate any hallucination or improper extraction such as:

Dates can have varied formats in natural language; “1 July ‘23”, “1 2023 07”, “1-7-2023”, “1st July ‘23”, “2023-7-1” - can all mean 1st July 2023 - sometimes Open AI does not get it in the format you have specified in the JSON structure
In information extraction, hallucination can be irritating; the “location” of the meeting can get extracted as “San Jose” - but the input may have no location information at all. So, cross-checking critical information is absolutely necessary.
And extraction of structured information like URLs are better done as “post-process” than really asking Open AI to do - because it does not extract hyperlinks at times

Conclusion

While LLMs, particularly those designed for NLP, significantly streamline tasks, they aren't a silver bullet for production readiness. They address the core challenges, but diligent oversight is essential. For the best results, one must feed them clear data, set explicit instructions, and maintain consistent checks and balances. That is where experienced Data Engineering folks at Pevatrons kick in - making men out of boys - rather making prompt engineering production ready.