For those of you who have worked with text data, you’re likely aware of the critical role that data quality plays in effectively training an NLP model. However, ensuring data quality can be quite tricky.
In my previous experience, I was involved in classifying the intent of customer emails before the advent of Large Language Models (LLMs) like we have today. During that time, methods such as Bag of Words (BOW) or basic embeddings were commonly used, and it feels like a different and distant era now. To reduce the noise in emails, I utilized regular expressions (REGEX) for data cleansing. I vividly remember spending two intense weeks diving deep into REGEX, becoming so sick of it 😭 and yet so good at it that I might never replicate.
Chances are, I might never have a need to do so again.
Presently, AI models heavily rely on unstructured data, whether you are starting from scratch (not necessarily recommended) or fine-tuning an existing framework. However, most of this type of data, whether it’s language or images, lacks labels or metadata.
What adds to the complexity is that determining what constitutes “good data” versus “bad data” heavily depends on the specific use case and user judgment. This makes it challenging to standardize or define the criteria for data quality.
Lilac is a powerful open-source library that aims to “analyze, structure and clean data with AI”.
The primary objective of Lilac is make unstructured data more visible, quantifiable, and malleable, ultimately leading to:
- Improved quality of AI models.
- Enhanced actionability in cases where AI models fail.
- Better control and visibility of model bias.
The Lilac team made an interesting discovery during their research: despite the various challenges they encountered, they noticed a common theme.
While teams would compute aggregate statistics to understand the general composition of their data, they often overlooked the raw data. When methodically organized and visualized, glaring bugs in datasets would present themselves, often with simple fixes leading to higher quality models.
They summarized their experience as:
Each dataset has its own quirks, and these quirks can have non-obvious implications…