How to Fix Bad Data in Unstructured Data ✘✔
For those of you who have worked with text data, you’re likely aware of the critical role that data quality plays in effectively training an NLP model. However, ensuring data quality can be quite tricky.
In my previous experience, I was involved in classifying the intent of customer emails before the advent of Large Language Models (LLMs) like we have today. During that time, methods such as Bag of Words (BOW) or basic embeddings were commonly used, and it feels like a different and distant era now. To reduce the noise in emails, I utilized regular expressions (REGEX) for data cleansing. I vividly remember spending two intense weeks diving deep into REGEX, becoming so sick of it 😭 and yet so good at it that I might never replicate.
Chances are, I might never have a need to do so again.
Presently, AI models heavily rely on unstructured data, whether you are starting from scratch (not necessarily recommended) or fine-tuning an existing framework. However, most of this type of data, whether it’s language or images, lacks labels or metadata.
What adds to the complexity is that determining what constitutes “good data” versus “bad data” heavily depends on the specific use case and user judgment. This makes it challenging to standardize or define the criteria for data quality.
Introducing Lilac
Lilac is a powerful open-source library that aims to “analyze, structure and clean data with AI”.