What Is The GLUE Benchmark for NLU Systems?

3 min readOct 9, 2022

There are a lot of deep explanations elsewhere so here I’d like to share some example questions in an interview setting.

What is the GLUE benchmark for natural language understanding systems?

Source: Modified GLUE task: Fine-Tuning Architecture for GLUE Task Training and Evaluation

Here are some example answers for readers’ reference:

The GLUE Benchmark stands for General Language Understanding Evaluation. It is one of the most commonly used Benchmarks in natural language processing. It is basically a collection of datasets and tasks that is used to train, evaluate, analyze natural language understanding systems. It has a lot of datasets and each dataset has several genres and there are different sizes and different difficulties.

Specifically, GLUE consists of:

- A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
- A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
- A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

Additional discussion about GLUE

The following is the summary of each GLUE tasks:

Summary of GLUE benchmark

The GLUE benchmark has a human baseline as you can see below on its leaderboard. Among the 9 tasks there have already been numerous models surpassing human performance.

The Winograd task (WNLI) is a good example of humans outperforming machines. This is intentional, as the task requires common sense and practical application to answer correctly, and is sometimes worded to intentionally foil an approach that would depend entirely on statistical co-occurence.

Check the explanation by Dr.Younes Bensouda Mourri from Deeplearning.ai:

Check the explanation!

To download GLUE datasets: This script contains instructions and code.

An example fine-tuning a BERT model for testing on GLUE:

Which is exactly what the implementation of BertForSequenceClassification looks like.

Happy practicing!

Thanks for reading my newsletter. You can follow me on Linkedin!

Note: There are different angles to answer an interview question. The author of this newsletter does not try to find a reference that answers a question exhaustively. Rather, the author would like to share some quick insights and help the readers to think, practice and do further research as necessary.

Source of video/answers: Natural Language Processing with Attention Models by Dr.Younes Bensouda Mourri from Deeplearning.ai

Source of images/Good reads: GLUE Explained: Understanding BERT Through Benchmarks by Chris McCormick

What Is The GLUE Benchmark for NLU Systems?

Additional discussion about GLUE

Written by Angelina Yang

No responses yet