When you have to send your most important writing out into the world, who is your trusted editor? At Grammarly, the question of how to earn and maintain that trust guides every decision we make. We know that even a simple grammar mistake (like using the wrong pronoun) could accidentally hurt the reader and their relationship with the writer. Consistent and reliable grammar, spelling, and punctuation suggestions are our foundation—laying the groundwork for more complex communication assistance in areas like tone or clarity.
We talked about our approach to AI broadly in our article about how Grammarly is building the future of communication. Now, we’ll dive deep into how we approach grammatical error correction (GEC). There’s always more to learn—from machine learning (ML) research, the field of linguistics, and our own users—to help us continue raising the bar for GEC quality.
What is GEC?
Grammatical error correction refers to using AI to process text that contains grammar, spelling, and punctuation mistakes and return text that’s mistake-free. GEC is an important area of research in the field of natural language processing, with a formal definition and canonical benchmark datasets used to evaluate different solutions. Our applied research team has published papers on approaches to GEC that achieve state-of-the-art results:
However, what we’ve described in our research is just one component of the GEC system that we actually use in our product. In this article, we’ll further explain how the whole system works.
Why is GEC hard?
To address GEC, you need a deep understanding of language usage and norms (which are constantly changing)—and that’s one reason we have a team of expert linguists on staff. But even when the rules are known, the trouble is that there’s often more than one valid suggestion for correcting a mistake in a sentence.
For instance, notice how in the example below, “This summary and any other sources is” can be corrected to either “This summary and any other sources are” or “This summary, along with any other sources, is.” Both options are equally valid but alter the meaning in subtle ways. And further along in the sentence, we have more decision points, leading to completely different results:
With so many different options, the question is: How do we decide that one suggestion is better than another? To do so, we need a way to measure quality.
In the academic research on GEC, quality is generally defined using two metrics: precision and recall. Precision asks: For all of the suggestions that you showed, how many of those were actually “good” suggestions? Recall asks: For all of the possible good suggestions out there, how many of them did you actually show? These metrics naturally balance; increasing precision tends to decrease recall and vice versa.
To measure our models’ precision and recall, we build evaluation datasets with the help of our team of expert analytical and computational linguists. We’re lucky to have some of the world’s most brilliant linguists solving the toughest communication challenges and tracking the evolution of language. You can learn more about how we build and maintain datasets here.
But our work doesn’t stop there. While we value precision and recall metrics for our evaluation datasets, we place even more importance on how users interact with our GEC: How many suggestions are we showing to users? Which suggestions do users accept, ignore, or dismiss? How are users rating the value of our suggestions? We beta test several candidate GEC updates with different precision and recall trade-offs, then look at user feedback and engagement to determine the best path.
How our GEC system works
Leading the industry in achieving high-quality grammatical error correction has been our focus since day one. We are constantly pushing the boundaries of ML research and engineering, improving not only in quality but in areas like speed, reliability, and memory consumption. We do all this while keeping the human in the loop—drawing from the expert knowledge of our team of linguists.
Our system is complex and has multiple interconnected components. For GEC suggestions, these are a few of the most important pieces:
One component of our system is a large sequence-to-sequence (seq2seq) machine translation model, based on modern transformer architecture. It uses complex neural networks to translate text with errors into error-free text by performing a complete, one-step rewrite. When it works well, it’s almost like magic—but it’s not perfect and needs to be supplemented by other systems that give us more granular insight and control.
If you read our research papers on GEC or browse our GitHub repository, you’ll learn about our system of tagging specific errors in a sentence and then correcting each error separately. This contrasts with machine translation’s approach of rewriting the sentence wholesale. By using our “tag, not rewrite” system in conjunction with our seq2seq model, we get the best of both worlds: We have the context of an entirely rewritten sentence while being able to identify localized issues one by one.
Our third system for GEC at Grammarly consists of rules based on syntax patterns that range from capitalizing the word “I” to suggesting where you need a comma. Our computational linguists curate and build on these rules continuously to improve our ability to detect mistakes in the text and quickly react to user feedback.
Developing high-quality GEC is work that never stops—the English language keeps changing, and there’s always more we can do to improve precision and recall in order to deepen our users’ trust. One of the key tasks ahead is to make our suggestions more personalized so that we’re helping users communicate in the right ways for them. This means learning not just what quality means to users as a whole but what it means for each person’s preferences regarding grammar, spelling, and punctuation.
Interested in joining us on this journey? We’re hiring! Check out our open roles here.