Grammarly’s Journey of On-Machine AI at Scale


As LLMs turn into extra succesful, person expectations for pace and reliability proceed to rise, particularly for enterprise and productiveness purposes. That’s why we imagine within the energy of on-device AI, which may ship real-time, high-quality experiences that match or exceed cloud-based or hybrid options. The problem? There’s no established blueprint for optimizing and scaling on-device fashions.

We’ve tackled this problem head-on by creating a scientific strategy to optimizing language fashions (like T5) to run on-device—with none high quality degradation. Particularly, we’ve decreased the reminiscence and latency of our grammatical error correction (GEC) mannequin by over 50%, enabling it to run effectively on customers’ units. We’ve additionally created a customized software program improvement package (SDK) to ship this optimized mannequin to thousands and thousands of customers throughout varied desktop platforms. By efficiently delivery this performance at scale, we’ve demonstrated that our on-device strategy works, making us one of many trailblazing corporations on this area.

On this weblog publish, we are going to share how we solved the technical challenges of optimizing and scaling our GEC mannequin, establishing a basis for future on-device AI improvement.

What does it take to run the GEC mannequin on-device?

Whereas Grammarly gives extra than simply grammatical error correction, it’s a core a part of how we empower customers to speak successfully. To make sure a seamless writing expertise, Grammarly should present high-quality recommendations in actual time—lower than 100 milliseconds. Attaining this requires fixing three distinctive challenges:

  • Reminiscence administration: Person units have restricted reminiscence, which is commonly shared with different purposes, making optimizing reminiscence utilization essential for efficiency. Our authentic mannequin (designed for cloud servers) wanted nearly 4 GB, roughly the dimensions of a mean desktop’s RAM—making it impractical to run domestically with out vital optimization.
  • Computational effectivity: As with reminiscence, person units have restricted processing energy that’s usually shared with different purposes. For a real-time expertise, the mannequin’s resource-intensive operations (like inference) should run rapidly with out compromising the standard. If our mannequin calls for too many sources, it could trigger lag, intrude with different purposes, or quickly drain battery life, resulting in a poor person expertise.
  • Cross-platform deployment: To serve all Grammarly customers, our answer should work throughout completely different platforms and units. That is difficult as a result of every platform makes use of completely different {hardware}, programming languages, and machine studying APIs. We should additionally take into account various machine capabilities—from highly effective MacBooks with devoted GPUs to price range Chromebooks with minimal sources.

To beat these challenges, we iteratively addressed every drawback: reminiscence administration, computational latency, and cross-platform deployment.

Decreasing reminiscence footprint

To make sure that GEC wouldn’t impression the reminiscence utilization of different purposes, we set an bold goal of optimizing the mannequin to run underneath 1 GB of reminiscence. A typical approach for fixing this drawback is quantization, wherein mannequin weights (sometimes 32-bit floats) are transformed into smaller, much less exact numbers (like a 4-bit integer). Whereas quantization tremendously reduces the reminiscence footprint, it could additionally degrade mannequin accuracy.

To steadiness mannequin high quality and reminiscence optimization, we experimented with completely different quantization ranges, together with BFLOAT16 and INT4. BFLOAT16 had the perfect outcomes: It had a minimal impression on accuracy whereas decreasing reminiscence utilization by 50%. By combining quantization with different optimization methods (just like the graph optimizations described beneath), the ultimate grammatical error correction mannequin ran in lower than 300 MB.

Bettering computational effectivity

To make sure a real-time person expertise, we calculated that your complete GEC mannequin should run at 100+ tokens/second, which required us to cut back latency by 50%.

We began by optimizing the T5 mannequin, the core inference engine for the GEC mannequin. We believed that growing the T5 mannequin’s pace from 70 tokens/second to 200 tokens/second would permit us to cross-apply these learnings to different elements within the pipeline and meet our general efficiency targets.

So, we started by performing handbook graph optimizations on the mannequin, reorganizing and streamlining operations in methods which are just like these carried out by libraries like TensorRT and ONNX Runtime. Subsequent, we examined the calculations inside every operation. Some optimizations have been apparent, similar to eradicating pointless operations (e.g., pointless typecasting), minimizing time-intensive operations (e.g., reshape and transpose operations), and sequencing operations to maintain information on the identical processor as a lot as attainable (information transfers are time-intensive).

One huge unlock was using an optimized kernel and fused operations that leverage the machine’s {hardware} specification. For instance, one essential calculation in a neural community is the multi-head consideration (MHA) operation, which happens when the mannequin examines the person enter to find out the person’s question and potential output. This operation requires a number of calculations concurrently throughout completely different elements of the enter question. We changed the naive implementation of MHA with an optimized perform from MLX, Apple’s ML framework, thereby dashing up the mannequin’s efficiency.

With these architectural and computational efficiency positive aspects, we elevated our T5 processing pace from 70 tokens/second to 297 tokens/second—attaining our latency objectives.

Seamless cross-platform deployment

Now that we had a performant mannequin, we confronted a brand new problem: deploying it to thousands and thousands of Grammarly desktop customers throughout varied platforms, every with its particular {hardware}, language, and APIs for ML operations. An easy strategy is to create a separate SDK for every platform. Nevertheless, this strategy makes future upkeep and iteration cumbersome as every code change have to be utilized to every SDK.

To unravel this, we constructed a Rust-based SDK that runs Grammarly machine studying fashions throughout three key platforms: Mac, Home windows, and Chrome Extension. This strategy lets us write code as soon as and compile it for every platform, simplifying mannequin deployment and maintainability. Moreover, the SDK leveraged native platform ML libraries (similar to Steel for Apple units), enabling us to benefit from hardware-specific accelerations.

The long run is on-device

Utilizing our SDK, we shipped our on-device GEC mannequin to thousands and thousands of Grammarly desktop customers. Early outcomes present no degradation in high quality or efficiency in comparison with the prior cloud-based mannequin. This confirms our perception that on-device AI can ship real-time, high-quality experiences with out counting on cloud servers—turning what was as soon as only a imaginative and prescient into actuality.

This work is simply step one towards highly effective, new AI experiences. Leveraging these optimization learnings and our SDK, we’re constructing on-device variations of different advanced writing fashions. If you wish to sort out difficult AI optimization issues at scale, come work with us. Take a look at our jobs web page.

Particular due to your complete workforce that labored on this venture: Sri Malireddi, Illia Dzivinskyi, Ignat Blazhko, Dhruv Matani, and John Blatz.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *