Acrobat on the Web Powered by WebAssembly
PDF documents are a major part of our digital lives and, in an era where we spend most of our time working inside a web browser, enhancing the PDF experience on the web is crucial for providing a seamless, multi-device experience. As the creators of PDF, this led Adobe to envision Acrobat Web; we embarked on our Acrobat Web journey with the introduction of the Document Cloud PDF Embed API in 2019.
The PDF Embed API offers Adobe’s pixel-perfect PDF viewing on the web with the promise of performance and ease of integration on all major browsers. It also offers UI customization and integration with Adobe Analytics. You can see the Embed API in action here.
PDF rendering and viewing in the Embed API is done purely on the client’s browser. All the client-side PDF heavy lifting is performed by the core component of the Embed API called Acrobat JS.
What is Acrobat JS?
Acrobat JS is a web-based PDF library powered by WebAssembly. WebAssembly (WASM) is an open standard that enables reusing native C/C++/Rust applications in a web browser at near-native performance. Acrobat JS leverages WebAssembly by using Adobe’s Mobile PDF library on the web; the same library that powers Adobe’s Acrobat Mobile Apps. The library’s C++ code has been compiled to WebAssembly to bring Adobe’s high-fidelity PDF rendering on the Web.
The Acrobat JS project started in the very early phases of WebAssembly in 2016, which meant the documentation and support were still very new. The browser implementations and the standard were continuously changing. Debugging support was also not very convenient, especially for large codebases like ours. But the fun of working with such an amazing technology made it all worth it.
Acrobat JS Rendering Technology
Ever since we started working on Acrobat JS, we always had two major goals: good performance and high fidelity. To ensure high fidelity, our rendering technology is based on doing a pixel-perfect rendering of the PDF instead of translating the PDF content streams to Canvas or HTML. The output is a high-quality bitmap which is shown to the user using an <img> tag. The bitmap is compressed to PNG to save memory.
Similarly, to ensure that text selection is accurate, we pass the user interaction events to the PDF library, which provides us the quads to be drawn based on the page content. This is in contrast to doing selection on a hidden text layer on top of the bitmap as it may lead to alignment issues when there are fonts embedded in the PDF or if there is a font mismatch or font substitution.
Acrobat JS: The Journey
In this section, I will talk about various challenging problems we solved to ensure both our goals for Acrobat JS: performance and high fidelity.
A major performance metric for us was the time it takes for us to load and initialize the PDF library as a WebAssembly (WASM) module and show the first page to the user. We like to call this metric as timeTillFirstRender. Currently, in Acrobat JS, this time is under 900ms for 75 percent of files from our benchmarking set (composition derived from real-world analytics data). We reached these numbers after working through an improvement of about 300 percent in timeTillFirstRender by using various strategies that we discuss here.
A major bottleneck for timeTillFirstRender was the time it takes for the WASM module to get compiled and ready for use. This was especially difficult when the tiered compilation was not part of JS engines. A lot of the work went into keeping the WASM size as low as possible. We used various strategies to accomplish this:
- WASM Swapping.
- Dynamic linking.
- Moving out font data from the binary: Separating out static data like fonts from the binary and, instead, having it load dynamically when needed helped us in reducing WASM file size by about 1MB.
Let’s talk about these in more detail.
In order to overcome this performance penalty for our critical rendering path of showing the first page to the user, we innovated an approach where, by default, we load the thinner WASM file with exceptions disabled and only load the exception enabled binary when an actual exception is encountered. That’s why we call this strategy as WASM Swapping.
We further extended this to implement something similar to a PGO (profile guided optimization), where we identified the hot and cold areas of our code on a big test set and then got rid of all the ‘colder’ areas from the thinner binary.
The below graph shows the gains we observed at the time of implementing this strategy on our benchmarking set:
As can be seen from the graph, this offered about 40 percent improvements in both size and performance for us, which was huge. We further made changes in our Mobile PDF Library code to reduce the use of exceptions and also, with advancements in WebAssembly and JS engines, the penalty of exceptions became less costly with time than in the relatively early days. We will still stick to this strategy as it still offers us about 20 percent gains in both size and load time performance.
Along our journey of bringing more and more PDF goodness to the web, we encountered situations where adding more PDF features to the library meant adding to the WASM size, thus affecting the overall performance. We could have extended our WASM Swapping approach to get around this, but we needed a more long-term approach to this. That’s where dynamic linking comes into the picture.
As part of dynamic linking, we divided the WASM module into a main module and several side modules. All the core PDF viewing features reside in the main module and additional features are present in side modules. We load only the main module for the critical rendering path while the side modules are loaded lazily on-demand.
This allowed us to keep adding new features without severely affecting the critical rendering path performance by adding incremental features on side modules. The current main module size is 900kB gzipped.
Rendering Non-embedded Fonts With High Fidelity
Fonts required to render text in a PDF are usually embedded as part of the PDF itself, but with documents in the wild, it is not very uncommon to find PDFs where the font is not actually embedded inside the PDF. In such a scenario, we had to fallback to font substitution which meant we could not meet our goal of ensuring high fidelity rendering.
The following images show the results of this:
This is an excerpt from a PDF that requires the font Tahoma for rendering, but it’s not embedded in the PDF. One can notice the differences in fonts between the above two images especially in characters like lower-case ‘y’ and upper-case ‘I’.
The Future is Bright
WebAssembly has opened so many opportunities on the Web. Performing high octane tasks efficiently on the browser is no longer a far-fetched dream. Our journey with running high-fidelity rendering on the browser in a performant way demonstrates this.
The WebAssembly working group, the browsers, and everyone involved are doing an awesome job of ensuring WebAssembly keeps getting better and better every day. Some of the recent interesting developments have been around threads and SIMD support and we are working towards using them in Acrobat JS. We keep a close watch on the upcoming WebAssembly advancements and features. We are really excited about features like Exception Handling and SIMD. We look forward to advancing the Web with the help of WebAssembly.
Credit: Source link