NYPL Open E-book Hackathon
On January 14, the JSTOR Labs team took part in New York Public Library Labs’ Open Book Hack Day. Hoo-boy, what a great day. We were inspired by a dozen awesome projects, and we met oodles of smart, creative, like-minded people, all of whom are working to increase access to knowledge. Our project for the day was an experiment in improving the reading experience of page-scan content on a mobile device. Dubbed ReflowIt and best experienced using a smartphone, our project re-flows page scan content for a handful of articles so that it renders more easily on a small screen, and does so without having to rely on imperfect OCR text.
The image below, a page scan of one of the articles in JSTOR’s open Early Journal Content (EJC) collection, demonstrates the problem. This content, like many digitized-print collections, consists of scans of the original page with associated OCR’ed text and metadata. This works fine for discovery and display on a large screen, but when a user tries to view this content on a phone’s small screen, the text is small, and they need to pan and zoom and pinch in order to read it.
One way to solve this problem is to present not the image of the page but the OCR’ed text. The challenge with this approach is that OCR is imperfect and especially so with some of the historical fonts found in the EJC content. Reflowit works by reflowing the images of the words for better reading on a small device:
We accomplished this by working with a couple of open-source tools, the venerable imagemagick and a tool that we’ve only recently discovered calledk2pdfopt. k2pdfopt was developed to convert PDFs for use with Kindle e-readers. It has a ton of configuration options and can be coerced into converting PDFs for other types of device. Once we have a mobile-optimized PDF from k2pdfopt we then use imagemagick to extract and resample the individual page images for use with mobile apps (either web apps or native).
The reflowing of text regions from scanned images works surprisingly well, especially for those documents with simple layouts and modern typefaces. However, when trying to reflow image text in documents with more complex formats the results are spottier, and in a few cases downright awful. It’s likely that the handling of these more difficult cases can be improved with some pre-processing and configuration tuning.
While it"s clear that this sort of approach will never be perfect, this quick proof of concept has shown that it is possible to perform automated reformatting of PDFs generated from scanned images with acceptable levels of accuracy in many cases. Based on this initial exploration we believe both the general approach and the specific tools used hold promise as a way to impact a lot of content economically. This should be viewed as but one in a suite of techniques that could be used for making this type of content more mobile-friendly. We will provide a more in-depth discussion of the approach used, our detailed findings, and possible areas for future research and development in an upcoming blog post. We’re interested in hearing the community’s interest in this approach, and suggestions for how it might be used. For example, one idea we had was to use this to improve the accessibility of page scan content by increasing the display-size of the rendered pages. We’d love to hear more ideas! Email us or toss in a comment below.