blog

Wolfram|Alpha and Screen Scraping

05.16.09 | 18 Comments

Perhaps the oddest thing about Wolfram|Alpha is that the text that appears in query results is not text at all, but is in fact made up of dynamically generated GIFs:

Page Info Wolfram Image Output

The Wolfram|Alpha FAQ claims:

All output content is rendered as images, for consistency.

Of course, a sentiment like that would make any web designer jump off a bridge. Considering all the other nice UI nuances that Wolfram|Alpha has, I call bullshit. It’s not about visual consistency.

I think it’s an attempt at preventing what is quickly becoming the bane of any informative website’s existence: screen scraping.

Screen scraping is, of course, using a script or bot to extract data from the visual output of a page. A web screen scraper digs out the needed data from the HTML source and formats it accordingly. This technique is used to subvert APIs, feeds, etc. when these “legally provided” methods of access don’t give you what you need. Of course, it’s also a way to really piss the people who run the website off, because screen scraping is typically immune to throttling and data control, unlike an API or feed, which can be monitored and cached.

And notably, it appears that this kind of control is of the utmost interest to Wolfram|Alpha, as it’s part of their “Step 3: Profit!” plan (also from the FAQ):

Subscriptions will be available in the near future with enhanced features for large-scale and commercial use.

That said, I can’t imagine it would be very difficult to write an OCR like re-texter for data scraped from Wolfram|Alpha.

When that happens will they have to change it so all the text looks like a smear of CAPTCHA images?

What do you think about this solution to screen scraping?

18 Comments

On 05.16.09 jackb scribed these epic words:

If it is a solution to screen scraping — which, quite frankly, I think you’re mistaken about — then it’s a very bad one, because the alt attribute of each image has all of the text used to render that image.

On 05.16.09 foo scribed these epic words:

Info seems to be entirely contained in the title tag, at least for the moment.

On 05.16.09 foo scribed these epic words:

Bah. Jackb bested me.

On 05.16.09 Joe Crawford scribed these epic words:

Wolfram is the creator of Mathematica. The state of mathematical notation on the web across browsers is not that great. The graphs and mathematical formulae that are part of many of the use cases intended for WA simply don’t look that great using typical web typography here in 2009. San, since MathML has been around in some form since 1998.

Also, the alt and title of the img tags you refer to have equivalent content that can be scraped just as easily as in-line text. The bigger issue for screen scrapers is that if you disable JavaScript you get the message “To see full output you need to enable Javascript in your browser.” To do the screen scraping of this site one would need to be able to execute JavaScript to make the network calls Wolfram Alpha makes.

On 05.16.09 Paul Ford scribed these epic words:

It’s actually fairly scrape-able given the embedded title tags and cut-and-paste popup boxes. Plus you could just download the PDF and scrape that. Once you’d scraped it you’d have a half-assed Wikipedia without a front end. The data on there–look at the sources–is broad and fairly well-curated, but has little market value outside of the context of responding to parsed search queries; it’s the computability and the underlying taxonomy/quantification that’s the value prop.

So why render text? I’d say this could just as likely be a case of “not invented here” where the thing NIH is “the world wide web.” Mathematica has been around forever and has its own way of doing things. They know how to render exactly the way they want, and a certain typographic awareness is obviously boiled into the company’s DNA (the production process on New Kind of Science being a prime example). Rendering to HTML as a target (which likely could be done with a combination of JSMath and eeeevil stylesheet tricks as Google does in its PDF-to-HTML converter) probably looked bad (see how Sage, a competitor to Mathematica that’s open-sourced and web-driven, does it to see how many tricks are necessary–math is [sometimes] rendered to image there too, and Wikipedia also renders its math using LaTeX).

What confuses me most is that Alphram’s output is not clickable. If you’re going to get all GIF on me to allow for absolute positioning, bring back the USEMAP and let me click on key terms to auto-propagate search queries; this would DRASTICALLY lower the learning curve so that I could see which kinds of queries were possible or not without all the guessing. As to Joe’s pt. on the screen-scraping–yeah, Ajax is evil this way and hard to SEO in many cases. That said given the open-source nature of many HTML/JavaScript engines writing a spider is probably possible if non-trivial.

On 05.16.09 BoLe scribed these epic words:

I don’t think they could achieve their quality of output without these gifs, covering all math and a lot of other notations. I guess these is also a reason why Wolfram|MathWorld and LaTeX plugin for WordPress do it the same way.

On 05.16.09 mattyohe scribed these epic words:

Yeah, to the person that doesn’t run a website typesetting equations all day long it might seem odd for Wolfram to use images, but what’s actually going on here is addressed perfectly in Joe Crawford’s comment.

Also as Joe mentioned, scraping an image for its content would seem to be a harder task than just grabbing the jsonArray associated with every image.

On 05.17.09 substitute scribed these epic words:

Looks like they aren’t interested in accessibility at all. Good luck getting “large-scale” contracts from government, medical, or just well-run organizations.

On 05.17.09 mattyohe scribed these epic words:

substitute: Does it look that way? What makes you so certain?

On 05.17.09 wishi scribed these epic words:

Export is to pdf… no issues.

On 05.17.09 Les scribed these epic words:

I understand the need to do it for formula, but it should be confined to that. This is a real step back for the semantic web that wolfram claims to be supporting (at the very least he is using). On the other hand, it is kewl. My guess, is that it is the simplest way for them to capture mathmatica output and that is their motivation.

On 05.17.09 negatendo scribed these epic words:

Things like the .PDF export, Alt text, etc. are theoretically much easier to restrict or pull from the site should they be abused (being part of the sites presentation and not its underlying architecture, as the image generation seems to be).

That said, you all make some great points.

I especially think Paul Ford’s point that the real value of Wolfram|Alpha may be is its underlying organization and not the data itself a very strong counter to whether or not they may be concerned about screen scraping.

On 05.17.09 Eric Rice scribed these epic words:

My gut tells me that prevention of screen-scraping could have been the priority–luckily, there are technical issues that can hide that intent. Just like, well, anything really.

However, there’s no issue in copy/pasting from a Wolfram results page and pasting into something plain text. The data is clear as day. Try it. It can be scraped.

On 05.17.09 mike scribed these epic words:

Bah, it has nothing to do with screenscraping and everything to do with inability of majority of browsers to display complex equations. Since WA has a ton of mathematical/physical content, displaying equations is of utter importance. Hence, they render it all to images.

On 05.18.09 Michael Dennis Smith scribed these epic words:

You do realize that it’s in text form in the alt value of the image tag, right?

On 05.19.09 Benj Arriola scribed these epic words:

I am not here to disclaim your suspicions but I just wanted to share my thoughts why I would like to use images.

1. Subscript and Superscript messes up line height. And there are a lot of these, in data, H2O, H2SO4, Na+, Cl+.
2. There are many mathematical formulas, that are not on the same line. Usually when there are large number fractions/division where both numerator or denominator may be a large formula, or even another fraction.
3. Special characters used in formulas, although we can use all these ASCII codes and display them with the &#—; format, we are not 100% sure they will render well depending on the character sets installed on each browser.
4. Many images may need labels and might be positioned in weird ways.

Overall, I think it is just easier to make them images or use flash than use tons on pixel positioning/controlling words by the pixel, using tons of floats. Now if the system works and runs fast, then I might as well use it for everything else to be consistent for the whole system.

On 05.26.09 Miercoles scribed these epic words:

DogsheepBeta gives much better answers.
http://www.openendedadventure.com/DogsheepBeta.aspx

On 05.28.09 Anonymous scribed these epic words:

While I agree with Joe’s reply that images are generated for typographical reasons rather than securing their data sets, I don’t think this is any better of a reason. The accessibility concerns raised here are significant. Forcing the rendering of images on all browsers all the time and requiring not only intensive client-side JS to load the content but extremely intensive server-side processes is a bad idea, the latter being a bad idea which will probably significantly hinder scalability. Frankly, it’s a copout.

There are browsers that support MathML and there are browsers that support Unicode and most of the notations used. These technologies should be taken advantage of when possible. For a site that claims to be smart enough to understand natural language, I’m sure they could make their algorithms smart enough to drop to plaintext when there are no typographical issues. They already use plaintext with images quite successfully on their sister site MathWorld. I see nothing inconsistent about that UI. The value of having directly copyable text is huge from a user experience point of view, especially for a system where data is meant to be taken out.

On a sidenote, I really don’t understand this “Export to PDF” thing. Are they trying to reimplement the browser on the server side? This is something 99% of browsers and OS’s in general already support via the Print functionality (it might be XPS or PS depending on the system). It seems like yet another service that needlessly tasks their systems.