<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Wolfram&#124;Alpha and Screen Scraping</title>
	<atom:link href="http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/feed/" rel="self" type="application/rss+xml" />
	<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/</link>
	<description>Everywhere with LOLercopter</description>
	<lastBuildDate>Tue, 02 Mar 2010 22:54:26 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Anonymous</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103690</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Thu, 28 May 2009 21:57:10 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103690</guid>
		<description>While I agree with Joe&#039;s reply that images are generated for typographical reasons rather than securing their data sets, I don&#039;t think this is any better of a reason. The accessibility concerns raised here are significant. Forcing the rendering of images on all browsers all the time and requiring not only intensive client-side JS to load the content but extremely intensive server-side processes is a bad idea, the latter being a bad idea which will probably significantly hinder scalability. Frankly, it&#039;s a copout. 

There are browsers that support MathML and there are browsers that support Unicode and most of the notations used. These technologies should be taken advantage of when possible. For a site that claims to be smart enough to understand natural language, I&#039;m sure they could make their algorithms smart enough to drop to plaintext when there are no typographical issues. They already use plaintext with images quite successfully on their sister site &lt;a href=&quot;http://mathworld.wolfram.com/Pi.html&quot; rel=&quot;nofollow&quot;&gt;MathWorld&lt;/a&gt;. I see nothing inconsistent about that UI. The value of having directly copyable text is huge from a user experience point of view, especially for a system where data is &lt;em&gt;meant&lt;/em&gt; to be taken out.

On a sidenote, I really don&#039;t understand this &quot;Export to PDF&quot; thing. Are they trying to reimplement the browser on the server side? This is something 99% of browsers and OS&#039;s in general already support via the Print functionality (it might be XPS or PS depending on the system). It seems like yet another service that needlessly tasks their systems.</description>
		<content:encoded><![CDATA[<p>While I agree with Joe&#8217;s reply that images are generated for typographical reasons rather than securing their data sets, I don&#8217;t think this is any better of a reason. The accessibility concerns raised here are significant. Forcing the rendering of images on all browsers all the time and requiring not only intensive client-side JS to load the content but extremely intensive server-side processes is a bad idea, the latter being a bad idea which will probably significantly hinder scalability. Frankly, it&#8217;s a copout. </p>
<p>There are browsers that support MathML and there are browsers that support Unicode and most of the notations used. These technologies should be taken advantage of when possible. For a site that claims to be smart enough to understand natural language, I&#8217;m sure they could make their algorithms smart enough to drop to plaintext when there are no typographical issues. They already use plaintext with images quite successfully on their sister site <a href="http://mathworld.wolfram.com/Pi.html" rel="nofollow">MathWorld</a>. I see nothing inconsistent about that UI. The value of having directly copyable text is huge from a user experience point of view, especially for a system where data is <em>meant</em> to be taken out.</p>
<p>On a sidenote, I really don&#8217;t understand this &#8220;Export to PDF&#8221; thing. Are they trying to reimplement the browser on the server side? This is something 99% of browsers and OS&#8217;s in general already support via the Print functionality (it might be XPS or PS depending on the system). It seems like yet another service that needlessly tasks their systems.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Miercoles</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103688</link>
		<dc:creator>Miercoles</dc:creator>
		<pubDate>Wed, 27 May 2009 03:35:34 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103688</guid>
		<description>DogsheepBeta gives much better answers.
http://www.openendedadventure.com/DogsheepBeta.aspx</description>
		<content:encoded><![CDATA[<p>DogsheepBeta gives much better answers.<br />
<a href="http://www.openendedadventure.com/DogsheepBeta.aspx" rel="nofollow">http://www.openendedadventure.com/DogsheepBeta.aspx</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Benj Arriola</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103672</link>
		<dc:creator>Benj Arriola</dc:creator>
		<pubDate>Tue, 19 May 2009 18:49:57 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103672</guid>
		<description>I am not here to disclaim your suspicions but I just wanted to share my thoughts why I would like to use images.

1. Subscript and Superscript messes up line height. And there are a lot of these, in data, H2O, H2SO4, Na+, Cl+.
2. There are many mathematical formulas, that are not on the same line. Usually when there are large number fractions/division where both numerator or denominator may be a large formula, or even another fraction.
3. Special characters used in formulas, although we can use all these ASCII codes and display them with the &amp;#---; format, we are not 100% sure they will render well depending on the character sets installed on each browser.
4. Many images may need labels and might be positioned in weird ways.

Overall, I think it is just easier to make them images or use flash than use tons on pixel positioning/controlling words by the pixel, using tons of floats. Now if the system works and runs fast, then I might as well use it for everything else to be consistent for the whole system.</description>
		<content:encoded><![CDATA[<p>I am not here to disclaim your suspicions but I just wanted to share my thoughts why I would like to use images.</p>
<p>1. Subscript and Superscript messes up line height. And there are a lot of these, in data, H2O, H2SO4, Na+, Cl+.<br />
2. There are many mathematical formulas, that are not on the same line. Usually when there are large number fractions/division where both numerator or denominator may be a large formula, or even another fraction.<br />
3. Special characters used in formulas, although we can use all these ASCII codes and display them with the &amp;#&#8212;; format, we are not 100% sure they will render well depending on the character sets installed on each browser.<br />
4. Many images may need labels and might be positioned in weird ways.</p>
<p>Overall, I think it is just easier to make them images or use flash than use tons on pixel positioning/controlling words by the pixel, using tons of floats. Now if the system works and runs fast, then I might as well use it for everything else to be consistent for the whole system.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Dennis Smith</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103666</link>
		<dc:creator>Michael Dennis Smith</dc:creator>
		<pubDate>Mon, 18 May 2009 07:52:19 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103666</guid>
		<description>You do realize that it&#039;s in text form in the alt value of the image tag, right?</description>
		<content:encoded><![CDATA[<p>You do realize that it&#8217;s in text form in the alt value of the image tag, right?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mike</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103662</link>
		<dc:creator>mike</dc:creator>
		<pubDate>Sun, 17 May 2009 17:37:15 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103662</guid>
		<description>Bah, it has nothing to do with screenscraping and everything to do with inability of majority of browsers to display complex equations. Since WA has a ton of mathematical/physical content, displaying equations is of utter importance. Hence, they render it all to images.</description>
		<content:encoded><![CDATA[<p>Bah, it has nothing to do with screenscraping and everything to do with inability of majority of browsers to display complex equations. Since WA has a ton of mathematical/physical content, displaying equations is of utter importance. Hence, they render it all to images.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Rice</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103661</link>
		<dc:creator>Eric Rice</dc:creator>
		<pubDate>Sun, 17 May 2009 17:06:54 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103661</guid>
		<description>My gut tells me that prevention of screen-scraping could have been the priority--luckily, there are technical issues that can hide that intent. Just like, well, anything really. 

However, there&#039;s no issue in copy/pasting from a Wolfram results page and pasting into something plain text. The data is clear as day. Try it. It can be scraped.</description>
		<content:encoded><![CDATA[<p>My gut tells me that prevention of screen-scraping could have been the priority&#8211;luckily, there are technical issues that can hide that intent. Just like, well, anything really. </p>
<p>However, there&#8217;s no issue in copy/pasting from a Wolfram results page and pasting into something plain text. The data is clear as day. Try it. It can be scraped.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: negatendo</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103659</link>
		<dc:creator>negatendo</dc:creator>
		<pubDate>Sun, 17 May 2009 14:45:08 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103659</guid>
		<description>Things like the .PDF export, Alt text, etc. are theoretically much easier to restrict or pull from the site should they be abused (being part of the sites presentation and not its underlying architecture, as the image generation seems to be).

That said, you all make some great points.

I especially think Paul Ford&#039;s point that the real value of Wolfram&#124;Alpha may be is its underlying organization and not the data itself a very strong counter to whether or not they may be concerned about screen scraping.</description>
		<content:encoded><![CDATA[<p>Things like the .PDF export, Alt text, etc. are theoretically much easier to restrict or pull from the site should they be abused (being part of the sites presentation and not its underlying architecture, as the image generation seems to be).</p>
<p>That said, you all make some great points.</p>
<p>I especially think Paul Ford&#8217;s point that the real value of Wolfram|Alpha may be is its underlying organization and not the data itself a very strong counter to whether or not they may be concerned about screen scraping.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Les</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103658</link>
		<dc:creator>Les</dc:creator>
		<pubDate>Sun, 17 May 2009 11:40:30 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103658</guid>
		<description>I understand the need to do it for formula, but it should be confined to that.  This is a real step back for the semantic web that wolfram claims to be supporting (at the very least he is using).  On the other hand, it is kewl.  My guess, is that it is the simplest way for them to capture mathmatica output and that is their motivation.</description>
		<content:encoded><![CDATA[<p>I understand the need to do it for formula, but it should be confined to that.  This is a real step back for the semantic web that wolfram claims to be supporting (at the very least he is using).  On the other hand, it is kewl.  My guess, is that it is the simplest way for them to capture mathmatica output and that is their motivation.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: wishi</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103657</link>
		<dc:creator>wishi</dc:creator>
		<pubDate>Sun, 17 May 2009 11:17:58 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103657</guid>
		<description>Export is to pdf... no issues.</description>
		<content:encoded><![CDATA[<p>Export is to pdf&#8230; no issues.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mattyohe</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103656</link>
		<dc:creator>mattyohe</dc:creator>
		<pubDate>Sun, 17 May 2009 08:46:51 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103656</guid>
		<description>substitute: Does it look that way? What makes you so certain?</description>
		<content:encoded><![CDATA[<p>substitute: Does it look that way? What makes you so certain?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: substitute</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103655</link>
		<dc:creator>substitute</dc:creator>
		<pubDate>Sun, 17 May 2009 07:20:37 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103655</guid>
		<description>Looks like they aren&#039;t interested in accessibility at all. Good luck getting &quot;large-scale&quot; contracts from government, medical, or just well-run organizations.</description>
		<content:encoded><![CDATA[<p>Looks like they aren&#8217;t interested in accessibility at all. Good luck getting &#8220;large-scale&#8221; contracts from government, medical, or just well-run organizations.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mattyohe</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103654</link>
		<dc:creator>mattyohe</dc:creator>
		<pubDate>Sun, 17 May 2009 06:26:45 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103654</guid>
		<description>Yeah, to the person that doesn&#039;t run a website typesetting equations all day long it might seem odd for Wolfram to use images, but what&#039;s actually going on here is addressed perfectly in Joe Crawford&#039;s comment.

Also as Joe mentioned, scraping an image for its content would seem to be a harder task than just grabbing the jsonArray associated with every image.</description>
		<content:encoded><![CDATA[<p>Yeah, to the person that doesn&#8217;t run a website typesetting equations all day long it might seem odd for Wolfram to use images, but what&#8217;s actually going on here is addressed perfectly in Joe Crawford&#8217;s comment.</p>
<p>Also as Joe mentioned, scraping an image for its content would seem to be a harder task than just grabbing the jsonArray associated with every image.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: BoLe</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103653</link>
		<dc:creator>BoLe</dc:creator>
		<pubDate>Sun, 17 May 2009 06:19:57 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103653</guid>
		<description>I don&#039;t think they could achieve their quality of output without these gifs, covering all math and a lot of other notations. I guess these is also a reason why Wolfram&#124;MathWorld and LaTeX plugin for Wordpress do it the same way.</description>
		<content:encoded><![CDATA[<p>I don&#8217;t think they could achieve their quality of output without these gifs, covering all math and a lot of other notations. I guess these is also a reason why Wolfram|MathWorld and LaTeX plugin for Wordpress do it the same way.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paul Ford</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103652</link>
		<dc:creator>Paul Ford</dc:creator>
		<pubDate>Sun, 17 May 2009 04:13:11 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103652</guid>
		<description>It&#039;s actually fairly scrape-able given the embedded title tags and cut-and-paste popup boxes. Plus you could just download the PDF and scrape that. Once you&#039;d scraped it you&#039;d have a half-assed Wikipedia without a front end. The data on there--look at the sources--is broad and fairly well-curated, but has little market value outside of the context of responding to parsed search queries; it&#039;s the computability and the underlying taxonomy/quantification that&#039;s the value prop.

So why render text? I&#039;d say this could just as likely be a case of &quot;not invented here&quot; where the thing NIH is &quot;the world wide web.&quot; Mathematica has been around forever and has its own way of doing things. They know how to render exactly the way they want, and a certain typographic awareness is obviously boiled into the company&#039;s DNA (the production process on New Kind of Science being a prime example). Rendering to HTML as a target (which likely could be done with a combination of JSMath and eeeevil stylesheet tricks as Google does in its PDF-to-HTML converter) probably looked bad (see how Sage, a competitor to Mathematica that&#039;s open-sourced and web-driven, does it to see how many tricks are necessary--math is [sometimes] rendered to image there too, and Wikipedia also renders its math using LaTeX).

What confuses me most is that Alphram&#039;s output is not clickable. If you&#039;re going to get all GIF on me to allow for absolute positioning, bring back the USEMAP and let me click on key terms to auto-propagate search queries; this would DRASTICALLY lower the learning curve so that I could see which kinds of queries were possible or not without all the guessing. As to Joe&#039;s pt. on the screen-scraping--yeah, Ajax is evil this way and hard to SEO in many cases. That said given the open-source nature of many HTML/JavaScript engines writing a spider is probably possible if non-trivial.</description>
		<content:encoded><![CDATA[<p>It&#8217;s actually fairly scrape-able given the embedded title tags and cut-and-paste popup boxes. Plus you could just download the PDF and scrape that. Once you&#8217;d scraped it you&#8217;d have a half-assed Wikipedia without a front end. The data on there&#8211;look at the sources&#8211;is broad and fairly well-curated, but has little market value outside of the context of responding to parsed search queries; it&#8217;s the computability and the underlying taxonomy/quantification that&#8217;s the value prop.</p>
<p>So why render text? I&#8217;d say this could just as likely be a case of &#8220;not invented here&#8221; where the thing NIH is &#8220;the world wide web.&#8221; Mathematica has been around forever and has its own way of doing things. They know how to render exactly the way they want, and a certain typographic awareness is obviously boiled into the company&#8217;s DNA (the production process on New Kind of Science being a prime example). Rendering to HTML as a target (which likely could be done with a combination of JSMath and eeeevil stylesheet tricks as Google does in its PDF-to-HTML converter) probably looked bad (see how Sage, a competitor to Mathematica that&#8217;s open-sourced and web-driven, does it to see how many tricks are necessary&#8211;math is [sometimes] rendered to image there too, and Wikipedia also renders its math using LaTeX).</p>
<p>What confuses me most is that Alphram&#8217;s output is not clickable. If you&#8217;re going to get all GIF on me to allow for absolute positioning, bring back the USEMAP and let me click on key terms to auto-propagate search queries; this would DRASTICALLY lower the learning curve so that I could see which kinds of queries were possible or not without all the guessing. As to Joe&#8217;s pt. on the screen-scraping&#8211;yeah, Ajax is evil this way and hard to SEO in many cases. That said given the open-source nature of many HTML/JavaScript engines writing a spider is probably possible if non-trivial.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe Crawford</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103649</link>
		<dc:creator>Joe Crawford</dc:creator>
		<pubDate>Sun, 17 May 2009 02:02:21 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103649</guid>
		<description>Wolfram is the creator of Mathematica. The state of mathematical notation on the web across browsers is &lt;a href=&quot;http://en.wikipedia.org/wiki/MathML#Web_browsers&quot; rel=&quot;nofollow&quot;&gt;not that great&lt;/a&gt;. The graphs and mathematical formulae that are part of many of the use cases intended for WA simply don&#039;t look that great using typical web typography here in 2009. San, since MathML has been around in some form since 1998.

Also, the alt and title of the img tags you refer to have equivalent content that can be scraped just as easily as in-line text. The bigger issue for screen scrapers is that if you disable JavaScript you get the message &quot;To see full output you need to enable Javascript in your browser.&quot; To do the screen scraping of this site one would need to be able to execute JavaScript to make the network calls Wolfram Alpha makes.</description>
		<content:encoded><![CDATA[<p>Wolfram is the creator of Mathematica. The state of mathematical notation on the web across browsers is <a href="http://en.wikipedia.org/wiki/MathML#Web_browsers" rel="nofollow">not that great</a>. The graphs and mathematical formulae that are part of many of the use cases intended for WA simply don&#8217;t look that great using typical web typography here in 2009. San, since MathML has been around in some form since 1998.</p>
<p>Also, the alt and title of the img tags you refer to have equivalent content that can be scraped just as easily as in-line text. The bigger issue for screen scrapers is that if you disable JavaScript you get the message &#8220;To see full output you need to enable Javascript in your browser.&#8221; To do the screen scraping of this site one would need to be able to execute JavaScript to make the network calls Wolfram Alpha makes.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: foo</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103648</link>
		<dc:creator>foo</dc:creator>
		<pubDate>Sun, 17 May 2009 00:12:39 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103648</guid>
		<description>Bah. Jackb bested me.</description>
		<content:encoded><![CDATA[<p>Bah. Jackb bested me.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: foo</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103647</link>
		<dc:creator>foo</dc:creator>
		<pubDate>Sun, 17 May 2009 00:12:14 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103647</guid>
		<description>Info seems to be entirely contained in the title tag, at least for the moment.</description>
		<content:encoded><![CDATA[<p>Info seems to be entirely contained in the title tag, at least for the moment.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jackb</title>
		<link>http://negatendo.net/blog/2009/05/16/wolframalpha-and-screen-scraping/comment-page-1/#comment-103646</link>
		<dc:creator>jackb</dc:creator>
		<pubDate>Sun, 17 May 2009 00:09:10 +0000</pubDate>
		<guid isPermaLink="false">http://negatendo.net/blog/?p=929#comment-103646</guid>
		<description>If it &lt;em&gt;is&lt;/em&gt; a solution to screen scraping — which, quite frankly, I think you&#039;re mistaken about — then it&#039;s a very bad one, because the alt attribute of each image has all of the text used to render that image.</description>
		<content:encoded><![CDATA[<p>If it <em>is</em> a solution to screen scraping — which, quite frankly, I think you&#8217;re mistaken about — then it&#8217;s a very bad one, because the alt attribute of each image has all of the text used to render that image.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
