updates @ m.blog

A Conversation With Paul Bausch

Early this fall, in the course of doing research for my (now completed) thesis, I conducted an informal interview with Paul Bausch, creator of Weblog Bookwatch. For those who are interested, I’ve decided to post the full text of the interview here.

Paul Bausch has a long history of involvement with the weblog community. In 1999, he was one of the founders of Blogger, an extremely popular software service which made it easy for users to publish their own weblogs. More recently, he developed a system called Weblog Bookwatch (and its new companion site, Weblog MediaWatch), which uses distributed web services technology to automatically read thousands of blogs and find out what types of books people are talking about (surprise: they’re not the bestsellers). I spoke to Paul regarding the implementation of Bookwatch and his views on what types of web services we’ll see in the future.

MR: Let’s start with the simple stuff. How did you first get the idea for Weblog Bookwatch? What were you trying to accomplish?

PB: A few forces came together to form the idea of Bookwatch. First, I’ve been working on a Web application to store and categorize books for over two years in my spare time. (It has been a hobby, but I hope to release it someday.) So I’ve been thinking alot about how people interact with books, and how the Web can connect readers. I was also a co-creator of Blogger, software that helps people create weblogs. So I’ve been involved in the weblog community from its early days. Finally, I was writing a book about weblogs and knew I would eventually want to track what people were saying about the book once it was released. With all of these thoughts in my head, I saw that the creator of DayPop (a weblog search engine) had implemented a feature that tracked webloggers Amazon wishlists.
   I thought that tracking what people were talking about would be a better measure, so I started coding. I think I was just trying to see what books bubbled up to the top of this subsection of the Web population.

MR: What do you see as being the main difference between a traditional metric of book popularity like the New York Times Bestseller List, and something like Bookwatch?

PB: Stephen Johnson, author of "Emergence" called it, "An interesting corrective to ordinary bestseller lists, in that it measures which books get talked about, rather than which ones get bought." I agree with that, though it’s only a small section of the population. If everyone was publishing weblogs and talking about (and linking to) their book interests, it would be a more accurate picture of the top books.

MR: What exactly do you mean by "a more accurate picture"? If everyone in the world did publish a weblog (a somewhat frightening thought!) would the Weblog Bookwatch list closely mirror the best sellers? Or is there a discernable distinction between what people purchase and what people talk about?

PB: I just meant a more accurate picture of what books are "top of mind" rather than top of the charts. I don’t have info to prove it, but I think there’s a difference between the books people talk about and the books they buy. In an everyone-blogging world you could compare the Bookwatch data and the bestsellers data to determine the exact difference.

MR: Privacy implications aside, it would be interesting to test this theory by having a system cross-reference an individual’s weblog hyperlinks to online booksellers to their actual purchase history on those sites.

PB: Amazon shares some of their demographic purchase data. They call them ‘purchase circles’. You can see the top books for a given geographic area, university, or in some cases, company. It’s very interesting, and I’m guessing some organizations aren’t happy about that data being available. They also allow users to make their past purchases available to their "Amazon friends"–lists of other Amazon customers the user defines. If all of this data were available through XML queries, you could start to quantify some of those ideas.

MR: Have you found anything particularly interesting or surprising about the Top Results lists that Bookwatch generates?

PB: I thought that the Bookwatch might reflect standard bestseller lists a bit more than it does. Instead, it shows the common interests that weblog authors share. Books about the Web and weblogs have consistently been at the top of the list. While standard bestsellers occasionally appear, it’s definitely skewed.

MR: Yes, the insular tendencies are quite readily noticeable when looking at the list. Why do you think the weblog community likes to talk about itself so much?

PB: I think the early weblog adopters were aware that they were blazing a trail, and were self-conscious in the process. Or perhaps the people drawn to weblogs were naturally self-conscious. This probably set a tone for others who followed. I think the naval-gazing will subside as blogging becomes as much a part of the Internet as email or instant messages.
  Also, the BookWatch is skewed by looking for the most popular books among a subgroup. The only interest that all webloggers share is the act of writing a weblog. It makes sense that a popularity index among them would favor the common interest. I don’t think that indicates obsessive naval-gazing.

MR: How does Amazon.com fit into the bargain? It would seem they would love a system the draws people attention towards purchasing new books from them. Have you ever actually gotten any feedback from them (official or unofficial) about Bookwatch?

PB: Erik Benson, an Amazon.com employee, has been extending the idea behind Bookwatch at his site, All Consuming. It seems this is an independent project, but I’m guessing he at least has the blessing of his employer. I haven’t heard anything about Bookwatch from Amazon, other than a few conversations with Erik. Recently I changed Bookwatch a bit so it’s now looking for links on weblogs to Barnes & Noble and Powells as well.

MR: Now I’d like to talk a little about the geeky end of things. Could you briefly describe how Bookwatch works from a technical perspective: the spidering, parsing, et al.

PB: Weblogs.com is a service that weblog authors use to notify the world that they have updated their site. Some tools have this "ping" built-in, for those that don’t the author has to notify weblogs.com manually with a Web form. Weblogs.com continually displays a list of the most recently updated weblogs, and offers it on a web page, or as XML.
   Every two hours, one of my scripts gets the latest XML version of the most recently updated weblogs. Based on that list, the script visits each of the weblogs, searching its text for links to online bookstores. It adds or updates any record of the weblog/book combination.
   Another script then analyzes and sorts this data so the most frequently mentioned books rise to the top. This script also contacts Amazon Web Services for an XML representation of the book information. I use this to display the book title, author, and an image of the cover. Before Amazon implemented Web Services, I was using screen scraping to get the book data…but that’s another conversation. ;)

MR: But an interesting conversation. :-) The issue of screen-scraping is worth discussing. Let’s return to the Bookwatch/Amazon relationship for a moment. Amazon opened up a XML API for registered ‘affiliates’ to use, with the assumption (somewhat backed up by licensing agreements) that it would only be used to refer people to buy books from Amazon.com’s website. If Bookwatch used this API to grab title/author information and images from Amazon’s server, yet didn’t link your users to Amazon when they wanted to purchase the book; one could imagine Amazon might not be so permissive. Amazon’s willing to open their data to outside websites as long as there is a clear benefit to them. If you remember when those price-comparison search engines first came out, online retailers freaked out and tried to block access while simultaneously attempting to sue everyone involved.
   At the same time, applications that reorganize data via screen-scraping or similar ‘unauthorized’ methods are actually becoming more and more popular–I am thinking of the very well received shareware-program Watson (Karelia Software) on MacOSX, which scrapes varies sites for things like movie times, sports scores… There are are lot of neat little programs that scrape data from weather.com to give you a weather report in your computer’s taskbar. Also we see things like Cerulean Software making "Trillian" a client for AOL’s Instant Messenger network (OSCAR) which leads to a continuously escalating battle as AOL tries to lock them out and Trillian changes their code to let themselves back in.
   In many of these cases, you have a service provider (of a network, of unique data, etc.) putting out data on a publicly-accessible server but trying to limit its use or restrict access to certain cases. Then you have this large segment of the net population who seems to think that, once you get on the internet and start dealing with digital information, this is bunk: that clients should be able to use the ‘public’ data in any way they see fit.
   The question seem to be to what degree are these services actually public–e.g. Does a website have the right to ‘discriminate’ what kind of traffic they get, or is this analogous to a brick-and-mortar bookstore kicking everyone out who appears to be "just browsing"? Whereas in physical space these interactions are fairly defined by social contract; once you start talking about network environments…

PB: That’s a good question, and I’m guessing they’ll be solved through legal agreements. Everyone who signs up to use Amazon web services agrees to a terms of service. In that TOS, it says that any use of the system must point to Amazon for sales. Through policing the system, they can find people violating the TOS and cancel their access. I think smart businesses will speak to the demands of consumers. Amazon saw that there was a demand for their data in an XML format b/c so many people were scraping their site. Instead of fighting against the people who are innovating with public information, companies can use the innovators to determine where the market is heading. Tim O’Reilly calls these the "alpha geeks"; people who are on the edge of technology, using it in ways the designers didn’t intend. Instead of having a "not invented here" attitude, paying attention to the alpha geeks turns the world into your development team. Those egregious offenders that hurt the core business model could still be stopped with technology or legal agreements.

MR: Where you able to quickly go from concept to a "working" version of the product?

PB: Yes. I had the screen scraping code from the other book project I was working on. And retrieving XML or HTML via HTTP is an easy task for scripts. It went from idea to reality in a couple of days. Of course I’ve constantly been tuning it as it goes, but the initial version was quick.

MR: This is something that could never have been done without a number of existing structural systems in place–SGML parsing libraries for scripting languages, weblog publishing systems like MoveableType which conform to certain agreed upon standards, existing content aggregators such as weblogs.com… Yet it clearly demonstrates the power of their integration. Until very recently it would have been fairly unthinkable that one person would be able to create, with limited resources, in a few days, a working system that polls daily tens of thousands of individuals without any actual "human" involvement.

MR: Does Bookwatch demand a lot of hardware resources, or is it chugging along happily on an old Pentium in your closet somewhere?

PB: It’s on a friend’s server that is hosted with a DSL line. It’s not on a machine dedicated for this purpose, and it doesn’t take up much bandwidth. I’m only interested in the text of weblogs, so each parse isn’t transferring too much data. The same with Amazon’s Web Services…it’s simple XML.

MR: How large a "barrier of entry" do you think there is into developing services like Bookwatch?

PB: The only barrier to entry is knowledge of a scripting language, and familiarity with HTML and XML. I’ve been developing sites and applications for the Web for over 7 years, so I’m very familiar with these technologies. With more and more sites offering their data as XML, the barrier will drop even farther.

MR: So perhaps we’ll see a Dreamweaver or even ‘My Easy Homepage Creator’ which will allow even a neophyte user to create such programs? Tim Berners-Lee once mentioned that in the near future (where he believes metadata will be everywhere) a web browser should be able to understand complex recursive user-queries such as "go to the homepage of all the employees of www.company.com and download their resumes."

PB: Could be. I’ve seen a prototype of a service that allows you to type in the name of your favorite music star, and it automatically creates an Amazon affiliate store for that artist. The instant site has pages of products organized by category…and the site owner makes money when items are purchased through the site. Instant revenue-earning fansite. Attracting an audience is another matter, though.

MR: In a nutshell, You’re taking data that already exists, and giving it a new purpose. Would you agree with that?

PB: Yes, it’s basically syndication of information; repackaging things from different sources in different ways to create something new. The only unique service beyond the new package is the analyzation and sorting of the data available.

MR: Along those lines, you publish a RDF feed of Bookwatch and encourage people to come up with new uses for its data. Some people already have.

PB: It’s actually RSS, a simple XML format. Some versions of RSS use RDF, but I publish it in an earlier version, RSS 0.91 (which is RDF free). That’s how Erik Benson got started with his application. He combined my RSS feed of the Bookwatch data with Google’s newly released Web Services to create something new.

MR: Can you conceive of a situation in which you wouldn’t want someone to re-use your data from Bookwatch?

PB: Not offhand. But people are creative. :)

MR: All of this seems to exist in a system where its of primary importance to make data "open." Could you speak a little about this? How do you feel emerging standards such as RDF will work along with this goal?

PB: As copyright laws are continually strengthened and extended, we no longer enjoy a thriving "commons" where derivative works can flourish without fear of being sued. Because of the open nature of Web protocols, we’re seeing a push toward purposely putting creative works into the commons of the Web. For some it’s a personal political issue, but for others (like Amazon) it’s a good business decision. Not only in public perception as they become the source of book data for the Web. But in their control of developers as they come to rely on their service.
   Placing your data within an XML format for public consumption is a way of speeding up syndication agreements. By doing so, you’re letting everyone know that this particular set of data is "part of the commons" and available for use. (Right now there are norms of use for XML data, but no written laws.) Agreed-upon formats help this along by suggesting a use for the data, and because developers create tools to work with the formats.

MR: When the web was first being developed, there was a conscious decision to have unidirectional links instead of bidirectional. While it was known this would cause a number of technical problems such a missing resources (and we do have a lot of 404 errors today); it’s believed by many to be the primary reason for the web’s explosive growth–since you didn’t have to get any sort of permission to link to another server, new nodes could be created quickly.
   Now with something like an RDF feed, the technology is actually there to limit the access to the data to "authorized" clients–the Google API limits the amount of XML search queries to 1000/day; other entities might decide to only allow certain parties access to their "raw" XML format content. A lot of corporate parties would really like some sort of a "robots.txt" for dictating what an RDF feed may or may not be used for. Hell, you still have lawsuits from companies trying to dictate who can hyperlink to them.
   Do these entities who favor access control as a business model just not "get it" when it comes to the benefits of opening data to the public online, or is this a legitimate sort of tension which perhaps indicates the internet’s fundamental incompatibility with some "old economy" business models based around scarcity and demand?

PB: Again, I think companies that embrace the open nature of the Web are going to find it helps their bottom line. It puts consumers in a position of power, because they are able to tell companies how they would like to use their data…instead of the other way around. The RIAA lost a big opportunity with Napster. They could have gradually worked in a payment system for the millions of music fans who wanted to download music. Instead, they’re fighting costly court battles to shut them down. It won’t be so easy to gain back the trust of their customers.

MR: One of the properties of the internet’s decentralization that has been so talked about is its resistance to attack/censorship/et al. Now, fast forward to "web services." With a system like Bookwatch, you actually end up with a hierarchy of command which relies upon key centralized players. So what happens if, for example, weblogs.com’s server goes down? Bookwatch stops working, and so does every site which extends Bookswatch’s data list…in more developed systems, you could end up with quite a domino effect of suddenly useless web services. How big a problem is this interdependency? How do you reconcile it with the ‘decentralized’ nature of the internet?

PB: That has happened. Weblogs.com has been down a few times, and I have code in place to warn people it’s currently down if that happens again. I think we’re only at the beginning of Web Services and that’s why there are so few providers. There are dozens of other weblog monitors, but not all of them offer their data as XML. In a perfect world, Bookwatch would rely on several different sources. There are dozens of other book sales sites, but only Amazon offers their data as XML. As more choices emerge, I think the problem of centralization and dependence will be solved. Of course, if commerce is part of the agreement, written contracts would need to solidify guarantees about uptime.

MR: Time to play futurist. Where in general, do you see the web going in the near future?

PB: I think the idea of an "information commons" that everyone can draw on will gain momentum. The combination of Web services, Weblogs, and XML syndication is a trend toward trust, inclusiveness, and cooperation. I’m an optimist, so I think the trend will continue. The Web won’t mean instant riches like people thought it did in the 90’s…but the hope and enthusiasm driving the Web is still there if you look in the right places.