Finding information in the internet age

The development of the world wide web has changed the ways in which almost all professions operate, to some degree or another, but few as profoundly as that of Librarianship. Prior to the 1990s, the primary way to find information in a library was to consult its catalogue.  This ‘fixed database’, takes the unstructured data of the library’s collection (names of authors, titles of books etc) and tabulates it into structured information which has far greater functionality to the user. This database approach to the problems of organising, searching for and retrieving information is still fantastically useful in many circumstances, and is largely down to the ease of use of Structured Query language (SQL) in conjunction with Relational Data Base Management Systems (RDBMS).

Foucault argued in his studies of epistemology, just like Levi-Strauss and other intellectuals labelled as ‘Structuralists’,  that we think in tables, in which things are compartmentalised and set as binary oppositions. Likewise, our DITA exercise this week required us to log the results of our activity on an Excel spreadsheet, partly to get us to think in a tabular way. It’s this human desire for order and structure, which is why I believe that data bases will remain with us for a very very long time.

So far, so straightforward. Where the problems arise however, are when you are seeking out information from beyond the confines of a database. The internet, by connecting together the world’s computers has led to an explosion in the amount of information being created and in the accessibility to that information. The website graphically illustrates this growth. The number of global users of the world wide web has gone from almost 40 million in 1995, to approximately 2.3 billion by 2011, and global internet traffic has grown from 1 petabyte per month to 27,483 petabytes per month over the same period. This presents great opportunities and also, potentially, great problems.

Even as early as 1945, Vannevar Bush, in his seminal article, ‘As We May Think’  identified the problem of processing such a volume of information:

“There is a growing mountain of research. But there is increased evidence that we are being bogged down today as specialization extends. The investigator is staggered by the findings and conclusions of thousands of other workers—conclusions which he cannot find time to grasp, much less to remember, as they appear.”

He recognised that the traditional methods of sharing information were no longer ‘fit for purpose’:

“Professionally our methods of transmitting and reviewing the results of research are generations old and by now are totally inadequate for their purpose.”

He concluded by suggesting that new computer technologies, could transform the situation and help overcome these barriers:

“there are signs of a change as new and powerful instrumentalities come into use.”

Information Retrieval is one of the means by which we can deal with data beyond a database; its most common form is that of the Internet Search Engine. This differs from the results of searching a database in a very significant way. Where two people making the same query of a database would receive the same results, two people making the same query in a search engine would be given differing results. This is all down to the fact that in Information Retrieval, the responses are all subjectively relevant; if you were using Google for instance, your previous searches would all be taken into account.

I’ll discuss Information Retrieval in more depth over the course of this blog, but I’d like to conclude by returning to a point that I made earlier, that the world wide web presents difficulties as well as opportunities. Yes, it’s true that there is an increasing symbiosis between users’ needs and search technologies, (as our DITA exercise showed, even natural language queries result in a fair degree of relevance these days), which makes it appear as if Information Retrieval may well help to overcome the problems highlighted by Bush; yet I read a blog post from the British Library’s web archive blog, which raised some serious concerns in my mind. The UK Web Archive has been archiving pages since 2004 and seeks to prevent the creation of a “potential digital black hole”. Recently they conducted an exercise in which they wanted to see how well they had achieved their objective. They checked to see whether the URLs they had archived were still live on the web, and if so, whether the information contained therein was still the same or had been changed.

The results, from an archival point of view were quite alarming. Over half of URLs from 2007 and earlier were unobtainable for one reason or another; and (apart from 2014 URLS), even those which are still retrievable, over 90% are no longer the same material. Therefore, could it be that Information Retrieval is only good for dealing with the web in the here and now? Are we going to have to develop new methods, new techniques and maybe even new ways of thinking when searching for historical materials on the web?





This entry was posted in Information Retrieval and Relational Databases and tagged , , , , , . Bookmark the permalink.

4 Responses to Finding information in the internet age

  1. rdonnison says:

    Wow great overview Steve! An interesting and useful read. Those statistics from the UK Web Archive are indeed alarming – do you think online information generally just seems to be considered more disposable?


    • stevemishkin says:

      I think that, yes, in some ways much of the content on the web is considered disposable. After all, it’s so easy to create and upload that perhaps inevitably it isn’t cared for in the same way that you would preserve a book, which requires an enormous effort to bring to print. It is true that much of the content on the web is fairly ephemeral, but it’s still part of the written record nonetheless, and a valuable insight into our social history. A further point is the fact that there is also the question of technology to consider. Although popularly we like to think of the web as just being ‘out there’ in the ether, the reality is that everything is stored on a server somewhere, and it’s this vulnerability which perhaps has not been addressed seriously enough thus far when we consider the issue of web archiving.


  2. yxchai says:

    Hmm, I find your last point quite interesting! I’ve always thought that digitising information would be one of the best ways to preserve it however I didn’t realise how most of it could be lost with the progression of technology! Perhaps we all take the internet for granted at times.


  3. Ali says:

    Nice post Steve! You’ve explained the difference between a searching using a search engine and within a fixed database so clearly. Really interesting to think about the issues about archiving the web. I’m not sure how you would even determine what is worthy of preserving when there is so much content created by so many people for so many different purposes? You’ve got me thinking!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s