Thursday, November 6, 2008

Week 10 Readings

This weeks readings were much easier to assimilate than that of XML and DTP and all that.

Web Search Engines --- Part 1 and Part 2
The major search engines were mentioned. That of Google, Yahoo and Microsoft.
Search engines cannot and should not index every page on the web. One thing that was interesting was "search engines must reject as much low-value automated content as possible."
Who decides what is low-value or not?
I'm guessing that the web crawler machines decide based in how many visitors a website gets.
There are hundreds of distributed web crawler machines going about their business daily, hourly, minute by minute. They communicate with other machines and with millions of different web servers constantly.

There are two phases to indexing algorithms. First phase is scanning. The indexer looks at the text of each input document, giving it a number and assigning it to a temporary file.
The second phase is inversion. The indexer sorts the temporary files and gives it a number as well. "A temporary file might contain 10 trillion entries."
There is some caching of information done as well.


"Current Developments and future trends for the OAI Protocol for Metadata Harvesting"
Well, this was interesting. If you know what they are talking about. OPEN ARCHIVES INITIATIVE = OAI. This basically began in 2001 with a grant from the Mellon foundation. there are several companies and universities that are excited about this topic. Some companies are building a "virtual collection" of sheet music. It can be looked at, copied, and annotated in this digitized manner.

Some shortcomings with OAI is "there is no "there is no search mechanism and fairly limited browsing capabilities. " Also that "few of the registries approach a complete list of all available repositories. "

while reading articles such as this one, I think, "why in the world would I ever have to know this information." then usually, Dr. He will give us an assignment that invariably makes us use some of the information we've recently read about. I'm really hoping that this is just so we know what's out there and we won't have to actually use this at this point.


The Deep Web: Surfacing Hidden Value - Michal K. Bergman

"Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. The reason is simple: Most of the Web's information is buried far down on dynamically generated sites, and standard search engines never find it."

Apparently the Deep Web is huge and "is the largest growing category of new information on the Internet." I have no idea if anything I have done on the internet is stored in the Deep Web or not. Is so, it is purely unintentional.
Search Engines such as Excite, yahoo and others are only catching the surface web, which is only a tiny portion of the information available. Thusly, "According to a recent survey of search-engine satisfaction by market-researcher NPD, search failure rates have increased steadily since 1997."


Figure 6. Distribution of Deep Web Sites by Content Type

"More than half of all deep Web sites feature topical databases. Topical databases plus large internal site documents and archived publications make up nearly 80% of all deep Web sites. Purchase-transaction sites — including true shopping sites with auctions and classifieds — account for another 10% or so of sites. The other eight categories collectively account for the remaining 10% or so of sites."

Hopefully there will be search engines that will have the capability to retrieve information from Deep Web so there will more information available to choose from for the student sitting at their computer, the mom helping her child with homework, or the librarian trying to help a patron with a question.


Muddiest Point: How do you get information stored in the "Deep Web" and how do you get it out again?


www.blogger.com/comment.g?blogID=6958200230416907745&postID=404487879365965148"
Comments: I responded to Rebekah's question on the disc. board - https://courseweb.pitt.edu/webapps/portal/frameset.jsp?tab_id=_2_1&url=%2Fwebapps%2Fblackboard%2Fexecute%2Flauncher%3Ftype%3DCourse%26id%3D_9047_1%26url%3D
I commented on Lori's blog : https://https://www.blogger.com/comment.g?blogID=6958200230416907745&postID=404487879365965148
Also commented on Allison's blog: http://ab2600.blogspot.com/feeds/posts/default

2 comments:

JPM73 said...

Joan,

I agree with you about hopefully getting information from the deep-web for our patrons. Because if we are supposed to get them the best possible information out there...then I think being able to go deeper than the surface web shoudl be something we can do.

Alison said...

Joan- It was really nice meeting you last weekend! I thought the readings for this week were pretty interesting, more so than the XML/HTML stuff of a few weeks ago! The "crawling" that search engines do is really fascinating to me- I can't fathom the amount of info that they all must sift through constantly. Good information in the articles.