Search Engine History – From WW2 to WWW

Everything old is new again. Would you believe that the origins of Google’s algorithm & mission statement may date back to 1945? What can the history of search tell us about where it’s going in the future?

In writing his next search marketing book (due in the fall) Search Engine Strategies’  Mike Grehan explored these questions & more. At this week’s SES Conference & Expo San Francisco, Mike gave the audience a sneak peek at his new book. It was an informative and enjoyable dive  into the curious history of search engine crawling, indexing & rankings.

Introducing our presenter was Tracy Falke, Social Media Specialist at Freestyle Interactive.

Tracy pointed out that Mike wrote one of the first and best books on SEO. He took the concept deeper in current book, which will extensively cover informational retrieval.

Mike gave fair warning that the next hour of his presentation would be a bit of a history lesson. In writing his book, he discovered much about search engine algorithms and why network theory is so important. Dissecting the chain of events that have happened in the search field since 2005, he came to the conclusion that we’ve probably gone in the wrong direction.

Information Retrieval vs Data Retrieval
The difference between information retrieval and data retrieval is sizable. Search is not like a database. Information retrieval is not just patent matching, or matching keywords. It’s about actually understanding what the user’s intent is. Oftentimes we arrive at search results by way of typing in keywords, then noting, “Wow, that’s really relevant,” even though we don’t actually want any of the results provided.

Google is viewed as a black box, with everyone outside the box is stuck wondering, “How do we figure this out?” The funny part is Google is looking at each of us like a black box as well. It’s trying to comprehend each user and come to the same conclusion: “What is it you really want?”

The History of HTTP
How did we end up using Hypertext Transfer Protocol that we know and love? The end of WWII in 1945 found one Vannevar Bush, who had spent most of that time creating weapons of mass destruction, quite depressed. Do a search “As We May Think,” a paper published by Bush. You may be astonished by what this man was talking about at that time. He described something about information- something called “Memex.” Memex is the same basic principal as HTTP. Ergo, Bush may very well have been the first Hypertext thinker.

Bush argued that instead of creating weapons of mass destruction, we should instead work to make all previous collected human knowledge more accessible. Fast-forward sixty years, to the Google mission statement:

“Google’s mission is to organize the worlds’ information and make it universally accessible and useful.”

Same line of thinking is present here- that we should be able to pull together all the world’s information and archive it in one central location. By 2005, the founders of Google were already beginning to understand that this mission was coming to an end.

Which brings us to Tim Berners Lee. Mike asked the audience if we were clear on the fact that the Internet and the World Wide Web are two different things: the Internet existed before the WWW, with WWW running like an application on top of it. Basically, Tim Berners Lee invented the Internet during his lunch break.

Crawlers & Indexes are no Mystery of the Sphinx
When we’re searching, we tend to have this major fascination with what the crawler is doing. What really happens when a search engine crawler comes to your website and the index is created? Crawlers follow links, collect text and that’s about it. There’s no need to obsess over or feel perplexed by crawler activity: the search engine crawler comes to your website, strips out all of the text, and puts it into the index; it strips out the links and puts them in the frontier of the crawler, where it’s headed next.

Think back to middle school- your biology text book. Remember the index in the back? The place that tells you every page on which the word “hemoglobin” appears? That’s how the inverted index works in search engines.

In the late 70′s, scientists were working to develop an automatic text retrieval by way of weighted pairs of words, where 1 word was considered more important than others on the page, and 1 page was considered more important than the entire index. This was how crawlers weight certain keywords on a page today. The crawler is trying to read the page exactly like a human being does. Nothing has changed in the way we create or write a webpage in terms of relevance to the end user or the crawler.

The Shifting Influence of Keywords
Forty years ago, keywords were where it was at – they were the important thing. Imagine a music student struggling to write a term paper on Beethoven’s 5th Symphony- then, imagine Andre Previn writing on that same piece of music. Who would you say wrote the more authoritative article? Yet if both articles were ranked solely on the presence of relevant keywords, on the number of keywords, authority wouldn’t really matter.

In 1998, John Kleinberg, the worlds’ foremost computer scientist, looked at search and some of the problems people were having. He soon realized it was the conundrum of keywords that brought back crappy results. Searches for “japanese auto manufacturer” would return car dealers in Florida. Kleinberg went back to Altavista and did a search for the same keyphrase, but rather than examining the keyword density, he scraped the top 200 results and developed HITS algorithm. This algorithm would only look at pages that were linking to each other. When the algorithm converged, Toyota topped the list, then Honda etc.

What Kleinberg discovered was that “it’s not what you say about yourself, it’s what others are saying about you.” Influencing the people around you became important. The dawn of Pagerank was at hand.

Not All Links Are Equal
Mike looks at getting links from a more creative, philosophical view: aim to get the “pope” to link to you… and in every community, there is a pope. Mike had a friend who bought a website, which Mike then inherited to optimize. He was trying to optimize for “restaurant London” and “London restaurant.” He had a look around and discovered a “pope” and two “cardinals”- the pope happened to be a top food critic in London. Mike knew that all he needed was for these guys to come to the restaurant and he’d get the link he was after. Even if they came and found the food to be total crap, he’d still get the link.  (When you start to think about linking, think about the quality, not that quantity.)

Mike recommended we forget about this whole Pagerank thing. However, he says, if you’ve got a Pagerank 8, that really means something– it means you’ll have one less than 9 and one more than 7.

<slight audience hesitation… then laughter>

So again, Mike stresses we forget Pagerank, because that example demonstrates just about how important it is. There’s no relevance between it and the factors that actually do matter. If a company starts saying they’re reverse-engineering Google’s link algorithm, call bullshit. There are connections beyond just links. Google has a lot of data that you will never be able to get at.

Back to Kleinbergs’ algorithm – basically, what he had discovered were hubs and authorities. Hubs are websites that reach out to all the great content sites. If you write great content, it’s likely a hub is linking to you.

The emergence of social communities have encouraged the algorithm to developed even further. These communities can identify the hubs and authority, and they represent places where we can find a wealth of content. Dig through them and find content that is relevant to you. Start by making yourself known to their community… they’ll make you known to theirs.

In the old days, SERPs were little more than 10 blue links. Now, things have changed completely. Gone are the days of 10 blue links- now when you enter keywords, you get universal results. Your eyes pull towards the images that consume the top of the SERP. The end user is changing. Now, we expect a much greater user experience. Simply adding images to the SERP has actually satisfied the more demanding end user, but it has also changed the way the end user thinks.

This evolution has also changed our view of SEO. If your competitor was  #1 and you were #3, you could beef up your title tags and stand the chance of drawing the attention of more users, more clicks and eventually, you’d see a change around. Mike points out that those days are, in a way, over– there’s no title tag in the world that will prevent you from clicking on an image.

Even if, in the same moment, you see a compelling title tag as well as an image, you’re much more likely to click on that because you know where’s it going.

Look at User Intent to Understand What Google is Doing Now
Search engines have a taxonomy they use for search. It’s an older, 3-prong approach:

  • Informational – This applies to the surfer that is really looking for factual information on the web.
  • Navigational – Generating leads and eventually getting to conversion. In the commercial sense, it means that the user understands something about your brand.
  • Transactional - The user wants to sign up for a newsletter or download a PDF document.

Think about those 3 points- then, think about what kind of content you could create.

Search engines understand query chains- one of their strongest signals is your previous query.  If you go to Google and type in…

  • “Special collection” ?  No not what I’m looking for.
  • “Special edition” ? Nope, not that either.
  • “Limited edition books” ?  Yes that’s exactly what I’m looking for!

…Google sees that query change over and over again. The next time Google see someone typing in “special collection” they come back with, “Idiot, we know you want limited edition books!”

Then, they use user trails. We assume that the link we click on in Google will lead to the landing page. But sometimes, the most relevant page is another six clicks away.

We started with text on the HTML page, we knew that linkage data and link anchor text are important, next we had social media tagging bookmarking and rating. But there is one thing that provides the strongest search engine signal.

How many people have the Google toolbar? Do you have any idea how much data you are sending to Google by having plug-in activated? The data helps Google understand from the number of trails that lead to the page why a page is so important. All of a sudden that page 6 clicks away stands a greater chance of ranking higher.

End user data is crucial. The minute you start sharing more information with each other, the more relevant the results become.

While talking with some search engine reps about the taxonomy of search results, Mike asked, “What if you understand that intent of my query is totally transactional? Is there ever a time where you wouldn’t serve me organic results?”

Their reply: “Hrrm, it could happen.”

So, Mike did a search when he was shopping for a night stand.  He was shocked to find so many transactional results on the page.

He advised the audience, “You have to start thinking about the combination you’re using both organic and paid.”

The reason that Google is having difficulty with keeping up with the different types of information out there is that by now with social media, User Generated Content is beating mediated content (things journalists write) by a 5:1 margin. It’s not possible using HTTP to get around to all of that in real-time and give it to users. Not only do we want this data, we want it now. All of a sudden we have to started rethinking ideas we had about HTTP.

When his kid has some kind of allergic reaction, Mike can go to Google and, with their fantastic algorithm, get top results with advise like “Drink Disinfectant.” After all, Google’s still just a machine. If you go to Facebook or a parenting site, more people are asking each other questions rather than typing three words into a query box. You might have 100,000 of parents who know what this allergy is. Some of those folks may be doctors or nurses.

Applications sidestep web browser to deliver specialized content, especially with all of these apps available. When we sidestep browsers, we get information a lot more quickly. As content becomes more diverse, HTTP and HTML may not be the right model anymore.

We make this assumption that Google is the Internet and they have the entire World Wide Web in their database. Google announced in their blog a few years ago that they had cited 1 trillion URLs in their index. But that’s just a tiny fraction of the content that’s being created for the web. The current method of crawling is very polite, i.e.: “Can I bring down this page?” If they tried to pull all of down immediately, the site would go down and they’d be in court the next day.

HTML was never intended to do rich applications. The technology was created for something else entirely.

Google has a new protocol, SPDY a new method to speed up the process. Mike thinks there’s an even newer protocol – multi-modal. One of the most important signals is understanding the way people connect together and share. It’s a change in the end user behavior and connecting those people together.


One Comment

  1. Matt Pee on September 9, 2010 at 3:49 pm

    Actually, I quite enjoyed this post :) I don’t care about what the haterz say, it had learnings of knowledge I could never fetch of my own and on my own, well played?

Post a Comment