It was about 4:45 pm on August 9th. I was leaving the Search Engine Strategies Conference in San Jose, both arms full with my laptop in one and a goody bag in the other. I turned away from the main stairs and headed toward the elevator. As I approached, I saw a guy in dark glasses who looked familiar.
“Are you Jon?,” I asked.
“I am,” he said.
Jon Glick is the Senior Director of Product Search and Comparison Shopping for the shopping search engine Become.com. Prior to joining Become.com in 2005, Jon spent several years with Yahoo! as a key member of their search team, and was integral to Yahoo’s launch of its own search engine in 2004. With more than 15 years experience in the search industry, he’s intimately familiar with crawling, indexing, and search relevancy.
I recognized Jon from his presentation a day earlier during the Search Engine Algorithm Research session. We started talking on the short elevator ride, continued while Jon helped me survive a finicky parking garage pay station, and finished about 15 minutes later next to my rented PT Cruiser. It was a great conversation about all things search.
At some point I mentioned the 2004 interview that Jon did with Mike Grehan, and nominated myself as the person who should update that interview. Jon agreed.
Over the course of the past two months, Jon and I exchanged more than a dozen e-mails, covering a wide range of search- and SEO-related topics. With so much material, I’m going to break up our conversation into three parts (over three weeks) to make it easier to read and discuss. As always, comments are welcome.
Part One: Beating Google, Search Spam, Sandboxing, and more…
Matt: Before we get into the nuts and bolts, let’s start with some “big picture” items. Google has become synonymous with search. How secure do you think their hold is on being the most popular search engine?
Jon: Right now it is very secure. Ironically, it’s due more to their brand strength than quality of their results. Engines like Yahoo! have results quality close to Google’s, but when people think “search” they think Google. People aren’t going to Yahoo or MSN for something because those brands stand for too many different features. Yahoo! is my e-mail and finance; it may be someone else’s fantasy sports and maps.
Like most brands, Google’s brand strength is based on their initial superiority. Starting in 1999, Google was basically head and shoulders above all of the other engines for five years running. People still hold on to that first impression. If you’re happy with Google, are you really going to go back and see if Ask has improved? It’s similar to traditional brands like Crest. It was the first fluoride toothpaste and is still the top brand despite the fact that all toothpastes have fluoride today.
However, Google’s brand strength is a double-edged sword. They are so synonymous with web search that as search changes and expands Google is not necessarily associated with the new forms. Great examples of this changing environment are the various search verticals. When people think of travel search, it’s companies like Orbitz and Travelocity that come to mind. Youtube exploded into video search and Google Video was struggling to catch up … and couldn’t; Google had to spend billions to buy them out. In many of these verticals, Google’s offering ends up riding coattails much like Yahoo!, or AOL’s search offerings do. People used Froogle mainly because it was available on Google.com, much the same way people use MSN search because MSN is their portal/homepage.
What are the chances we’ll see some combination of Yahoo!, MSN, and/or Ask.com merging to try to beat Google?
Unlikely, in my opinion. It’s more likely that these companies will try to partner with/acquire other businesses that also feel threatened by Google’s expansion. Ebay has been one of those discussed recently. I actually thought Ask Jeeves would have been a good fit for Microsoft before IAC acquired them. It would have given Microsoft a great search engine, the Teoma technology and team, from day one to build upon. Instead they are developing a search engine from the ground up and are discovering just how difficult it is to do well, and how long it takes to get right.
Google’s brand is so strong right now that it’s going to be very hard for a competitor to win by just having 10 better URLs in the results. To win back users, competitors probably need to redefine the current Google-dominated search paradigm — which would be a good thing since showing text links and little text ads is not the be-all and end-all for finding information on the internet. Search is barely a decade old, and there’s a lot more progress to be made. Teaming resources probably won’t help these companies get to that next level, which is why I feel such partnerships are unlikely.
Let me ask you about search spam. From keywords to links to blogs (content), everything the engines have used to rank web pages that could be spammed has been spammed. Will it ever end? Is a spam-proof algorithm possible?
From a purely algorithmic perspective, spam proofing will be very difficult if not impossible. What makes search algos scalable across the billions of pages that are now being indexed is that they can identify what, on average, is a good page. It’s always possible for a spammer to design pages with these hallmarks of quality to try to work into the rankings. That said, I would expect the spam resistance of the engines to improve significantly over the next few years. The major engines had let some of the blog and CSS spam get totally out of control, and the new wave is stitched-together, made-for-AdSense (MFA) “collage sites.” Recently Yahoo! and Google have been doing a lot of research into the nature of link structures to identify those that are created artificially, and both have “sandboxes” to limit hit and run spamming. At Become.com, our algo looks at both inlinks and outlinks, as well as the “aboutness” of the linking pages which makes it much harder to set up link farms. Getting lots of links to a gardening website by swapping links with poker sites and online pharmacies is becoming a much less viable way to spam.
I want to ask you about Become.com and shopping search a bit later, but I think the headline you’re creating here is “Jon Glick confirms the sandbox, and says Yahoo has one, too.”
Sandboxing just makes sense. If you suddenly ran into a domain that you’d never crawled before and found it contained three million pages and a million inlinks, the odds are it’s spam. So before you fill up your index with these pages or give them high placement in the rankings, you want to see if the site is legit. Nacho Hernandez of iHispanic.com sent me a great study on crawling (www.drunkmenworkhere.org) that actually shows sandboxing in action. From this analysis it’s clear that Yahoo! and Google both have a sandbox, while MSN does not.
The search engines aren’t going to be able to stamp out spam, but the harder and more complex they make it the more sense it makes for spammers to switch to white hat tactics, especially for clients. We are seeing this in some of the lucrative non-US markets where the value of Yahoo Search Marketing/AdSense is high enough that the engines are ramping up spam removal efforts. A few years ago, one of the major search engines had 1/8 of all its German pages from a single spammer, including the top 300 results for the term “auto”, something that would have never been allowed in the US. This year I’ve been hearing from optimizers how white hat tactics are actually generating better ROI in major EU markets due to increased spam patrolling.
There is a way to make an engine virtually spam-proof, and that’s to go editorial and charge for it. Bill Gates has said that if it cost $0.05 to send an e-mail that would end e-mail spam, and a similar economic disincentive model would work on the web. If you had to pay Google $10 to validate your URL and for them to regularly verify your content, much of today’s spam would cease to be economically viable. I don’t think anyone would like this solution though. The element of this idea that we are seeing is that search engines are increasingly using editorial input to flag good content and punish poor content. This makes sense; as they serve more users the editorial cost/user to evaluate popular queries gets lower.
Is there more hand-editing of SERPs than we’re all lead to believe?
Search engine algos do a good job most of the time and are unmatched for scalability over billions of crawled pages and billions of unique queries. However, there are cases where the engines need to have better results for an important query. In these cases, the easiest thing for the search engine team to do is make human edits. If someone searches for “Olympics”, odds are the site for the upcoming Beijing 2008 games is what they are actually searching for, but it has fewer inlinks than the Athens 2004 games. Would you retrain your whole index/algo system, or build a shortcut to quickly fix this specific result?
Every major search engine has a method for editorial influence over the results. Some are very direct, ex. make www.foo.com the #1 result for “foo”. Others are slightly more indirect. A hypothetical example would be to give www.foo.com credit for 100,000 anchor text hits for “foo”. This doesn’t actually place the site at #1, but pretty much ensures that the existing algo will rank it #1 … and you can claim that the ranking was done by the algo, not an editor. Search engines are understandably loathe to talk about their editorial processes. If you have a human judgment element, it opens you up to allegations of bias. All the major search engines work very hard to maintain objectivity; there’s no equivalent to Fox News in the search engine space (unless we’re talking about China!). That takes discipline. At Yahoo!, the search team has to tell the VPs of the various Yahoo! properties that they don’t get to be #1 just because it’s Yahoo!’s search engine.
You mentioned MFA “collage sites” earlier. Would that be something like this? (http://50th-birthday.info/) (Ed. note: this site has changed since Jon and I discussed it.)
Yeah, that’s a good example. The growth of “Ads by Google” as a monetization tool for spam has made having site content more critical. The spammers need to auto-generate hundreds of thousands of pages that have enough content to attract desirable contextual ads and seem unique enough to foil the duplicate content detection. They look for rich content sources and paste sections together randomly. Popular content sources are RSS news feeds, Wikipedia, and specialized content sites that can be scraped. These then get blended with contextual ads and paid listings.
The upshot is that spam is getting more relevant. In the late ‘90s, a porn site would try to rank for “golf” by repeating the word golf dozens of times in hidden text on a page. Now spammers try to rank for golf by cobbling together golf-related content and then showing paid listings. At least the user who clicks on search engine spam is getting some on-topic information. The problem with this type of spam is that it clogs up the search results with sub-optimal, duplicated information.
A couple months ago, I semi-jokingly made an “Oversimplified Search Engine Algorithm Scale” graphic showing how the big three engines rank pages differently. How far off was I, and how do you think that graph will look a year or two from now?
Qualitatively that’s a pretty good comparison of the algos. I’d say Yahoo! and Google are reasonably close to each other in terms of weighting (i.e. Yahoo! is left of center on the graph) and Yahoo! and Google will get closer over time. We will see MSN give linking and other query independent factors more weight as they improve their algo to get competitive. Since link spam is harder to detect than keyword spam, MSN will need to develop much more sophisticated systems to be able to move to the left without decreasing relevancy.
From a search engine perspective, we look at “Query Dependent” vs. “Query Independent” ranking factors just because of how we architecture. The query independent factors are those that you can pre-compute like PageRank (LinkFlux in Yahoo! lingo), and page “spamminess”. These impact your ranking regardless of the search term. Query dependant factors are things like title tags, body-copy and anchor text whose value can only be computed once a user enters a search.
One of the panels at SES posed this question, and I wonder how you’d answer: Can you please all the search engines?
You can please all the engines; look how well Wikipedia is doing. It’s all of the fundamentals: great content that attracts users, in-links and great anchor text in droves. The joke at SES was now that Google indexes Wikipedia they only have to figure out what the other nine results will be. Seriously though, while different tactics are more impactful on some engines than others (ex. MSN tends to give more weight to title text than other engines), there are very few cases where changes will help you on one engine and hurt you in others. Since Google has 60% market share of US searches, most sites I know optimize for Google and just see what they get from everyone else.
I would advise sites to start by optimizing for Google, but follow on with specific steps designed to help with the other engines. For example, adding RSS for your site will get you more aggressive and frequent crawling by Slurp and title tag optimization will pay big dividends in MSN. None of these will hurt you in an algo like Google’s. Where you get into tradeoffs are in two places: time and over-optimization. Since Google has more traffic than Ask, the ROI of spending time on Ask-specific steps might not compare favorably to general ranking improvement steps like link building that will help on all engines. Regarding over-optimization, each engine has its own thresholds for things like maximum keyword density (above which you are considered spam). Increasing keyword density might help you in one engine at the expense of getting banned from another. Likewise, sophisticated CSS spam that might help you in MSN will probably blacklist you in Yahoo!.
I think you just answered a question I was planning to ask later: Keyword density – does it matter? Should webmasters have a calculator handy for each page of content they write?
Keyword density does matter. While there are diminishing returns when increasing the density, it doesn’t hurt a page’s ranking until you go past the level each engine considers spam. A good rule of thumb is that if the page becomes awkward to read due to the excessive use of keywords then you’ve gone too far.
How much particular words count in the ranking is also dependant on inverse document frequency (IDF). Search engines look at IDF when ranking based upon the principle that the more a word is used on a given page, relative to its use on the web in general, the more it helps to uniquely define that page. This is a long-standing IR concept, so there’s lots of great research available.
What I would remind site designers of is that while PageRank “flows” through a site, keywords are page-specific. Web designers often forget to use important keywords. For example, this page from the Woolrich’s site doesn’t use the word “woolrich” to describe the product. As a result, despite being the official site, they are getting outranked for the term “woolrich flannel shirt” by sites like ezflyfish.com. Because of IDF, “woolrich” is actually the most important of the three terms to have on the page.
At SES you explained that the engines watch how often a site changes and that “meaningful changes” on a regular basis can get a site crawled more often. How would you define a “meaningful change”?
What search engines are looking to find are changes a human reader would find to be significant. The early change analyses were little better than checksums and a lot of time was wasted recrawling pages that were only subtly different every time the crawler viewed them (such as showing the current date or weather in the top corner). Engines are now much more sophisticated, using techniques like shingling to measure the extent to which pages change. That changing date won’t count anymore, but posting a new entry at the top of your blog will. Using shingles analysis also means that page content actually has to change; just shuffling it has no impact.
Coming up in Pt. 2: Personalization, Linking, Local Search, and SEO “Fact or Fiction”…
[tags]seo, sem, google, yahoo, msn search, msn live search, ask.com, jon glick[/tags]