SMX Advanced: Duplicate Content Summit

Filed in Conferences/Educ. by Matt McGee on June 4, 2007

SMX Advanced logo(This was posted live as-it-happens, I’ll try to clean this up later.)

Eytan Seidman, MSFT Live Search

Dupe content fragments your pages – hurts links (which page should someone link to?); prevents you from concentrating your rank one page

Avoid duplicate content
– avoid session IDs in URLs
– multiple sites with same content — i.e., separate sites for separate countries — should be differentation between two sites
– use client-side redirects whenever possible, not server side redirects (REALLY?)
– http and https — make sure content is not duplicated; use absolute links on https pages

Avoid having your content duplicated
– verify user-agent; only feed good ones your content
– block unknown IPs from crawling
– but, minimize the blocking of legit users

How Live Search handles dupe content
– generally, no site-wide penalties
– we look aggressively for session IDs at crawl time; prune down to useful results for searchers; avoid showing pages that are substantially identical
– use a mechanism for finding near duplicates and looking at key content on a page

Peter Linsley, Ask.com

Issues for Webmasters
– risk of missing votes, harming you site popularity
– risk of wrong candidate selection; don’t leave it up to search engine to figure out most important page
– some cases of dupe content are beyond your control

Ask.com and Dupe Content
– not a penalty; effect is similar to not being crawled – we don’t knock you down in the rankings
– we don’t consider templates/fluff; focus is on indexable content
– filter only when confidence factor is high
– candidate for filtering is identified from numerous signals

What to Do
1. Put content in single URL
2. Make content unique
3. Make it hard for scrapers to grab your content, including legal action
4. Contact us – re-inclusion requests

Amit Kumar, Yahoo

Where does Yahoo eliminate dupes?
– at all points, but as much as possible at query-time
– Crawl filtering
— less like to extract links from known dupes
— less likely to crawl new pages from known dupe sites
– Index filtering

Legitimate reasons to duplicate
– content in multiple formats (HTML, PDF, etc.)
– content in multiple languages
– site navigation, templates, etc. — not considered duplicate content

Accidental duplication
– session IDs in URL; may inhibit crawling; not only applicable to dynamic URLs – can happen on static URLs, too
– soft 404s; without a 404 page, we can crawl many copies of the same “not found” page
– this is not abusive, but can hamper our ability to display your pages

Abusive duplication (scraping content, deliberate duplication, etc.) may lead to “unanticipated results” for webmasters

1. Avoid bulk duplication of main content – use robots.txt to tell crawler what to crawl
2. Avoid accidental duplication via session IDs, soft 404s
3. Avoid duplication across different domains
4. Be careful about importing content from other sites — are you adding value?
5. Use Robots nocontent to mark up low-value content

Vanessa Fox, Google

Vanessa is using examples from Buffy the Vampire Slayer to describe dupe content; see Lisa Barone’s post at the Bruce Clay blog for details. Or Tamar’s post at SE Roundtable. I’ve never seen Buffy. They have.

Q&A Session

Eytan: I do include 301 redirects as a client-side redirect; want to clarify that earlier point.
Amit: Yahoo feels the same way about using 301s to redirect users and spiders.
Peter: For the most part, a meta-refresh is treated the same as a 301.

Vanessa: I wouldn’t use “nofollow” to get rid of duplicate pages. Other people might still link to the pages you don’t want.

Eytan: Date/time stamps are a factor, but not so much outside of a news/blog scope.
Peter: We don’t rely on it much because it can be gamed; scrapers can indicate their version was posted months earlier.