SMX Advanced: Duplicate Content Summit
(This was posted live as-it-happens, I’ll try to clean this up later.)
Eytan Seidman, MSFT Live Search
Dupe content fragments your pages – hurts links (which page should someone link to?); prevents you from concentrating your rank one page
Avoid duplicate content
– avoid session IDs in URLs
– multiple sites with same content — i.e., separate sites for separate countries — should be differentation between two sites
– use client-side redirects whenever possible, not server side redirects (REALLY?)
– http and https — make sure content is not duplicated; use absolute links on https pages
Avoid having your content duplicated
– verify user-agent; only feed good ones your content
– block unknown IPs from crawling
– but, minimize the blocking of legit users
How Live Search handles dupe content
– generally, no site-wide penalties
– we look aggressively for session IDs at crawl time; prune down to useful results for searchers; avoid showing pages that are substantially identical
– use a mechanism for finding near duplicates and looking at key content on a page
Peter Linsley, Ask.com
Issues for Webmasters
– risk of missing votes, harming you site popularity
– risk of wrong candidate selection; don’t leave it up to search engine to figure out most important page
– some cases of dupe content are beyond your control
Ask.com and Dupe Content
– not a penalty; effect is similar to not being crawled – we don’t knock you down in the rankings
– we don’t consider templates/fluff; focus is on indexable content
– filter only when confidence factor is high
– candidate for filtering is identified from numerous signals
What to Do
1. Put content in single URL
2. Make content unique
3. Make it hard for scrapers to grab your content, including legal action
4. Contact us – re-inclusion requests
Amit Kumar, Yahoo
Where does Yahoo eliminate dupes?
– at all points, but as much as possible at query-time
– Crawl filtering
— less like to extract links from known dupes
— less likely to crawl new pages from known dupe sites
– Index filtering
Legitimate reasons to duplicate
– content in multiple formats (HTML, PDF, etc.)
– content in multiple languages
– site navigation, templates, etc. — not considered duplicate content
Accidental duplication
– session IDs in URL; may inhibit crawling; not only applicable to dynamic URLs – can happen on static URLs, too
– soft 404s; without a 404 page, we can crawl many copies of the same “not found” page
– this is not abusive, but can hamper our ability to display your pages
Abusive duplication (scraping content, deliberate duplication, etc.) may lead to “unanticipated results” for webmasters
1. Avoid bulk duplication of main content – use robots.txt to tell crawler what to crawl
2. Avoid accidental duplication via session IDs, soft 404s
3. Avoid duplication across different domains
4. Be careful about importing content from other sites — are you adding value?
5. Use Robots nocontent to mark up low-value content
Vanessa Fox, Google
Vanessa is using examples from Buffy the Vampire Slayer to describe dupe content; see Lisa Barone’s post at the Bruce Clay blog for details. Or Tamar’s post at SE Roundtable. I’ve never seen Buffy. They have.
Q&A Session
Eytan: I do include 301 redirects as a client-side redirect; want to clarify that earlier point.
Amit: Yahoo feels the same way about using 301s to redirect users and spiders.
Peter: For the most part, a meta-refresh is treated the same as a 301.
Vanessa: I wouldn’t use “nofollow” to get rid of duplicate pages. Other people might still link to the pages you don’t want.
Eytan: Date/time stamps are a factor, but not so much outside of a news/blog scope.
Peter: We don’t rely on it much because it can be gamed; scrapers can indicate their version was posted months earlier.
Comments (1)
Trackback URL | Comments RSS Feed
Sites That Link to this Post