Data Mining – Web Content Spammers & Blekko

By Dale Reagan | November 10, 2010

Have you noticed how the quality of pages that search engines are leading you to has/is declining?  Wonder why? [Lots of questions  in this post…]

A partial answer:  content farms – web sites where articles are created based on hot topics; articles typically provide minimal useful information, are heavy with SEE (search-engine-enticements) but pages are filled with marketing/ads (which, of course create revenue for the content farm…)    Companies are mining search data  for ‘popular’ search terms or ‘trending topics’ and then creating web page content that targets the SEO aspects of the topic  – this brings their pages into competition with the quality pages.  Is this SPAM? The content may actually be semi-good (perhaps scarfed from a ‘real’ web resource…) or possibly custom-created-at-low-cost or with controlled-costs.  This semi-good content is being presented from domains which may have previously represented some higher level of originality as well as a higher level of content quality.

Ever used a search engine and ‘found’ a page that simply contained a long list of content that had no relevance to your search?  Was is useful?  Is anyone being scammed here?  You follow the search trail to a page that seems to be relevant or that may even be helpful – what’s wrong with that?  Did you let the search engine know that they were providing bad links/content?

Blekko.Comslash the web!

I started this post in the spring of 2010.  Now (fall of 2010) a new search engine has presented itself with the lofty goal or removing/not displaying SPAM web sites/pages.  Blekko.Com is using the power of group think to filter search results. Hmmm – using the power of many minds (PMM?) to tag/flag good or bad web resources?  Sound like a good idea to me.  Will this offering provide a shorter path or shorter-time to useful search results?

I previously wrote a short post about the need for a solution-engine.  I have not found it yet.  There have been several developments in the search arena since the post (along with some private discussions) and now, Blekko might be getting us closer to really-relevant-search-results (RRSR.)

It should be noted that I may not be a typical web searcher – I usually use long tail search terms (i.e. multiple key words or very specific search strings) to reduce/qualify my searches.  Some example long-ish tail search terms:

  1. mod_geoip apache block country sample
  2. mod_geoip apache block countries in Asia
  3. mod_security apache example rules
  4. mod_security apache sample rules
  5. “Unable to connect to CUPS server localhost:631 – Connection refused”
  6. “blocking web spam”

Based on my experiences using three or more terms usually provides significantly better (more relevant) results for review.  Provided that web page authors create search-friendly content my results are even better (and so are the content farms…)  As I encounter content farms I make mental notes not to use them – here is where a search engine like Blekko might save me some time; other users can/will tag SPAM sites which will reduce their chances of showing up in the search results.  Again, sounds like a good idea to me.

So, any problems with this approach (i.e. using group think to filter search results?)

The first thing that comes to mind is that your content (i.e. my web pages) might have a smaller chance of showing up in search results since, by it’s own premise, Blekko (or a similar approach) is using a limited set-of-minds to create filters.  Some enterprising entity could create a bot that trampled any competitor’s chance of placing in the top search results – that would not be good.  There have already been a number of social web sites where user ratings have been compromised due to such activities.  Will Blekko survive such manipulations?  After all, that is part of what the current problem is with the major search engines (quality results being drowned by what amounts to noise…)

Some trial searches

At this time, based on some tests Blekko does not seem to be providing (me) with adequate, relevant results when using a long tail search approach.  Some results show the same domain 2 or more times in the returned list (10 or more results from the same domain.)  Many results seem to be excluding my long tail terms leading me to believe that the results are not as relevant as when using other search engines.  I expect to find some level of similarity in the quality/relevance across all search engines – if I don’t then I question the results. Quality is quality and no serious search engine can ignore it…

Using my sample sets of search terms (above) the 5th set, “Unable to connect to CUPS server localhost:631 – Connection refused“, did not return any results on Blekko.   Hmm.  Bing and Google provided many results.  What does this mean, if anything?  At this point it leads me to believe that my assertion is at least partially correct, but it also leads me to believe that Blekko simply has more spidering to do.  If I remove the quotation marks from the search query then I get what seems to be a reasonable set of results (i.e. hopefully relevant, helpful links.)  Hmm.  Ok, it appears that typical approaches for Internet search (i.e. attempting to limit results via careful keyword and Boolean options) may need some re-thinking when using Blekko. I also noted that search term sequence (order of words) seems to be ignored by Blekko (not good IMO.)

Some interesting points about Blekko

Search results tend to be ‘less new’ (i.e. there does not seem to be a real-time aspect to the results, at least during my testing.)  The Blekko spider, ScoutJet seems to fetch robots.txt frequently but crawls less frequently than Google, Yahoo, or Bing.

Mobile devices – pages are ok, but somewhat annoying with lots of scrolling involved; could be improved with a new search bar at the end of any results page (on my Win 6.1 mobile using IE…)

Go to any search engine and search for: 

using Search_Engine_Name

What do you get? For both Bing and Google I get top-level results from the search engine itself.  Using Blekko I don’t get any hits from Blekko??? Hmmm.  If you are not the expert on yourself what does that tell us?  Is the site already gamed? Is this a Whoops!? Something else?

Blekko & Search BOR (bill of rights)

web search bill of rights
1.      Search shall be open
2.      Search results shall involve people
3.      Ranking data shall not be kept secret
4.      Web data shall be readily available
5.      There is no one-size-fits-all for search
6.      Advanced search shall be accessible
7.      Search engine tools shall be open to all
8.      Search & community go hand-in-hand
9.      Spam does not belong in search results
10.     Privacy of searchers shall not be violated

What a great list! Will Blekko deliver?  During my testing I did find some results that I would label as SPAM.  Try this:

  1. search for your name (or a name that should be closely aligned with keywords) along with keywords that are relevant to your name
  2. any SPAM in the results?

I use the above technique to quickly locate what appear to be SPAM sites.  When appropriate I report such sites to search engines.  However, what the problem appears to be is that the search engines themselves do not adequately filter such combinations – I suggest that they implement such a filter to more quickly reduce/lower placement for content that seems to be related to search queries but is most likely transient and/or less relevant.  I have found many index sites, or sites that are large and present indexes in what I would call SPAM search results.  If a spider encounters a dynamic home page then the results from that page should reflect current or at least recent content – I have found many such search result pages which, when visited, no longer presented any content relevant to the search query… Could be that I am too ‘picky’… 🙂

How might the other search engines employ  group think (PMM?) What about a simple ‘not relevant’ button next to search results? If a searcher clicks the ‘not relevant’ button (NRB, NR Button) then the engine re-examines the results and remove similar ‘not relevant’ matches.  What would that do to the search process?  I see this not so much as a way to flag SPAM but an addition (missing!) option for refining search results.  No need for a login.  No need to exclude the ‘not relevant’ results to the next searcher (unless, of course, multiple searches also mark the result as ‘not relevant’…)  In the end, quality should be what survives (at least that’s what I am hoping for.)  [Hmm – how long before we see any search options for my NRB search results approach?]

I will continue to explore Blekko and encourage other folks to do the same.  Note that in order to get ‘full value’ you need to sign up for an account with Blekko – I’m on the fence at this point; I don’t need yet-another-account-on-yet-another web site…   Again, my 2 cents and expect your mileage to vary.  🙂

