kottke.org posts about search
After I heard Microsoft’s announcement of yet-another-interation of their search engine (named Bing), I went to look at the stats for kottke.org for the past month to see how many visitors each search engine sent to the site. I couldn’t believe how dominant Google was.
Google | 262,946 | 93.8%
MS Live | 4,307 | 1.5%
Yahoo | 4,036 | 1.4%
MSN | 2,796 | 1.0%
It’s a small sample and doesn’t match up with Comscore’s numbers (Google: 64.2%, Yahoo: 20.4%, MS: 8.2%), but wow. As a comparison, the numbers for a year ago for kottke.org had Google at 91%, Yahoo at 4.9%, and Live at 0.7%.
At some event called the Churchill Club Top Tech Trends, VC Steve Jurvetson had an interesting idea about the future direction of search.
He said the aggregate power of distributed human activity will trump centralized control. His main point was that Google, and other search engines that analyze the Web and links, are much less useful than a (theoretical) search engine that knows not what people have linked to (as Google does), but rather what pages are open on people’s browsers at the moment that people are searching. “All the problems of search would be solved if search relevance was ranked by what browsers were displaying,” he said.
I like that idea a lot, but it got me thinking: how many instances of Firefox can you run on a cheapo LInux box, how many tabs could you have open in each of those browsers, and would that be more or less cost effective than the search term gaming that currently happens? In other words, good luck with that!
If you’re skeptical of WolframAlpha (as I was), you should watch this introduction by Stephen Wolfram. The comparison to Google (usually “is WolframAlpha a Google killer?”) is not a good one but the new service could learn a little something from the reigning champion: hide the math. One of the geniuses of Google is that it took simple input and gave simple output with a whole lot of complexity in between that no one saw and few people cared about. Plus the underlying premise of the complex computation was simplified, branded (PageRank!), and became a value proposition for Google: here’s what the web itself thinks is important about your query.
Here’s a small and nerdy measure of the huge change in the executive branch of the US government today. Here’s the robots.txt file from whitehouse.gov yesterday:
User-agent: *
Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /omb/search
Disallow: /omb/query.html
Disallow: /expectmore/search
Disallow: /expectmore/query.html
Disallow: /results/search
Disallow: /results/query.html
Disallow: /earmarks/search
Disallow: /earmarks/query.html
Disallow: /help
Disallow: /360pics/text
Disallow: /911/911day/text
Disallow: /911/heroes/text
And it goes on like that for almost 2400 lines! Here’s the new Obamafied robots.txt file:
User-agent: *
Disallow: /includes/
That’s it! BTW, the robots.txt file tells search engines what to include and not include in their indexes. (thx, ian)
Update: Nearly four months later, the White House’s robots.txt file is still short…only four lines.
User-agent: *
Disallow: /includes/
Disallow: /search/
Disallow: /omb/search/
TinEye is an image search engine. You give it an image and it’ll find it on the web for you. If it works โ I didn’t get to try it too much because it was down โ this is great for chasing down attribution and finding other pix by the same photographer and such. (via master kalina)
Google Book Search has added a few magazines to their repertoire.
Today, we’re announcing an initiative to help bring more magazine archives and current magazines online, partnering with publishers to begin digitizing millions of articles from titles as diverse as New York Magazine, Popular Mechanics, and Ebony.
At least I think it’s a few magazines…it might be thousands but there’s no way (that I can find) to view a list of magazines on offer.
Update: Spellbound and Thomas Gruber have lists of some of the magazines on offer.
StateStats is hours of fun. It tracks the popularity of Google searches per state and then correlates the results to a variety of metrics. For instance:
Mittens - big in Vermont, Maine, and Minnesota, moderate positive correlation with life expectancy, and moderate negative correlation with violent crime. (Difficult to commit crimes while wearing mittens?)
Nascar - popular in North and South Carolinas, strong positive correlation with obesity, and and moderate negative correlation with same sex couples and income.
Sushi - big in NY and CA, moderate positive correlation with votes for Obama, and moderate negative correlation with votes for Bush.
Gun - moderate positive correlation with suicide and moderate negative correlation with votes for Obama. (Obama is gonna take away your guns but, hey, you’ll live.)
Calender (misspelled) - moderate positive correlation with illiteracy and rainfall and moderate negative correlation with suicide.
Diet - moderate positive correlation with obesity and infant mortality and moderate negative correlation with high school graduation rates.
Kottke - popular in WI and MN, moderate positive correlation with votes for Obama, and moderate negative correlation with votes for Bush.
Cuisine - This was my best attempt at a word with strong correlations but wasn’t overly clustered in an obvious way (e.g. blue/red states, urban/rural, etc.). Strong positive correlation with same sex couples and votes for Obama and strong negative correlation with energy consumption and votes for Bush.
I could do this all day. A note on the site about correlation vs. causality:
Be careful drawing conclusions from this data. For example, the fact that walmart shows a moderate correlation with “Obesity” does not imply that people who search for “walmart” are obese! It only means that states with a high obesity rate tend to have a high rate of users searching for walmart, and vice versa. You should not infer causality from this tool: In the walmart example, the high correlation is driven partly by the fact that both obesity and Walmart stores are prevalent in the southeastern U.S., and these two facts may have independent explanations.
Can you find any searches that show some interesting results? Strong correlations are not that easy to find (although foie gras is a good one). (thx, ben)
Rogers Cadenhead has beaten me to the punch in calculating the winner of the Dave Winer/Martin Nisenholtz Long Bet pitting the NY Times vs. blogs to see who ranks higher in end of the year search results for the 5 most important news stories of 2007. The winner? Wikipedia.
The Times has really improved their position in Google since 2005…opening up their archives helped, I bet.
There are indications that Google is changing their PageRank algorithm, possibly to penalize sites running paid links or too many cross-promotional links across blog networks. Affected sites include Engadget, Forbes, and Washington Post. Even Boing Boing, which I think had been at 9, is down to 7. You can check a site’s PR here.
Depending on the site, 30-40% of a site’s total traffic can come from search engines, much of that from Google. It will be interesting to see how much of an impact the PR drop will have on their traffic and revenue. (thx, my moon my mann)
Update: Just got the following from the editor of a site that got its PR bumped down. He says:
Two weeks ago I lost 80% of my search traffic due to, I believe, using ads from Text-Link-Ads, which does not permit the “nofollow” attribute on link ads. That meant an overall drop of more than 44% of my total traffic. It also meant a 65%-95% drop in Google AdSense earnings per day and a loss of PageRank from 7 to 6.
He has removed the text links from his site and is negotiating with Google for reinstatement but estimates a loss in revenue of $10,000 for the year due to this change. And this is for a relatively small site…the Engadget folks must be freaking out.
Speaking of cool Etsy shops, elastiCo is selling pillows and tshirts with the most popular Google News search terms printed on them.
A rerun, because it came up at dinner the other night: EPIC 2014, the recent history of technology and the media as told from the vantage point of 7 years in the future. “2008 sees the alliance that will challenge Microsoft’s ambitions. Google and Amazon join forces to form Googlezon. Google supplies the Google Grid and unparalled search technology. Amazon supplies the social recommendation engine and its huge commercial infrastructure.”
An interesting somewhat-inside look at Google’s search technology. I found this interesting: “When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds.” No matter how hard CNN or Digg or Twitter works to harness their audience to break news, hooking up Google search queries to Google News in a useful manner would likely scoop them all every time.
Google is the crossword puzzler’s best friend. Several of the top 100 searches on a given day are for crossword clues. This was more apparent a few days ago but it looks like they’ve started to filter the crossword terms out. More here. (thx, peggy & jonah)
Since swearing off Technorati a couple of years ago, I’ve been checking back every few months to see if the situation has improved. The site is definitely more responsive but their data problems seemingly remain, at least with regard to kottke.org; Google Blog Search gives consistently better results and easy access to RSS feeds of searches.
Technorati recently introduced something called the Technorati Authority number, which is a fancy name for the number of blogs linking to a site in the last six months. Curious as to where kottke.org fell on the authority scale, I checked out the top 100 blogs list. Not there, so I proceeded to the “Everything in the known universe about kottke.org” page where a portion of that huge cache of kottke.org knowledge was the authority number: 5,094. Looking at the top 100 list, that should put the site at #47, nestled between The Superficial and fishki.net, but it’s not there. Technorati also currently states that kottke.org hasn’t been updated in the last day, despite several updates since then and my copy of MT pinging Technorati after each update.
Maybe kottke.org has been intentionally excluded because I’ve been so hard on them in the past. Or maybe it’s just a glitch (or two) in their system. Or maybe it’s an indication of larger problems with their service. Either way, as the company is attempting to offer an authentic picture of the blogosphere, this doesn’t seem like the type of rigor and accuracy that should send reputable media sources like the BBC, Washington Post, NY Times, and the Wall Street Journal scurrying to their door looking for reliable data about blogs.
Update: As of 3:45pm EST, the top 100 list has been updated to include kottke.org. The site also picked up this post right away, but failed to note a subsequent post published a few minutes later..
Google buys Doubleclick for $3.1 billion. My assertion more than four years ago that Google is not a search engine isn’t looking too shabby.
The NY Times today:
On Thursday, Google, the Internet search giant, will unveil a package of communications and productivity software aimed at businesses, which overwhelmingly rely on Microsoft products for those functions.
The package, called Google Apps, combines two sets of previously available software bundles. One included programs for e-mail, instant messaging, calendars and Web page creation; the other, called Docs and Spreadsheets, included programs to read and edit documents created with Microsoft Word and Excel, the mainstays of Microsoft Office, an $11 billion annual franchise.
kottke.org from April 2004:
Google isn’t worried about Yahoo! or Microsoft’s search efforts…although the media’s focus on that is probably to their advantage. Their real target is Windows. Who needs Windows when anyone can have free unlimited access to the world’s fastest computer running the smartest operating system? Mobile devices don’t need big, bloated OSes…they’ll be perfect platforms for accessing the GooOS. Using Gnome and Linux as a starting point, Google should design an OS for desktop computers that’s modified to use the GooOS and sell it right alongside Windows ($200) at CompUSA for $10/apiece (available free online of course). Google Office (Goffice?) will be built in, with all your data stored locally, backed up remotely, and available to whomever it needs to be (SubEthaEdit-style collaboration on Word/Excel/PowerPoint-esque documents is only the beginning). Email, shopping, games, music, news, personal publishing, etc.; all the stuff that people use their computers for, it’s all there.
When you swing a hammer in the vicinity of so many nails, you’re bound to hit one on the head every once in awhile. Well, I got it in the general area of the nail, anyway.
Jeffrey Toobin, the New Yorker’s legal writer, has penned a piece about Google’s book scanning efforts and the legal challenges it faces. Interestingly, both Google and the publishers who are suing them say that the lawsuit is basically a business negotiation tactic. However, according to Larry Lessig, settling the lawsuit might not be the best thing for anyone outside the lawsuit: “Google wants to be able to get this done, and get permission to resume scanning copyrighted material at all the libraries. For the publishers, if Google gives them anything at all, it creates a practical precedent, if not a legal precedent, that no one has the right to scan this material without their consent. That’s a win for them. The problem is that even though a settlement would be good for Google and good for the publishers, it would be bad for everyone else.”
How to disable the stupid Snap Preview things that are popping up on everyone’s site these days. (via df)
All links on Wikipedia now automatically use the “nofollow” attribute, which means that when Google crawls the site, none of the links it comes across get any PageRank from appearing on Wikipedia. SEO contest concerns aside, this also has the effect of consolidating Wikipedia’s power. Now it gets all the Google juice and doesn’t pass any of it along to the sources from which it gets information. Links are currency on the web and Wikipedia just stopped paying it forward, so to speak.
It’s also unclear how effective nofollow is in curbing spam. It’s too hard for spammers to filter out which sites use nofollow and which do not and much easier & cheaper just to spam everyone and everywhere. Plus there’s a not-insignificant echo effect of links in Wikipedia articles getting posted elsewhere so the effort is still worth it for spammers.
Via Tim O’Reilly comes this comment from Bill Burnham:
A couple of months ago I had the pleasure of moderating a panel at TIECon on the Search Industry. Peter Norvig, Google’s Director of Research, made one comment in particular that stood out in my mind at the time. In response to a question about the prospects for the myriad of search start-ups looking for funding Peter basically said, and I am paraphrasing somewhat, that search start-ups, in the vein of Google, Yahoo Ask, etc. are dead. Not because search isn’t a great place to be or because they can’t create innovative technologies, but because the investment required to build and operate an Internet-scale, high performance crawling, indexing, and query serving farm were now so great that only the largest Internet companies had a chance of competing.
For Norvig to say what he did seems a little crazy, given the company he works for. The first time that search died was back in 1998. Yahoo, Altavista, Hotbot, Webcrawler, and other sites had the search game all sewn up. They were all about the same in terms of quality and people found what they were looking for much of the time. No one needed another search engine, and starting a search company in such a mature market seemed like folly. Around that time, Google became a company and eventually the world figured out it really did need another search engine.
O students! Pray teachers! Behold: a Shakespeare search engine.
Simply Google, a one-pager for navigating and searching all of Google’s offerings.
I’ve been keeping track of words which return a link to a dictionary definition of the word in Google. Dictionary words are those that are written but not written about, haven’t been subject to the corporate/band/blog word grab, or aren’t otherwise popular words.
germane
paucity
reticent
cantankerous
suppositious
abstruse
whinge
assiduous
surreptitious
proclivity
disparaging
sporadically
hypertrophied
pallor
acerbic
surfeit
Many of the Dictionary.com Words of the Day are probably dictionary words as well.
Newer posts
Older posts
Stay Connected