Very Big Decisions
Story by Mark Whitehorn, 23-10-2008, 0 comment
Greenplum is a provider of data warehousing and BI solutions for large data sets. The company has made its name by providing a system that allows complex, aggregation-type queries to run very rapidly against relational database structures. Greenplum essentially allows you to plug servers together and build up a large parallel processing system. It has been in the news recently because it has introduced support for MapReduce. “MapReduce?” you say. “Right… something to do with those mashup thingies you were on about last month?” MapReduce is a ridiculous name that does little except wave a large red herring. The technology has little to do with “reduction” in the normal sense of the word and absolutely nothing to do with geographical mapping.
It is, in fact, a parallel processing technique for very large data sets (in the multiple petabyte range) on clusters of computers. Given Greenplum’s background, the support for MapReduce begins to make sense.
The use of MapReduce has been pioneered by Google to meet that company’s requirement for high-speed analysis of massive quantities of data. Its name comes from the two functions in the field of functional/mathematical computing upon which it relies: Map and Reduce.
It’s a specialist form of processing that performs such tasks as counting the number of times each different word makes an appearance in a set of documents. Google used it to regenerate the index of World Wide Web content. (In this context I feel the Web deserves its full name with capitals to emphasis the massive scale of this undertaking.) MapReduce can also be valuable for data mining, indexing, searching and customer relationship management applications. Google says that it runs more than a thousand MapReduce jobs every day.
Incidentally, MapReduce is not the panacea for all high-volume database applications: it has specialised abilities which are highly effective in certain situations. It has been written about in scathing terms by Michael Stonebraker, who has been much quoted for saying: “As a data processing paradigm, MapReduce represents a giant step backwards.” (Stonebraker is responsible for the open-source relational database engine, Postgres, the engine upon which, ironically, Greenplum’s database is based.) Among other points he states that MapReduce lacks much of what is deemed to be good practice for data handling, citing no separation between application and data, no high-level data access language and incompatibility with database management system tools.
The big job
Greenplum has added MapReduce support to its eponymous database engine. As I said above, Greenplum is no stranger to parallel processing – its engine uses a massively parallel processing architecture – nor to large data volumes with multiple petabytes being readily digested, so it seems a natural, if bold, step.
This new means of handling data has been added to the core Greenplum database engine and will let programmers delve into data with MapReduce, while DBAs can use SQL to query the same data (Figure 1). There seems to be a tacit acceptance that MapReduce is likely to be used by “programmers”:
Greenplum’s web site states “Greenplum MapReduce enables programmers to run analytics [my italics] against petabyte-scale data sets stored in and outside of the Greenplum database.” It sounds as if there is more than a little truth in Stonebraker’s argument that this is a tool with few refinements.
Figure 1: Greenplum’s parallel dataflow engine
This all may sound as if I’m against MapReduce, but I’m not (apart from the name). The great point in its favour is that it works. Forget the computer science theory for a moment; this is a technology that actually delivers. Who could possibly question Google’s ability to classify large amounts of data?
And Greenplum isn’t alone in its adoption of MapReduce: Aster Data Systems has done the same thing and announced it the same week. Aster’s nCluster is a massively parallel processing (MPP) relational database system that runs on a large cluster of commodity hardware nodes and is aimed at companies wishing to analyse the data in a warehouse. The social networking web site, MySpace, has deployed nCluster, using more than 100 nodes and the system is capable of loading millions of rows per second.
Over the past few years, the interest in business intelligence has grown rapidly and one reason is that the quality of information we are able to extract from data has improved. Analysis of data drives the commercial world and government thinking, and reaches all of us in whatever daily dose of general news we take. The sets of data that are available for analysis are becoming ever larger and the capability is arriving to analyse and mine them: MapReduce is just one of the developing technologies that make it possible. For more information see www.greenplum.com and www.asterdata.com.
Unripe Greenplum
In June it was reported that Greenplum was planning to open a European office in the UK with an eye towards raising its profile and, of course, market share. At the time of writing, no information could be gleaned about this plan.
Greenplum apparently wants to set up shop in the UK in order to challenge the dominance of Teradata and Netezza. (Teradata has a London office and Netezza opened one in Bracknell in 2004.) Greenplum sees an advantage in its use of off-the-shelf hardware, while Teradata and Netezza systems are both based on proprietary hardware. A presence in the UK would increase the choice for potential buyers who prefer to deal with a local office.
Parallel partnership
In early September, Tableau announced it now offers support for Netezza’s data warehouse applications. We’re all collecting ever larger volumes of data and need ways of handling it in a timely manner, and we also need ways of making sense of all that data. The Tableau-Netezza combo sounds like a good match.
Netezza’s strong suit is the power of its data warehouse applications: its proprietary hardware is optimised for data warehousing and designed to perform complex queries against large data sets. The use of MPP technology ensures rapid query processing.
Tableau is renowned for its prowess in data visualisation for large data sets and publishing the results to the Web. A wide range of graphical displays is available and can be combined into dashboards and other displays. The power of graphics to assist us in our understanding of the meaning of data has long been acknowledged and Tableau is at the forefront. Tableau 4 was launched in early August and includes greatly enhanced mapping capabilities: working with geographical data is easier and users can produce analytical maps quickly, including animated maps to show change over time.
Users of Netezza’s appliances will now be able to use Tableau’s visual analytic capabilities as their front end. I’m inclined to agree with Matt Rollender, director of strategic and technology alliances at Netezza, who says: “This technology partnership is a great example of how two companies can come together and deliver true value to the end user.” To find out more see www.tableausoftware.com and www.netezza.com.
Why buy?
Microsoft has bought DATAllegro, the provider of MPP data warehouse appliances. On the face of it, it seems an unlikely purchase. Not only does Microsoft have a very comprehensive BI offering, but the strategy the company has followed thus far – conventional data warehouse with OLAP cubes in data marts – does not appear to sit well with DATAllegro’s position as a supplier of data warehouse appliances.
On the other hand, it’s clear that Microsoft didn’t make its purchase on a whim. For the moment Microsoft is remaining tight-lipped on the subject, but it transpires that the company has taken the DATAllegro product off the market (although it is supporting existing customers).
I spoke to David Hobbs-Mallyon, product manager for SQL Server in the UK, and he said that it was likely that a significant announcement would be made at Microsoft’s BI conference, to be held in Seattle at the beginning of October (which will have just finished by the time you read this). He did point out that Microsoft currently has an excellent offering that will scale to tens of terabytes and that DATAllegro’s product will scale to hundreds of terabytes. While very few customers require elbow room in the 100TB+ range, the knowledge that scaling up is perfectly feasible should the necessity arise can only enhance the confidence of any potential (or existing) customers in Microsoft’s solution.
Stay tuned for more news after that BI conference.
Sign up to receive the latest news and updates from Server-Management via email.
Symantec Enterprise Vault
Second Site Saver
OLAP usage in the UK
The One True Database Engine
Migrating Blackberries to Exchange 2007
Exchange Server 2010: Database Availability Group
System Center Essentials 2010 RC
Exchange 2010: The New Archiving Feature
Strong authentication failing
- Posted:
- 2010-03-11
- Location:
- West Midlands, West Midlands
- Salary range:
- 55000 - 60000
- Salary period:
- year
Description:
Head of Data - SQL/Data-warehouse/Data-modelling/Strategy - Industry Leader - West Midlands Data Manager/Head of Data/Data Strategy Manager/Head of BI As part of my well known client's ongoing IT strategy, they are in urgent need of an experienced Head of Data to make a real impact in the ... read more
- Posted:
- 2010-03-11
- Location:
- Reading, Berkshire
- Salary range:
- 20000 - 25000
- Salary period:
- year
Description:
This is a fantastic job opportunity for a keen IT person, who has a massive interest in computers and building a career within this sphere. My client a well known IT Reseller based in Reading is seeking to recruit a Technical Presales Consultant. You do not have to be qualified as my client is l... read more
- Posted:
- 2010-03-11
- Location:
- Sheffield, South Yorkshire
- Salary range:
- 20000 - 25000
- Salary period:
- year
Description:
IT Technician (Legal) Sheffield £20-25k The Job Role: We are looking for a network administrator who will be able to maintain and support the systems our client has in place providing services to their team. The Systems Administration Team will be responsible for building, supporting ... read more
- Posted:
- 2010-03-11
- Location:
- Cambridgeshire, Cambridgeshire
- Salary range:
- 35000 - 40000
- Salary period:
- year
Description:
My client, a specialist consultancy, are looking for a senior Infrastructure Consultant. You will be the sole owner of the companies infrastructure so must have solid Windows Server experienced including Active Directory coupled with excellent IIS Administration experience.Ideally you will have ... read more
- Posted:
- 2010-03-11
- Location:
- Southampton, Hampshire
- Salary range:
- 18000 - 25000
- Salary period:
- year
Description:
Data Analyst / IT Support Assistant – Southampton – £18k to £25k + bens Key skills: MsAccess, MsExcel, 1st line PC Support, PC networking +/- SQL, Visual Basic, PHP. Base of £18k to £25k (20 days holiday, rising to 25 in time), 8:30am –... read more
Want to advertise here? Follow me!