Author Archives: Ben Bor

avatar

About Ben Bor

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben can count some of the largest international conglomerates amongst his clients, including the UK tax office, Shell, Exxon, Credit Suisse, QBE, Woolworths, Westpac and others. Ben is an international presenter on Information Management (IM) topics, having presented in Europe, Asia, USA, Canada, NZ and Australia on IM topics ranging from performance through data warehousing and data quality to Social Media analysis and Big Data. Ben has over 30 years’ experience in the IT industry (he wrote his first program in 1969, using punched cards). Prior to joining Teradata, Ben worked for international consultancies for about 15 years (including CapGemini, HP and Logica) and for international banks before that

Cyber Security and the Art of Data Analytics

Friday June 10th, 2016

Who remembers the 1983 movie starring Matthew Broderick titled “War Games”? This was one of the first movies that piqued my interest in computer security. In those days it was the use of acoustic couplers to dial a remote computer and we could actually communicate with a remote pc which was fascinating for me. Before long it was dial-up modems where I could literally dial anywhere I wanted to because it was attached via a serial cable to my pc. And yes such power at my fingertips meant I did perform some ethical hacking as a form of learning the inner workings of computer security and also the fact that I thought I was Matthew Broderick too!

Fast forward 30 years and I am still involved in computer security albeit from a different angle. Firstly through my ever continuing research on computer security topics as part of my PhD program and also secondly through the application of data analytical capabilities to detect and counter threats.

But let’s take a step back first to understand today’s cyber security threat and what it means to you. The proliferation of device connectivity onto the internet over the past 20 years has given rise to huge volumes of information that is accessible from one device to another. It’s a very simple concept; a device connects to another to exchange bits of data across a communications link. In the today’s modern form however the availability of data is what attracts the cyber criminals.

 Protect your data, but also understand the data protection policies of your trading partners.

We heard last month from the US Department of Justice on the case of charging several Chinese Nationals who have been identified as stealing trade secrets from US companies and feeding this back into Chinese corporations. This however is not your backyard group of ragtag coders though. This is a sophisticated state backed group using techniques that are developed in-house. Their targets are not the military missile silos either like they depicted in wargames, they are corporate organisations.

Their targets are patent designs and any other corporate information that can be used as an advantage. And they don’t discriminate on organisation size either. I recently spoke to a CEO of a Funds management organisation based in Canberra that specialises in rural properties. I asked him what his organisation is doing in regards to protection of corporate secrets and his response was very sobering. His view was that they weren’t a target. “What do we have that would be of interest to them?” After pointing out the value of any form of data to foreign organisations, he got the picture.

A survey by Ponemon institute about cyber-attacks highlighted the state of cyber readiness. In this report I note the following figure:

Less than half agreed that their organisation is vigilant in detecting attacks and slightly less agreed that they were preventing attacks. I thoroughly recommend reading this report as it highlights some fascinating insights into the state of the art of cyber attack prevention. Download the report here.

And attacks may not come directly into your organisation either. On a local note here in Canberra, we witnessed the accessing of building design plans of the new ASIO HQ not directly against ASIO but via a 3rd party contractor. Therefore we see that access comes in many forms, shapes and sizes. Protect your data, but also understand the data protection policies of your trading partners.

So in understanding the context of cyber-attacks on our society, how does the use of data analytics play a role in defending against these attacks? The obvious answer lies in the vast amounts of information that we have at our fingertips and analysing this data to figure out what is happening. There are a number of key requirements that a data analytical system to combat cyber attacks should have and I have outlined a few below:

Speed– Obviously the quicker we can analyse the data, the quicker we can detect the threat and put in place counter measures. But traditionally, data analytics has taken on a historical view of the data. It was ok to send the data off to somewhere to be processed and have a result come back a few hours later, but that’s not how we handle cyber security data. We now must develop processes whereby we can collect, analyse and take action in a fraction of a second. Any longer and the attack would be deemed to be successful. To do this we have to design environments that collect data instantly and process the data “in-flight”. Therefore analytical functions have to be performed at the point of capture in real-time.

Volume– Imagine if we had to walk around our house constantly monitoring every fence line to stop burglars coming over. As soon as we turn our backs, one could slip over in an instant. Well the same applies to the volume of data we need to keep watch over. Analytics plays a role in analysing web logs, firewall logs, change logs, application logs, packet information and user activity all in one place. Organisations need to centralise security information into one place to analyse it all as a single entity and not in isolation.  Miss one bit of information and sure enough the attack will come through that crack.

Convert to an intelligence driven security model- Just like the hackers out there evolve quickly, so to must our security models adapt. As organisations, we are far too slow and rigid in our security approaches to be able to adapt quickly to the multiple threats that appear every day. Therefore we must move towards an intelligence driven security model. This approach relies on security-related information from both internal and external sources being combined to deliver a comprehensive picture of risk and security vulnerabilities. Current security models rely too much on detecting what’s already known and protecting the enterprise against those threats. Instead an intelligence driven security model will help us to detect the unknowns and predict the threats. As a result we can strengthen our defences where the attacks are going to come from. Predictive analytics certainly has a role to play in this space and Teradata leads the way with our Aster platform.

Know the unknowns and be more effective in protecting your organisation through the use of predictive analytics.

On a final note, I recommend you visit a news release from last year that highlighted the next big wave in partnerships on combating cyber-attacks. Teradata has formed a partnership with Novetta to develop next generation cyber security solutions. Combining the benefits of proven Teradata technology with Novetta advanced cyber security solutions is a no brainer. Especially when you consider that if the US military can trust Novetta for their cyber security needs, then surely you can too!

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

Schrödinger’s Cat and Big Data

Monday June 6th, 2016

Schrödinger’s Cat is a thought experiment developed by Erwin Schrödinger (1887-1961) to illustrate that micro-scale quantum effects can be made to produce real (and quite bizarre) effects in the real world.

Being Slightly Dead

Ben Bor_Cat 1

Figure 1 – Schroedinger’s cat, thankfully alive

In this case, Schrödinger uses superposition on atomic scale to affect the life-span of a cat. The cat is placed in a sealed box with a radioactive material and a Geiger counter. The half-life of the radioactive material is known. The Geiger counter ensures that if the material has decayed, poison is released and the cat dies (don’t blame me, I am just the messenger).

In the half-life time the material has 50% chance of having decayed. But according to Quantum Mechanics, until we open the box and make the measurement, the material is both in a decayed state and in a non-decayed state. It is in a superposition of both states. The act of opening the box “collapses” the two states into one state, which is either decayed or non-decayed. Therefore, says Schrödinger, until we open the box the cat is in a superposition of dead and alive; in other words, both dead and alive.

The Big Data angle

Ben Bor_Cat 2

Figure 2 – Don’t let your Data lake look like this

“What does this have to do with Big Data?”
I hear you ask.

Well, it has a lot to do with Data Lakes.

A Data Lake is a data repository used by an organisation to store data, mainly for future use (the data that are actually used are typically stored in other repositories).

The nature of Data Lake is that it stores large volumes of data whose present value is not always known. And here comes the Quantum aspect: until you have actually used the data, you don’t know whether it is valuable or not. See the connection? The data is in a superposition of useful and useless.

If a large portion of your Data Lake is useless, the whole Data Lake becomes useless, as the business loses the confidence in using the data and stop using it.

How to prepare

The main issues with data in the Data Lake is reliability of the analytics that uses the data. This lack of trust is caused by data quality issues and lacking metadata. When using a Data Lake I need to know how recent the data is, what the source is, how accurate it is etc.

Ben Bor_Cat 3

Figure 3 – Data Reservoir: a well-curated Data Lake

Therefore the only way to ensure that your Data Lake is useful is to apply the same rigor you apply to your (more formal) data repositories:
• Appoint data owners
• Appoint data stewards
• Provide data metrics
• Ensure high quality of data at ingestion time , not at usage time
• Collect, curate and publish metadata

With this simple approach you end up with a well-curated Data Reservoir, not an under-managed under-used lake.

Download the “Definitive Guide to the Data Lake” for free here.

Is Big Data Getting Smaller? — Part 2

Tuesday March 1st, 2016

When most people hear the term “Big Data” they envisage a data centre full of servers, all happily parallel-processing the world’s most important problems (like the data centre in the picture below: analysing particle collisions at the Large Hadron Collider at CERN).

Ben Bor_Data getting smaller 1

Above: Data Centre at CERN

Well, the whole Big Data thing started with Google and was quickly adopted by similar companies with high-volume requirements (like Facebook and Yahoo) so no wonder the image in our mind is of ginormous data volumes being crunched by ginormous pools of computers.

But these days the technology behind Big Data is quickly becoming mainstream. Yes, not all the bugs have been ironed-out and it is still quite “clunky” when compared with mature technologies, but the adoption of Big Data technology is increasing even for crunching smaller problems. There are several reasons for this:

  • Much of the software is developed by companies who use the software internally before releasing it to the public domain. By then it is highly functional and well tested. See for example Presto: developed originally by Facebook for their own use, it is now continuing development (by Facebook and Teradata) in the open domain
  • The cost of the software is close enough to zero (at least in the pilot stage …)
  • Running on a large number of parallel computers, these solutions are highly scalable. It is very easy to start small and grow quickly
  • Finally, let’s admit it, the hype around Big Data attracts technologists to try-out these ‘cool’ new software gadgets

I recently worked with a company that needs to make real-time data available both internally and externally. The volumes are not high: thousands of events happen every day. They could buy an off-the-shelf streaming solution for a lot of money or develop an end-to-end solution based on Spark, Kafka and Hive. What’s more, they can speed up development and reduce maintenance costs by using Listener which envelopes Kafka, Cassandra, Elastic Search and Mesos thus deploying real-time streams with very little programming and in a very short time frame.

An arguably extreme example of using Big Data on Small Data is doing Social Network Analysis on a group of dolphins in Doubtful Sound (one of the fjords) in New Zealand. The analysis shows that the network is scale-free and illuminates other fascinating characteristics of this very small (64 individuals) group.

Ben Bor_Data getting smaller 2

Above: Dolphins at Doubtful Sounds, NZ

Big Data is, therefore, no longer the domain of the Big Players: the technology is quickly getting acceptance and being adopted by medium and small players.

So, what is the smallest project using Big Data technologies you know of?

For a different perspective on this, see my current article titled “Is Big Data Getting Bigger”.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

Is Big Data Getting Bigger? — Part 1

Tuesday March 1st, 2016

Although definitions of Big Data typically embrace features other than size (and for no good reason these features always begin with a “V” like Velocity, Variety and Veracity) the name “Big Data” instantly brings to mind a lot of data.

So, how big can Big Data get?

In 2010 the list of largest databases in the world quotes the World Data Centre for Climate database as the largest in the world, at 220 Terabyte (possibly because of the additional 6 Petabyte of tapes they hold, albeit not directly accessible data). By the end of 2014, according to the Centre’s web site, the database size is close to 4 Petabyte (roughly 2 Petabytes of these are internal data).

Facebook claim upwards of 300 Petabyte of data in their (so called) data warehouse; however, as we all know, there is very little analysis done on these data – mainly due to the fact that much of it is pictures of cats :-).

These sizes are about to be dwarfed by new science projects running now or coming to life soon.

The Large Synoptic Survey Telescope (depicted below) is likely to break many data-volume records.

Ben Bor_Data getting bigger 1

Above: The Large Synoptic Survey Telescope (source: AstronomyNow.com)

The 8.4 meter telescope (which is quite small compared with the planned European Extremely Large Telescope with a diameter of 40 meters) will boast a 3.2 Giga-pixel camera (which is the largest digital camera on earth) taking a photo of the sky every 15 seconds.

This generates 30 Terabytes of astronomical data per night.

In its planned 10 years of operation the telescope will generate over 60 Petabyte of raw data plus a (probably several times larger) amount of analysis data. For comparison, humanity has accumulated circa 300,000 Petabytes of data since time immemorial. This telescope alone will add 0.1% !

And it gets even bigger.

The Large Hadron Collider at CERN generates about 30 Petabytes per year (as a result of 600 million collisions per second generating data in their detectors. Interestingly, scientists had to sift through these data to find the handful of collisions that produced the Higgs Boson. They deservedly won the Noble prize for their efforts).

Ben Bor_Data getting bigger 2

Above: The Large Hadron Collider (source: HowItWorksDaily.com)

The data is too large for a single data centre so CERN created the Global Computing Grid which divides the load between computer centres all over the world.

The Internet of Things promises ubiquitous sensors providing data continuously. Some of the data repositories involved are likely to break even these new records.

So, what is the biggest data set you know of? And what is the biggest single data set you are expecting to be involved in?

For a different perspective on this, stay tuned for my next article titled “Is Big Data Getting Smaller

1  A Petabyte is 1,000 Terabyte, or 1,000,000 Gigabyte, or 1015 byte.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

What to Avoid in Customer Analytics

Tuesday November 17th, 2015

Working recently on enabling customer analytics for a large Australian bank, I was once again reminded of the potential of Big Data and the risks involved.

This bank proudly analyses customer transactions to glean important information about its customers. They plan to use the data, for example, to offer you travel insurance when they see that you bought a flight ticket. But they also plan to offer you access to their lounge in Singapore if the flight tickets were bought from Singapore Airlines.

The big question is: as a customer, would you be happy to receive such an offer? Where does the (very subtle) creepiness line pass? How do we decide what’s acceptable and what’s not? Some examples might help.

Let’s start from the creepiest. I know of a US-based company that offers software that analyses web-clicks and can predict with high precision, for female surfers, whether they are having their period. Creepiest, you must agree. Not surprisingly, the company doesn’t get a lot of clients (no female clients, I assume).

A more famous example is the Target story, where their analytics predicted a young girl’s pregnancy before her family knew. What did Target do when the story came to light? “We found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works,” —said Target to Forbes.

On the other side of the continuum there are companies who “spy” on me and make it worth my while. Two examples are Google and Amazon. When I type a search item, Google attempts to complete it for me. It uses what I searched for in the past and what you are searching right now. It spies on all of us and we love it. Why? Because the benefits outweigh the loss of privacy.

We rarely feel that our privacy has been compromised by this and we enjoy the benefits.

Similarly, when I buy a book from Amazon, they tell me that “people who bought this book also bought those books…”. Again, they are spying on me and you, creating a detailed profile of my buying habits and comparing it to your profile. Do I feel that my privacy is breached ? Not at all. The recommendations are actually quite good, usually. Does this benefit Amazon? You bet it does. Fortune magazine claims that Amazon get up to 60% conversion rate on their recommendations.

A middle-way example is Orbitz. As a result of customer-spending analytics, Orbitz decided to present Mac users with more expensive options than PC users. All users had access to the same offers, but Mac users would have to work harder to see the cheaper options. As an Orbitz user, would you be happy with this? Would you complain? Or would you switch to a PC?

Your company is likely to find itself somewhere between these extremes. Yes, you want to know as much as possible about your customers. But you must ensure that you don’t alienate them by compromising their sense of privacy.

My advice: do your customer analytics, but use the results wisely.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

 

 

Big Data and Evolution – How Big Data is Changing the World

Monday July 20th, 2015

Having written about Big Data and Time Travel and about Big Data and Philosophy, it’s time to write about Big Data and Evolution.

The subject has been already discussed in Daniel Dennett’s excellent article in Scientific American in March this year, How Digital Transparency Became a Force of Nature.

In the article Dennett, the Tufts University philosopher and cognitive scientist, and Deb Roy, the Massachusetts Institute of Technology professor and Twitter’s chief media scientist, compare the developments in Digital Data to the developments of environments in the early-stage Earth. Their main idea is that an emerging trend towards digital transparency will put evolutionary pressure on current companies.

This, in turn, will cause a whole round of “survival of the fittest”, where only companies that can withstand the pressure of digital transparency survive, in the same way that Evolution caused whole families of species to disappear when they couldn’t cope with environmental changes.

Source: “Walking with Dinosaurs film”

Dennett and Roy write: “The impact on our organisations and institutions will be profound. Governments, armies, churches, universities, banks and companies all evolved to thrive in relatively murky epistemological environment, in which most knowledge was local, secrets were easily kept, and individuals were, if not blind, myopic. When these organisations suddenly find themselves exposed to daylight, they quickly discover that they can no longer rely on old methods; they must respond to the new transparency or go extinct.”

Anyone who followed the various governments’ response to Wikileaks would agree that they are struggling to cope with transparency. Will commercial organisations fare any better?

But this is only one way in which Big Data is changing the world.

Source: http://www.rand.org/

At present Big Data is Big Promise – it hasn’t yet hit the “killer app”. Remember the early days of the Internet? It was obvious that it is a thing of great promise, but uptake was slow. Then came email and the rest is history.

So what will be Big Data’s killer app?

My money is on Health.

Yes, the Internet of Things (IoT) is the current buzzword. But in my view, it is the potential of Big Data to influence public healthcare that will get us all excited.

More and more people are collecting personal health data, by smartphone app, by smart watch or by special equipment. I recently spoke to a marathon running university professor who can predict his performance on the next marathon based on data he collects on his daily training runs.

Collect this data and analyse it and you have a treasure trove of information that can help predict epidemics, correctly plan public budgets, improve access to health and discover hidden correlations.

The public is obviously interested, but weary of loss of privacy, which brings us back to Dennett’s article. When large corporations evolve to survive in a transparent world, the individual will need to evolve with them. A society that shares its health data is a healthier society.

Let’s hope that common sense prevails over privacy-made-public fears.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

Is Time Travel Possible Without Big Data?

Sunday April 12th, 2015

Is Time Travel possible? Many scientists, including Stephen Hawking, state that Time Travel is not possible for the simple reason that if it were possible, then we would have already seen all those time-travellers visiting us from the future. They are not here, so time travel is not possible.

Source: http://www.insidescience.org/  

In Science, as Karl Popper writes (The Logic of Scientific Discovery, 1934) “Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counter example is logically decisive”. In other words, lack of evidence is not a disproof.

But Big Data is changing that.

We are entering an era when so much data is available that failing to prove your claim with this data is taken as a proof of falsehood (while Popper says that having no evidence for a theory does not disprove it. So Popper would not accept that having seen no time travellers is a proof time travel is not possible).

An example I read recently was for disproving homeopathy. Analysis of all available data on homeopathy result shows no difference between homeopathy and placebo. The headline was “homeopathic treatments have been proven to be completely useless” while the scientists used a more cautious language: “The available evidence is not compelling and fails to demonstrate that homeopathy is an effective treatment for any of the reported clinical conditions in humans”.

Source: https://happyholistichealth.wordpress.com/tag/homeopathy-painkillers/

To you and me both of these come to the same conclusion: even with lots of data, there is no evidence that homeopathy works; so you and I accept the inevitable conclusion: it doesn’t work (even though Karl Popper would have warned us that this is not a proof).

Which brings me to the obvious question:

What else is being disproved by lack of evidence?

Source: http://5writers5novels5months.com/2013/01/

For me (and I expect to see some heated argument on this), the next target is Astrology.

Come on. Data Scientists have access to enough data about the world population to ascertain if it divides nicely into 12 types of people (one for each sign of the zodiac). Have you read any articles proving Astrology by Big Data analysis of the available data?

No? OK, maybe you should accept that Astrology is not real.

Similarly, have any fortune-tellers won the lottery recently?

No? OK, maybe you should accept that they can’t tell the future with any precision.

Numerology? Ditto.

My point? With access to enough data and enough data scientists, the world is changing: lack of proof is becoming proof of falsehood. And Data Science has an important role to play.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

Certain Uncertainties or Philosophy of Big Data

Monday February 23rd, 2015

What does philosophy have to do with Big Data, I hear you ask. Bear with me – all will be explained.

Donald Rumsfeld famously said “There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.”

Donald Rumsfled, February 2002

But the data world is not this clear-cut; not only are there things we know or don’t, there are also whole domains of data where we are just not sure.

So we need to add two new boxes to the diagram:

Domain 1: “I am certain of its uncertainty” – I can quantify the level of unreliability.

Domain 2: “ I am uncertain of its uncertainty” – I know that it is not reliable data, but I don’t know how unreliable it is. Simplistic (and arguable) example: Trip Advisor data; even with a large number of reviews, I can’t be certain that they represent reality.

Let’s first place these two new domains on Rumsfeld’s diagram, then look at a real-life example.

So, where would the “certain uncertainties” and “uncertain uncertainties” fit?

I would place them somewhere around the middle, as in the second diagram.

Let’s look at a real-life example

A Telco wants to sell socio-economic information about its customers, for direct-marketing purposes. The problem is that it knows close-to-nothing about its pre-paid customers: they buy a SIM without giving any personal information.

Can the Telco find out any socio-economic parameters about this population?

The only data we have is usage data: we know the location and duration of calls; we know the location and web-address of web-surfing activities.

Using Teradata Aster solution, we try the following:

  • Identify the gender by analysing web activities. Using known subscribers, we identify the top gender-specific web sites (men use more gambling, sport, etc; women use more dating, picture-sharing, online clothes-shopping etc’. Hey, don’t shoot the messengerJ). We then use this on a test-set and achieve 75% success in ‘guessing’ the gender. Now we can be certain of our uncertainty when applying this to an unqualified data set.
  • Identify higher-income customers by locating frequent domestic flyers. We identify subscribers who made a call from the vicinity of a domestic airport and another call from the vicinity of another airport with a time-gap shorter than the possible driving time between them. Once again, trying this on a known data set results in 80% confidence in this approach. Another certain uncertainty.
  • Find where people live, then use this to identify their income level. The team does this by assuming that calls made before 7am and after 10pm are made from home. It identifies calls made at these times from the same location on different dates and takes that as their home location. It then uses publicly-available socio-economic data about neighbourhoods to assign an income-band to each subscriber. This technique achieves 42% match (compared with known data) and is thus discarded. This is an uncertain uncertainty. Therefore the risk of using it is too high.

To summarise: we start with completely unknown data and explore several avenues. We use known data to estimate our confidence (our level of uncertainty). Some avenues lead to successful and repeatable results; some are a dead-end (which is a very certain uncertainty). We have identified our uncertain uncertainties and converted our certain uncertainties into known-knowns.

 Finally, what about the philosophy angle?

Socrates said (and Plato wrote down) “…only these two things, true belief and knowledge, guide cor­rectly, and that if a man possesses these he gives correct guidance.” (Socrates, in Plato’s Meno Dialogue, 99A).

In other words, you need to know your uncertainties to have true knowledge. Otherwise, it’s only a guess.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

Drowning in the Data Lake

Wednesday October 15th, 2014

IT people always try to use soothing names for complex propositions (don’t we all love the fluffy Cloud, being Service-Oriented [I sometimes wish that the restaurant sector would adopt this] or promising our customers that we are, above all, Agile?).

Drowning in the Data Lake

The new buzzword is the data lake, which immediately brings to mind visions of calm waters and natural beauty (like the picture above, Lake Marian in Fjordland NZ, taken last time I hiked there).

So, what is a Data Lake?

Simply put, it is about never having to dispose of any data; mainly because it may be useful one day. With Hadoop, you can afford to keep all your data so that at some time in the future, when you really need it, it is all there.

Is this new?

The sceptic may ask: If HDFS is just a File System, surely we could have kept all this data on some other File System before Hadoop?!

Well, yes. But could you easily retrieve it? The big difference between storing your data on, say, a LINUX file directory and storing it on Hadoop is that there are several access methods to the Hadoop data (map-reduce and its SQL derivatives like Hive) while for your LINUX directory you would need to write very complex programs.

data lake best practices

Can you really keep everything and retrieve anything?

You got me there…

I have a client that is struggling with this now. The company uses Hadoop to store several Tb of data that don’t have a natural place anywhere else. A group of users would like to be able to query the data many times during the day. The problem is that Hadoop does not include an advanced query optimiser. It also does not support indexes. So queries that would take seconds on a decent RDBMS take up to 30 minutes on the Hadoop cluster.

So are you saying that the Data Lake is not a good idea?

Not at all! Your Data Lake must be part of your Information Architecture. You have to think about what information you need to store, how you plan to retrieve it and therefore where is the best place to store it.

So, before diving into your Data Lake adventure:

  1. Ensure that your Data Lake is part of a robust Enterprise Information Strategy.
  2. Use best practice advice to ensure that your approach is robust.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.

So You Want to Be a Data Scientist?

Tuesday August 26th, 2014

So, you want to be a data scientist?

Congratulations!  You have made the right decision.  You have just chosen a career that will present you with diverse challenges, an interesting life and one that will be in high demand in the short and medium-term future.

The other good news is that if you possess certain key skills, you can just go and call yourself a data scientist.  After all, all these data scientists that you hear about all the time don’t have a degree in Data Scientifiking.  They declared themselves data scientists and got someone to pay them to do what they enjoy doing.

So, what does a data scientist do?

My definition: they specialise in getting convincing answers to important questions using data that is not easy to get and/or not easy to analyse.

This requires a certain way of thinking.  If you are not sure what I mean, check out the talk from Talithia Williams here to find out how a data scientist thinks (in this example, the talk is about health, but it’s her thinking patterns that make her a data scientist.  One can assume she would probably use the same approach with regard to her nutrition, shopping etc.).

Now that you know what data scientists do, what skills do they need?

I expect a successful data scientist to have equal measures of the following:

 

 

 

 

 

With extra emphasis on Chutzpah.

If you have all of the above in spades, you are probably ready to become a data scientist.

data_scientist_bbor

 

 

 

The rest of us may need training and education (I am happy to share privately my definition of the difference between training and education…).

Several universities have embarked on data science courses.  This could be a great place to start.  Once graduates of these courses get to the market, they will have an advantage over those without formal data science qualifications.

A great way to get “the feel” without paying anything is to take the free data science series of courses from John Hopkins University, offered free on Coursera.  Let me write the key word again: Free!!  It is an interesting curriculum, good fun (if you are that way inclined) and full of practical exercises and useful advice.  Have I already mentioned that it’s free?

So, good luck with your new career; you have just chosen the sexiest job this century.

Ben Bor is a Senior Solutions Architect at Teradata ANZ, specialist in maximising the value of enterprise data. He gained international experience on projects in Europe, America, Asia and Australia. Ben has over 30 years’ experience in the IT industry. Prior to joining Teradata, Ben worked for international consultancies for about 15 years and for international banks before that. Connect with Ben Bor via Linkedin.