From Millions to Billions of Connected Devices, Aeris Uses Cassandra to Scale
August 5, 2013
Subu Balakrishnan: Head of Analytics at Aeris
Drew Johnson: Head of Engineering at Aeris
Matt Pfeil: Co-Founder at DataStax
Matt: Hello Planet Cassandra this is Matt Pfeil, the co-founder of DataStax. Today I’m with Drew Johnson, the Head of Engineering at Aeris, as well as Subu Balakrishnan, the Head of Analytics and Billing. Gentlemen, thanks for joining us today. Why don’t we kick this thing off with you guys telling everyone a little bit about what Aeris does?
Drew: Aeris is in the machine-to-machine/internet of things space; so what we do is connect machines into a network. These machines can be as large as cars and trucks, all the way down to the little sensors sitting in a gasoline tank or parking space. We have customers, such as Hyundai and Honda, that have all of their vehicles connected to a network; all of these networked vehicled are connected via our systems.
Matt: That’s really cool. What kind of information do you track?
Drew: With cars, we think about three main categories of applications: The first are ‘under the hood applications’, which is really about the automaker managing and monitoring the vehicle (monitoring all of the engine systems, all of the transmission systems, the tire systems). The second are ‘front seat applications’; these are applications that are most useful for the owner of the vehicles (the owner of the vehicle can lock and unlock the car, adjust the temperature and even remotely start the car). The third is what we call ‘backseat applications’, these are streaming applications like Pandora, Tune-In radio, things like that.
Matt: That’s awesome. It’s really cool to see that technology is evolving to permit all these things to become a reality in this world. What’s your use case for Cassandra?
Drew: We’re really moving from a time where there are millions of devices connected to a network today to a time where there will be billions of devices connected to this network. We’re seeing an order of magnitude change in the scale of the kind of offering that we’re providing. What we were looking for in a database is a horizontally scalable data store that’s going to work and be cost effective for that change that is soon to come.
Matt: That make sense; so you initially came across Cassandra because of its linear scalability characteristics. Is it your primary data store? Help me understand a little bit about where you’re using it today.
Drew: For our current set of services we’re generally coming from an Oracle basis, as many companies are coming from. This company actually started ten years ago in machine-to-machine internet of things, so it’s really been a pioneer of this space. Subu and I have both been at the company less than two years and we’re part of a group in engineering that have been driving towards newer technologies, including Cassandra.
Matt: That’s awesome. When you did to your evaluation, what made you choose Cassandra, other than the linear scalability, and what were some of the other technologies that you evaluated?
Drew: I’ll answer and then let Subu give a more technical answer. We really looked at all of the possibilities; the easiest path, the path of least resistance was to stick with Oracle… but we also looked at other relational databases like MySQL, PostgreSQL and those kinds of solutions.
We looked at NoSQL stores like MongoDB and even HBase. A few things that we like about Cassandra is the horizontal scalability, linear scalability, and ability to handle partitions so easily… but there are also some other specific aspects that we really like about it as well: We enjoy the no master, no single point of failure architecture (compared to some of the other NoSQL solutions).
The fact that Cassandra actually forces the separation of concern in the architecture is actually one of the things that we’re trying to get ourselves out of, as there was a lot of tightly coupled business logic in our existing Oracle systems. Oracle is very good at storing data but unfortunately it also encourages putting business logic tightly coupled with that data.
We are leveraging like many people Amazon. Our cloud strategy is to own the base, rent a peak and rent the new; so anything new that we’re developing, we’re deploying it first in Amazon and then as we understand the usage profile, we have an opportunity to migrate that into our data center and Cassandra works well with that.
Subu: One way of looking at making a choice for our data store is that you can think of a data store as a continuum in terms of latency and different aspects of the pitches of the storage system. We obviously have different needs for different applications, so if you think about real time scenarios, you’re talking about something that has a read-write latency in tens of milliseconds. In other scenarios you want to actually push the computation to the data and it becomes more of computational need; that’s when you need a data store that can satisfy the need of computation. You have all these different use cases and different scenarios and Cassandra fits into at least a few of them very well.
In scenarios where we need very low latency read and write, we use something like in-memory solutions and then, in scenarios there we need pretty high throughput write, we actually directly write to Cassandra. An example where we don’t use Cassandra is when we need both read and write latency to be low with a certain level of consistency scenario, such as certain counters that needs high consistency.
Matt: Great. Are you guys storing time-series based data?
Subu: Yes.
Matt: Usually things that are based on machines like to blip a lot. Can you give an example of one those use cases and a little bit more information about what you are doing there?
Subu: At a higher level, you can think of the type of data that we have as (in the telecom, we call it) control plane or data plane data. Control plane data captures the information exchanges between network elements exchange. And then, you have the actual data from the device to a backend system or to another device, which we call as data plane.
We have a type of data that is actually representing what is happening in the control plane, which we try to think of it as metadata or ‘data plane’. And then you have the actual data from the customer or from the device that flows across the network.
In these two classes of data, one is more of a metadata and the other one is the actual data from the devices. These are the two classes of data we are looking at.
Matt: That’s great. Thank you. Do you have any advice on getting started with Cassandra for someone coming from a relational or Oracle background?
Drew: This is actually the second company that I’ve helped bring Cassandra to and one of the big lessons at the first company is it’s actually difficult to take engineers who are really oriented around relational databases and get them working in Cassandra and thinking in a non-relational way. Taking a hardcore Oracle engineer and turning that person into a Cassandra engineer is probably more challenging than taking a Java engineer and having them work on Cassandra.
The other thing is that Cassandra is not the answer to everything. You really have to look at what exactly is your use case and figure out if Cassandra is the right answer for the questions that you’re asking.
Matt: I like that, “Cassandra isn’t the answer for everything”; I actually think that’s really true for most tools. There are tools that has specialty purposes and each of your opinions, I just like to know, what do you think of the top things that Cassandra is best for?
Drew: Well, we are using Cassandra in a bunch of areas; so anywhere that we’re looking at storing a relatively large amount of data (especially if there is a relatively heavy write aspect to that data) then Cassandra is really good and especially for time series and the TTL aspect of that time series is fantastic; we find that that actually automates a lot of the operational needs. We’re using Cassandra in some of our core 4G telecom network elements. We’re using it from a reporting perspective for a simple kind of reporting repository. We actually built a horizontally scalable machine search infrastructure on top of Cassandra. And then we’re using it basically to build out our storage of the device data (the actual user plan data that Subu was talking about) for our platform as a service, for the internet of things.
There is a huge class of data, particularly where some of the primary aspects are high-write throughput and also access via time series. These are just directly in the sweet part of Cassandra.
Subu: The only other addition I would add is that when choosing these technologies, many of them are technically close in detail with only slight advantages over the other. From that point of view, Cassandra may only be slightly better or worse than another other technology, depending on the context of how you’re using it.
We cannot have five different data stores because it increases complexity from an operations and development point of view. I could imagine scenarios where Cassandra may not be the best fit but may actually still be chosen because of reduction of operation cost with reduction of data store technologies.
Drew: That’s a very good point, excellent.
Matt: Great guys. I really want to thank you for your time for this interview and unless there’s anything else you’d like to add, we’ll be signing off.
Subu: Okay, thank you.