spark fast-data big-data podcast

Typesafe AMA Podcast Ep. 01 feat. Fast Data with Dean Wampler

Today, it's my personal (me, as a person named Tonya Rae Moore who is typing this for you to read) pleasure to unveil the first ever Typesafe Ask Me Anything (AMA) Podcast feat. the Typesafe experts.

As this podcast evolves, I'd like to start crowd-sourcing questions from our audience for upcoming guests. I'll be recording episode 02 at JavaOne next week with Konrad Malawski. We're going to talk about Akka Streams and how they relate to Reactive Streams, deployment architectures, and what the heck both of us are doing at a Java conference. If there is anything you want to know from or about Konrad, tweet at me with #typesafepodcast and I'll see what I can do!

I hope to eventually talk to all of the amazing Typesafe engineers and thought-leaders who are involved with the latest and greatest technologies. I was lucky enough to have met Dean Wampler my first day working for Typesafe. You can google to see all the links to interesting stuff he's done. Or, you can listen in and find out he's a totally brilliant and interesting dude.

TRANSCRIPT:

Tonya Rae Moore: Hi, and welcome to our first ever Typesafe AMA podcast with our experts. My name is Tonya Rae Moore, and I'm a Product Marketing Specialist with Typesafe. I write some of our blogs, I record all of our webinars, and in general I try and help get the word of our Reactive Platform out into the world.

I also live in Chicago and have had the extreme privilege of hanging out with my fellow Chicago dude, Dean Wampler, who is also my first podcast guest. Dean is Typesafe's smart guy. He's a Chicago transplant. He's an outdoorsy sort, and he is a smart cookie, as my Southern mother would say. And he is also super easy to talk to and approachable, which is why I requested him to be my first guest. So let's get started talking to Dean. Heya, Dean.

Dean Wampler: Hey, how are you?

Tonya Rae Moore: I'm good. I'm really excited to talk to you so do you have your coffee, and you're ready to go?

Dean Wampler: I'm ready to go.

Tonya Rae Moore: Okay. As I referenced earlier, you've probably done way more podcasts than me, so we're going to start me off easy by asking you to talk a little bit about your background and how it got you to Typesafe.

Dean Wampler: Okay, well I've been around the block a few times, but just to cut to the chase I started my professional life as a physicist and then, got in to software. And more recently, was doing classic enterprise server-side software development. Got really excited about big data and moved into that space. And also about the same time, maybe a little earlier even I started working with Scala and became an advocate for Scala. Co-wrote the first edition of "Programming Scala," for O'Reilly. Since then, I've updated it to the second edition I guess, gosh, a year ago now. And right now at Typesafe I'm bringing both of those backgrounds. The Scala functional side, the server-side programming side -- I guess there's three backgrounds -- and the third would be big data experience to help us figure out what the future and even the present, I guess to some degree of big data should be. And how we can also bring some modern things that maybe are not quite there hat need to be brought in like better programming tools, and so forth.

Tonya Rae Moore: I think everybody wants better programming tools, don't they? That seems like a popular request.

Dean Wampler: It's a never-ending quest, actually. It's like the Holy Grail.

Tonya Rae Moore: Since you're our go-to guy what are the most common questions that you see getting posed by our Typesafe subscribers in getting started with Spark?

Dean Wampler: Most of the people have some Hadoop experience. They may be used to writing MapReduce jobs. So the first obvious thing is they need to learn, "What are the APIs? How do I write this particular job in Spark?" Usually that's not too hard because Spark is a very nice API for writing really powerful code with very little actual lines of code, let's say. But where you quickly need additional help is, "All right, now that I've got this thing running what is it actually doing? How do I know what's the most efficient way to run this thing? What constructs that I used in the program should I actually use that are most efficient? Why is something going wrong? How do I make it faster?" Those kinds of runtime things become really important. A criteria. So most of our customers have come to us because they really need to understand better what's going on inside and how to use Spark more effectively when they're actually doing real production work with it.

Tonya Rae Moore: And what about architectural challenges?

Dean Wampler: So there's an interesting thing happening. Like I said, a lot of customers come from a Hadoop background. We're sort of seeing these architecture big data systems move forward. There's a term that's popped up called "fast data." It kind of describes the idea that whereas Hadoop is much more oriented towards collecting lots and lots of data and then, running big batch jobs over it more and more people want to be able to analyze data as it streams in. So stream processing has gotten to be very important.

Spark plays in to that very well. It's a great transition technology to support both camps of, if I can call it that, of data processing. But more and more people need to understand, "All right, how do I build systems that can process streams of data running at possibly very large scale? Running reliably so if something fails it doesn't lose data?" Get started up again, and all that. So people are coming to us for help for that as well, and we're trying to, with partners, put together really good solutions to those challenges.

Tonya Rae Moore: You mentioned fast data and data in general a lot, and that's really nteresting because I see the terms like data at rest, and data in motion, and fast data all the time so what do those mean to you? What should I be thinking when I hear those terms?

Dean Wampler: I like the phrases data in motion, data at rest because they do encapsulate maybe better than words like batch and streaming what's really going on. Really, I think that the answer is I don't want that data just sitting there and me not extracting useful information from it until later in the day when I have to kick off this batch job. t's really about, "Well, let's get the most information out of it as quickly as possible." So for me, that's what data in motion really means. That the data is live. It's most valuable the moment it hits my environment so that's when I want to try to extract that value. That information from it. And then, so whatever I need to do whether it's alerting for something, if it's something I need to respond to immediately, or if it's updating my intelligence about the world. It's an interesting sub-trend, let's say, is all of these machine learning models that people use to recommend stuff to your or serve you ads, and so forth. We would like that stuff to be based on the latest possible information rather than information that could be hours or days old.

So a lot of that is part of this whole data in motion. And then, the term fast data is kind of in a way another way of describing that, but it ties a little more closely to its predecessor model which was big data. It's not really fair to say that big data is obsolete and now it's fast data. Big data could easily encompass this new stream-oriented world that very much people are starting to say fast data to emphasize this importance of streaming as the basis of their infrastructure.

Tonya Rae Moore: Now, I'm relatively new to Typesafe and so, I've been learning a lot about some of the partners who have been wanting to work with us like Databricks, and Mesosphere, and IBM, and some really impressive names to me. So is this part of the reason why they're interested in working with us?

Dean Wampler: I think so. I think we all have our strengths, and we're finding that it's much better if we collaborate, and bring our strengths together to better serve the needs of customers who have highly non-trivial environments. We could try to support everything that somebody is building at a large company but then, we might not be as good at the Mesos side so having Mesosphere as a partner has been really beneficial for us to support people running Spark and other tools on top of Mesos which for me, I think is the next generation. Cluster management infrastructure that I think is eventually going to eclipse Spark…I mean Yarn….

Tonya Rae Moore: (laughs)

Dean Wampler:…. So many terms I can't keep them straight.

Tonya Rae Moore: Tell me about it. I'm trying to catch my way up on all of them. I switched languages even.

Dean Wampler: I know, it's tough. There's just so many moving parts.

Tonya Rae Moore: Yeah.

Dean Wampler: But yes, Hadoop is sort of based on Yarn to manage everything, but Yarn is actually fairly limited in what it can do, and I think longer-term people will migrate to things like Mesos. Working with Databricks. They're the primary company that was created by the creators of Spark so we partner with them to enhance Spark. We've got a lot of work on the streaming side, on the Mesos integration. Even on things like how well building with Scala 2.11 works. We've done some work to help with that. And IBM is another one you mentioned. IBM has really invested heavily in Spark recently. They opened, I think the Spark Development Center in San Francisco. They're hiring a lot of Spark experts and doing their part to contribute to open source. And also, they're very attracted to the Typesafe reactive platform. They realize that it's one thing to have a tool that's great at say, analyzing data at scale, but how do I wire all the plumbing together? And how do I build these custom services that do whatever unique thing I need to do? And so, they're also very interested in Akka, and Play, and those constellation of tools as they're just part of this fabric of components that need to be glued together. Need to be reliable, durable. Represent the state of the art in thinking about distributed systems. So among them there's several partners we're working with on this whole, let's call it an evolution.

Tonya Rae Moore: It was fun. I remember you, and I, and Martin Odersky, and a couple other people went to IBM together. It was my first time going out there. And I was such a country mouse in the big city like, "Wow, IBM knows our name. IBM thinks we're cool." It was pretty great. I enjoyed that.

Dean Wampler: Yeah, I have a funny story about that trip, too. Obviously, IBM has a lot of facilities for the listener. This was the San Jose sector of it they have. And I remember Martin and I did talks, and there were a couple of guys in the audience that we met afterwards. And it turns out they were IBM fellows who had been doing fundamental database, and streaming research, and implementations for decades. It just reminded me that people often think of IBM as yesterday's company, but really, they have such a deep bench that the fact that they're having people like that think about Spark is really going to make Spark a much better, much more scalable, durable, reliable tool over the long term. So that was kind of a fun day for me, too.

Tonya Rae Moore: Yeah, I'm glad. So I guess I can wrap it up with the broad, "What is the future for fast data? Where are we going from here?

Dean Wampler: Yeah, I think that obviously, it's hard to really know exactly, but I do feel that people are migrating away from a batch orientation with their infrastructure, and more towards, "Let's do as much as we can with the stream architecture," because it does actually let you do batch processing, too. If you think of a batch as like a finite stream then, it works well there. I think that there's always going to be the continued pressure or cost reduction, for flexibility. Cost in terms of hardware resources. Spark is good for that because it uses less resources than a comparable MapReduce job, let's say. And just better productivity. And so, tools like Spark really help. But also, the need to choreograph special-purpose tools that address particular needs that no one tool can handle. We've seen, from our talking to customers, from surveys that tools like Kafka are very important for durable ingestion and management of event streams. And databases like Cassandra, and Riak, and some of these other highly-scalable, highly-resilient databases are very important for providing that durability. That actual storage that then gives you also the ability to pull that data back out for other purposes, or even to actually run through Spark in some cases.

So I think you'll see these short-term things happening where tools like this will be the basis of the ecosystem. Like I said, I think Mesos will be really important because it can manage all of these kinds of tools, including older tools like the Hadoop file system. Whereas, Yarn is somewhat limited in its ability to manage these things.

And then, I think also just the general trend toward more containers like Docker. Towards cloud deployments. I talk to companies that five years ago wouldn't have even considered putting their data in the cloud for the usual security or whatever reasons. And now, we think, "You know, we really need to find a way to make this work." So I think we'll see those trends, too.

Tonya Rae Moore: That's cool. And as always, after talking to you I have seven more books I need to read. I really appreciate you talking with me today. I know you're a super busy guy, and we keep you hopping so remind me to buy you a drink next time I see you, okay?

Dean Wampler: All right, I'll remember that. Thanks a lot. I enjoyed it.

Tonya Rae Moore: Thanks, Dean. All right, so check out www.typesafe.com for more information about Spark and our reactive platform. And you can always tweet at us @deanwampler for Dean, @TonyaRaeMoore for me, @typesafe for Typesafe. I'll also be checking out the #Typesafepodcast for feedback and suggestions on improvements.

This is a work in progress and our inaugural podcast so thanks to everyone who listened!

Learn to master Spark

As your business goes Reactive and uses Spark, Scala and Akka, a ton of development work lays ahead. Now, more than ever, staying ahead of the curve is important.

As the official commercial support provider for Spark, we provide hands-on, lab-based approaches to learning all the relevant functionalities to master Spark.

Introductory Workshop For Developers by your new friend Dean. Designed to teach developers how to implement data processing pipelines and analytics using Apache Spark.
Case study with our customer UniCredit on their Fast Data system built with Spark, Akka, Scala, and Play.
Fast Data: Big Data Evolved. Understand Fast Data architecture, and the role of Kafka, Cassandra, and Spark.

Don't waste your time searching for the right information: We've got the stuff!

LEARN MORE ABOUT TRAINING

October 22, 2015

Typesafe AMA Podcast Ep. 01 feat. Fast Data with Dean Wampler

Learn to master Spark

Stay up to date on industry trends, analyst insights and all things Lightbend.