Happy 1st birthday MOHUG!  Its been one year since we went public and officially started with meetup.


Doing Data Science with Apache Spark – Dong Meng from MapR.

Spark is a distributed computational framework that make data science handy over huge datasets. This presentation will cover some spark core introduction. Then dive in with use cases to run ad-hoc analytical query with SparkSQL, build machine learning pipeline with MLlib, doing graph modeling on GraphX

Impala performance benchmarks and use cases – Derek Kane from Cloudera.

Security in the cluster – Erik Nor from Moser Consulting.

As data in Enterprise Hadoop clusters continues to grow, securing that data continues to be an important part of any implementation, yet it often is an afterthought in many implementations. This presentation will cover best practices of securing a cluster including authentication via Kerberos, authorization, ongoing administration, auditing via Ranger, access via Knox, and encryption via TDE, SASL, and SSL. This presentation will demonstrate why each aspect of security is needed, how it is implemented, and what each tool does to protect the data. If time allows, live examples of how the tools are configured and how they protect your data will be shown.

Speaker Bios

Derek has spent the last 20 years building solutions with data. The last ten years he was with JP Morgan where he was a Lead Architect. As a part of the JP Morgan Innovation team, Derek led the creation of Big Data solutions for an organization that managed $2 trillion in assets. He has also built out multiple Centers of Excellence covering Business Intelligence and Data Visualization. He is a patent holder for an application that manages Total Cost of Ownership of technology solutions. Derek has worked at Cloudera as a Systems Engineer since 2015 and is based in Columbus, Ohio.

Erik is a Principal Consultant and Big Data Tech Lead at Moser Consulting.  He has been working with Hadoop since 2012 when he naively installed it onto a cluster of Solaris servers.  Since then he has become certified in a variety of distributions and travels around the country architecting, implementing, and supporting solutions for clients big and small.

Dong Meng is a Data Scientist for MapR, focused on building data science solution leveraging MapR tech stack for our customers. He has several years of experience in machine learning, data mining, and big data software development. Previously, Dong was a senior data scientist for ADP, where he built a machine learning pipeline and data products to power ADP Analytics. Prior to ADP, Dong was a staff software engineer for SPSS, where he helped build analytical catalyst (current Watson analytics). During graduate study, he serves as research assistant at the Ohio State University, where he concentrated on compressive sensing and solving point estimation problems from a Bayesian perspective.