Next Generation Streamers: an Apache Kafka® Intern Showcase (IN-PERSON EVENT)

:tada: A new Confluent VUG online meetup has been posted! We hope to see you there :slight_smile:

  • :spiral_calendar: When: 18 August 2022 at 6:00 PM (PDT)
    :clock1: Click here to see the meetup time in your own timezone. From this link you can also add the meetup directly to your calendar.
  • :speaking_head: Speaker(s): Qi Liu, Jordan Hunt, Shufan Liu
  • :notebook_with_decorative_cover: Talk(s): Apache Kafka® Intern Showcase (IN-PERSON EVENT)

Every summer, a new group of interns are given an opportunity to gain enriched knowledge of Apache Kafka. Within their short internships, these hard-working interns are able to gain experience with new technologies and become better developers along the way.

Join us for an in-person meetup on August 18th at 6 pm PDT as we celebrate the accomplishments of these outstanding interns and showcase their Apache Kafka-based summer projects. The agenda and speaker information can be found below.

Join the Community Slack and Forum to ask any follow-up questions!


Agenda in Pacific Daylight Time:

6:00 pm - 6:15 pm: Food, drinks, and networking

6:15 pm - 6:20 pm: Welcome message, Danica Fine

6:20 pm - 6:45 pm: Qi Liu, LinkedIn

6:45 pm - 7:10 pm: Jordan Hunt, Netflix

7:10 pm - 7:35 pm: Shivalika Gupta, LinkedIn

7:35 pm - 7:55 pm: Shufan Liu, Confluent

7:55 pm - 8:00 pm: Q&A + networking


Speaker:

Qi Liu, LinkedIn

Bio:

Qi Liu is a third-year Ph.D. student at the University of Virginia. Her research area is in Computer Network and Network Security. She is a Backend - Systems and Infrastructure Engineer Intern at LinkedIn in Summer 2022. She had the fortune to work in the LinkedIn Kafka team where her intern project is on adding latency metrics and implementing a time-based consumer lag monitoring.

Title:

Building a time-based consumer lag monitoring system for pub-sub systems

Abstract:

Consumer lag indicates how much a consumer of a topic is lagging behind its producer. Measuring consumer lag is important because it indicates the overall health of the consumer and reflects its performance. A widely adopted way of calculating consumer lag is the offset-based approach which computes the difference between the latest committed offset of the consumer and the latest log end offset. LinkedIn uses an offset-based lag monitoring service.

However, the offset-based approach has the following limitations: 1) it does not provide end-to-end latency from the production of records to the actual consumption, without taking into account the size of them 2) it relies on periodically fetching the log end offsets, which is often not real-time and can be a source of an error when a precise lag is required. To address these issues, we propose a produce-to-consume latency-based lag computation and monitoring and demonstrate how this lightweight approach can help with those pain points and complements the existing offset-based lag monitoring.


Speaker:
Jordan Hunt, Netflix

Title:
Automated Detection and Notification for Inactive Apache Kafka topics

Speaker Bio:
Jordan Hunt is a rising senior at Harvey Mudd College studying
computer science and mathematics. He is currently an intern on the Real-Time Data
Infrastructure team at Netflix. His past experience includes internships at Facebook
and Asana. When Jordan isn’t coding, you can find him playing basketball, reading, or
hanging out with friends.

Abstract:

There is no way to identify unused Apache Kafka topics, automatically deprecate them, and notify their owners within DataMesh, a real-time data movement, and processing platform at Netflix. Thus, lots of Kafka topics are taking up resources when they are no longer in use.

In this presentation, we’ll dive into one of the applications of Kafka topics at Netflix and the reasons topics become inactive. Then, we will explore the solution that was implemented to identify inactive topics and notify users of their status.


Speaker:
Shivalika Gupta, LinkedIn

Bio:

Shivalika Gupta is a rising senior at Cornell University majoring in Computer Science. She grew up in Asbury Park, New Jersey, and is currently staying in San Francisco for the summer! Shivalika is a Backend Systems and Infrastructure Engineer Intern on the Data Pipelines Team at LinkedIn under the mentorship of Aditya Toomula. In her free time, Shivalika loves hiking, traveling, watching Netflix shows, and playing with her pet cat (name if you don’t mind sharing).

Title:

A Deep-dive into the Brooklin MirrorMaker (BMM)

Abstract:

The Brooklin MirrorMaker (BMM) is a tool that LinkedIn utilizes for copying data between Kafka clusters across data centers. It is important that BMM works at high performance and stability so that the apps that consume from it do not receive delayed events. In order to maintain this high performance, the team must ensure that the SLA is not violated as this could create significant lag. In particular, the SLA guarantees a function of source data throughput. Sometimes the source data throughput varies abruptly for extended periods of time indicating missed SLA and it takes further RCA to identify if the miss was due to a bug or due to source throughput changing drastically. Hence, my efforts entailed identifying the partitions that exceeded the max throughput, so that corrective actions could quickly be taken.


Speaker:
Shufan Liu, Confluent

Bio:

Shufan Liu is a Developer Advocate intern at Confluent. He is studying in the Computer and Information Technology master’s program at the University of Pennsylvania, as he just pivoted his career from the business side to a tech field. His job at Confluent is to design end-to-end applications using Apache Kafka on Confluent Cloud to demonstrate use cases to developer communities, write tutorial blogs to introduce his projects, and deliver presentations to meetups like this one!

Title:

Analyzing Subreddit Sentiment Using Apache Kafka

Abstract:

Freedom of speech is an important factor of any online forum – Reddit is the manifestation of that ideology, allowing users to gather in different communities called subreddits to freely express their opinions, emotions, and suggestions. Sometimes the emotions are heart-warming, sometimes they are chilling, and sometimes they are wild. But is there a way to use Apache Kafka to investigate and track these sentiments over time and utilize the result in a scalable application?

Overall sentiment within a subreddit can be quantified using Natural Language Processing (NLP). To reduce coupling, ensure fault tolerance, and improve elasticity, we broke this problem down into a number of Python microservices handling user interaction, Reddit polling, and analysis. A user supplies a subreddit and time range, which is captured and produced to a Kafka topic. A separate microservice reads this input topic and polls all Reddit threads from that subreddit over the specified time range, calculating sentiment scores for each thread in another microservice and appending these scores to another Kafka topic. Finally, a ksqlDB application computes the average sentiment score per user request, yielding the overall subreddit sentiment.

This presentation will show how to get the data from Reddit, how data flows across microservices and Kafka topics, and why Kafka and microservices are a great fit for this application.


Address:

Confluent HQ

899 West Evelyn Ave.

Mountain View, CA 94041


If you are interested in speaking or hosting our next event please let us know at community@confluent.io.