Distributed Systems

From-scratch Java RMI framework · Docker multi-service deployment · PySpark large-scale data analysis

A collection of distributed systems projects completed for CS328 at SUSTech, spanning three interconnected components: a custom Java RMI-style framework built without using java.rmi, a Docker-based multi-service deployment, and a PySpark pipeline analyzing ~970,000 real-world parking records.

View on GitHub


Highlights

  • Implemented a Java RMI-style framework from scratch — service registry, dynamic proxy stubs (java.lang.reflect.Proxy), skeleton threading, and serialized InvocationMsg/ReturnMsg over sockets — without using java.rmi
  • Demonstrated the framework with a concrete MatrixCalculator remote service, with server and client communicating through the custom registry and stub layer
  • Containerized the full system with Docker multi-stage builds and Docker Compose; three isolated services (registry, server, client) discover each other via environment variables
  • Processed ~970,000 real-world parking records with PySpark, producing five analytical outputs including time-windowed utilization rates and Spark DAG visualizations
  • Applied CheckStyle across the Java codebase (~700 lines) to enforce code quality standards

MyRMI — Remote Method Invocation from Scratch

The centerpiece of the work is a Java RMI-style remote invocation framework implemented without the built-in java.rmi library. Every layer of the RMI stack was constructed by hand:

  • RegistryRegistryImpl handles object bind and lookup; LocateRegistry provides the client-facing factory; registry calls are themselves proxied over the network via RegistryStubInvocationHandler
  • StubsStubInvocationHandler implements java.lang.reflect.InvocationHandler, intercepting any method call on a remote interface and serializing it into an InvocationMsg for transmission
  • SkeletonsSkeletonReqHandler threads receive incoming messages, deserialize them, dispatch the call via reflection, and return the result as a ReturnMsg
  • MessagesInvocationMsg and ReturnMsg carry method name, argument types, argument values, and return value over a socket connection
  • ExceptionsRemoteException, AlreadyBoundException, and NotBoundException mirror the standard RMI exception hierarchy
Source layout: 16 Java files across 6 packages
Service-side log: stub creation, registry bind and lookup in action

The test service exposes a MatrixCalculator remote interface with two implementations, allowing multiple services to be registered and resolved by name through the same registry instance.


MyRMI_Docker — Containerized Deployment

The framework was extended into a realistic multi-service deployment using Docker and Docker Compose. Registry, server, and client each run as an isolated container:

  • Multi-stage Dockerfile per service: Maven compiles in the build stage; a minimal OpenJDK image runs at runtime
  • docker-compose.yaml defines startup order (registry → server → client) and inter-service dependencies
  • Service discovery uses environment variables (REGISTRY_HOST, SERVER_PORT) — no hardcoded addresses
All three containers starting and completing an end-to-end RMI call — registry binds a Mortgage service, client looks it up and executes a remote calculation

SparkProcessing — Large-Scale Parking Data Analysis

A PySpark pipeline analyzing Shenzhen parking records (~970,000 rows, 1,930 berths, 43 street sections).

Five analytical tasks:

  1. Berth count per street section
  2. Unique berth-to-section mappings
  3. Average parking duration per berth
  4. Hourly utilization rate with percentage breakdown
  5. Peak-hour identification across sections
Sample output: berth count by street section
Spark DAG for one task — earlier stages (CSV scan, sort-aggregate) are skipped due to result caching; AQE optimizes the final shuffle

Technical Summary

   
Languages Java 8, Python 3
Infrastructure Docker, Docker Compose
Frameworks PySpark, Maven
Key concepts RPC, dynamic proxy, serialization, service registry, container orchestration, DAG execution
Dataset ~970,000 rows of real parking data
Code quality CheckStyle enforced across Java codebase