Req number:
R6280Employment type:
Full timeWorksite flexibility:
RemoteWho we areCAI is a global technology services firm with over 8,500 associates worldwide and a yearly revenue of $1 billion+. We have over 40 years of excellence in uniting talent and technology to power the possible for our clients, colleagues, and communities. As a privately held company, we have the freedom and focus to do what is right—whatever it takes. Our tailor-made solutions create lasting results across the public and commercial sectors, and we are trailblazers in bringing neurodiversity to the enterprise.
Job Summary
As a Spark Engineer, you will design, build, and optimize large-scale data processing systems using Apache Spark. You will collaborate with data scientists, analysts, and engineers to ensure scalable, reliable, and efficient data solutions.Job Description
We are looking for a Spark Engineer with deep expertise in distributed data processing, ETL pipelines, and performance tuning for high-volume data environments. This position will be full-time and remote.
What You'll Do:
Design, develop, and maintain big data solutions using Apache Spark (Batch and Streaming).
Build data pipelines for processing structured, semi-structured, and unstructured data from multiple sources.
Optimize Spark jobs for performance and scalability across large datasets.
Integrate Spark with various data storage systems (HDFS, S3, Hive, Cassandra, etc.).
Collaborate with data scientists and analysts to deliver robust data solutions for analytics and machine learning.
Implement data quality checks, monitoring, and alerting for Spark-based workflows.
Ensure security and compliance of data processing systems.
Troubleshoot and resolve data pipeline and Spar k job issues in production environments
What You'll Need
Required:
Bachelor’s degree in Computer Science, Engineering, or related field (Master’s preferred).
3+ years of hands-on experience with Apache Spark (Core, SQL, Streaming).
Strong programming skills in Scala, Java, or Python (PySpark).
Solid understanding of distributed computing concepts and big data ecosystems (Hadoop, YARN, HDFS).
Experience with data serialization formats (Parquet, ORC, Avro).
Familiarity with data lake and cloud environments (AWS EMR, Databricks, GCP DataProc, or Azure Synapse).
Knowledge of SQL and experience with data warehouses (Snowflake, Redshift, BigQuery is a plus).
Strong background in performance tuning and Spark job optimization.
Experience with CI/CD pipelines and version control (Git).
Familiarity with containerization (Docker, Kubernetes) is an advantage.
Preferred:
Experience with stream processing frameworks (Kafka, Flink).
Exposure to machine learning workflows with Spark MLlib.
Knowledge of workflow orchestration tools (Airflow, Luigi).
Physical Demands
Ability to safely and successfully perform the essential job functions
Sedentary work that involves sitting or remaining stationary most of the time with occasional need to move around the office to attend meetings, etc.
Ability to conduct repetitive tasks on a computer, utilizing a mouse, keyboard, and monitor
Reasonable accommodation statement
If you require a reasonable accommodation in completing this application, interviewing, completing any pre-employment testing, or otherwise participating in the employment selection process, please direct your inquiries to [email protected] or (888) 824 – 8111.