Job Description

Seasoned Site Reliability Engineer (SRE) with 5+ years of experience in supporting complex, large-scale distributed systems. Highly skilled in managing production failures, conducting root cause analysis, and driving effective remediation. Strong communicator with expertise in ing, monitoring, and release management, complemented by automation proficiency and a keen ability to learn quickly.

This role involves providing 24/7 support as part of the SRE team, ensuring the reliability and performance of mission-critical Java, .NET, and Batch applications deployed across GCP, PCF, and on-premise environments.

Years of experience needed –

Candidate experience – 5+ Years

Technical Skills:

• Expertise in understanding large scale production systems and technologies, for example load balancing, monitoring, distributed systems, microservices, and configuration management.

• Should have solid hands-on experience in troubleshooting and fixing application failures, application Performance degradation, Code issues, cloud platform issues, Batch Failures, Infra failures, DB failures, Network failures.

• Hands-on experience in performing Production deployments using CI/CD and exposure to deployment strategies.

• Experience in troubleshooting of Linux/Unix.

• Monitor the application/Services/batch availability.

• Act quickly on the application s(Performance, Availability) and Batch Job failures

• Perform the required analysis (Code/Log) and escalate to the Engineering team as required.

• Initiate and drive the Techlines in case of outages/major incidents/Batch abends and ensure Service Restoration in the least time possible.

• Effectively handle the Incident, Problem, Release and Change management.

• Own and deliver the user stories assigned as part of the sprint.

o The user stories range from application code Debugging, Issue analysis, Code fix, Knowledge base creation, documentation of SOP’s, Production Deployments, Pre & Post Patching/Maintenance activities, Service Requests.

o Build monitoring solutions using APM tools like Splunk, Appdynamics, Thousand Eyes, ITRS, AppMetrics, MoogSoft, Kafka etc.

o Automate of day-day operational tasks.

o Be part of the Exit reviews to ensure the best practices are followed to have the right code deployed to Production systems

o Provide feedback/recommend improvements to the system which would enable highly stable systems.

• Strong understanding of Networking Concepts (TCP/IP, SSL/TLS, IPSec, VPN etc), Firewall and Load Balancers.

• Experience in Scripting – Shell/Powershell/Python

• Strong Experience in working with any Cloud-based infrastructure (PCF, GCP, AWS, Azure Cloud or others)

Certifications Needed:

As per industry standards

Job Tags

Similar Jobs

Inland Northwest Behavioral Health

Hospital Clinical Teacher/Educator Job at Inland Northwest Behavioral Health

...Responsibilities Under the direction of Human Resources, the Clinical Teacher serves as the Hospital Educator responsible for designing, implementing, and evaluating professional development and educational programs for all hospital staff, including clinical and...

Palladino | Isbell | Casazza

Bilingual Immigration Paralegal Job at Palladino | Isbell | Casazza

About the job Palladino, Isbell & Casazza (PIC Law) is a fast-paced, progressive immigration law firm located in Philadelphia currently seeking a full-time, bilingual paralegal. Candidates must be fluent in English and at least one other language. Our staff is friendly...

SGS

Mechanical/Welding/Electrical Inspectors - Shop Inspections Job at SGS

...Company Description SGS is the world's leading inspection, verification, testing and certification company. We are recognized as... ...inspectors with structural steel and coating experience.- Electrical inspectors with added disciplines such as CWI The Inspector...

Two Trees Management Co.

Handyman - Luxury Residential Job at Two Trees Management Co.

...utilizations and tenant interactions. Experience Ideal candidate will have previous experience as Handyman at a large scale luxury rental building or commercial office building. References from relevant previous employers will be sought. Leadership/...

Randstad Enterprise

Machine Millwright Job at Randstad Enterprise

...equipment, playing a key part in keeping our operations running smoothly and safely. What Youll Do Troubleshoot and rebuild mechanical assemblies and automated systems Perform preventative and predictive maintenance on machinery Diagnose root causes of...

Site Reliability Engineer Job at TalentLynk, Austin, TX

b0dGdXAvY2VEZDJlcmtDTWtWMW1aQk5XaXc9PQ==