AWS/DevOps Site Reliability Engineer in Charlotte, North Carolina
THE TEAM YOU WILL BE JOINING:
Top 25 U.S. digital financial services company committed to developing award-winning technology and services.
Named one of the top three fastest-growing banking brands in the U.S. in 2020.
Offers a full suite of products including mortgage lending, personal lending, and a variety of deposit and other banking products (savings, money-market, and checking accounts, certificates of deposit (CDs), and individual retirement accounts (IRAs)), self-directed and investment-advisory services, and capital for equity sponsors and middle-market companies.
Where permitted by applicable law, must have received or be willing to receive the COVID-19 vaccine by date of hire to be considered.
WHAT THEY OFFER YOU:
Fast paced, highly collaborative, teamwork-oriented environment
Make an immediate impact in this high visibility role
Base salary of $145k with bonus potential and excellent benefits package
Top-notch leadership committed to developing people
Charlotte, NC - 100% remote for now, then will sit Hybrid (3 days in office, 2 days remote) when staff transitions back into the office after February.
WHAT YOU WILL DO
Collaborate with Cloud Engineering, Agile squads/developers, sustain and business partners and provides significant contributions to develop specifications to resolve problems, and to address enhancement needs focusing in areas of logging, monitoring and metrics for operational readiness.
Use technical knowledge, creativity, and company practices and to drive down occurrences of incidents through development of proactive alerting and monitoring.
Provide continuous feedback to development teams on system stability, defect analysis and system enhancements.
Develop runbooks and patterns for AWS/DevOps operations
Serve as a mentor to lower- level developers and IT operations teams
Participate in technical discussions with the development team for deployment and code reviews
Drive knowledge transition from development to sustain team for each functional deployment
Work with IT business and development partners to gather inputs to develop new capabilities in displaying/monitoring/alerting on key performance indicators (KPIs) by tracking business transactions (BT) in real-time
Partner with application owners to develop creative and effective solutions to mitigate risk and successfully remediate any audit issues
Lead RCA and SWAT investigations for the IT Operations team
Plan for validation and verification of changes deployed by infrastructure teams, development teams and sustain team
Facilitate day to day execution of real time L2 technical support and troubleshooting
Attend CAB Meetings and approve changes
Support business continuity and disaster recovery activities
Lead maintenance of master documents i.e. Runbook, Playbook and help maintain accurate application documentation
Provide guidance in resolving performance related issues and designing solutions for any technical issues faced by the application
Review and accept the technical documentation
HOW YOU ARE QUALIFIED:
BS (preferably MS) in Computer Science or related field preferred
5 + years of experience in a similar sustain role and extensive knowledge of associated processes
Shows deep knowledge and understanding of enterprise-scale platforms and architectures
Possesses strong analytical, problem-solving skills and exhibits strong leadership skills
Experience with Co-ordination between upstream applications to resolve incidents
Grasps new technologies and can adapt to rapid shifts in priorities
Experience with implementing sustainable, audit-ready processes to support IT controls such as executing deployment, access management, audits, incident management, change management, etc.
AWS/Cloud hands on experience
DevOps hands on experience with DevOps tools such as Python, Terraform, Jenkins, GitLab, Ansible, docker
Experience with Splunk, AppDynamics or other similar monitoring tools preferred
Correlate environment conditions and metrics to application events
Experience debugging problems in a distributed system