Senior Service Engineer - CTJ - Poly
Redmond, WA 
Share
Posted 2 days ago
Job Description
OverviewMicrosoft has an exciting opportunity for a Senior Service Engineer on the Cloud + AI Silver Infrastructure and Operations (I&O) Team. This team is responsible for deploying and operating services within an air gapped environment, including the infrastructure for collaboration. The I&O team manages the infrastructure and day to day operations enabling Azure engineers to work, collaborate, and deliver mission success in highly regulated environments. In this role as a Senior Service Engineer, you will have the opportunity to work with engineers who enable a broad set of Azure services to be consumed by internal and external customers in highly secured and regulated industries. The services you provide and influence, and decisions you make will be required to meet the security policy and assurance requirements of both public and private sector customers. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesThe scale of our operations is enormous. Microsoft's products and services are overwhelmingly consumed online, and billions of people use them every day. We need people who enjoy analyzing complicated problems, coming up with creative solutions, working in focused teams to build things no-one has thought of before, all in the service of production reliability. Technical Knowledge and Expertise:Develops end-to-end expertise in service and/or system design, interactions between technology layers and components, functions of infrastructure, and dependencies at scale. Takes ownership of service design by driving efforts within an organization to identify, define, recommend, and build optimal configurations of technology solutions with considerations for cost management. Independently adjusts configurations and defines infrastructures to improve the availability, reliability, efficiency, observability, and/or performance of supported products and services. Drives reviews with the engineering teams that develop and/or manage services, identifying opportunities for efficiencies in operations and sharing learnings and recommendations across engineering teams working on related services within their organization.Stays current in knowledge and expertise as technology landscape evolves, maintaining awareness of industry norms. Uses knowledge to drive the adoption of new solutions across engineering teams working with related products within an organization. Provides guidance to others through sharing, coaching, conferences, and other means to drive improvements across teams.Reviews and provides technical guidance, change advisory board authority, and direction on electrical, mechanical, and other critical facility maintenance methods of procedure, drawings, and technical documents.Maintains a subject matter expertise of Azure critical facilities dependency and resiliency as a foundational aid to decision making in planned and unplanned scenarios. Operational Excellence:Maximizes uptime and operational excellence and minimizes disruption and downtime of Azure critical facilities through proper management, planning, coordination, and assessment of risk and impact of preventative and corrective maintenance. Leads space improvement projects. Creates standards and reviews and approves technical physical engineering procedures. Plans, organizes, and executes work with stakeholders and partners.Maintains operations of live service as issues arise on an on-call basis. Implements solutions and mitigations to more complex issues impacting performance or functionality of Live Site service and escalates as necessary. Reviews and writes issues postmortem and shares insights with the team. Independently implements reliable, scalable, and high-performance solutions across teams. Contributes to design documents. Owns implementation and rollback plans. Maintains quality checklist and related documentation.Creates, monitors, and takes action on telemetry data and influences telemetry analytics to better identify patterns that reveal errors and unexpected problems that are affecting the system's availability, reliability, performance, and/or efficiency. Develops solutions and/or automation and leverages an understanding of solutions to define, develop, measure, track, change, and improve the quality of telemetry pipelines that support automated monitoring and incident response.Responds to incidents while on-call, including complex issues with major customer or business impact, by identifying the level of impact, troubleshooting, contributing to difficult decisions based on business impact, deploying appropriate fixes to resolve root cause(s), and implementing automations for prevention of recurring issues through coordinating resources required for incident resolution, which may include product teams, owners, leadership, other engineering teams, and/or subject matter experts. Escalates resolution of highly complex, ambiguous, and impactful issues as needed. Contributes to postmortems and shares details related to incidents and their resolution through post-mortem reports and regular review meetings. Provides incident response assistance to other personnel as needed, and develops incident response and resolution guidance.Adheres to prescriptive guidance for security, privacy, and compliance standards in alignment with direction from the business and technical experts. Works with security, privacy, and compliance teams to identify and address issues relevant to their services. Identifies patterns of violations and implements automations for prevention. Provides assistance to other personnel as needed. Collaboration and Knowledge Sharing:Collaborates within and across teams by proactively and systematically sharing information with an appropriate level of detail for their audience. Overcomes obstacles by resolving conflicts and issues across interdependent teams and engages with partners and stakeholders so issues can be resolved and mutual objectives are met.Shares insights and best practices that can be applied to improve development and operations across related sets of the systems, platforms, and/or products. Continues to develop their understanding of insights and best practices through interactions with members of product engineering teams and other resources (e.g., conferences, brown bags, wikis, documentation). Mentors and coaches other engineers to help them identify and propose relevant solutions. Specialty Responsibilities:Leverages advanced technical expertise, judgment, and decision making to coordinate multiple work streams and resources in crisis situations to drive mitigation plan and resolve crisis by engaging necessary teams and escalating to appropriate stakeholders. Applies diagnostic expertise. Provides guidance to other engineers working to mitigate and resolve issues. Communicates customer impact and other relevant information with key stakeholders, leadership, and customers. Develops and drives projects and programs to improve crisis response by creating standard practices for consistent response across engineering teams. Fosters increased stability. Reduces noise by adjusting telemetry and alarming. Influences key stakeholders to adopt new standards and practices to broadly improve crisis and problem management.Monitors and maintains security by addressing security vulnerabilities through patches, reconfigurations, and/or settings updates. Identifies, prioritizes, and targets solutions to complex security issues that may impact customers and partners, and drives action to promote the adoption of relevant mitigations. Drives program and process of mitigation (e.g., automation), troubleshoots system issues, and partners closely with internal customers and engineering teams to conduct root cause analyses, share end-to-end expertise in services, and to mitigate and resolve issues. Communicates and drives adherence to security policies and procedures.Defines and develops standardized, repeatable, scalable procedures and solutions to guarantee quality.

 

Job Summary
Company
Start Date
As soon as possible
Employment Term and Type
Regular, Full Time
Required Experience
Open
Email this Job to Yourself or a Friend
Indicates required fields