Job Description
Title: NOC Engineer Location- Remote Key Responsibilities: - Monitor public cloud infrastructure (compute, storage, networking, and Kubernetes clusters) using observability tools like Prometheus, Grafana, and internal dashboards.
- Identify, triage, and respond to real-time alerts and incidents to prevent or minimize customer impact.
- Perform first-level troubleshooting of system issues, including host failures, degraded services, and latency incidents.
- Escalate critical issues to CloudOps Engineering, Network Infrastructure, or Security teams following predefined runbooks and escalation paths.
- Maintain clear documentation of incidents, resolutions, and system changes in the ticketing system (e.g., Jira, PagerDuty, or internal tooling).
- Write and update operational playbooks to standardize response procedures for cloud infrastructure issues.
- Collaborate in post-incident reviews with the Network Infrastructure and CloudOps teams to identify root causes and help implement long-term fixes.
Qualifications: - 2+ years of experience in a NOC, cloud operations, or system monitoring role, preferably in a public cloud or SaaS environment.
- Strong understanding of Linux systems, networking concepts (TCP/IP, DNS, VPN, BGP), and system administration basics.
- Experience working with Juniper and Arista network equipment, including basic configuration and troubleshooting.
- Familiarity with container orchestration and cloud-native tools (e.g., Kubernetes, Docker) is a plus.
- Excellent troubleshooting skills and ability to work calmly in high-pressure, time-sensitive situations.
- Strong communication skills with the ability to write clear incident reports and Cloud Operations playbooks.
- Experience with services (e.g., Droplets, VPCs, Load Balancers, Spaces) is highly preferred.
Preferred Qualifications: - Certifications in Juniper (e.g., JNCIA, JNCIS) or Cisco (e.g., CCNA) technologies.
- Familiarity with Infrastructure-as-Code tools (e.g., Terraform) and CI/CD pipelines.
- Prior experience in high-availability cloud environments and large-scale incident management.
Job Tags
Remote work, Night shift,