Dallas Network Outage

posted by Robert
Sep 5

At 1:38am I started receiving pages from binarycanary about servers in Dallas being unreachable. Two minutes later 1 server reportedly passed, but was down 1 minute later again. The support.virtbis.com site is down as well, which leads me to believe it is a repeat of the incident on the 28th of August – all the servers were functioning fine, but not reachable because of a router issue. I’ve emailed Chris with Virtbiz a – though I’m not sure it can get through right now as well as called and left voicemail. I’m sure they have their own monitoring setup and have been altered. I will update this as I know anything.

Edit: the support.virtbiz.com came up for a minute or so… Tried to put in a ticket, not sure if it made it or not.

3:30am edit: Was finally able to put in a ticket, last one did not take – network crapped out before the submit took. But it’s there now: Ticket ID: RXU-696907 I’ve also setup a binarycanary monitor for support.virtbiz.com to track it (how I could tell it was back up to hit submit again)

3:45 : Virtbiz claimed they notified customers of…

Engineers will be performing emergency network maintenance tonight, September 4 2010.

VIRTBIZ customers receiving service at the DAL-1 facility (2805 Canton St) in Dallas TX may be affected by this maintenance. The purpose of this work is to perform important software upgrades to distribution switches to help prevent flood attacks and unplanned outages.

EMERGENCY MAINTENANCE WINDOW: September 4, 2010, 8:00PM – 11:59PM CDT

Customers in the “Bakers Rack” area (ie: non-rack-mountable equipment) will experience an interruption in service that may last up to 30 minutes. Customers in rows 11-16 may experience brief interruptions in service lasting no longer than 30 seconds.

Only problem, couldn’t get to a website to know anything about it. No email was sent. And 30 minutes… try 2 hours plus! Not a happy camper. I understand the need, but the execution was pretty poor.

Oh wait, we’re in row 11… Should have been 30 SECONDS not 30 minutes… I think what gets me the most was the the “you should have known” attitude that came from the response and no apology, explanation, or anything else. How are you supposed to know if you don’t get notice, can’t get to the support site, and the phone’s network status message was “all networks operational”… Uh, NO they weren’t.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.


Network outage @ Dallas

posted by Robert
Aug 28

This is the latest update (original at the bottom):
Subject: [#MAU-767967]: Network outages – still happening

As a follow-up, I would like to share with you a postmortem of the
routing event you experienced this morning.

At about 6:30AM we received a preliminary alarm of a failure on one of
our customer gateway routers. The event was logged and the system was
enabled for failover to the redundant processing card on that gateway
router. A short time later, the routing fabric of the gateway failed
and caused a disruption in network service. Our technician brought the
router back up and ran a diagnostic, which ran clear. At that time, the
system was brought back online and routing resumed to your equipment.

After running stably for about 30 minutes, the system and its backup
failed again. Senior technicians implemented a hard-failure plan and
brought up a “warm spare” gateway router and began loading the
configuration. Network service was restored individually to each
customer affected by the outage as the configuration was checked-in and
loaded.

I should note that while you are one of several customers that was
affected by this incident, this was not a full-scale routing outage.
Our network architecture makes extensive use of sandboxing in order to
not put all eggs in one basket. Nevertheless, I understand that while
not everybody was affected, YOU were affected, and I do apologize for
the incident.

At this time, our warm spare has been placed into production and
functioning normally. We have brought in a cold spare and activated it
into standby so that redundancy is still in place. We anticipate that
we will replace the affected gateway router with new hardware. When
that happens, we will migrate your routing onto the new system, place
the warm spare back into standby and pull power from the cold spare.
All this will be seamless and will go unnoticed from a connectivity
standpoint.

Please be assured that your VIRTBIZ team will continue to review this
incident so that we can further improve our service to you.

I hope that you have a pleasant remainder of your weekend.

Original posts and updates:

We are experiencing a network issue between cogent and virtbiz this morning. First shortly after 6am, and lasted apx. 2 minutes. By the time I started to look into it and contact the DC, it was back up.

It is occurring again now. This time I was already on the servers looking into a spam report from a VPS account. Check checks showed the link between cogent and virtbiz to be down. It back up to the point I could send virtbiz a message (sure they already knew but just in case…), and am waiting to hear back. It had come back up at 7:50. Actually the virtbiz support site came up about 10 minutes before that.

At 8:01, it’s out again… Still waiting on an answer from vb…

As you can see by this, it’s having a problem finding a route.

[root@gt24-1 ~]# traceroute support.virtbiz.com
traceroute to support.virtbiz.com (208.77.216.244), 30 hops max, 40 byte packets
1 208.75.228.193 (208.75.228.193) 0.482 ms 0.809 ms 0.952 ms
2 tulip-core-2-ge3-8.tshost.com (208.75.224.5) 0.326 ms 0.318 ms 0.335 ms
3 core-1-gi7-2.tshost.com (208.75.224.13) 0.314 ms 0.344 ms 0.380 ms
4 te8-3.mpd01.atl01.atlas.cogentco.com (38.104.182.45) 0.310 ms 0.274 ms 0.287 ms
5 te0-0-0-6.mpd21.iah01.atlas.cogentco.com (154.54.28.254) 14.704 ms 14.816 ms te0-2-0-1.mpd21.iah01.atlas.cogentco.com (154.54.2.146) 14.697 ms
6 te2-1.mpd01.dfw01.atlas.cogentco.com (154.54.5.133) 20.387 ms 20.471 ms te3-4.mpd01.dfw01.atlas.cogentco.com (154.54.25.93) 20.560 ms
7 vl3834.na01.b000868-0.dfw01.atlas.cogentco.com (38.112.35.54) 21.105 ms 21.519 ms vl3534.na01.b000868-0.dfw01.atlas.cogentco.com (66.250.13.178) 20.402 ms
8 38.107.227.210 (38.107.227.210) 20.385 ms 20.646 ms 20.818 ms
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 ge8-0.brdr2.dal1.virtbiz.com (64.125.196.45) 25.759 ms 25.751 ms 25.755 ms
16 * * *
17 ge8-0.brdr2.dal1.virtbiz.com (64.125.196.45) 25.950 ms 25.892 ms 25.902 ms
18 * * *
19 ge8-0.brdr2.dal1.virtbiz.com (64.125.196.45) 26.025 ms 25.996 ms 26.005 ms
20 * * *
21 ge8-0.brdr2.dal1.virtbiz.com (64.125.196.45) 26.172 ms 26.126 ms 26.668 ms
22 * * *
23 ge8-0.brdr2.dal1.virtbiz.com (64.125.196.45) 26.741 ms 26.330 ms 26.374 ms
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *

As of 9am, we are back up at the moment… Just received this from VB:

We have been having an issue with one of our routers. Our technicians are correcting the issue and services should be fully restored shortly.

I’ll close this ticket now. If we can be of further service, just respond to this message.

Thank you
Jack B. – VIRTBIZ Internet Support

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.


Aug 17

It’s upgrade time again… Time to retire one of the older servers. The new server is another Tyan GT24. We’ve been really pleased with them. This one will have 2 Quad Core AMD Opteron processors, 16GB of RAM, and 4x drives in RAID-10. Can’t say what size drives will be yet. At least 500GB each, likely to be 750′s or 1TB. And yes, they will be on a PERC hardware RAID controller.

The plan right now is to replace the IBM X335 server. We expect this will be sometime in September – so we don’t rush things. We’d rather see it done properly than quickly. Since all public facing servers are virtualized, downtime should be able to be measured in seconds while the VM migrates.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.



We’ve asked Virtbiz to look into the issue and are waiting on word back.

Early the morning, GT24-2 experienced an issue, we remotely reset the machine and brought it back up. About an hour later, it happened a second time. The about 1pm we lost connectivity to all equipment there. Servers are coming back online and it appears they had lost power.

Edit -
Virtbiz says they had an issue with the power to the rack and had to shutdown the power temporarily.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.


Mar 8

We have migrated the cpwhm1 (reseller cpanel) virtual server to a faster node… The new node will have access to faster drives, more memory and additional cpu cores.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.


gt24-2 4am issue

posted by Robert
Mar 2

The gt24-2 node had an issue starting at apx 4:20am Central time. The load started spiking and the node and all vm’s started slowing down, until the point they became unresponsive. We’ve run updates on the OS and are still investigating what happened.

As the /var/log/messages shows nothing out of the ordinary up until it stopped, we’ll also be logged into the node to see if we can see what is occurring in real-time. It also means it the delay until notification will be gone. We’ll also be sending out a replacement server to the DC in case we need to migrate the VMs off of it. If it does happen again we’ll migrate a couple VMs off the server as well, but trying not to as the server they will be migrated to has less RAM, CPU and disk to begin with. If we do it’s a stop gap measure while the replacement is setup.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.



Lagniappe Internet’s founder Robert Porter has passed the Certified Internet Web Professional, aka CIW, Database Design Specialist exam today.
CIW descibes this exam…

CIW v5 Database Design Specialists have mastered the knowledge and theory of database design that applies to the most popular database platforms. These professionals help solve the problem of poorly designed databases. Aimed at database programmers and administrators alike, this vendor-neutral certification focuses on universal database design principles and SQL. The CIW v5 Database Design Specialist exam validates foundational knowledge of databases in general, such as Oracle, IBM, DB2, MySQL and others.

CIW v5 Database Design Specialist certification is valuable for individuals working in fields such as IT, database development, application development and other areas that depend on Web-enabled systems for productivity. To become a CIW Database Design Specialist, the candidate must pass one required CIW exam AND complete the CIW Certification Agreement by logging in to the CIW Candidate Information Center.

Together with the previously passed CIW Associate requirements, gives Robert the “CIW Professional” certification as well. This is in a long list of certifications gained over the years including A+, Network+, Security+, MCSE+Inet, and CCNA.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.


Feb 18

We’ll be experiencing some short downtimes while migrating some virtual machines – the largest of which is our own lagniappeinternet.com website. It was down for 9 minutes while migrating. Most others should see no more than a few seconds of downtime. The vm’s that are being migrated will be housed on faster nodes… so this is a good move that has been planned for a while.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.


GT20-1 issues

posted by Robert
Dec 28

Since the Christmas eve event where all servers, pdu, switches, etc briefly lost power in Dallas, the GT20-1 server has not been behaving well. Virtbiz in Dallas worked hard to get the server back up and going, but it’s simply not reliable. Any change to CMOS settings and it doesn’t want to boot… So we don’t want to put it back into production. They’re going to ship the server back to us, and we’re going to migrate off of that chassis to one of the spares we have here. We’ll likely do an signifigant upgrade to the hardware at the same time.

For what it’s worth… Our decision of virtualizing all customer facing servers and backing them up and copying the backups to other servers worked. We’re looking at refining that backup procedure to decrease the downtime in the future. We apologize for the downtime and will be contacting affected customers shortly.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.


Dallas nodes … power

posted by Robert
Dec 24

All nodes, power distribution system, etc. lost power for a moment. They are restarting now, and should be available shortly.

Update: gt24-2, x335, x306 are back up, and all virtual servers on them are running.
Sent support page to the d.c. for gt20-1

Update 2: Virtbiz’s support and company websites are down. The gt20-1 node is taking a long time to come back up.

Update 3: I’ve paged a virtbiz support tech, but have not gotten any responses. Their support site is down. The This affects the DirectAdmin virtual server plus a couple others. We apologize for the downtime, and will post Virtbiz’s response when we get one.

Update 4: Since we can’t seem to get in touch with Virtbiz (we will be getting a LONG-TERM resolution to that problem!) in any way shape or form. We’re restoring last night’s backups from gt20-1 to alternate nodes. It will take apx 30 minutes to have the virtual servers done. We apologize, and will be contacting the affected customers in the coming days. One quick note: DirectAdmin customer weren’t affected by gt20-1 extended outage. That virtual server had been migrated to gt24-2 previously for load balancing.

Robert

Robert Porter is the founder and managing member of Lagniappe Internet LLC. Robert holds multiple certifications incluging Oracle Certified Professional-Java 6 SE (OCP-J6), MCSE, A+, Net+, Project+, Security+, and multiple CIW certifications. He has been in the hosting industry for more than a decade and is founded Lagniappe Internet L.L.C. as a privately owned, completely debt free, hosting company based out of New Orleans. Robert's background includes 25+ years in programming, databases, networking and systems administration.