Social Media

Enter your email address:

Delivered by FeedBurner

Search
  • Contact Me

    This form will allow you to send a secure email to the owner of this page. Your email address is not logged by this system, but will be attached to the message that is forwarded from this page.
  • Your Name *
  • Your Email *
  • Subject *
  • Message *
Navigation

Entries in Quality Control (33)

Thursday
Sep182014

Learning from when things break, Seattle Tunnel-Boring Machine Repairs

You ever notice how some of the so called experts rarely talk about the things that went wrong on their projects.  They make it seem like they are perfect in their execution and anything that goes wrong is an easy fix given how smart they are.  I don’t know about you but I know of some really bad data centers out there that have been the vision of some experts.  :-)  In general, their way of getting out of accountability is they say the operations crew is to blame. 

The real experts know mistakes are made and they need to learn from them.  In Seattle is the largest tunnel-boring machine in the world and it broke.  The media went wild pointing the fingers of blame at politicians as if they know how to design and operate a tunnel-boring project.  The politician is going to say whatever they think they can to benefit their goals.  This is the mistake the so-called experts make as well, to think they can say whatever they think they can to benefit their goals.

Well when you dig a big tunnel, things go wrong.  In the case of the Seattle tunnel-boring, they went really bad requiring repair work over the cost of the boring machine.  Popular Mechanics post on the repair project and the author jumps on the media.

What do you do when the world's largest tunneling machine is, essentially, stuck in the mud? Bertha is 60 feet under the earth, and you're on the surface watching a squirmy public swap rumors of cost and delay on the $1.35 billion tunnel component of an even larger transportation project, and the naysayers are howling: Just you watch, Bertha will be abandoned like an overheated mole, boondoggle to end all boondoggles. Because, don't forget, when you're boring the world's largest tunnel, everything is bigger—not just the machine and the hole and the outsize hopes but the worries too. The cynicism. 

What do you do? 

Here's what you do: You try to tune out the media. You shrug off the peanut gallery's spitballs. You put off the finger-pointing and the lawsuits for now; that's what the lawyers are paid for afterward. You do the only thing you can do. You put your head down and you think big, one more time. You figure out how to reach Bertha and get her moving again.

NewImage

The post tells the engineering story of trying to repair the tunnel-boring machine.  

The YouTube video embedded in the article is available here.

Sunday
Jul272014

Airbus Lessons for Debugging A350 could apply to Data Centers

Businessweek has an article on how Airbus is debugging the development of its latest aircraft the A350.

Reading the article it gave some good tips/lessons that can apply to data centers.

The term debugging is used which also equates to reducing  risk.

The company has put unprecedented resources into debugging the A350—“de-risking,” as it’s called.

The big risk is not the safety risk, but the cost of the plane.

The engineering risk with the A350 isn’t that it will have chronic, life-threatening safety problems; it’s cost.

When you get into the details the discussion can sound like a data center issue.

The challenge, Cousin says, is that “in a complex system there are many, many more failure modes.” A warning light in the cockpit could alert a pilot to trouble in the engine, for instance, but the warning system could also suffer a malfunction itself and give a false alarm that could prompt an expensive diversion or delay. Any downtime for unscheduled maintenance cuts into whatever savings a plane might offer in terms of fuel efficiency or extra seating capacity. For the A350 to be economically viable, says Brégier, “the airlines need an operational reliability above 99 percent.” That means that no more than one flight out of every 100 is delayed by more than 15 minutes because of technical reasons.

Airbus realized the past methods of slowly working the issues out was costly.

Instead of a cautious, incremental upgrade, Airbus went for an entire family of superefficient aircraft ranging from 276 to 369 seats, with a projected development cost of more than $10 billion. The goal was what Airbus internally calls “early maturity”—getting the program as quickly as possible to the kind of bugs-worked-out status that passenger jets typically achieve after years of service.

Many companies make it seem like the data center comes from their company, but in reality almost everyone is an integrator like Boeing and Airbus.

Much of the early work was done not by Airbus but by its suppliers. While the company might look to the outside world like an aircraft manufacturer, it’s more of an integrator: It creates the overall plan of the plane, then outsources the design and manufacture of the parts, which are then fitted together. “We have 7,000 engineers working on the A350,” says Brégier, “and at least half of them are not Airbus employees.”

And a smart move is to change the way you work with suppliers to be partners.

Throughout the development process, teams of engineers were brought in from suppliers to collaborate with Airbus counterparts in Toulouse in joint working groups called “plateaux.” “You need to have as much transparency with your suppliers as possible,” says Brégier. “With such a program you have plenty of problems every day, so it’s bloody difficult.”

And just like operations is critical to data center, airplane operations is the reality that needs to be addressed.

The idea is not just to put the systems through every combination of settings, but to see how the whole aircraft responds when individual parts are broken, overexerted, or misused. That, after all, is how the real world works. “Every plane in the air has something wrong with it,” Cousin says.

Name the number of companies who think about their data centers in the above way.  The list is pretty short.

Wednesday
Jul232014

15 years ago Google placed its largest server order and did something big starting site reliability engineering

Google’s   posted on Google placing its largest server order in its history 15 years ago.

 

Shared publicly  -  11:41 AM
 
 
15 years ago we placed the largest server offer in our history: 1680 servers, packed into the now infamous "corkboard" racks that packed four small motherboards onto a single tray. (You can see some preserved racks at Google in Building 43, at the Computer History Museum in Mountain View, and at the American Museum of Natural History in DC,http://americanhistory.si.edu/press/fact-sheets/google-corkboard-server-1999.)  

At the time of the order, we had a grand total of 112 servers so 1680 was a huge step.  But by the summer, these racks were running search for millions of users.  In retrospect the design of the racks wasn't optimized for reliability and serviceability, but given that we only had two weeks to design them, and not much money to spend, things worked out fine.

I read this thinking how impactful was this large server order.  Couldn’t figure what I would post on how the order is significant.

Then I ran into this post on Site Reliability Engineering dated Apr 28, 2014, and realized there was a huge impact by Google starting the idea of a site reliability engineering team.

NewImage

Here is one the insights shared.

NewImage

The solution that we have in SRE -- and it's worked extremely well -- is an error budget.  An error budget stems from this basic observation: 100% is the wrong reliability target for basically everything.  Perhaps a pacemaker is a good exception!  But, in general, for any software service or system you can think of, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and, let's say, 99.999% available.  Because typically there are so many other things that sit in between the user and the software service that you're running that the marginal difference is lost in the noise of everything else that can go wrong.
If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system?  I propose that's a product question. It's not a technical question at all.  It's a question of what will the users be happy with, given how much they're paying, whether it's direct or indirect, and what their alternatives are.
The business or the product must establish what the availability target is for the system. Once you've done that, one minus the availability target is what we call the error budget; if it's 99.99% available, that means that it's 0.01% unavailable.  Now we are allowed to have .01% unavailability and this is a budget.  We can spend it on anything we want, as long as we don't overspend it.  

Here is another rule that is good to think about when running operations.

One of the things we measure in the quarterly service reviews (discussed earlier), is what the environment of the SREs is like. Regardless of what they say, how happy they are, whether they like their development counterparts and so on, the key thing is to actually measure where their time is going. This is important for two reasons. One, because you want to detect as soon as possible when teams have gotten to the point where they're spending most of their time on operations work. You have to stop it at that point and correct it, because every Google service is growing, and, typically, they are all growing faster than the head count is growing. So anything that scales headcount linearly with the size of the service will fail. If you're spending most of your time on operations, that situation does not self-correct! You eventually get the crisis where you're now spending all of your time on operations and it's still not enough, and then the service either goes down or has another major problem.

Tuesday
Jul222014

What you mean there are bogus repairs? Hell yeh

WSJ had an article on bogus repairs on trains at port complex.

TERMINAL ISLAND, Calif.—Ten thousand railcars a month roll into this sprawling port complex in Los Angeles County. While here, most are inspected by a subsidiary ofCaterpillar Inc. CAT +0.44%

When problems are found, the company repairs the railcars and charges the owner. Inspection workers, to hear some tell it, face pressure to produce billable repair work.

Some workers have resorted to smashing brake parts with hammers, gouging wheels with chisels or using chains to yank handles loose, according to current and former employees.

In a practice called "green repairs," they added, workers at times have replaced parts that weren't broken and hid the old parts in their cars out of sight of auditors. One employee said he and others sometimes threw parts into the ocean.

It is bit ironic that the term “green repairs” is used to describe the practicer.  What could be more non-green (environmental) than damaging a part to create a repair transaction. 

Even so, they said, car men are under pressure to identify repair work to be done. The quickest way to do so, they said, was to smash something or to remove a bolt or other part and report it as missing.

They weren't instructed to do that, the workers said. But they added that some managers made clear the workers would be replaced if they didn't produce enough repair revenue.

"A lot of guys are in fear of losing their jobs because there's no work in California," said one worker, standing in front of his small ranch house a few miles from the Terminal Island ports.

Car men are expected to justify their hourly pay "and then some," this worker said. "If you find no defects, it's a bad night," he added, and that creates a temptation to "break something that's not broken."

This is a consequence of having performance based systems that are short sighted.

 

Tuesday
Jun102014

If your medical records have 95% errors, how many other parts of your system have errors?

Part of the beauty of all that data out there is most you never use, and almost no one worried about the quality of the data when it was entered.  Now that Big Data is hot and machine learning is too, your data history is ready to be used.  But, how about those errors?  What errors?  WSJ writes on medical health care and makes the point that up 95% of the records have errors and doctors are asking patients to review their medical records.

Health-care providers are giving patients more access to their medical records so they can help spot and correct errors and omissions.

Studies show errors can occur on as many as 95% of the medication lists found in patient medical records.

Errors include outdated data and omissions that many patients could readily identify, including prescription drugs that are no longer taken and incorrect data about frequency or dosage.

Any one who has worked on asset management or ewaste and end of life of hardware discover how inaccurate inventory management can be.

If you don’t think of the quality of data, then you’ll have a much harder time using your data history.