Posted by Pete Sorensen on Thu, Aug 19, 2010 @ 09:18 PM
Virtualization has solved or simplified many IT tasks and/or made them safer. However like all things in the universe this must be balanced out with new issues and tasks that need to be confronted. In a previous blog I had addressed the challenges of backing up of virtual environments. I would now like to discuss managing them as this has become an important topic for many IT shops.
Currently there are a set of products that are designed to manage and maintain your infrastructure. Some of these include Microsoft SCOM, HP OpenView, Nagios and Zenoss. I have seen these systems successfully implemented in many of the sites that I have visited. However they all have one thing in common. Their monitoring, alerting and reporting capabilities are focused on a system as an individual asset. That scope makes perfect sense if that server is running a single OS supporting a few applications. It does not work so effectively when you are in a virtualized environment with many operating systems and many applications running on a few hosts.
To help you increase the scope of your monitoring, alerting and reporting with a virtual environment Veeam has created two products which help bring your whole environment into view.
Monitoring
Veeam Monitor is the tool that actively monitors and alerts on your environment and allows you to view and customize what performance data you see from the VMware environment. Your view can be as high up as the whole environment or as low as a single process running a Windows 2008 guest and everything in between. Simple things like a list of your highest and lowest utilized ESX hosts or what are you top 5 resource consuming VMs can give a more complete understanding of your environment.
When analyzing a system it can be difficult to see what the root cause to an issue is if your view is limited to that guest VM. For instance if the ESX host a guest is living on is critically low on memory and swapping this will cause those ESX’s guests to experience performance problems. However you cannot determine that directly from the guests themselves as they are unaware of the hypervisor. Veeam nWorks management pack for your existing Microsoft SCOM or HP OpenView management tools solves this visibility issue. You can now have a complete view of your system from the virtualization layer down to the guest OS and Applications from one pane of glass.
Reporting
Reporting on the health and status of your virtual infrastructure is important. The reason for this is a virtual infrastructure is fluid as there are so many different workloads operating in the same compute environment. Virtual environments are usually in a state of growth, whether it be adding new guests or adding /upgrading hosts. Veeam reporter is designed to take the mountain of data that vCenter collects and allow your compile and view it in a way that makes sense. By this I mean if you want a quick list of the VMs on a certain datastore or an inventory of your entire virtual environment.
To help managing the fluid nature of VMware Veeam reporter can give you change reports. These reports are composed from the differences in your vCenter database over time. Running a job over time will produce a change report at the granularity of the job schedule. This change report will show what has been added, modified and deleted and who did it and when. A report like that is immensely useful for a tracking what is going on in the environment.
Both of these products are very feature rich. They can be downloaded for free for 30 days to test in your environment from [http://www.veeam.com]. F3 Technology Partners would be happy to assist with setting up a POC in your environment.
Posted by Steve Putre on Thu, Aug 12, 2010 @ 02:36 PM
Swap space in the Solaris 10 OS is misunderstood or poorly understood by most SA's, myself included. Here is my shot at explaining what I think it means from a Systems standpoint (as opposed to a programming or design standpoint).
Q: When I run 'swap -s', what does the output mean?
A:
# swap -s
total: 61541352k bytes allocated + 10156312k reserved = 71697664k used, 5965344k available
'allocated' is the sum total of all the user process address spaces (including shared memory), plus all the data in /tmp.
'reserved' is memory which has been allocated but which is not in use. Solaris attempts to put these pages onto the swap device so as to not occupy physical memory with unused pages.
'used' + 'available' should equal all of the RAM on the host, minus anything used by the kernel.
Q: Why does 'used' + 'available' not add up to the RAM on my server?
A: kernel memory pages are not counted in the output of the swap command because they are not directly accessible by processes.
Q: How can I see how much memory the kernel is occupying?
A: Here is one way:
# echo "::memstat" | mdb -k
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 1483266 11588 9%
ZFS File Data 4445637 34731 27%
Anon 7034973 54960 43%
Exec and libs 249329 1947 2%
Page cache 787194 6149 5%
Free (cachelist) 2036135 15907 12%
Free (freelist) 416189 3251 3%
Total 16452723 128536
Physical 16430557 128363
Q: OK, I see that the kernel is occupying 11Gb of memory. So why does the total of used + available only equal about 77Gb on a 128Gb host? Shouldn't it equal 115Gb or 117Gb?
A: ZFS has played a sneaky trick on you by using kernel pages for its adaptive read cache (ARC). On this system, the ARC has grown large and is occupying 34Gb of space. 'swap -s' does not display kernel memory and therefore your total available memory will shrick as the ARC grows. But the pages in the ARC are readily freed and should not have a great impact on user-level memory allocation unless memory becomes fragmented and user processes are requesting large memory pages.
Q: What if I am running Solaris Containers?
A: The swap usage of containers can be limited by defining the zone.max-swap resource control (rctl). In a container which has a zone.max-swap rctl defined, 'swap -s' used+available should add up to the value of the rctl. For example, in a container with
zone.max-swap = 4Gb:
zone# swap -s
total: 3624128k bytes allocated + 0k reserved = 3624128k used, 570176k available
Note that the zone.max-swap rctl also limits the amount of /tmp space which a container can use.
Posted by Steve Putre on Wed, Jul 21, 2010 @ 10:02 AM
A big part of our work here is the provisioning and management of a Solaris Container Farm for a large customer. In the course of migrating an older Solaris 9 server to a Branded container, users began reporting that their Oracle 11g utilities (namely, opatch and dbua) were not working inside the container. When invoking them, they would receive the error message,
"Could not reserve enough space for code cache"
This issue was observed only in branded containers, not in Solaris 10 Native containers.
After performing some research and contacting the vendor for support, it was determined that there is a wrokaround for the problem: Add the option "-XX:-UseLargePages" to the JVM invocation.
The permanent fix should be to add Solaris patchid 143357-03 or greater to the global zone. We have not tested this as yet.
Posted by Will Usher on Mon, Jul 12, 2010 @ 10:49 AM
We recently had an event with our Blackberry Enterprise Server (BES) where we were down for the better part of a day. Turns out it was corrupt MAPI profile, causing an inability of the BES to interact with the Exchange mailbox store. The fix was simple, delete some registry entries and recreate the profile, but a considerable amount of time was spend tracking down a very uncommon and obscure problem.

This got me thinking, what was the cost of this outage to our business? Being a sales organization, this could range from nothing more than an inconvenience up to causing us to lose a major deal or major account. Of course this question is impossible to answer, and very difficult to approximate, but It's not hard to imagine a situation where this could have been a very costly outage.
So why do we have a BES? It's more expensive, it's an additional point of failure, it takes additional time to manage, and we don't use any of the features it provides beyond what ActiveSync does (Remote wipe, Enterprise apps, IT policies, etc). While the BlackBerry was far and away the best device to receive enterprise email on, its competitors have come a long way in the past few years. I have a blackberry and I like the device, but I don't think it's a better device than what Apple has with the new iPhone, or some of the new Android phones. So I decided to find out how much more supporting a Blackberry device was, compared to supporting an ActiveSync compatible phone.
Most of our users are on Verizon, so I used numbers from Verizon for comparison. This is based on the phones having a two year refresh cycle and a new two year contract.
|
|
Blackberry
|
ActiveSync Phone
|
|
Basic Phone
|
$20 (8830)
|
$0 (Palm Pixi or Samsung Saga)
|
|
Plan (450 min, Unlimited Data, Exchange Access)
|
$85/mo * 24 mo = $2,040
|
$70/mo * 24 mo = $1,680
|
|
Blackberry Enterprise server support
|
$500/year = $500/8 users = $62 /user/year
|
$0
|
|
Two year reoccurring cost per user
|
$2,184
|
$1,680
|
By my quick calculations that's a $500 per user premium for two years of service, or $250 per year per user. Now that's not much money for a business, but I believe that's only a small part of the total cost. When you factor in how much time our administrator has spent managing the BES server (updates, service packs, version upgrades, troubleshooting) and the additional burden in related tasks (upgrading Exchange), plus the additional downtime we've experience on the BES, I would imagine the cost per user is significantly higher. This also does not include MS Windows licensing, hardware, power and cooling (our BES lives in a VM, reducing our costs).
Of course the situation can vary dramatically from our organization to another, especially if the enterprise features of BES are leveraged. I don't think the BlackBerry makes our users any more productive over another Smartphone, and it certainly adds hard costs and soft costs to our IT infrastructure. My recommendation to management will be to begin to phase out the BES as employee contracts expire.
Disagree with me or my numbers? Please post below.
Posted by Will Usher on Tue, Jul 06, 2010 @ 04:00 PM
Buzzwords like SaaS and cloud computing can seem like daunting technologies that only big companies can afford to invest in and migrate to. In my experience, this couldn’t be further from reality.
We were recently approached by a small law firm with that had been experiencing problems with email. They ran Microsoft Exchange in house and had problems with their Exchange server crashing, losing their internet connection, and poor power service to their building. Like most modern businesses, email is a business critical service, and the downtime was hurting their ability to communicate with their clients in a timely fashion. They needed help fast, without a large capital investment.
We discussed the options with them and settled on a hosted exchange model. They liked the following advantages of a hosted solution:
- Offsite in a professional datacenter with redundant power, cooling, network and tight security.
- Managed by professional Exchange administrators
- Low capital investment (only the Professional services we charged to migrate them to the service)
- Fixed operating costs: $5 per user, per month. No hardware or software upgrades, ever.
To get an idea of how quick and easy this process was, we went form first meeting to complete migration in less than a week. Most of the time was spent discussing what solution was best for them, and how we would migrate to the new service. The migration itself took place over one weekend.
The client is very happy with the new solution, and since moving they have not had any downtime. If you’re interested in this for your business, contact us and we can discuss the best options for you.
Posted by Pete Sorensen on Wed, Jun 09, 2010 @ 02:00 PM
Virtualization has made a lot of IT tasks and processes simpler and easier to manage. It has brought the ability to quickly deploy a server or workstation with the click of a mouse, move machines around without causing an outage or bring High Availability and Fault Tolerance to systems that otherwise would not be able to benefit from those features. However like every piece of technology that makes our lives easier it invariably creates new issues that need to be solved.
One of these problems is how to efficiently back up a virtualized environment.
Backing up a virtual environment should NOT be treated like a physical one. Doing so will create performance and management headaches, remember you are sharing resources. Backing up VMware VMs used to use VCB (VMware Consolidated Backup). The VCB was a band-aid for the problem, you should not use it, VMware is getting rid of it, don’t use it. With the vStorage API’s in vSphere the backup process is streamlined. The API allows for well designed methods for more efficient access to a VMs data.
Unlike conventional backup systems there are no agents to install. Veeam utilizes VMware tools which you already have on your VMs. Through VMware tools Microsoft VSS is supported so your Active Directory, Exchange, and SQL servers can be quiesced so their databases are consistent on disk. A snapshot is then taken and the now read-only disk is mounted to the Veeam backup server using the hot disk add feature and backed up to a compressed and deduped file. The file can exist anywhere that VMware ESX, or the Veeam backup server can write to (even a USB drive). This gives you great flexibility. From this point data replication or conventional tape backup system can be used to send your backup off site.
With Veeam restoring a single file or whole VM is simplified. You can restore a VM from a backup to anywhere in your VMware environment accessible to Veeam. This gives you the ability not just to restore a broken VM but also to deploy it as a clone. You can then do testing of patches and software on a VM that is the same of that in production.
The other feature that is part of the Veeam backup product it replication. The advantage here is that you can replicate at the VM level. Only replicating systems that are needed for DR can save you bandwidth and complexity in your DR environment. Veeam also allow for near CDP (continuous data protection). Near CDP is accomplished through a combination of changed block tracking and FastSCP. Culmination of this allows you to reach lower recover point objectives.
Posted by Will Usher on Mon, May 03, 2010 @ 11:31 AM
(Or "How we help our clients pick the best storage for their needs")
This being my first blog post, I thought I'd take a moment to introduce myself. I'm Will Usher and I've been working for F3 Technology Partners for the past two years. Here at F3, my time is split pretty evenly between pre-sales and post-sales engineering work. My two main focuses are VMware Virtualization and Storage. This blog post is going to be about my methodology around helping a client understand their storage options and picking the solution that best fits their needs.
I break down any potential storage solution into two of four categories; frame-based/frame-less and block-based/file-based. Every storage array on the market can be placed into two of these quadrants. So we're all on the same page, let me give a
quick background of what each quadrant means.
Frame-based arrays are your traditional SANs. You buy a controller (or two) and however much disk you need, generally added in trays. You can continue to add disk until you have reached the limit of disk that your controller can support.
Frame-less arrays are a similar in concept to grid computing systems; every time a node is added, CPU, memory, network and disk are added. You're not locked into the one or two controllers the way you are with frame based computing.
Block Based arrays are traditionally what was considered a SAN. They don't run traditional file systems like Windows' NTFS or Solaris' ZFS. They carve up raw storage into LUNs and serve that out via block based protocols (FC, iSCSI, FCoIP, etc.)
File Based arrays use an internal filesystem, like ZFS, WAFL, or NTFS, and are able to serve files to clients via file based protocols (CIFS/SMB, NFS, HTTP/S, FTP). They have traditionally been called NAS devices, but this is changing as many are now able to serve our block based protocols.
So now we know how to classify storage arrays, but how does this help my clients find the correct storage solution? The answer is that each of these architectures has specific advantages and disadvantages.
Frame based arrays
A frame based array will be easier to design and potentially easier to manage, thus lowering costs. When designing a frame based array you make an assumption that there will be at most two controllers. This makes inter-controller communication easier, as well as adding features easier. This can lead to faster development and richer feature sets.
Frame-less arrays
A frame-less array has a few advantages over their frame-based counterparts. It has to do with what I call "linear growth". With a frame-less architecture your array is composed of nodes. Generally each node has compute resources (CPU), network resources (FC, Ethernet, etc), cache (DRAM, NVRAM, SSD), and storage (disk). This means that every time you add capacity, you're also adding compute, network, and cache resources. This means you maintain a balanced system as you scale the capacity of the array. Contrast this to a frame based array, where you must anticipate your future growth and buy that upfront. If you under estimate your growth, you're going to end up doing a rip-and-replace to move up to the next larger controller. If you overestimate your growth, you've just wasted money on a controller that is too large for your environment. A storage engineer (like me J ) can help mitigate some of this risk based on our experience with other clients, but unless you have a crystal ball that can tell us how much storage you're going to need in the next three to five years, it's still going to be an approximation.
Block Based arrays
Traditional SANs use a block based design where disks are assigned to RAID groups and carved up into LUNs, which are then presented to servers as block devices. The server formats it with its native filesystem, and proceeds to use it like a local disk. This is easier to design as it has fewer moving parts than a file based array. Block based arrays are generally less expensive in the low and mid range, and scale larger at the high end than file based solutions. People generally associate block based arrays as being faster than file based arrays. This does not necessarily hold true today with high-performance file based products like the NetApp FAS, and the Oracle 7000. Keep reading to find out why.
File Based arrays
Once relegated to mundane file sharing tasks, file based arrays have improved leaps and bounds over the past five or so years, thanks in no small part to Moore's law. File based array's require more CPU power to do the same amount of work as their block based counterparts. This was an issue when we had Xeons running at 800 MHz, but now that we have six core Xeons running at 3+ GHz we have more CPU power than we know what to do with. File based arrays can leverage this abundance of CPU power to meet and in some cases exceed the performance of block based arrays. They do this several ways including leveraging advanced caching algorithms to prefetch blocks from disk into cache before they're needed by the client. O f course having equal performance certainly isn't a good reason to go file based over block based. So why are file based arrays gaining rapidly in popularity? Feature set and TCO. Because a file based array will have a local filesystem on the array, it unlocks a huge amount of features not possible with traditional block storage, for example DeDuplication and encryption. One only has to look at the (admittedly dizzying) selection of software features on the NetApp website to understand my point. The other big feature of file based storage is what's been coined Unified Storage. A unified storage device can serve both block based data (iSCSI, FC) and file level data (NFS, CIFS, HTTP, FTP, etc). This means a single device can take the place of a block based array and consolidate all of your file servers. This can save a lot of time between patching and administering windows file servers. Of course there are disadvantages of file based arrays; they tend to be more expensive and they don't scale much over a petabyte.

When I do this for a client I have the benefit of being much more interactive. I can take their needs and pain points into account and tailor the discussion around them. By the end of the meeting a client has a much better idea of what they want, and then I can offer several different solutions that fit their needs. We can the drill down on those products and the client can weigh the pros and cons of each one. My clients appreciate that we're vendor agnostic and that we can give them solutions and let them pick the best one, rather than trying to force a specific technology on them.
Do you agree with me? Think I've got it all wrong? Please let me know in the comments.
Posted by Steve Putre on Tue, Apr 27, 2010 @ 08:53 AM
In the process of testing LDOMs on a T5220, we came to a point where we decided to place the T5220 under the control of Oracle Enterprise Manager Ops Center. We set up an OS provisioning job which would wipe out any LDOMS and OSes on the host and start fresh.
During the course of the rebuild, Ops Center attempted to reset the hypervisor configuration to its factory default. This operation encountered some kind of problem which placed the T5220 in an unusable state. The host itself was powered off. The ILOM showed the following conditions:
-> show /SYS
Properties:
type = Host System
ipmi_name = /SYS
keyswitch_state = Normal
product_name = SPARC-Enterprise-T5220
product_part_number = 602-3822-09
product_serial_number = xxxxxxx
product_manufacturer = SUN MICROSYSTEMS
fault_state = Faulted
power_state = Off
-> show faulty
Target | Property | Value |
/SP/faultmgmt/0 | fru | /SYS |
/SP/faultmgmt/0/faults/0 | timestamp | Apr 26 xxx |
/SP/faultmgmt/0/faults/0 | sp_detected_fault | Apr 26 xxx ERROR: Unsupported memory configuration |
| | |
| | |
| | |
| | |
Following some procedures in the T5220 documentation, I attempted the following to try and get the host back in a running state:
-> cd /HOST/bootmode
-> set config=factory-default
-> start /SYS
No Luck; same problem.
Turns out that the documentation was missing a critical step-- after doing the 'set config' above, the next step must be to reset the ILOM itself:
-> reset /SP
Following the ILOM reset, the fault was cleared. Resetting or power-cycling the host itself does not fix the problem.
Posted by Steve Putre on Fri, Apr 16, 2010 @ 10:57 AM
Solaris Zones are an increasingly popular technology for performing server consolidation in large datacenters. Zones are part and parcel of Solaris 10 and will run on any hardware platform (Sparc or X86) where Solaris 10 is supported. On its own, Zones technology addresses the security and isolation aspects of consolidating multiple applications or workloads on a physical host. Combined with other products and technologies, Zones provides the basis for a complete managed solution. Here are some examples of how the zones concept can be expanded and complemented:
Resource Management: Sure, you can cram 20 or more zones onto a single host, but how to ensure that they all get their fair share of CPU and memory resources? Solaris 10 not only provides free, built-in resource management; but gives you the choice of how to implement it. Want to dedicate CPUs to a workload? Want to limit how much real memory and swap space each zone occupies? Or how about allowing any zone to grab as much CPU and memory as it can but reliquish some of that resource when required by other workloads? All built into Solaris 10.
High Availability: Putting all those eggs into one basket could be risky. Both Solaris Cluster 3.x and Symantec's Veritas Cluster Server (VCS) 5.0 provide full support for monitoring anf failover of zones between hosts in a cluster. Solaris Cluster takes the HA concept one step further, by enabling not only the failover of a zone with all of its workloads, but also a 'virtual host' clustering mode, where the applications are monitored and can be moved between two or more zones located on different hosts in the cluster.
Branded Containers: Solaris 8 and Solaris 9 branded containers allow you to retire old hardware by taking a flash image of the physical host and installing it into a container on a Solaris 10 host. The branded container presents its applications with system call interfaces exactly as Solaris 8 or Solaris 9 would; in fact, because the branded container is installed with an image taken directly from a physical host; all binaries and libraries are carried over. The magic occurs at the system call layer, where Solaris 8 or 9 system calls are translated, executed in the Solaris 10 kernel, and the results sent back to the caller in native format. But branded containers are intended as a means to an end-- enabling migration off old hardware and operating systems while planning for migration to native Solaris 10 is underway. Vendor support for a given brand follows the support schedule for its corresponding operating system. When Solaris 8 enters its EOSL (end of support life) in March 2012, there will no longer be patches or support for the Solaris 8 container brand.
Capacity planning and management: A major advantage of shared / consolidated environments is their ability to make much more efficient use of compute resources such as CPU and memory. To fully exploit this advantage, it is important to have historical capacity data upon which to base decisions regarding capacity. Most data used in Solaris capacity planning originates with the kernel's extended accounting facility. The collection tool's local agent will take samples of this data and store it in a centralized database for analysis and reporting. Major capacity management solutions (BMC Perform/Predict, Teamquest, Sun Ops Center) are all container-aware and can give planners a graphical, unified view of where their environments have excess capacity and where capacity is short.
Posted by Steve Putre on Mon, Apr 05, 2010 @ 02:15 PM
In the process of configuring zones on Solaris 10 update 8 (10/09) in a clustered environment, F3 engineers came across the following problem:
# zoneadm -z zone_1 detach
zoneadm: zone 'zone_1': These file-systems are mounted on subdirectories of /tech/zones/zone_1.
zoneadm: zone 'zone_1': /tech/zones/zone_1/dev/.devfsadm_synch_door
We have not yet determined what causes this; it seems to be sporadic. It appears that .devfsadm_synch_door is a hidden mount. To fix it, we have done:
# umount -f /tech/zones/zone_1/dev/.devfsadm_synch_door
We can then cleanly detach the zone from the host.