Friday, November 3rd, 2006

I'm Home. Yay.

Not doing a long writeup of the final keynote, "From Lancelot to Lovelace, and Beyond" by Robert 'r0ml' Lefkowitz. In short, Lefkowitz asserts that to be computer literate, one has to be able to read and write the language of computers (e.g. read and write code), and that, currently, we're effectively in the 13th century with respect to the percentage of people who are computer literate.

And now, I become one with my bed. ZZZ.
(Leave a comment)

Thursday, November 2nd, 2006

ZendCon Session Notes - Zend Framework

Presented by John Cogeshall (Zend)

The Zend Framework is a modular collection of PHP classes, based on PHP 5, to simplify common tasks. It's a smaller component of the PHP Collaboration Project. It's also supposed to be a demonstration of PHP 5 best practices.

The Framework is intended to be E_STRICT compatible (that is, returning no warnings when E_STRICT is enabled). It's also completely PHP 5-powered, requiring as few external PHP extensions as necessary.

One of the goals behind the Framework is to provide "clean" IP to enable commercial use: real companies can't just borrow code from the internet without clear licensing. The framework is licensed using a PHP/BSD style license, so anyone can use it for anything, with no strings attached. Contributors also have to sign an agreement saying that any code they commit, they either created or had the rights to contribute.˜

John reiterated the "easy things should be easy, complex things should be possible" quote I've mentioned in earlier entries.

One of the features of the Zend Framework is that you don't need to use all of it to use part of it. It's also supposed to be entirely self-contained: there are no functions or constants at the global level: everything is inside classes.

John then went on to demonstrate how this could be used to set up a blog site quickly, but because the start of his session was delayed, he didn't get to do all of the presentation he wanted to do.
(Leave a comment)

ZendCon Session Notes - Unicoding With PHP6

Presented by Andrei Zmievski (Yahoo!)
http://www.gravitonic.com/talks/

Today is Andrei's birthday, so his birthday present is getting to present an 8:30 session. Coincidentally (or not), this is also session number 2-11.

Tower of Babel
Dealing with multiple languages and encodings is a pain, but it can't be avoided.

In the past, PHP has always been a binary processor; the string type is byte-oriented and used for everything from text to images. The core language doesn't know anything about text encodings and multilingual data. And while they're a help, the iconv and mbstring extensions are not completely sufficient.

Andrei spent some time talking about some of the features of Unicode. Unicode by itself doesn't mean internationalization. I18N and L10N (localization) rely on consistent and correct local data. Locale is an identifier (like en_us) that record characteristics like date/time formats, number/currency formats, sorting order, character direction, etc. PHP uses the Unicode Common Locale Data Repository, which contains 360 locales covering 121 languages and 142 territories.

Goals for Unicode in PHP 6
Have a native unicode string type, and a distinct binary string type (that works like PHP's existing string type); update the language semantics to work correctly with unicode strings; maintain backwards compatibility.

PHP 6 uses ICU: International Components for Unicode (provided by IBM), which provides encoding conversions, collation, unicode text processing, and a large number of other features.

Introduced in PHP 6 is a new configuration option, unicode.semantics. No changes to program behavior unless it's enabled; but you can still use Unicode when it's disabled. When it's enabled, PHP converts strings into an internal unicode representation.

With unicode off, 1 character in a string is 1 byte. With unicode on, 1 character may be more than 1 byte: strlen() would return the proper number of characters. To determine the size in bytes of a unicode string, you need to use a different function. (I'm wondering if this means that, for binary safety, you can no longer rely on strlen() when you need to pass a sequence of bytes and a length to an API.)

In strings, you can use \u or \U and specify the codepoint (e.g. \u05D0), or \C{HEBREW LETTER ALEF} when you don't know the code point but do know the unicode character name.

PHP can automatically change the data encoding for different input and output sources. It will automatically convert string literals to UTF-8, unless declare(encoding="iso-8859-1"), and that code file is interpreted in that character set.

Procesing data retrieved from the browser poses a special problem: GET requests have no encoding at all, and POST only rarely comes marked with encoding. However, browsers are supposed to submit data in the same encoding as the page the form was on, and PHP will attempt to decode based on the unicode.output_encoding setting; but if decoding fails, PHP will populate request arrays with raw binary extension. Applications can then use the filter extension to decode the text.

When there is a conversion error to or from Unicode, you can specify how PHP is to handle the error, and even provide an error function so that you can handle the error via PHP code.

Also new is the TextIterator, which allows for fast iteration, forwards and backwards, over text. It allows you to iterate based on code point, character, words, lines, or even sentences.

To date, about 40% of PHP's 3070 built-in functions have been upgraded to handle unicode text.

There should be a preview release of PHP 6 in December.
(Leave a comment)

Wednesday, November 1st, 2006

ZendCon Session Notes - PHP Data Objects

Presented by Wez Furlong (Omni TI)

Problems with PHP Databases APIs
There's no consistency of API between DB extensions. Sometimes, the DB extensions aren't even internally consistent. This also means that there's code duplication in PHP internals, and also leads to high maintenance.

The PDO Solution
Move PHP-specific stuff into one extension, database specifics in their own extensions, and create data access abstraction, rather than database abstraction.

Most people aren't using abstraction layers, or are using home-brew layers because existing generally available abstraction layers are too slow, do too much, or are too complicated.

PDO Features
Native C code, rather than a PHP-based abstraction helps performance. It also takes advantage of improvements in the PHP 5 internals.

PDO has common database features as a base, and database-specific features are also available.

What Can PDO Do?
In summary: prepared statements, bound parameters, transactions, LOBs, SQLSTATE error codes, flexible errorhandling, and database portability.

PDO supports MySQL. PostgreSQL, ODBC, DB2, OCI, SQLite, and Sybase/FreeTDS/MSSQL.

Wez then talked about how to use PDO. I'll spare you pages and pages of PHP code, which I'm sure is given in examples on the PHP website.

PDO maps error codes from the database-specific format to standard ANSI SQLSTATE error codes. In the case of errors, PDO also has three error handling strategies: silent (by defaut), displaying warnings, and throwing exceptions.

PDO implements forward-only cursors by default. (This is similar to mysql_unbuffered_query.) This makes them fast, since you don't have to wait for all the data to come from the network, but it also means that you don't know how many rows there are until you've fetched all the data. It also means that you can't initiate another query until you finish fetching all the data, and it's possible that it might cause other queries to block because the database is still busy servicing your initial query. You can also request buffered queries (like mysql_query) instead.

PDO implements iterators: foreach($dbh->query(...) as $row) {}. Kinda neat, especially since 95% of the time, I wind up doing while($row = mysql_fetch_*()) anyway.

PDO handles LOB support via streams. Database support permitting, you can potentially stream content to or from the database without first having to load it entirely in PHP. For example, you could select an image from the database and fpassthru() it to the browser, or fopen() a file from the filesystem and stream it into the database without having to file_get_contents(). This cuts down on both memory usage and latency.

This talk actually has me excited about PDO now, and I'm going to look into implementing it in my applications where possible.
(Leave a comment)

ZendCon - Movable Walls

It's a rather interesting experience to be sitting in a room, and have its configuration change by having the movable walls expand or collapse. One moment, you're in a giant, cavernous room; the next, you're in a room a third its size, with a wall 10 feet away.

On the plus side, it turns out that there's a power strip plugged in in the middle of the room. And fortunately, all the sessions I've been to today have been in that part of the conference room, so I've not had the power issues today that I did yesterday.

The wireless mikes are still having issues, though. I don't think I've been to a single session or talk today where the mikes didn't have an issue at some point.
(Leave a comment)

ZendCon Session Notes - Keynote - The MashUp Economy: What are you waiting for?

Presented by David Berlind (ZDNet)

David's Law of Economics
"Those who handle the most amount of heavy lifting, but do the least amount of it win big*"

*If you're in bettween, perhaps it's time to rethink life.

Ecosystems
Ecosystems are a cycle, starting with artists, who create a user experience, which attracts consumers, which generate additional artists.

A healthy ecosystem also generates an arms race of technologies and companies using it; media interest; research; venture funding; and conferences and events.

Mashup Defined
Two or more disparate sources of content* or functionality are blended** to form a unique user experience*** that's usually substantially different from any of the original sources

*Dave's opinion: data qualifies as content. (I share this opinion; it seems almost obvious.)
** sometimes refered to as remixing
*** audio, visual, or a combination thereof qualifies

"Mashups" are just a trendy term; we've been doing similar things for years: newspaper articles that quote other press; productions like Forest Gump (where Gump is shown shaking hands with the president); consumer video and animation; music (which is where the term originated); software.

Key Enabler
The better the tools are, the easier it is for anyone to do it. (This reminds me of the quote "Make easy things easy; make hard things possible.")

David talked about "ecosystems" - the various different software platforms (Mac, Linux, Windows, PHP, Perl, etc), and how they all can be replaced with "The Internet". With the Internet, anyone can add an API (new web service, etc) and have it available to everyone; whereas, all the other platforms are relatively closed - they're only controlled by a few people. For example, while anyone can start a new Linux distribution, that doesn't mean that new APIs you might add to that distribution will be picked up and distribtued in other kernels.

Programmableweb.com is a website for creating and sharing mashups. It apparently makes it easy enough to create mashups that 5-year-olds can do so.

David also mentioned that there are potential legal issues. It's a giant grey area; the legality of mashups isn't yet worked out. But, if you make available interesting data without an API to access it, the chances are very high that someone else will. This is something that potential mashup authors need to keep in mind. During the Q&A session, David gave an example of a mashup that got shutdown: it was mapping 911 calls and response times, and apparently, the city whose data was being used was unhappy about this because potential terrorists could use that information to know when to attack emergency responders.

podbop.org provides list of artists playing concerts in a geographical area. From there, you can listen to MP3s to decide whether or not you want to actually attend those concerts. This is, of course, breaking the stranglehold of the media companies who like to be able to explicitly present a particular image for a given band or concert.

Google Maps make up 50% of mashups on programmableweb. Flickr is next (11%), followed by Amazon (9%) and YahooMaps (5%).

Mashups in the media: David set up a Google Alert for the term Mashups; in one year, he went from getting seven alerts per week, to seven alerts per day.

David also mentioned that mashups is starting to show up in fiction: Mashup Corporations.
(3 comments | Leave a comment)

ZendCon Session Notes - Caching Systems

Presented by Ilia Alshanetsky.

Ilia presented a number of different caching approaches and talked about their pros and cons:

Complete Page Content Caching
This can be implemented simply: Create a cache() function that tries to read in a cache file. If the file is too old or non-existant, call your init_cache() function, which turns on output bufferent and sets up a register_shutdown_function() which gets the content out of the output buffer, echos it out, and writes out the cache file.

When writing the cache file out, you will want to use a tempnam(), file_put_contents(), and then rename(), so that you don't run into issues with multiple connections attempting to write into the same file at the same time.

This is fast, but it requires that the entirety of your page be cacheable, which is not often the case.

Compressed Page Cache
If the browser accepts gzip, we can Content-Encoding: gzip and use a compressed version. This can be done really easily with:

copy("/tmp/index.html", "compress:zlib://" , $tmp_name);

thanks to the magic of PHP file streams.

Content Pre-Generation
This cache generation code can be simpler than on-the-fly full-content caching, because we're manually triggering the cache generation operation and creating the entire website all at once. This allows us to ignore having to handle locking issues when multiple accesses attempt to write out a page's cache multiple times. However, this may result in the generation of pages no one may visit; and the disk space used may be very large. The time to generate an entire site's worth of content may also be very large.

On-Demand Caching
Instead of creating all the content of the site all at once, we can instead create it on-the-fly, by implementing a 404 error handler which generates the page the user was trying to access. On a 404 to an .html file, the error handler generates the page, then writes it out to that file. Future accesses to the page hit the static content, and are very fast.

Partial Page Caching with APC
APC also has the ability to create and manage shared memory regions, which provides easy access to a shared memory cache:
	apc_store($key, $contents, $ttl)
	$contents = apc_fetch($key)
	apc_delete($key)


APC isn't a built-in extension, so this limits its availability, however.

SQL Query Caching
When doing a search that is slow, you can store the ids for the results of a search query into a database table keyed by date and search id. Ilia also suggested limiting query results to 1000 items. More than that indicates that the user is probably doing too broad of a search. (If it's good enough for Google, it's good enough for you.)

In-Memory Caching without APC
If APC is not available, you can use the built-in shmop module to gain the same benefits. However shmop requires a little more code to use.

Browser Caching
Finaally, you can have the browser cache the content by sending Expires, Last-Modified, and Etag headers (and also returning 304 results to tell the browser that it can continue to use its cached copy). This reduces data send and server resource use to next to zero, but it's not always guaranteed to work and there's flimsy control over the cache age, since you're at the mercy of the browser's cache algorithm.
(Leave a comment)

ZendCon Session Notes - Keynote - Innovation That Matters: Making it Easy for Developers to Rapidly Deploy Usable & Actionable Information

Presented by Anant Jhingran and Mike Smith from IBM

Evolution of Enterprise Application Development
1970s: Applications were a hodgepodge of data and logic
1980s: Databases were developed, allowing data to be removed from the application
1990s: Web App + J2EE + Database
2000s: Web App + J2EE + Database and lots of buzzwords (Content, Search, Federation); Information as a Service

Information as a service: a number of heterogeneous applications and information sources

2004: PHP Web App + Database
2005: PHP Web App + Database + Buzzwords
2006: Information as a Service and Web 2.0

Wen 2.0 and "Info 2.0" - essentially the same thing as the separation of data and logic that occurred in the '80s.

Modern Web Data, Modern Web Applications
XML and its interaction with PHP

IBM's clients want to deal with XML, not as a separate system, but as an extension of their current applications.

Anant then talked about doing XML queries in DB2, and searching XML data with SimpleXML.

Ready for the Enterprise
Anant says that PHP and DB2 provides enterprise caliber functionality (via pureXML native XML store and PDO connectivity), scalability, reliability, security, and monitoring. Additionally, there are a number of tools that help, such as DB2 Developer Workbench, Zend Studiom Zend Core for IBM, integration with the Zend Framework, and DB2 Express-C.

(At this point, there was yet another snafu with the wireless mikes.)

Anant then showed some slides on the performance characteristics of PHP/Zend Core + DB2 using XML

On the "i platform"
Mike then took the stage to promote IBM System i. 98% of Fortune 100 and 85% of Fortune 500 companies are using System i.

(At which point, Mike's wireless mike also shut off.)

He then showed some customer testimonials praising System i's low maintenance costs (e.g. not needing to reboot since 1992 minimal use of IT staff to maintain System i servers, etc).

Mike mentioned that System i servers themselves scale from 1 CPU to 64 CPU racks. He then talked a little about logical partitioning, and how you can create virtual machines using as little as 1/100th of a server's CPU.

IBM is providing Zend Core and Zend Studio Professional for i5/OS for free.

The i5/OS version of Zend Core includes extensions to take advantage of special i5/OS features.

In the first three months of availability, there have been 4000 downloads of PHP for System i. More customer testimonials on using PHP to consolidate and integrate with other applications.
(Leave a comment)

Tuesday, October 31st, 2006

ZendCon Session Notes - Securing PHP Applications

This session, presented by Ilia Alshanetsky, covered the most common PHP security mistakes, as found by searches on Google Code Search.

You can get the specific search terms and examples from Ilia's website once he gets the conference slides online, but this is a quick run-down of the issues:

Cross-Site Scripting (XSS)
User supplied HTML is displated as-is to the screen.
Over 90,000 results from Google code search, with specific examples from phpMyAdmin, phpMyEdit, University of Toronto, and Modernbill.

Possible exploits with this class include cookie/session theft, content modification, cross-site request forgery initiation, and social engineering (by displaying a real site's content or forms).

Preventing this involves passing input through htmlspecialchars() or htmlentities(), or to use the filter extension to santize input (new in PHP 5.2).

SQL Injection
User-supplied input used as-is in SQL queries.
Over 3,000 results, with examples includign Bugs (a bug tracking system), OSTicket, XOOPS, and phpMyAdmin.

Possible exploits include arbitrary query injection, arbitrary data retrieval, denial of service (esp. with (BENCHMARK(very-large-number, ...)), and data modification. Ilia's examples used subqueries, which I hadn't previously considered, but which are still quite effective.

Solutions include using prepared statements; some some databases can even reject queries if the data types don't match between the query and the table columns. Note that string escaping functions don't work for multibyte charater sets, as it's often possible to craft an invalid sequence in a multibyte character set that can cause the escaping functions to not work as intended.

Code Injection
User can make script execute arbitrary PHP Code.
Thousands of results (but a number of these are false positives?). Specific examples include Serendipity, WordPress, YaBB, abd Squirrelmail Plugin.

Possible exploits include sensitive file retrieval (e.g. including ../../../../../etc/passwd), arbitrary code execution, or content removal.

One should never pass user input to include(), require(), or eval() (and also, the shell functions, such as system(), exec(), shell_exec()). If it's necessary to do so, one should build a whitelist of acceptable values, and even obfuscate them in some way with a lookup table so that the user input is not directly what is on the filesystem. Ilia also gave a number of solutions that apply to hosting company setups to help mitigate the problem.

Header Injection
Gives hacker the ability to inject arbitrary content headers.
Over 8,000 results, with examples including Cacti, TikiWiki, Horde, and phpMyConference.

Possible exploits include cache poisoning (which is an especially bad problem if proxy servers are involved), arbitrary url redirection (including circular redirects, which can act as a DoS), and cookie overload/session fixation.

Solutions include using PHP 4.4.2 or 5.1.2 or later, where header() will only emit a single header line at a time. Also, avoid passing user input to header(), setcookie() or session_id(), and/or pass user input through urlencode().

Session Fxation
Hacker can hardcode user's session id to a known value.

if an attacker can get the user to log into a site with a session id specified, the hacker can then try the url later, gaining access if the user is still logged in.

A solution to this is to use session_regenerate_id(), which changes the user's session id. Ilia says this normally only needs to be done at privilege escalation (e.g. when the user logs in, or accesses certion functions), but says that some people use it on every request.

Information Disclosure
Sensitive information is exposed to unauthorized users.

Ilia's specific examples here centered around leaving error messages enabled on production sites. This can result in disclosing the filesystem paths of PHP files, which provide attackers with information that may make subsequent attacks easier. In older versions of PHP, creative exploitation of error output could also lead to XSS. And some functions may expose sensitive parameters as part of the error message. (In particular, one credit card processing extension is known to emit usernames and paswords as part of error messages.)

Arbitrary File Output
Usually caused by a function such as fopen() using user input to identify which file to open.

This seems to be a fairly uncommon attack. Solutions include using open_basedir, disallowing allow_url_fopen, and preventing the web server to access itself via firewall rules.

There's also one other potential exploit Ilia covered, which I didn't write down because I was running low on battery power. If I remember it, I'll update this entry.
(1 comment | Leave a comment)

ZendCon Session Notes - Scalability and Performance Best Practices

Presented by George Schlossnagle, OmniTI Computer Consulting
Longer version of talk: http://omniti.com/~george/talks/

George started out by reminding everyone that scalability and performance are different, if linked, concepts. Scalability is the ability to gracefully handle additional traffic or load while maintaining service quality. Performance, on the other hand, is the ability to execute a single task quickly. One can be performant without being scalable (e.g., an application that can perform a single task extremely quickly, but which falls over under load). They're both symbiotic parts in the software development relationship. Another way of looking at this is, performance is the ability to handle the service commitments of today, while scalability means being able to meet the performance commitments of tomorrow.

He also pointed out that with well-designed code, almost any optimization you make will make the code more brittle, complicated, and (possibly) less flexible. So when you're making an optimization, it's important to be aware of what you're doing and why you're doing it. (Which is sound advice for any sort of development, not just optimization-related.) One should design code to be refactored later, so that when changes are necessary, it's possible to make them.

George then described best practices for software development in general, and then as related to scalability and performance. These overlapped considerably with the earlier session on High Volume PHP & MySQL scaling Techniques, and yesterday's tutorial on improving the performance of PHP applications.

His general key points were to:
* profile code early and often: effective debugging profiling is about spotting deviations from the norm, and effective habitual profiling is about making the norm better.
* Developers and Operations need to have a good working relationship, so that when there is a problem, things go smoothly
* One should always test on production data, so that there are no surprises when dev code is made live
* And that assumptions will burn you. In the long run, any assumption you make will probably be wrong.

His points on scalability (which he tried to condense into one-word bullet points) were to "decouple" (isolate performance faults, refactor only where and when necessary, and split hardware only when needed, as it impairs your ability to efficiently join decoupled application sets); "cache" (the fundamental question here is, how dynamic does the data really have to be?); "federate" (aka "shards", as described in an earlier talk); "replicate" (but be aware that high write rates can cause replication lag); and "avoid straining hard-to-scale resources" (such as third-party data sources or black-box data).

An important point is that caching longevity can be a key issue: sometimes, caching data for only a few sections can lead to a performance win; it all depends on the frequency of requests.

The key points for performance are to use a complier cache (which just about everyone is saying now), be mindful of external data sources, avoid recursive or heavy looping code (which, apparently, is expensive in PHP and probably indicates you're doing something wrong), don't try to outsmart PHP (don't try to work around perceived inefficiencies with code that turns out to be less optimal; for example, trying to write a parser in PHP when a simple regex could do); and build software with caching in mind (but watch cache hit rates - if a cache is never actually used, it's a net loss to build it).

Overall, this was a good session, but (as seems to be a trend so far for the conference), it was cut short, and there was a fair bit of overlap with the other similar sessions.
(Leave a comment)

ZendCon - Sneaky WiFi

Hopefully, this will be useful to other conference attendees, if any of you are reading this:

It seems that the Hotel's wifi system is networked together in a rather annoying way: if you first acquire a wifi connection up in the room, it asks you to pay for it. But if you first acquire your connection downstairs in the conference area, it gives it to you for free, no matter where you roam in the hotel.

My laptop ran out of juce during the last session I attented, so when I went back to my room, it didn't have wireless access. Going back downstairs to the front desk area and re-connecting to the network fixed that, and now the wifi works in my room again.
(2 comments | Leave a comment)

ZendCon Session Notes - Keynote - The Economics of Abundance

Presented by Chris Anderson, Wired

Attendees received a free copy of Chris Anderson's The Long Tail book.

Tim Bray: "As the cost of something decreases, it effectively becomes free."

A number of advances have been made as a result of wasting resources.

Waste Transistors: we go from a CLI environment and the thinking that the computer should be using its resources only on real work, to drawing icons and simple GUIs, to the fancy GUI environments of Mac OS X and Windows today, which make computers more accessable and easy to use.

Waste Storage: as an example, Gmail: "Over 2779.907662 megabytes (and counting) of free storage so you'll never need to delete another message." Gmail's mail storage counter, located on its home page, increases by the second.

Waste Bandwidth: Online video, which leads to YouTube, which leads to user-produced video with network TV-sized audiences.

3D printing - for the first time in history, complexity is free - printing three gears is just as expensive as printing one.

Scarcity vs. Abundance
Scarcity - physical world - limited shelf space
Abundance - online world - unlimited virtual shelf space

Wal-Mart vs. amazon: there's no cost to add additional products that would sell in very few numbers; unexpected products can sell very well.

Blockbuster Video vs. Netflix: shelf space limits the DVDs Blockbuster can make available, but Netflix doesn't have this limitation, so they can make available items that aren't "popular".

Tower records vs. iTMS (with a photo of a going-out-of-business Tower Records store)

Entertainment: mass-market television vs. user-produced videos distribtued on YouTube.

The Long Tail
The Long Tail is a Power Law: a small number of products sell well and have high availability, but many products sell little and aren't readily available.

Blockbuster Video: if a product doesn't sell 1.5 - 2 times per week, or isn't expected to sell that often, they don't offer it. It's not worth the shelf space it uses, in that case.

"The New Growth Market: Products you can't find anywhere but online"
Rhapsody: 40% of its total sales of music isn't available at Wal-Mart (3m vs 55k tracks)
Netflix: 21% of the DVDs they ship aren't available at Blockbuster (65,000 dvds vs 4,000 dvds)
Amazon: 25% of their books aren't available at a physical bookstore(3.7 million book titles vs. 100k at B&N)

Online sales: fastest growing sector overall

Three Forces of the Long Tail
1) Democratize the tools of production. Result: More stuff
2) Democratize distribution. Result: more sales outside of the hits
3) Connect Supply and Demand. Result: drives business from hits to niches

Case Study: Software
Cost of production of software has incredibly decreased - which increases the number of people who are able to obtain the tools, this increasing production.

Costs of distribution (from tape, to disks, to internet downloads, then to plugins, and then web services that don't actually require any downloads).

Costs of finding software (from user groups, to software stores, shareware.com, etc.)

Software timeline:
1980s: shareware, Hypercard
1990s: Excel templates, Java apps, ActiveX
2000s: Firefox plug-ins, hosted apps

Now: hosted app marketplaces, mashups, and embedded hosted web apps, so you don't have to actually download any software at all.

We now have lower "risks and costs" of obtaining and running software now. (Obviously, this is discounting potential security risks or malware.)

Salesforce AppExchange - 350+ niche apps
Example: Evacuee management (on salesforce)

Characteristics of Long Tail software
1) Small pieces, loosely joined (e.g. the Web)
2) Focused on a few features, not everything for everyone
3) Bottom-up, not top-down (user input vs. a boardroom deciding what will be popular)
4) User contributions
5) Flexible, evolving

Theory: the natural shape of any marketplace is a power law Power laws are straight lines on a log-log scale.

Movie box office sales: follows a power law until the number of movie screens runs out, and there is a sharp drop-off, as unpopular files just aren't shown at all. Distributed physical demand - so you can't put a movie on a screen if there isn't a sufficient amount of local physical demand.

Netflix DVD rentals: follows the power law - no physical bottleneck. The graph that Chris showed has the beginning of the curve (huge box-office hits) with a lower slope than the tail; I'm wondering if this is because those movies were so popular that people bought the DVDs, so they didn't need to get them from Netflix, and in sufficient quantities as to affect the graph.

Sourceforge: 61,974 projects that fall into a power law by downloads - not just making this available, but making them findable

Joe Kraus: "The focus has been on dozens of markets of millions, instead of millions of markets of dozens."
(Leave a comment)

ZendCon Session Notes - High Volume PHP & MySQL scaling Techniques

High Volume PHP & MySQL Scaling Techniques

Presented by Elliott White III of digg.com

Introduction
Performance is a problem, but scaling is always a bigger problem, and you can't always throw hardware at a problem quick enough.

Standard Solution
Many PHP servers behind a load balancer, and many mysql slaves all talking to a single master.

Randomized or "planned" PHP to MySQL relations - every PHP server has its own MySQL slave. Ideally, want to use randomized connections, so that you can scale the PHP servers independently of the mySQL servers. (Also, specify weighting so that you can give more load to the more powerful servers,)

PHP solutions
Ensure that you're using an opcode cache. (Elliott recommends APC.)

Use a faster webserver - e.g. use thttpd for static html pages, images, etc. thttpd apparently scales better. Elliott also suggested the possibility of using a threaded version of Apache for satic content.

If pages don't need to be instantly updated, pregenerate on a regular basis and cache.

He also suggests using jcache as a PHP solution - rather than pregenerating everything, cache dynamic content as it's created. jpcache also enables client-side caching (with ETag headers), and gzips outgoing content when possible.

Memcached
Cache certain parts of a dynamic page that are static with an in-memory persistent cache. This is used on digg for caching your list of friends, so it doesn't have to retrieve it from the mysql server each time.

Memcached server farms: partition the data so that it's spread across multiple (physical) servers, and connect to the servers that contain the proper data. (The php memcache documentation has a class that handles randomized/consistent connections, and failover.) Failover has issues, though, with stale data if a server disappears. A potential solution is to use redundancy and save information on multiple servers.

Memcached has a number of disadvantages, though. The code to determine cache decisions, including the hash to determine where data goes, is going to be site-specific. Memcached can also perpetuate slave lag - if you cache data from a database slave that's old, you wind up caching the old data for even longer. Memcached also doesn't like data segments larger than 1mb.

Implementation: write a generic class that abstracts away all the decisions away, so that you only have to say "store" or "retrieve". Then, abstract away how the data is even stored, so the application "just asks for data", and the data classes retrieve data from db, memcache, etc, without the application having to know what's going on.

==> Obviously, you want to implement memcached only when the connect-store-retrieve-disconnect time is less than the time needed to retrieve/generate the data in the first place.

Purpose Driven MySQL Pools
Put queries that may take a long time to execute on servers that only handle these queries, so that you don't impact the primary servers handling the main parts of the web application - e.g. for searching, or expensive batch jobs.

Shards
Breaking the database into a number of smaller databases.

Pros: greater performance; tweakable and scalable
Cons:
Loss of sql support - won't be able to do simple selects, and depending on how the shards are set up, you might not be able to do unions.
This leads to increased PHP load, since you must write PHP code to handle aspects that you can't do in mysql anymore
Which, of course, complicates the programming.

Types of shards
Table based: when the amount of data is smaller, there's less data each slave/master has to transfer and keep updated. Also, this enhances the performance of the query cache, since there's less memory pressure. But this completely breaks joining.

Date based: could keep the tables in the same database, or separate them. But the tables themselves are smaller, so there's less work involved in searching and retrieving the data.

Range based: Rows (1 .. 100000) in one table, (100001 .. 200000) in another. (Or users with A-F; G-L; etc) Good for when you know the id you're searching for (e.g. a users table, perhaps).

Interlaced: You specify a number of tables and evenly divide the tables based on a hash of the id.

Partial sharding: table with everything, and then a table with only the most recent items.

But coding for shards can be complicated, and you need to properly abstract out the shard logic so the application doesn't need to know about it.
(Leave a comment)

ZendCon Session Report - Panel - How do the Stacks Stack Up?

The panel was moderated by Steve O'Grady (Redmonk), and its participants included Bill Hilf (Microsoft), Tim Bray (Sun), Mike Olson (Oracle), Mårten Mickos (MySQL), and Anant Jhingran (IBM).

Bill explained that Microsoft's "ulterior motive" behind its support of PHP is that, as a tools company, when 75% of PHP developers are devloping on Windows but deploying to UNIX, well, that's something they want to resolve, so that more people deploy PHP applications to Windows.

Tim pointed out that Sun machines runs Linux "just great", in the event someone doesn't want to run Sun's OS. Tim mentioned that Solaris is one of the most observable operating systems available, pointing out the integrated DTrace debugging tool.

Mike reiterated that Oracle last week announced full support for Linux for Oracle.

Mårten says that the strength of open source is that there are multiple projects at once, and that each can focus on making their tools faster - Zend can work on making PHP faster, while MySQL works on making MySQL faster. He expects that PHP and MySQL will continue to evolve to support the needs of Web 2.0 applications.

Anant briefly talked in general about how IBM iSeries servers are extremely reliable - and tat the difference between "Windows" reliabilitnd iSeries reliability is night and day.

Tim says that it's obvious that a large portion of the market has made it clear (with their wallets) that open source is the way people want to go, that there are strong engineering reasons for continuing to support open source development, and that corporate involvement with open source tools is only going to continue to grow.

Mike says that Oracle was led to open source applications - particularly PHP - by their customers, who wanted to use Oracle with PHP. He says that Oracle isn't trying to subvert open source applications, rather that they want to make their customers successful, and if that means supporting open source, then that's what it means.

Bill talked about how one of the key qestions of working with open source tools is, "does this help us make more money"? - And any company needs to ensure that they are.

Question from the audience along the lines of - "Buy Oracle, which is Fast, or MySQL, and then buy people to make it fast?" The answers from Mike and Mårten were essentially "given what you're trying to implement, it'll take engineering effort to determine what the best solution is". Tim (I think) said that he's never seen a large-scale application deployment succeed without a substantial investment in people. Bill chipped in that it's really all about the staff - people that aren't good won't result in good performance. Anant talked about how TCO was a strong consideration for IBM's clients.

Mårten pointed out the distinction of "stacks" from a customer and vendor perspective - from a venros's perepective, it's "how can we lock a customer in?". Tim points out that the the most important thing is freedom - esecpailly the freedom to leave your current vendor. Tim thinks that changing the database is the hardest part of the whole stack - changing OS and hardware is easy, and not hard (though, not necessarially easy, at that) to mix software environments in a given application. Mike actually said that - pragmatically, you don't want to change the underlying software or hardware because you need a stable foundation to work from, and no matter what you start from, it's always going to be hard to change. As soon as you build a system that has to work, you are locked in, to the platform you chose. Bill talked about how the important thing is that given an existing investment in tools, to ensure that there's an evolution path, and credits LAMP's success to this - the loosely coupled nature of the LAMP stack forced communication and open standards in each of the tools, allowing them to work better and be replacable if necessary.

Tim talked about how for many years, Sun's answer was "Java. What was the question?", although they're moving towards more dynamic languages in the future. He also thinks that there will be "no winner" between PHP, Java, .NET, Ruby, etc - they exist, and they'll have their own uses. The interesting problem, he thinks, is that given all the different languages and technologies, the key challenge will be integrating them all together.

IBM has invested significantly in PHP. Anant said that he wants to see more enterprise-class applications in PHP, because then it makes it easier to sell enterprise-class IBM servers.

In closing, the panelists talked about how the'd like to see PHP continue to evolve. Bill wanted to see PHP continue to use FastCGI, and sees potential for desktop PHP applications. Tim thinks PHP is pretty strong, but talkefd about the proliferation of passwords and identiy, and thinks that identity managment is something that needs to addressed, and pointed people at Pat Patterson on Identity Management. Mårten was happy that Zend was doing both ZendBox and Zend Core, and hoped that there would be additional ready-made downloadable stacks in the future. Anant thinks a perfect storm is coming regarding information managment.
(2 comments | Leave a comment)

ZendCon - Rendezvous

For all the Mac laptops I'm seeing at the conference, it's a little surprising that I'm not seeing any other Macs show up in Rendezvous Browser. I wonder if the wireless isn't rebroadcasting Rendezvous traffic?
(Leave a comment)

ZendCon Session Report - Keynote - State of the Union

The keynote was delayed for a bit due to microphone problems, and there were a few ongiong issues (e.g. lack of a remote contro for the slide presentation). Most of the presentation was given by Zeev Suraski and Andi Gutmans, the co-founders of Zend.

Once it actually got going, the keynote session started off talking about Web 2.0 CRMs - Joomla, SugarCRM (which has some drag-and-drop configuration), MediaWiki.

PHP is appareny really popular on HotScripts - about three times the number of PHP scripts than any other category - 13,000 vs. 4,400 for perl.

According to zend, about 74% of PHP development is done with PHP5. Which is good, IMO, because in comparison, PHP4 is really limited - both with its object support and fewer number of builtin functions.

Andi mentions that the PHP/Java Bridge that's part of Zend Platform is one of the items he thinks is driving PHP 5 adoption. (Update 7:51 pm: I had earlier suggested that this was related to the php-java-bridge project on SourceForge, but this is not the case. Zend's product is closed-source and proprietary.)

Andrei Zmievski talked briefly about PHP 6's upcoming unicode support, which he started work on in March 2005. Unicode in PHP 6 will have a configuration option to allow PHP to continue running as it is now with regard to strings, so there won't be backwards compatibility issues. He expects a preview release of PHP 6 by the end of the year.

The Zend Framework has had 200k downloads since it was released a few months ago. It consists of 100,000 lines of code. v0.2 was released yesterday. It claims to have new JSON support, as well as an "enhanced" Lucene-compatible search API. I'm currently using the php-java-bridge to use the Java version of Lucene, because when I lastl ooked at the PHP offering, it was fairly out of date. So this might be worth looking at again. Later, during the Q&A, Andi mentioned that the Framework is still under development and is not recommended for production use.

Zeev demoed a PHP Framework API to Google Calendar, and then showed off the Zend Studio IDE, but the demo that I think was supposed to show the IDE's debugging features didn't actually work.

Zeev also showed off PHP debuging in the Eclipse IDE. The first public preview version of this is supposed to be released in December.

Zeev then showed off a beta of Zend Studio 5.5. It looks like it has some pretty nifty code completion support. It's also supposed to have integration with java - probably some built-in version of the php-java-bridge I mentioned earlier. It has code-completion support for Java as well. I think we're going to start seeing a lot more integration of PHP and Java in the future, especally if the development tools make it easier to use than it already is.

Zeev also showed off the monitoring tools provided by Zend Platform. When errors occur during PHP execution, the monitoring tools actually record all the relevant data necessary to do debugging. It even integrates with the Studio to allow you to debug errors after the fact. Zend Platform is apparently free for use for development purposes. Too bad their website is down right now, or I'd give it a try now.

Zend Platform also has a built-in cron-like facility. I suppose it's useful to have a GUi interface for entering this stuff, but it seems like extra overhead unless it can do something cron can't.

Zeev also showed off Zend Platform 3.0's BIRT report creation tools. The interesting thing here is that the reports can be exported to PDF. I wonder what they're using to do that...

Mårten Mickos from MySQL briefly evangelized MySQL and their support for Zend Core. New features MySQL is planning to add in the MySQL client include client-side caching of data and prepared statements.

Zend is also announcing a new managed hosted PHP 5 system, ZendBox, to be available in November.

Andi is stealing Steve Jobs' trademark "one mor thing", by announcing a partnership with Microsoft that is aimed at improving PHP's performance, reliability, application compatibility, and interopability on Windows. A guy from Microsoft then came up to present. Someone else called him out on the fact that he was using a MacBook Pro to run Windows for the presentation.

The Microsoft collaboration, part of which has created a new FastCGI component for IIS, has resulted in a significant (almost double) increase in PHP's performance on IIS.

This is also my first chance to se the new Windows Aero interface. The translucent window title bars are really distracting, and I think it really detracts from the readability of window titles. Makes me glad that Apple dropped the transparency in inactive Mac window titlebars.

Andi re-iterated that the optimizations for Windows will be in the public version of PHP, and that the "community" version of PHP wil remain the main version of PHP - pointing out that this won't be like how Mono's implementation of .NET is perpetually behind the Microsoft .NET implementation.
(Leave a comment)

ZendCon Report - Power!

There is an amazing dearth of power outlets here. I may need to steal the six-way splitter that's in the room and carry it around with me so that when I'm actually lucky enough to be near a power outlet that's occupied, it can be shared.

Seriously, how can you have a technical conference like this without having power adaptors strung all over the place.

I might be able to run without my laptop's backlight for awhile, though. The floodlights in the back of the keynote room are really bright, and I can actually see what's on the screen without its backlight.
(Leave a comment)

ZendCon Session Report - Extending PHP

I'm not really going to go too much in detail about the Extending PHP session I was at yesterday, which was presented by Marcus Boerger and Sara Golemon. It was a pretty technical look at how to write extensions for PHP. Any discussion on extending php, though, has to explain a fair amount of php's internal workings, and that's really why I was there. It was a bit informative, although they had planned for a six hour presentation, and there was only three hours, so they didn't get through the entirety of their material. A fair amount of the material they covered appears to have come from this PHP Extension Writing Tutorials page on zend's website, so it looks like I'll have some further reading material for later.
(Leave a comment)

Monday, October 30th, 2006

ZendCon Session Report - Improving Performance of PHP Applications

The first session I attended today was on improving the performance of PHP applications, presented by Ilia Alshanetsky. It was pretty informative, but he spent a significant amount of time talking about optimizations that, while relevant, aren't php-specific, which makes them both useful for all websites, not just PHP applications.

  • Apache configuration tuning (turning off ExtendedStatus and HostnameLookups; turning on FollowSymlinks; tuning of KeepAlive timeouts, etc. See also Apache Performance Tuning)
  • Use a separate server specifically for serving static content; turn off KeepAlive on a purely-dynamic content server
  • setting hdparm to appropriate values on linux
  • Consider using a ramdisk for frequently accessed content (e.g. your php session files).
  • Directories with large numbers of files are slow. (I have plenty of experience with this - trying to use directories that have hundreds of thousands of files simply sucks. It's no fun when a ls wedges the kernel and grinds the whole machine to a halt.)

The interesting PHP-related bits, though, are:

Using an optimizer without an opcode cache can be a net loss. (Though, this is obvious if you think about it - in order to generate optimal executable code, the optimizer may spend more time generating that code than can be saved in a single execution.)

You can have a separate ini file for commandline PHP (php-cli.ini; php-<SAPI>.ini), so you can do things like disable register_argc_argv on the webserver where it's never needed, and enable it for the cli when it is.

Instead of using time(), consider $_SERVER['REQUEST_TIME']. In my own tests, this only appears to be a 15% difference, so I wouldn't be in too much of a hurry to make this change for existing code. There's a few other functions whose values are duplicated in constants, so they can be fetched with even better gains.

preg_* is generally faster than ereg_*; but in any case, don't use regular expressions when there's a PHP API function that does specifically what you need. (This I already knew, but it's worth repeating, since I see a lot of people making this mistake.)

This one's counter-intuitive, since it seems like you're creating more work. When doing string replaces, it's often advantageous to do the replace conditionally: In the case that there's no match, if(strpos() !== false) str_replace(); is significantly faster than blindly calling str_replace(), and barely any slower in the case that there's a text match. (This is due to the fact that str_replace has to duplicate the search, replacement, and source strings.) My tests show that it's about 1.7 times faster for an empty file (non-matching, of course), 2.3 times faster for a non-matching 95kb file, and 3.2 times faster for a 2MB file (non-matching). For a matching replace in the 2MB file, it appears to be about a 1% difference. I'm wondering if it might be possible to modify str_replace so that it can do this by itself - first search for the replacement, and then if it finds one, create the new copies. A quick glance at the str_replace code suggests that there might be potential for some improvment, but it's going to take some time for me to do more research.

The @ (error-suppression) operator, which I had already decided was evil on the grounds that you shouldn't be writing code that emits errors. (Though I use @ for those PHP functions which are brain-dead enough to emit a warning when the code is perfectly valid but an error occurs - e.g. mysql_connect() to a server that is down.) Ilia, however, says it's evil because it's amazingly slow. And he's right. For a call into a function that does nothing, it's slower by a factor of four. For a no-op (@0; vs. 0;), 100 million iterations of @0; takes about 80 seconds. I can't actually time how long it takes to do 100 million iterations of 0;, though, because it executes so quickly that the noise of the background processes on my computer result in meaningless benchmark values - about half the time, I'm getting a negative value for time elapsed. In any case, @ is really slow, and should be used as close to never as is practical.

I'm really happy ilia touched on accessing array indexes with unquoted strings (e.g. $foo[bar] = 1). This is a major pet-peeve of mine, and while I knew it was slower, I hadn't really realized how much slower it was. (It involves one call to strtolower, two hashtable lookups, an E_NOTICE error, and the creation of a temporary string; none of which is necessary if the index is enclosed in quotes.) Ilia's benchmarks show an  average of a 700% difference "on average" depending on the length of the key, but my tests are showing an 1100%+ difference, since I'm factoring out the cost of loop iteration.

I asked Ilia whether using echo with multiple parameters was faster than giving it concatenated strings (e.g. echo $one, $two, $three vs. echo $one . $two . $three). He said it was, but I'm not so sure this is true in all cases. I did some rather pathological benchmarks, and found that when outputting/concatenating two empty strings two-parameter echo is twice as fast. But increasing the two strings to one character results in concatenation being twice as fast with output buffering disabled, and about 10-15% faster with output buffering enabled. Clearly, this is deserving of more extensive benchmarks.

Ilia explains that it's important to fix code that's generating errors that aren't displayed by default (E_NOTICE and E_STRICT), because they result in time spent generating the error mesage, even it it's not displayed, but didn't say what the speed penalty was. My tests show that accessing an array element that does not exist, for example, takes about 10 times longer than accessing an array element that does exist. I've written a lot of code that runs into this. Often, I'll have

if($array['someProperty']) {do something }

when I should have

if(!empty($array['someProperty'])) { ... }

instead (which is about 7.5 times faster). I don't think it looks as pretty (which is why I've been stubbornly doing the former), but I think I can justify the "ugilier" code now in new development. (I'm not sure I could justify going back and changing existing code though, except as an exercise to remove the warnings so that when E_NOTICE/E_STRICT are enabled, they don't drown out any new warnings.)

A reminder that PHP5 passes variables with copy-on-write, so it's not necessary (and actually bad for performance) to pass a variable to a function by reference unless you need the function to modify the original variable.

There's a number of other things that were brought up that are good to know, but which I'm not going to ponder in any great detail here. These include using full pathnames in include|require(_once) calls; using references for loop invariants with multidimensional arrays (e.g. for($x = 0; $x < 5; $x++) $arr['a']['b'][$x] = $x;).

(Leave a comment)